0% found this document useful (0 votes)
248 views1,236 pages

404 Research Methodology

Uploaded by

Anjaneyulu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
248 views1,236 pages

404 Research Methodology

Uploaded by

Anjaneyulu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1236

RevolutionPharmD.

com

6 Sample Size and Power

The question of the size of the sample, the number of observations, to be used in scientific
experiments is of extreme importance. Most experiments beg the question of sample size.
Particularly when time and cost are critical factors, one wishes to use the minimum sample size
to achieve the experimental objectives. Even when time and cost are less crucial, the scientist
wishes to have some idea of the number of observations needed to yield sufficient data to answer
the objectives. An elegant experiment will make the most of the resources available, resulting
in a sufficient amount of information from a minimum sample size. For simple comparative
experiments, where one or two groups are involved, the calculation of sample size is relatively
simple. A knowledge of the ␣ level (level of significance), ␤ level (1 − power), the standard
deviation, and a meaningful “practically significant” difference is necessary in order to calculate
the sample size.
Power is defined as 1 − ␤ (i.e., ␤ = 1 − power). Power is the ability of a statistical test to
show significance if a specified difference truly exists. The magnitude of power depends on the
level of significance, the standard deviation, and the sample size. Thus power and sample size
are related.
In this chapter, we present methods for computing the sample size for relatively simple
situations for normally distributed and binomial data. The concept and calculation of power
are also introduced.

6.1 INTRODUCTION
The question of sample size is a major consideration in the planning of experiments, but may not
be answered easily from a scientific point of view. In some situations, the choice of sample size
is limited. Sample size may be dictated by official specifications, regulations, cost constraints,
and/or the availability of sampling units such as patients, manufactured items, animals, and
so on. The USP content uniformity test is an example of a test in which the sample size is fixed
and specified [1].
The sample size is also specified in certain quality control sampling plans such as those
described in MIL-STD-105E [2]. These sampling plans are used when sampling products for
inspection for attributes such as product defects, missing labels, specks in tablets, or ampul leak-
age. The properties of these plans have been thoroughly investigated and defined as described
in the document cited above. The properties of the plans include the chances (probability) of
rejecting or accepting batches with a known proportion of rejects in the batch (sect. 12.3).
Sample-size determination in comparative clinical trials is a factor of major importance.
Since very large experiments will detect very small, perhaps clinically insignificant, differences
as being statistically significant, and small experiments will often find large, clinically significant
differences as statistically insignificant, the choice of an appropriate sample size is critical in the
design of a clinical program to demonstrate safety and efficacy. When cost is a major factor in
implementing a clinical program, the number of patients to be included in the studies may be
limited by lack of funds. With fewer patients, a study will be less sensitive. Decreased sensitivity
means that the comparative treatments will be relatively more difficult to distinguish statistically
if they are, in fact, different.
The problem of choosing a “correct” sample size is related to experimental objectives and
the risk (or probability) of coming to an incorrect decision when the experiment and analysis
are completed. For simple comparative experiments, certain prior information is required in

RevolutionPharmD.com
SAMPLE SIZE AND POWER 129

order to compute a sample size that will satisfy the experimental objectives. The following
considerations are essential when estimating sample size.
1. The ␣ level must be specified that, in part, determines the difference needed to represent a
statistically significant result. To review, the ␣ level is defined as the risk of concluding that
treatments differ when, in fact, they are the same. The level of significance is usually (but
not always) set at the traditional value of 5%.
2. The ␤ error must be specified for some specified treatment difference, . Beta, ␤, is the risk
(probability) of erroneously concluding that the treatments are not significantly different
when, in fact, a difference of size  or greater exists. The assessment of ␤ and , the
“practically significant” difference, prior to the initiation of the experiment, is not easy.
Nevertheless, an educated guess is required. ␤ is often chosen to be between 5% and 20%.
Hence, one may be willing to accept a 20% (1 in 5) chance of not arriving at a statistically
significant difference when the treatments are truly different by an amount equal to (or
greater than) . The consequences of committing a ␤ error should be considered carefully.
If a true difference of practical significance is missed and the consequence is costly, ␤ should
be made very small, perhaps as small as 1%. Costly consequences of missing an effective
treatment should be evaluated not only in monetary terms, but should also include public
health issues, such as the possible loss of an effective treatment in a serious disease.
3. The difference to be detected,  (that difference considered to have practical significance),
should be specified as described in (2) above. This difference should not be arbitrarily or
capriciously determined, but should be considered carefully with respect to meaningfulness
from both a scientific and commercial marketing standpoint. For example, when comparing
two formulas for time to 90% dissolution, a difference of one or two minutes might be
considered meaningless. A difference of 10 or 20 minutes, however, may have practical
consequences in terms of in vivo absorption characteristics.
4. A knowledge of the standard deviation (or an estimate) for the significance test is necessary.
If no information on variability is available, an educated guess, or results of studies reported
in the literature using related compounds, may be sufficient to give an estimate of the
relevant variability. The assistance of a statistician is recommended when estimating the
standard deviation for purposes of determining sample size.
To compute the sample size in a comparative experiment, (a) ␣, (b) ␤, (c) , and (d) ␴
must be specified. The computations to determine sample size are described below (Fig. 6.1).

6.2 DETERMINATION OF SAMPLE SIZE FOR SIMPLE COMPARATIVE EXPERIMENTS


FOR NORMALLY DISTRIBUTED VARIABLES
The calculation of sample size will be described with the aid of Figure 6.1. This explanation is
based on normal distribution or t tests. The derivation of sample-size determination may appear
complex. The reader not requiring a “proof” can proceed directly to the appropriate formulas
below.

Figure 6.1 Scheme to demonstrate calculation of sample size based on ␣, ␤, , and ␴: ␣ = 0.05, ␤ = 0.10,
 = 5, ␴ = 7; H 0 :  = 0, H a :  = 5.

RevolutionPharmD.com
130 CHAPTER 6

6.2.1 Paired-Sample and Single-Sample Tests


We will first consider the case of a paired-sample test where the null hypothesis is that the
two treatment means are equal: H0 :  = 0. In the case of an experiment comparing a new
antihypertensive drug candidate and a placebo, an average difference of 5 mm Hg in blood
pressure reduction might be considered of sufficient magnitude to be interpreted as a difference
of “practical significance” ( = 5). The standard deviation for the comparison was known, equal
to 7, based on a large amount of experience with this drug.
In Figure 6.1, the normal curve labeled A represents the distribution of differences with
mean equal to 0 and ␴ equal to 7. This is the distribution under the null hypothesis (i.e., drug
and placebo are identical). Curve B is the distribution of differences when the alternative, Ha :
 = 5,∗ is true (i.e., the difference between drug and placebo is equal to 5). Note that curve B is
identical to curve A except that B is displaced 5 mm Hg to the right. Both curves have the same
standard deviation, 7.
With the standard deviation, 7, known, the statistical test is performed at the 5% level as
follows [Eq. (5.4)]:

␦− ␦−0
Z= √ = √ . (6.1)
␴/ N 7/ N

For a two-tailed test, if the absolute value of Z is 1.96 or greater, the difference is significant.
According to Eq. (6.1), to obtain the significance
  ␴Z 7(1.96) 13.7
␦ ≥ √ = √ = √ . (6.2)
N N N
√ √
Therefore, values of ␦ equal to or greater than 13.7/ N (or equal to or less than −13.7/ N)
will lead to a declaration of significance. These points are designated as ␦L and ␦U in Figure 6.1,
and represent the cutoff points for statistical significance at the 5% level; that is, observed
differences equal to or more remote from the mean than these values result in “statistically
significant differences.”
√If curve B is the true distribution
√ (i.e.,  = 5), an observed mean difference greater than
13.7/ N (or less than −13.7/ N) will result in the correct decision; H0 will be rejected and√we
conclude that√ a difference exists. If  = 5, observations of a mean difference between 13.7/ N
and −13.7/ N will lead to an incorrect decision, the acceptance of H0 (no difference) (Fig. 6.1).
By definition, the probability of making this incorrect decision is equal to ␤.
In the present √example, ␤ will be set at 10%. In Figure 6.1, ␤ is represented by the area in
curve B below 13.7/ N(␦U ), equal to 0.10. (This area, ␤, represents the probability of accepting
H0 if  = 5.)
We will now compute the value of ␦ that cuts off 10% of the area in the lower tail of the nor-
mal curve with a mean of 5 and a standard deviation of 7 (curve B in Figure 6.1). Table IV.2 shows
that 10% of the area in the standard normal curve is below −1.28. The value of ␦ (mean difference
in blood pressure between the two groups) that corresponds to a given value of Z (−1.28, in this
example) is obtained from the formula for the Z transformation [Eq. (3.14)] as follows:


␦ =  + Z␤ √
N
␦−
Z␤ = √ . (6.3)
␴/ N

Applying Eq. (6.3) to our present example, ␦ = 5 − 1.28(7/ N). The value of ␦ in Eqs. (6.2)
and (6.3) is identically the same, equal to ␦U . This is illustrated in Figure 6.1.

∗  is considered to be the true mean difference, similar to ␮. ␦ will be used to denote the observed mean difference.

RevolutionPharmD.com
SAMPLE SIZE AND POWER 131

Table 6.1 Sample Size as a Function of Beta


with  = 5 and ␴ = 7: Paired Test (␣ = 0.05)

Beta (%) Sample size, N


1 36
5 26
10 21
20 16


From
√ Eq. (6.2), ␦U = 13.7/ N, satisfying the definition of ␣. From Eq. (6.3), ␦U = 5 −
1.28(7)/ N, satisfying the definition of ␤. We have two equations in two unknowns (␦U and N),
and N is evaluated as follows:

13.7 1.28(7)
√ = 5− √
N N
(13.7 + 8.96)2
N= = 20.5 ∼
= 21.
52

In general, Eqs. (6.2) and (6.3) can be solved for N to yield the following equation:
 ␴ 2
N= (Z␣ + Z␤ )2 , (6.4)


where Z␣ and Z␤ † are the appropriate normal deviates obtained from Table IV.2. In our example,
N= (7/5)2 (1.96 + 1.28)2 ∼
= 21. A sample size of 21 will result in a statistical test with 90% power
(␤ = 10%) against an alternative of 5, at the 5% level of significance. Table 6.1 shows how the
choice of ␤ can affect the sample size for a test at the 5% level with  = 5 and ␴ = 7.
The formula for computing the sample size if the standard deviation is known [Eq. (6.4)]
is appropriate for a paired-sample test or for the test of a mean from a single population. For
example, consider a test to compare the mean drug content of a sample of tablets to the labeled
amount, 100 mg. The two-sided test is to be performed at the 5% level. Beta is designated as
10% for a difference of −5 mg (95 mg potency or less). That is, we wish to have a power of 90%
to detect a difference from 100 mg if the true potency is 95 mg or less. If ␴ is equal to 3, how
many tablets should be assayed? Applying Eq. (6.4), we have
2
3
N= (1.96 + 1.28)2 = 3.8.
5

Assaying four tablets will satisfy the ␣ and ␤ probabilities. Note that Z = 1.28 cuts off 90%
of the area under curve B (the “alternative” curve) in Figure 6.2, leaving 10% (␤) of the area in
the upper tail of the curve. Table 6.2 shows values of Z␣ and Z␤ for various levels of ␣ and ␤
to be used in Eq. (6.4). In this example, and most examples in practice, ␤ is based on one tail of
the normal curve. The other tail contains an insignificant area relating to ␤ (the right side of the
normal curve, B, in Fig. 6.1)
Equation (6.4) is correct for computing the sample size for a paired- or one-sample test if
the standard deviation is known.
In most situations, the standard deviation is unknown and a prior estimate of the standard
deviation is necessary in order to calculate sample size requirements. In this case, the estimate
of the standard deviation replaces ␴ in Eq. (6.4), but the calculation results in an answer that is
slightly too small. The underestimation occurs because the values of Z␣ and Z␤ are smaller than

† Z␤ is taken as the positive value of Z in this formula.


RevolutionPharmD.com

132 CHAPTER 6

Table 6.2 Values of Z␣ and Z␤ for Sample-Size Calculations

Z␣

One sided Two sided Z␤a

1% 2.32 2.58 2.32


5% 1.65 1.96 1.65
10% 1.28 1.65 1.28
20% 0.84 1.28 0.84
a The value of ␤ is for a single specified alternative. For a two-sided test,
the probability of rejection of the alternative, if true, (accept H a ) is virtually
all contained in the tail nearest the alternative mean.

√ √
Figure 6.2 Illustration of the calculation of N for tablet assays. X = 95 + ␴ Z␤ / N = 100 − ␴ Z␣ / N.

the corresponding t values that should be used in the formula when the standard deviation is
unknown. The situation is somewhat complicated by the fact that the value of t depends on the
sample size (d.f.), which is yet unknown. The problem can be solved by an iterative method,
but for practical purposes, one can use the appropriate values of Z to compute the sample size
[as in Eq. (6.4)] and add on a few extra samples (patients, tablets, etc.) to compensate for the
use of Z rather than t. Guenther has shown that the simple addition of 0.5Z␣2 , which is equal
to approximately 2 for a two-sided test at the 5% level, results in a very close approximation to
the correct answer [3]. In the problem illustrated above (tablet assays), if the standard deviation
were unknown but estimated as being equal to 3 based on previous experience, a better estimate
of the sample size would be N + 0.5Z␣2 = 3.8 + 0.5(1.96)2 ∼ = 6 tablets.

6.2.2 Determination of Sample Size for Comparison of Means in Two Groups


For a two independent groups test (parallel design), with the standard deviation known and
equal number of observations per group, the formula for N (where N is the sample size for each
group) is
 ␴ 2
N=2 (Z␣ + Z␤ )2 . (6.5)


If the standard deviation is unknown and a prior estimate is available (s.d.), substitute
s.d. for ␴ in Eq. (6.5) and compute the sample size; but add on 0.25Z␣2 to the sample size for each
group.
Example 1: This example illustrates the determination of the sample size for a two indepen-
dent groups (two-sided test) design. Two variations of a tablet formulation are to be compared
with regard to dissolution time. All ingredients except for the lubricating agent were the same
in these two formulations. In this case, a decision was made that if the formulations differed by
10 minutes or more to 80% dissolution, it would be extremely important that the experiment
shows a statistically significant difference between the formulations. Therefore, the pharmaceu-
tical scientist decided to fix the ␤ error at 1% in a statistical test at the traditional 5% level. Data
were available from dissolution tests run during the development of formulations of the drug
RevolutionPharmD.com

SAMPLE SIZE AND POWER 133

and the standard deviation was estimated as 5 minutes. With the information presented above,
the sample size can be determined from Eq. (6.5). We will add on 0.25Z␣2 samples to the answer
because the standard deviation is unknown.
2
5
N=2 (1.96 + 2.32)2 + 0.25(1.96)2 = 10.1.
10

The study was performed using 12 tablets from each formulation rather than the 10 or
11 suggested by the answer in the calculation above. Twelve tablets were used because the
dissolution apparatus could accommodate six tablets per run.
Example 2: A bioequivalence study was being planned to compare the bioavailability of a
final production batch to a previously manufactured pilot-sized batch of tablets that were made
for clinical studies. Two parameters resulting from the blood-level data would be compared:
area under the plasma level versus time curves (AUC) and peak plasma concentration (Cmax ).
The study was to have 80% power (␤ = 0.20) to detect a difference of 20% or more between the
formulations. The test is done at the usual 5% level of significance. Estimates of the standard
deviations of the ratios of the values of each of the parameters [(final product)/(pilot batch)]
were determined from a small pilot study. The standard deviations were different for the
parameters. Since the researchers could not agree that one of the parameters was clearly critical
in the comparison, they decided to use a “maximum” number of patients based on the variable
with the largest relative variability. In this example, Cmax was most variable, the ratio having a
standard deviation of approximately 0.30. Since the design and analysis of the bioequivalence
study is a variation of the paired t test, Eq. (6.4) was used to calculate the sample size, adding
on 0.5Z␣2 , as recommended previously.
 ␴ 2
N= (Z␣ + Z␤ )2 + 0.5(Z␣2 )

2
0.3
= (1.96 + 0.84)2 + 0.5(1.96)2 = 19.6. (6.6)
0.2

Twenty subjects were used for the comparison of the bioavailabilities of the two formula-
tions.
For sample-size determination for bioequivalence studies using FDA recommended
designs, see Table 6.5 and section 11.4.4.
Sometimes the sample sizes computed to satisfy the desired ␣ and ␤ errors can be inordi-
nately large when time and cost factors are taken into consideration. Under these circumstances,
a compromise must be made—most easily accomplished by relaxing the ␣ and ␤ requirements‡
(Table 6.1). The consequence of this compromise is that probabilities of making an incorrect
decision based on the statistical test will be increased. Other ways of reducing the required
sample size are (a) to increase the precision of the test by improving the assay methodology
or carefully controlling extraneous conditions during the experiment, for example, or (b) to
compromise by increasing , that is, accepting a larger difference that one considers to be of
practical importance.
Table 6.3 gives the sample size for some representative values of the ratio ␴/, ␣, and ␤,
where the s.d. (s) is estimated.

6.3 DETERMINATION OF SAMPLE SIZE FOR BINOMIAL TESTS


The formulas for calculating the sample size for comparative binomial tests are similar to those
described for normal curve or t tests. The major difference is that the value of ␴ 2 , which is
assumed to be the same under H0 and Ha in the two-sample independent groups t or Z tests,
is different for the distributions under H0 and Ha in the binomial case. This difference occurs
because ␴ 2 is dependent on P, the probability of success, in the binomial. The value of P will

‡ In practice, ␣ is often fixed by regulatory considerations and ␤ is determined as a compromise.


134

Table 6.3 Sample Size Needed for Two-Sided t Test with Standard Deviation Estimated

One-sample test Two-sample test with N units per group

Alpha = 0.05 Alpha = 0.01 Alpha = 0.05 Alpha = 0.01

Beta = Beta = Beta = Beta =

Estimated S /Δ 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20
4.0 296 211 170 128 388 289 242 191 588 417 337 252 770 572 478 376
2.0 76 54 44 34 100 75 63 51 148 106 86 64 194 145 121 96
1.5 44 32 26 20 58 54 37 30 84 60 49 37 110 82 69 55
1.0 21 16 13 10 28 22 19 16 38 27 23 17 50 38 32 26
0.8 14 11 9 8 19 15 13 11 25 18 15 12 33 25 21 17
0.67 11 8 7 6 15 12 11 9 18 13 11 9 24 18 15 13
0.5 7 6 5 4 10 8 8 7 11 8 7 6 14 11 10 8
0.4 6 5 4 4 8 7 6 6 8 6 5 4 10 8 7 6
RevolutionPharmD.com

0.33 5 4 4 3 7 6 6 5 6 5 4 4 8 6 6 5
CHAPTER 6
SAMPLE SIZE AND POWER 135

be different depending on whether H0 or Ha represents the true situation. The appropriate


formulas for determining sample size for the one- and two-sample tests are
One-sample test
 
1 p 0 q 0 + p1 q 1
N= (Z␣ + Z␤ )2 , (6.7)
2 2

where  = p1 − p0 ; p1 is the proportion that would result in a meaningful difference, and p0 is


the hypothetical proportion under the null hypothesis.
Two-sample test
 
p 1 q 1 + p2 q 2
N= (Z␣ + Z␤ )2 , (6.8)
2

where  = p1 − p2 ; p1 and p2 are prior estimates of the proportions in the experimental groups.
The values of Z␣ and Z␤ are the same as those used in the formulas for the normal curve or
t tests. N is the sample size for each group. If it is not possible to estimate p1 and p2 prior to
the experiment, one can make an educated guess of a meaningful value of  and set p1 and p2
both equal to 0.5 in the numerator of Eq. (6.8). This will maximize the sample size, resulting in a
conservative estimate of sample size.
Fleiss [4] gives a fine discussion of an approach to estimating , the practically significant
difference, when computing the sample size. For example, one approach is first to estimate
the proportion for the more well-studied treatment group. In the case of a comparative clinical
study, this could very well be a standard treatment. Suppose this treatment has shown a success
rate of 50%. One might argue that if the comparative treatment is additionally successful for 30%
of the patients who do not respond to the standard treatment, then the experimental treatment
would be valuable. Therefore, the success rate for the experimental treatment should be 50% +
0.3 (50%) = 65% to show a practically significant difference. Thus, p1 would be equal to 0.5 and
p2 would be equal to 0.65.
Example 3: A reconciliation of quality control data over several years showed that the
proportion of unacceptable capsules for a stable encapsulation process was 0.8% ( p0 ). A sample
size for inspection is to be determined so that if the true proportion of unacceptable capsules
is equal to or greater than 1.2% ( = 0.4%), the probability of detecting this change is 80%
(␤ = 0.2). The comparison is to be made at the 5% level using a one-sided test. According to
Eq. (6.7),
 
1 0.008 · 0.992 + 0.012 · 0.988
N= (1.65 + 0.84)2
2 (0.008 − 0.012)2
7670
=
2
= 3835.

The large sample size resulting from this calculation is typical of that resulting from
binomial data. If 3835 capsules are too many to inspect, ␣, ␤, and/or  must be increased. In
the example above, management decided to increase ␣. This is a conservative decision in that
more good batches would be “rejected” if ␣ is increased; that is, the increase in ␣ results in an
increased probability of rejecting good batches, those with 0.8% unacceptable or less.
Example 4: Two antibiotics, a new product and a standard product, are to be compared
with respect to the two-week cure rate of a urinary tract infection, where a cure is bacteriological
evidence that the organism no longer appears in urine. From previous experience, the cure rate
for the standard product is estimated at 80%. From a practical point of view, if the new product
shows an 85% or better cure rate, the new product can be considered superior. The marketing
RevolutionPharmD.com
136 CHAPTER 6

division of the pharmaceutical company felt that this difference would support claims of better
efficacy for the new product. This is an important claim. Therefore, ␤ is chosen to be 1% (power
= 99%). A two-sided test will be performed at the 5% level to satisfy FDA guidelines. The test
is two-sided because, a priori, the new product is not known to be better or worse than the
standard. The calculation of sample size to satisfy the conditions above makes use of Eq. (6.8);
here p1 = 0.8 and p2 = 0.85.
 
0.08 · 0.2 + 0.85 · 0.15
N= (1.96 + 2.32)2 = 2107.
(0.80 − 0.85)2

The trial would have to include 4214 patients, 2107 on each drug, to satisfy the ␣ and
␤ risks of 0.05 and 0.01, respectively. If this number of patients is greater than that can be
accommodated, the ␤ error can be increased to 5% or 10%, for example. A sample size of 1499
per group is obtained for a ␤ of 5%, and 1207 patients per group for ␤ equal to 10%.
Although Eq. (6.8) is adequate for computing the sample size for most situations, the
calculation of N can be improved by considering the continuity correction [4]. This would be
particularly important for small sample sizes

   2
 N 8
N = 1+ 1+ ,
4 (N | p2 − p1 |)

where N is the sample size computed from Eq. (6.8) and N is the corrected sample size. In the
example, for ␣ = 0.05 and ␤ = 0.01, the corrected sample size is

   2
 2107 8
N = 1+ 1+ = 2186.
4 (2107 |0.80 − 0.85|)

6.4 DETERMINATION OF SAMPLE SIZE TO OBTAIN A CONFIDENCE INTERVAL


OF SPECIFIED WIDTH
The problem of estimating the number of samples needed to estimate the mean with a known
precision by means of the confidence interval is easily solved by using the formula for the
confidence interval (see sect. 5.1). This approach has been used as an aid in predicting election
results based on preliminary polls where the samples are chosen by simple random sampling.
For example, one may wish to estimate the proportion of voters who will vote for candidate A
within 1% of the actual proportion.
We will consider the application of this problem to the estimation of proportions. In
quality control, one can closely estimate the true proportion of percent defects to any given
degree of precision. In a clinical study, a suitable sample size may be chosen to estimate the
true proportion of successes within certain specified limits. According to Eq. (5.3), a two-sided
confidence interval with confidence coefficient p for a proportion is


p̂q̂
p̂ ± Z . (6.3)
N

To obtain a 99% confidence interval with a width of 0.01 (i.e., construct an interval that is
within ±0.005 of the observed proportion, p̂ ± 0.005),


p̂q̂
Zp = 0.005
N
SAMPLE SIZE AND POWER 137

or

Z2p ( p̂q̂ )
N= (6.9)
(W/2)2

(2.58)2 ( p̂q̂ )
N= .
(0.005)2

A more exact formula for the sample size for small values of N is given in Ref. [5].
Example 5: A quality control supervisor wishes to have an estimate of the proportion of
tablets in a batch that weigh between 195 and 205 mg, where the proportion of tablets in this
interval is to be estimated within ±0.05 (W = 0.10). How many tablets should be weighed? Use
a 95% confidence interval.
To compute N, we must have an estimate of p̂ [see Eq. (6.9)]. If p̂ and q̂ are chosen to
be equal to 0.5, N will be at a maximum. Thus, if one has no inkling as to the magnitude of
the outcome, using p̂ = 0.5 in Eq. (6.9) will result in a sufficiently large sample size (probably,
too large). Otherwise, estimate p̂ and q̂ based on previous experience and knowledge. In the
present example from previous experience, approximately 80% of the tablets are expected to
weigh between 195 and 205 mg ( p̂ = 0.8). Applying Eq. (6.9),

(1.96)2 (0.8)(0.2)
N= = 245.9.
(0.10/2)2

A total of 246 tablets should be weighed. In the actual experiment, 250 tablets were
weighed, and 195 of the tablets (78%) weighed between 195 and 205 mg. The 95% confidence
interval for the true proportion, according to Eq. (5.3), is
 
p̂q̂ (0.78)(0.22)
p ± 1.96 = 0.78 ± 1.96 = 0.78 ± 0.051.
N 250

The interval is slightly greater than ±5% because p is somewhat less than 0.8 (pq is larger
for p = 0.78 than for p = 0.8). Although 5.1% is acceptable, to ensure a sufficient sample size, in
general, one should estimate p closer to 0.5 in order to cover possible poor estimates of p.
If p̂ had been chosen equal to 0.5, we would have calculated

(1.96)2 (0.5)(0.5)
N= = 384.2.
(0.10/2)2

Example 6: A new vaccine is to undergo a nationwide clinical trial. An estimate is desired


of the proportion of the population that would be afflicted with the disease after vaccination. A
good guess of the expected proportion of the population diseased without vaccination is 0.003.
Pilot studies show that the incidence will be about 0.001 (0.1%) after vaccination. What size
sample is needed so that the width of a 99% confidence interval for the proportion diseased in
the vaccinated population should be no greater than 0.0002? To ensure that the sample size is
sufficiently large, the value of p to be used in Eq. (6.9) is chosen to be 0.0012, rather than the
expected 0.0010.

(2.58)2 (0.9988)(0.0012)
N= = 797,809.
(0.0002/2)2

The trial will have to include approximately 800,000 subjects in order to yield the desired
precision.
138 CHAPTER 6

6.5 POWER
Power is the probability that the statistical test results in rejection of H0 when a specified
alternative is true. The “stronger” the power, the better the chance that the null hypothesis will
be rejected (i.e., the test results in a declaration of “significance”) when, in fact, H0 is false. The
larger the power, the more sensitive is the test. Power is defined as 1 − ␤. The larger the ␤ error,
the weaker is the power. Remember that ␤ is an error resulting from accepting H0 when H0 is
false. Therefore, 1 − ␤ is the probability of rejecting H0 when H0 is false.
From an idealistic point of view, the power of a test should be calculated before an exper-
iment is conducted. In addition to defining the properties of the test, power is used to help
compute the sample size, as discussed above. Unfortunately, many experiments proceed with-
out consideration of power (or ␤). This results from the difficulty of choosing an appropriate
value of ␤. There is no traditional value of ␤ to use, as is the case for ␣, where 5% is usually
used. Thus, the power of the test is often computed after the experiment has been completed.
Power is best described by diagrams such as those shown previously in this chapter
(Figs. 6.1 and 6.2). In these figures, ␤ is the area of the curves represented by the alternative
hypothesis that is included in the region of acceptance defined by the null hypothesis.
The concept of power is also illustrated in Figure 6.3. To illustrate the calculation of power,
we will use data presented for the test of a new antihypertensive agent (sect. 6.2), a paired sample
test, with ␴ = 7 and H0 :  = 0. The test is performed at the 5% level of significance. Let us
suppose that the sample size is limited by cost. The sponsor of the test had sufficient funds
to pay for a study that included only 12 subjects. The design described earlier in this chapter
(sect. 6.2) used 26 patients with ␤ specified equal to 0.05 (power = 0.95). With 12 subjects,
the power will be considerably less than 0.95. The following discussion shows how power is
calculated.
The cutoff points for statistical significance (which specify the critical region) are defined
by ␣, N, and ␴. Thus, the values of ␦ that will lead to a significant result for a two-sided test are
as follows:
 
␦
Z= √
␴/ N
±Z␴
␦= √ .
N

In our example, Z = 1.96 (␣ = 0.05), ␴ = 7, and N = 12.

±(1.96)(7)
␦= √ = ±3.96.
12

Figure 6.3 Illustration of beta or power (1 − ␤).


SAMPLE SIZE AND POWER 139

Values of ␦ greater than 3.96 or less than −3.96 will lead to the decision that the products
differ at the 5% level. Having defined the values of ␦ that will lead to rejection of H0 , we obtain
the power for the alternative, Ha :  = 5, by computing the probability that an average result, ␦,
will be greater than 3.96, if Ha is true (i.e.,  = 5).
This concept is illustrated in Figure 6.3. Curve B is the distribution with mean equal to 5
and ␴ = 7. If curve B is the true distribution, the probability of observing a value of ␦ below
3.96 is the probability of accepting H0 if the alternative hypothesis is true ( = 5). This is the
definition of ␤. This probability can be calculated using the Z transformation.

3.96 − 5
Z= √ = −0.51.
7/ 12

Referring to Table IV.2, the area below +3.96 (Z = −0.51) for curve B is approximately
0.31. The power is 1 − ␤ = 1 − 0.31 = 0.69. The use of 12 subjects results in a power of 0.69 to
“detect” a difference of +5 compared to the 0.95 power to detect such a difference when 26
subjects were used. A power of 0.69 means that if the true difference were 5 mm Hg, the statistical
test will result in significance with a probability of 69%; 31% of the time, such a test will result in
acceptance of H0 .
A power curve is a plot of the power, 1 − ␤, versus alternative values of . Power curves can
be constructed by computing ␤ for several alternatives and drawing a smooth curve through
these points. For a two-sided test, the power curve is symmetrical around the hypothetical
mean,  = 0, in our example. The power is equal to ␣ when the alternative is equal to the
hypothetical mean under H0 . Thus, the power is 0.05 when  = H0 (Fig. 6.4) in the power curve.
The power curve for the present example is shown in Figure 6.4.
The following conclusions may be drawn concerning the power of a test if ␣ is kept
constant:

1. The larger the sample size, the larger the power.


2. The larger the difference to be detected (Ha ), the larger the power. A large sample size will
be needed in order to have strong power to detect a small difference.
3. The larger the variability (s.d.), the weaker the power.
4. If ␣ is increased, power is increased (␤ is decreased) (Fig. 6.3). An increase in ␣ (e.g., 10%)
results in a smaller Z. The cutoff points are shorter, and the area of curve B below the cutoff
point is smaller.

Power is a function of N, , ␴, and ␣.

Figure 6.4 Power curve for N = 12, ␣ = 0.05, ␴ = 7, and H0 :  = 0.


140 CHAPTER 6

A simple way to compute the approximate power of a test is to use the formula for sample
size [Eqs. (6.4) and (6.5). for example] and solve for Z␤ . In the previous example, a single sample
or a paired test, Eq. (6.4) is appropriate:
 ␴ 2
N= (Z␣ + Z␤ )2 (6.4)

√
Z␤ = N − Z␣ . (6.10)

Once having calculated Z␤ , the probability determined directly from Table IV.2 is equal to
the power, 1 − ␤. See the discussion and examples below.
In the problem discussed above, applying Eq. (6.10) with  = 5, ␴ = 7, N = 12, and
Z␣ = 1.96,

5√
Z␤ = 12 − 1.96 = 0.51.
7

According to the notation used for Z (Table 6.2), ␤ is the area above Z␤ . Power is the area
below Z␤ (power = 1 − ␤). In Table IV.2, the area above Z = 0.51 is approximately 31%. The
power is 1 − ␤. Therefore, the power is 69%.§
If N is small and the variance is unknown, appropriate values of t should be used in place
of Z␣ and Z␤ . Alternatively, we can adjust N by subtracting 0.5Z␣2 or 0.25Z␣2 from the actual
sample size for a one- or two-sample test, respectively. The following examples should make
the calculations clearer.
Example 7: A bioavailability study has been completed in which the ratio of the AUCs for
two comparative drugs was submitted as evidence of bioequivalence. The FDA asked for the
power of the test as part of their review of the submission. (Note that this analysis is different
from that presently required by FDA.) The null hypothesis for the comparison is H0 : R = 1,
where R is the true average ratio. The test was two-sided with ␣ equal to 5%. Eighteen subjects
took each of the two comparative drugs in a paired-sample design. The standard deviation was
calculated from the final results of the study, and was equal to 0.3. The power is to be determined
for a difference of 20% for the comparison. This means that if the test product is truly more than
20% greater or smaller than the reference product, we wish to calculate the probability that the
ratio will be judged to be significantly different from 1.0. The value of  to be used in Eq. (6.10)
is 0.2.

0.2 16
Z␤ = − 1.96 = 0.707.
0.3

Note that the value of N is taken as 16. This is the inverse of the procedure for determining
sample size, where 0.5Z␣2 was added to N. Here we subtract 0.5Z␣2 (approximately 2) from N;
18 − 2 = 16. According to Table IV.2, the area corresponding to Z = 0.707 is approximately 0.76.
Therefore, the power of this test is 76%. That is, if the true difference between the formulations
is 20%, a significant difference will be found between the formulations 76% of the time. This
is very close to the 80% power that was recommended before current FDA guidelines were
implemented for bioavailability tests (where  = 0.2).
Example 8: A drug product is prepared by two different methods. The average tablet
weights of the two batches are to be compared, weighing 20 tablets from each batch. The average
weights of the two 20-tablet samples were 507 and 511 mg. The pooled standard deviation was
calculated to be 12 mg. The director of quality control wishes to be “sure” that if the average
weights truly differ by 10 mg or more, the statistical test will show a significant difference, when

§ The value corresponding to Z in Table IV.2 gives the power directly. In this example, the area in the table
corresponding to a Z of 0.51 is approximately 0.69.
SAMPLE SIZE AND POWER 141

he was asked, “How sure?”, he said 95% sure. This can be translated into a ␤ of 5% or a power
of 95%. This is a two independent groups test. Solving for Z␤ from Eq. (6.5), we have


 N
Z␤ = − Z␣
␴ 2

10 19
= − 1.96 = 0.609. (6.11)
12 2

As discussed above, the value of N is taken as 19 rather than 20, by subtracting 0.25Z␣2
from N for the two-sample case. Referring to Table IV.2, we note that the power is approximately
73%. The experiment does not have sufficient power according to the director’s standards. To
obtain the desired power, we can increase the sample size (i.e., weigh more tablets). (See Exercise
Problem 10.)

6.6 SAMPLE SIZE AND POWER FOR MORE THAN TWO TREATMENTS
(ALSO SEE CHAP. 8)
The problem of computing power or sample size for an experiment with more than two treat-
ments is somewhat more complicated than the relatively simple case of designs with two
treatments. The power will depend on the number of treatments and the form of the null
and alternative hypotheses. Dixon and Massey [5] present a simple approach to determining
power and sample size. The following notation will be used in presenting the solution to this
problem.
Let M1 , M2 , M3 . . . Mk be the hypothetical population means of the k treatments. The null
hypothesis is M1 = M2 = M3 = Mk . As for the two sample cases, we must specify the alternative
of Mi . The alternative means are expressed as a grand mean, Mt ± some deviation, Di ,
values 
where (Di ) = 0. For example, if three treatments are compared for pain, Active A, Active B,
and Placebo (P), the values for the alternative hypothesized means, based on a VAS scale for
pain relief, could be 75 + 10 (85), 75 + 10 (85), and 75 − 20 (55) for the two actives and placebo,
respectively. The sum of the deviations from the grand mean, 75, is 10 + 10 − 20 = 0. The power
is computed based on the following equation:


(Mi − Mt )2 /k
␺ =
2
, (6.12)
S2 /n

where n is the number of observations in each treatment group (n is the same for each treatment)
and S2 is the common variance. The value of ␺ 2 is referred to Table 6.4 to estimate the required
sample size.
Consider the following example of three treatments in a study measuring the analgesic
properties of two actives and a placebo as described above. Fifteen subjects are in each treatment
group and the variance is 1000. According to Eq. (6.12),

 
(85 − 75)2 + (85 − 75)2 + (55 − 75)2 /3
␺ =
2
= 3.0.
1000/15

Table 6.4 gives the approximate power for various values of ␺ , at the 5% level, as a function
of the number of treatment groups and the d.f. for error for 3 and 4 treatments. (More detailed
tables, in addition to graphs,
√ are given in Dixon and Massey [5].) Here, we have 42 d.f. and three
treatments with ␺ = 3 = 1.73. The power is approximately 0.72 by simple linear interpolation
(42 d.f. for ␺ = 1.7). The correct answer with more extensive tables is closer to 0.73.
142 CHAPTER 6

Table 6.4 Factors for Computing Power for


Analysis of Variance

d.f. error ␺ Power


Alpha = 0.05, k = 3
10 1.6 0.42
2.0 0.76
2.4 0.80
3.0 0.984
20 1.6 0.62
1.92 0.80
2.00 0.83
3.0 >0.99
30 1.6 0.65
1.9 0.80
2.0 0.85
3.0 >0.99
60 1.6 0.67
1.82 0.80
2.0 0.86
3.0 >0.99
inf 1.6 0.70
1.8 0.80
2.0 0.88
3.0 >0.99

alpha = 0.05, k = 4

10 1.4 0.48
2.0 0.80
2.6 0.96
20 1.4 0.56
2.0 0.88
2.6 986
30 1.4 0.59
2.0 0.90
2.6 >0.99
60 1.4 0.61
2.0 0.92
2.6 >0.99
inf 1.4 0.65
2.0 0.94
2.6 >0.99

Table 6.4 can also be used to determine sample size. For example, how many patients
per treatment group are needed to obtain a power of 0.80 in the above example? Applying
Eq. (6.12),

{(85 − 75)2 + (85 − 75)2 + (55 − 75)2 }/3


= ␺ 2.
1000/n

Solve for ␺ 2

␺ 2 = 0.2n.
SAMPLE SIZE AND POWER 143

We can calculate n by trial and error. For example, with N = 20,

0.2N = 4 = ␺ 2 and ␺ = 2.

For ␺ = 2 and N = 20 (d.f.√ = 57), the power is approximately 0.86 (for d.f. = 60, power
0.86). For N = 15 (d.f. = 42, ␺ = 3), we have calculated (above) that the power is approximately
0.72. A sample size of between 15 and 20 patients per treatment group would give a power of
0.80. In this example, we might guess that 17 patients per group would result in approximately
80% power. Indeed, more exact tables show that a sample size of 17(␺ = (0.2 × 17) = 1.85)
corresponds to a power of 0.79.
The same approach can be used for two-way designs, using the appropriate error term
from the analysis of variance.

6.7 SAMPLE SIZE FOR BIOEQUIVALENCE STUDIES (ALSO SEE CHAP. 11)
In its early evolution, bioequivalence was based on the acceptance or rejection of a hypothesis
test. Sample sizes could then be determined by conventional techniques as described in section
6.2. Because of inconsistencies in the decision process based on this approach, the criteria for
acceptance was changed to a two-sided 90% confidence interval, or equivalently, two one-sided
t test, where the hypotheses are (␮1 /␮2 ) < 0.8 and (␮1 /␮2 ) > 1.25 versus the alternative of
0.8 < (␮1 /␮2 ) < 1.25. This test is based on the antilog of the difference between the averages of
the log-transformed parameters (the geometric mean). This test is equivalent to a two-sided 90%
confidence interval for the ratio of means falling in the interval 0.80 to 1.25 in order to accept
the hypothesis of equivalence. Again, for the currently accepted log-transformed data, the 90%
confidence interval for the antilog of the difference between means must lie between 0.80 and
1.25, that is, 0.8 < antilog (␮1 /␮2 ) < 1.25. The sample-size determination in this case is not as
simple as the conventional determination of sample size described earlier in this chapter. The
method for sample-size determination for nontransformed data has been published by Phillips
[6] along with plots of power as a function of sample size, relative standard deviation (computed
from the ANOVA), and treatment differences. Although the theory behind this computation is
beyond the scope of this book, Chow and Liu [7] give a simple way of approximating the power
and sample size. The sample size for each sequence group is approximately
 2
CV
N = (t␣, 2N−2 + t␤, 2N−2 )2 , (6.13)
(V − ␦)

where N is the number of subjects per sequence, t the appropriate value from the t distribution, ␣
the significance level (usually 0.10), 1 − ␤ the power (usually 0.8), CV the coefficient of variation,
V the bioequivalence limit, and ␦ the difference between products.
One would have to have an approximation of the magnitude of the required sample size
in order to approximate the t values. For example, suppose that RSD = 0.20, ␦ = 0.10, power is
0.8, and an initial approximation of the sample size is 20 per sequence (a total of 40 subjects).
Applying Eq. (6.13)

n = (1.69 + 0.85)2 [0.20/(0.20 − 0.10)]2 = 25.8.

Use a total of 52 subjects. This agrees closely with Phillip’s more exact computations.
Dilletti et al. [8] have published a method for determining sample size based on the log-
transformed variables, which is the currently preferred method. Table 6.5 showing sample sizes
for various values of CV, power, and product differences is taken from their publication.
Based on these tables, using log-transformed estimates of the parameters would result in
a sample size estimate of 38 for a power of 0.8, ratio of 0.9, and CV = 0.20. If the assumed ratio
is 1.1, the sample size is estimated as 32.
Equation (6.13) can also be used to approximate these sample sizes using log values for V
and ␦: n = (1.69 + 0.85)2 [0.20/(0.223 − 0.105)]2 = 19 per sequence or 38 subjects in total, where
0.223 is the log of 1.25 and 0.105 is the absolute value of the log of 0.9.
144 CHAPTER 6

Table 6.5 Sample Sizes for Given CV Power and Ratio (T /R ) for Log-Transformed Parametersa

CV Power ␮r , ␮x
(%) (%) 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20
5.0 70 10 6 4 4 4 4 6 16
7.5 16 6 6 4 6 6 10 34
10.0 28 10 6 6 6 8 16 58
12.5 42 14 8 8 8 12 24 90
15.0 60 18 10 10 10 16 32 128
17.5 80 22 12 12 12 20 44 172
20.0 102 30 16 14 16 26 56 224
22.5 128 36 20 16 20 30 70 282
25.0 158 44 24 20 22 38 84 344
27.5 190 52 28 24 26 44 102 414
30.0 224 60 32 28 32 52 120 490
3.0 80 12 6 4 4 4 6 8 22
7.5 22 8 6 6 6 8 12 44
10.0 36 12 8 6 8 10 20 76
12.5 54 16 10 8 10 14 30 118
15.0 78 22 12 10 12 20 42 168
17.5 104 30 16 14 16 26 56 226
20.0 134 38 20 16 18 32 72 294
22.5 168 46 24 20 24 40 90 368
25.0 206 56 28 24 28 48 110 452
27.5 248 68 34 28 34 58 132 544
30.0 292 80 40 32 38 68 156 642
5.0 90 14 6 4 4 4 6 8 28
7.5 28 10 6 6 6 8 16 60
10.0 48 14 8 8 8 14 26 104
12.5 74 22 12 10 12 18 40 162
15.0 106 30 16 12 16 26 58 232
17.5 142 40 20 16 20 34 76 312
20.0 186 50 26 20 24 44 100 406
22.5 232 64 32 24 30 54 124 510
25.0 284 78 38 28 36 66 152 626
27.5 342 92 44 34 44 78 182 752
30.0 404 108 52 40 52 92 214 888

a Source: From Ref. [8].

For ␦ = 1.10 (log = 0.0953), the sample size is: n = (1.69 + 0.85)2 [0.20/ (0.223 − 0.0953)]2 =
16 per sequence or 32 subjects in total.
If the difference between products is specified as zero (ratio = 1.0), the value for t␤, 2n−2
in Eq. (6.3) should be two sided (Table 6.2). For example, for 80% power (and a large sample
size) use 1.28 rather than 0.84. In the example above with a ratio of 1.0 (0 difference between
products), a power of 0.8, and a CV = 0.2, use a value of (approximately) 1.34 for t␤, 2n−2 .

n = (1.75 + 1.34)2 [0.2/0.223]2 = 7.7 per group or 16 total subjects.

An Excel program to calculate the number of subjects required for a crossover study under
various conditions of power and product differences, for both parametric and binary (binomial)
data, is available on the disk accompanying this volume.
This approach to sample-size determination can also be used for studies where the out-
come is dichotomous, often used as the criterion in clinical studies of bioequivalence (cured or
not cured) for topically unabsorbed products or unabsorbed oral products such as sucralfate.
This topic is presented in section 11.4.8.

RevolutionPharmD.com
RevolutionPharmD.com

2 DATA GRAPHICS

“The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove
nothing, but bring outstanding features readily to the eye; they are therefore no substitute for
such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in
explaining the conclusions founded upon them.” This quote is from Ronald A. Fisher, the father
of modern statistical methodology [1]. Tabulation of raw data can be thought of as the initial
and least refined way of presenting experimental results. Summary tables, such as frequency
distribution tables, are much easier to digest and can be considered a second stage of refine-
ment of data presentation. Summary statistics such as the mean, median, variance, standard
deviation, and the range are concise descriptions of the properties of data, but much informa-
tion is lost in this processing of experimental results. Graphical methods of displaying data
are to be encouraged and are important adjuncts to data analysis and presentation. Graphical
presentations clarify and also reinforce conclusions based on formal statistical analyses. Finally,
the researcher has the opportunity to design aesthetic graphical presentations that command
attention. The popular cliché “A picture is worth a thousand words” is especially apropos to
statistical presentations. We will discuss some key concepts of the various ways in which data
are depicted graphically.

2.1 INTRODUCTION
The diagrams and plots that we will be concerned with in our discussion of statistical methods
can be placed broadly into two categories:
1. Descriptive plots are those whose purpose is to transmit information. These include dia-
grams describing data distributions such as histograms and cumulative distribution plots
(see sect. 1.2.3). Bar charts and pie charts are examples of popular modes of communicating
survey data or product comparisons.
2. Plots that describe relationships between variables usually show an underlying, but unknown
analytic relationship between the variables that we wish to describe and understand. These
relationships can range from relatively simple to very complex, and may involve only two
variables or many variables. One of the simplest relationships, but probably the one with
greatest practical application, is the straight-line relationship between two variables, as
shown in the Beer’s law plot in Figure 2.1. Chapter 7 is devoted to the analysis of data
involving variables that have a linear relationship.
When analyzing and depicting data that involve relationships, we are often presented
with data in pairs (X, Y pairs). In Figure 2.1, the optical density Y and the concentration X are
the data pairs. When considering the relationship of two variables, X and Y, one variable can
often be considered the response variable, which is dependent on the selection of the second
or causal variable. The response variable Y (optical density in our example) is known as the
dependent variable. The value of Y depends on the value of the independent variable, X (drug
concentration). Thus, in the example in Figure 2.1, we think of the value of optical density as
being dependent on the concentration of drug.

2.2 THE HISTOGRAM


The histogram, sometimes known as a bar graph, is one of the most popular ways of presenting
and summarizing data. All of us have seen bar graphs, not only in scientific reports but also in
advertisements and other kinds of presentations illustrating the distribution of scientific data.
RevolutionPharmD.com
DATA GRAPHICS 27

Figure 2.1 Beer’s law plot illustrating a linear relationship between two variables.

The histogram can be considered as a visual presentation of a frequency table. The frequency, or
proportion, of observations in each class interval is plotted as a bar, or rectangle, where the area
of the bar is proportional to the frequency (or proportion) of observations in a given interval.
An example of a histogram is shown in figure 2.2, where the data from the frequency table in
Table 1.2 have been used as the data source. As is the case with frequency tables, class intervals
for histograms should be of equal width. When the intervals are of equal width, the height of
the bar is proportional to the frequency of observations in the interval. If the intervals are not of
equal width, the histogram is not easily or obviously interpreted, as shown in Figure 2.2(B).
The choice of intervals for a histogram depends on the nature of the data, the distribution
of the data, and the purpose of the presentation. In general, rules of thumb similar to that used

Figure 2.2 Histogram of data derived from


Table 1.2.
28 CHAPTER 2

for frequency distribution tables (sect. 1.2) can be used. Eight to twenty equally spaced intervals
usually are sufficient to give a good picture of the data distribution.

2.3 CONSTRUCTION AND LABELING OF GRAPHS


Proper construction and labeling of graphs are crucial elements in graphical data representation.
The design and actual construction of graphs are not in themselves difficult. The preparation
of a good graph, however, requires careful thought and competent technical skills. One needs
not only a knowledge of statistical principles, but also, in particular, computer and drafting
competency. There are no firm rules for preparing good graphical presentations. Mostly, we
rely on experience and a few guidelines. Both books and research papers have addressed the
need for a more scientific guide to optimal graphics that, after all, is measured by how well the
graph communicates the intended messages(s) to the individuals who are intended to read and
interpret the graphs. Still, no rules will cover all situations. One must be clear that no matter
how well a graph or chart is conceived, if the draftsmanship and execution is poor, the graph
will fail to achieve its purpose.
A “good” graph or chart should be as simple as possible, yet clearly transmit its intended
message. Superfluous notation, confusing lines or curves, and inappropriate draftsmanship
(lettering, etc.) that can distract the reader are signs of a poorly constructed graph. The books
Statistical Graphics, by Schmid [2], and The Visual Display of Quantitative Information by Tufte
[3] are recommended for those who wish to study examples of good and poor renderings of
graphic presentations. For example, Schmid notes that visual contrast should be intentionally
used to emphasize important characteristics of the graph. Here, we will present a few examples
to illustrate the recommendations for good graphic presentation as well as examples of graphs
that are not prepared well or fail to illustrate the facts fairly.
Figure 2.3 shows the results of a clinical study that was designed to compare an active
drug to a placebo for the treatment of hypertension. This graph was constructed from the X, Y
pairs, time and blood pressure, respectively. Each point on the graph (+ , ) is the average blood
pressure for either drug or placebo at some point in time subsequent to the initiation of the
study.
Proper construction and labeling of the typical rectilinear graph should include the fol-
lowing considerations:

1. A title should be given. The title should be brief and to the point, enabling the reader to
understand the purpose of the graph without having to resort to reading the text. The title
can be placed below or above the graph as in Figure 2.3.
2. The axes should be clearly delineated and labeled. In general, the zero (0) points of both axes
should be clearly indicated. The ordinate (the Y axis) is usually labeled with the description
parallel to the Y axis. Both the ordinate and abscissa (X axis) should be each appropriately

115
Diastolic blood pressure (mm Hg)

110

105

100

95

90

85

80
0 2 4 6 8
Time (weeks) after initialion of study

Figure 2.3 Blood pressure as a function of time in a clinical study comparing drug and placebo with a regimen
of one tablet per day. , placebo (average of 45 patients); +, drug (average of 50 patients).
RevolutionPharmD.com
DATA GRAPHICS 29

(A)
450

Exercise time (sec) 400 DRUG II

350

300
DRUG I

250
0 1 2 3 4 5
(B) (C)
500 450

400
Exercise time (sec)

DRUG II
Exercise time (sec) 400
300 DRUG II
DRUG I

200 350

100
300
0 DRUG I
0 1 2 3 4 5
250
(D) 0 1 2 3 4 5
500
(E)
140
400
Difference in exercise time

DRUG II 120
Exercise time (sec)

100
300
DRUG I 80
(sec)

200 60

40
100 20

0
0 0 1 2 3 4 5
0 1 2 3 4 5
Time after dosing (hr)

Figure 2.4 Various graphs of the same data presented in different ways. Exercise time at various time intervals
after administration of single doses of two nitrate products.  = Drug I,  = Drug II.

labeled and subdivided in units of equal width (of course, the X and Y axes almost always
have different subdivisions). In the example in Figure 2.3, note the units of mm Hg and
weeks for the ordinate and abscissa, respectively. Grid lines may be added [Fig. 2.4(E)] but,
if used, should be kept to a minimum, not be prominent and should not interfere with the
interpretation of the figure.
3. The numerical values assigned to the axes should be appropriately spaced so as to nicely
cover the extent of the graph. This can easily be accomplished by trial and error and a little
manipulation. The scales and proportions should be constructed to present a fair picture of
the results and should not be exaggerated so to prejudice the interpretation. Sometimes, it
may be necessary to skip or omit some of the data to achieve this objective. In these cases,
the use of a “broken line” is recommended to clearly indicate the range of data not included
in the graph (Fig. 2.4).
30 CHAPTER 2

4. If appropriate, a key explaining the symbols used in the graph should be used. For example,
at the bottom of Figure 2.3, the key defines  as the symbol for placebo and + for drug. In
many cases, labeling the curves directly on the graph (Fig. 2.4) results in more clarity.
5. In situations where the graph is derived from laboratory data, inclusion of the source of the
data (name, laboratory notebook number, and page number, for example) is recommended.

Usually graphs should stand on their own, independent of the main body of the text.
Examples of various ways of plotting data, derived from a study of exercise time at various
time intervals after administration of a single dose of two long-acting nitrate products to anginal
patients, are shown in Figures 2.4(A) to 2.4(E). All of these plots are accurate representations of
the experimental results, but each gives the reader a different impression. It would be wrong to
expand or contract the axes of the graph, or otherwise distort the graph, in order to convey an
incorrect impression to the reader. Most scientists are well aware of how data can be manipulated
to give different impressions. If obvious deception is intended, the experimental results will not
be taken seriously.
When examining the various plots in Figure 2.4, one could not say which plot best repre-
sents the meaning of the experimental results without knowledge of the experimental details,
in particular the objective of the experiment, the implications of the experimental outcome, and
the message that is meant to be conveyed. For example, if an improvement of exercise time of
120 seconds for one drug compared to the other is considered to be significant from a medical
point of view, the graphs labeled A, C, and E in Figure 2.4 would all seem appropriate in con-
veying this message. The graphs labeled B and D show this difference less clearly. On the other
hand, if 120 seconds is considered to be of little medical significance, B and D might be a better
representation of the data.
Note that in plot A of Figure 2.4, the ordinate (exercise time) is broken, indicating that
some values have been skipped. This is not meant to be deceptive, but is intentionally done
to better show the differences between the two drugs. As long as the zero point and the break
in the axis are clearly indicated, and the message is not distorted, such a procedure is entirely
acceptable.
Figures 2.4(B) and 2.5 are exaggerated examples of plots that may be considered not to
reflect accurately the significance of the experimental results. In Figure 2.4(B), the clinically
significant difference of approximately 120 seconds is made to look very small, tending to
diminish drug differences in the viewer’s mind. Also, fluctuations in the hourly results appear
to be less than the data truly suggest. In Figure 2.5, a difference of 5 seconds in exercise time
between the two drugs appears very large. Care should be taken when constructing (as well as
reading) graphs so that experimental conclusions come through clear and true.

6. If more than one curve appears on the same graph, a convenient way to differentiate the
curves is to use different symbols for the experimental points (e.g., ◦, ×, , , +) and, if
necessary, connecting the points in different ways (e.g., —.—.—., . . . . . ., –.–.–.–). A key or
label is used, which is helpful in distinguishing the various curves, as shown in Figures 2.3
to 2.6. Other ways of differentiating curves include different kinds of crosshatching and use
of different colors.

Figure 2.5 Exercise time at various time inter-


vals after administration of two nitrate products.
•, product I; +, product II.
RevolutionPharmD.com
DATA GRAPHICS 31

Figure 2.6 Plot of dissolution of four successive


batches of a commercial tablet product.  = batch
I, • = batch II, × = batch 3,  = batch 4.

7. One should take care not to place too many curves on the same graph, as this can result in
confusion. There are no specific rules in this regard. The decision depends on the nature of
the data, and how the data look when they are plotted. The curves graphed in Figure 2.7
are cluttered and confusing. The curves should be presented differently or separated into
two or more graphs. Figure 2.8 is a clearer depiction of the dissolution results of the five
formulations shown in Figure 2.7.
8. The standard deviation may be indicated on graphs as shown in Figure 2.9. However, when
the standard deviation is indicated on a graph (or in a table, for that matter), it should be
made clear whether the variation described in the graph is an indication of the standard
deviation (S) or the standard deviation of the mean (Sx̄ ). The standard deviation of the
mean, if appropriate, is often preferable to the standard deviation not only because the
values on the graph are mean values, but also because Sx̄ is smaller than the s.d., and
therefore less cluttering. Overlapping standard deviations, as shown in Figure 2.10, should
be avoided, as this representation of the experimental results is usually more confusing than
clarifying.
9. The manner in which the points on a graph should be connected is not always obvious.
Should the individual points be connected by straight lines, or should a smooth curve that
approximates the points be drawn through the data? (See Fig. 2.11.) If the graphs represent
functional relationships, the data should probably be connected by a smooth curve. For
example, the blood level versus time data shown in Figure 2.11 are described most accurately
by a smooth curve. Although, theoretically, the points should not be connected by straight
lines as shown in Figure 2.11(A), such graphs are often depicted this way. Connecting the
individual points with straight lines may be considered acceptable if one recognizes that
this representation is meant to clarify the graphical presentation, or is done for some other
appropriate reason. In the blood-level example, the area under the curve is proportional to
the amount of drug absorbed. The area is often computed by the trapezoidal rule [4], and
depiction of the data as shown in Figure 2.11(A) makes it easier to visualize and perform
such calculations.
Figure 2.12 shows another example in which connecting points by straight lines is con-
venient but may not be a good representation of the experimental outcome. The straight line
connecting the blood pressure at zero time (before drug administration) to the blood pressure
after two weeks of drug administration suggests a gradual decrease (a linear decrease) in blood

Figure 2.7 Plot of dissolution time of five dif-


ferent commercial formulations of the same drug.
• = product A,  = product B, × = product C,
 = product D,  = product E.
32 CHAPTER 2

Figure 2.8 Individual plots of dissolution of the five formulations shown in Fig. 2.7.

pressure over the two-week period. In fact, no measurements were made during the initial
two-week interval. The 10-mm Hg decrease observed after two weeks of therapy may have
occurred before the two-week reading (e.g., in one week, as indicated by the dashed line in
Fig. 2.12). One should be careful to ensure that graphs constructed in such a manner are not
misinterpreted.

Figure 2.9 Plot of exercise time as a function of time for an antianginal drug showing mean values and standard
error of the mean.
DATA GRAPHICS 33

Figure 2.10 Graph comparing two antianginal drugs that is confusing and cluttered because of the overlapping
standard deviations. •, drug A; o, drug B.

2.4 SCATTER PLOTS (CORRELATION DIAGRAMS)


Although the applications of correlation will be presented in some detail in chapter 7, we will
introduce the notion of scatter plots (also called correlation diagrams or scatter diagrams) at this
time. This type of plot or diagram is commonly used when presenting results of experiments.
A typical scatter plot is illustrated in Figure 2.13. Data are collected in pairs (X and Y) with the
objective of demonstrating a trend or relationship (or lack of relationship) between the X and
Y variables. Usually, we are interested in showing a linear relationship between the variables
(i.e., a straight line). For example, one may be interested in demonstrating a relationship (or
correlation) between time to 80% dissolution of various tablet formulations of a particular drug

Figure 2.11 Plot of blood level versus time data illustrating two ways of drawing the curves.

Figure 2.12 Graph of blood pressure reduction with time of


antihypertensive drug illustrating possible misinterpretation
that may occur when points are connected by straight lines.
RevolutionPharmD.com
34 CHAPTER 2

45
Time to 80% dissolution (min)

30

15

0
0.0 0.2 0.4 0.6 0.8 1.0
Fraction of dose absorbed in vivo

Figure 2.13 Scatter plot showing the correlation of dissolution time and in vivo absorption of six tablet formula-
tions. , formulation A; ×, formulation B; •, formulation C; , formulation D; , formulation E; +, formulation F.

and the fraction of the dose absorbed when human subjects take the various tablets. The data plotted
in Figure 2.13 show pictorially that as dissolution increases (i.e., the time to 80% dissolution
decreases) in vivo absorption increases. Scatter plots involve data pairs, X and Y, both of which
are variable. In this example, dissolution time and fraction absorbed are both random variables.

2.5 SEMILOGARITHMIC PLOTS


Several important kinds of experiments in the pharmaceutical sciences result in data such
that the logarithm of the response (Y) is linearly related to an independent variable, X. The
semilogarithmic plot is useful when the response (Y) is best depicted as proportional changes
relative to changes in X, or when the spread of Y is very large and cannot be easily depicted
on a rectilinear scale. Semilog graph paper has the usual equal interval scale on the X axis and
the logarithmic scale on the Y axis. In the logarithmic scale, equal intervals represent ratios. For
example, the distance between 1 and 10 will exactly equal the distance between 10 and 100 on a
logarithmic scale. In particular, first-order kinetic processes, often apparent in drug degradation
and pharmacokinetic systems, show a linear relationship when log C is plotted versus time.
First-order processes can be expressed by the following equation:

kt
log C = log C0 − (2.1)
2.3

where C is the concentration at time t, C0 the concentration at time 0, k the first-order rate
constant, t the time, and log represents logarithm to the base 10.
Table 2.1 shows blood-level data obtained after an intravenous injection of a drug
described by a one-compartment model [3].
Figure 2.14 shows two ways of plotting the data in Table 2.1 to demonstrate the linearity
of the log C versus t relationship.
1. Figure 2.14(A) shows a plot of log C versus time. The resulting straight line is a consequence
of the relationship of log concentration and time as shown in Eq. 2.1. This is an equation of
a straight line with the Y intercept equal to log C0 and a slope equal to −k/2.3. Straight-line
relationships are discussed in more detail in chapter 8.

Table 2.1 Blood Levels After Intravenous Injection of Drug

Time after injection, t (hr) Blood level, C (␮g/mL) Log blood level
0 20 1.301
1 10 1.000
2 5 0.699
3 2.5 0.398
4 1.25 0.097
DATA GRAPHICS 35

(A) (B)
100
2nd
Log concentration

1.2

Concentration
cycle

0.8 10

0.4 1st
cycle
0 1
0 1 2 3 4 5 0 1 2 3 4 5
Time after injection (hr) Time after injection (hr)

Figure 2.14 Linearizing plots of data from Table 2.1. (Plot A) log C versus time; (plot B) semilog plot.

2. Figure 2.14(B) shows a more convenient way of plotting the data of Table 2.1, making use of
semilog graph paper. This paper has a logarithmic scale on the Y axis and the usual arithmetic,
linear scale on the X axis. The logarithmic scale is constructed so that the spacing corresponds
to the logarithms of the numbers on the Y axis. For example, the distance between 1 and 2 is
the same as that between 2 and 4. (Log 2−log 1) is equal to (log 4−log 2). The semilog graph
paper depicted in Figure 2.14(B) is two-cycle paper. The Y (log) axis has been repeated two
times. The decimal point for the numbers on the Y axis is accommodated to the data. In our
example, the data range from 1.25 to 20 and the Y axis is adjusted accordingly, as shown in
Figure 2.14(B). The data may be plotted directly on this paper without the need to look up
the logarithms of the concentration values.

2.6 OTHER DESCRIPTIVE FIGURES


Most of the discussion in this chapter has been concerned with plots that show relationships
between variables such as blood pressure changes following two or more treatments, or drug
decomposition as a function of time. Often occasions arise in which graphical presentations are
better made using other more pictorial techniques. These approaches include the popular bar
and pie charts. Schmid [2] differentiates bar charts into two categories: (a) column charts in which
there is a vertical orientation and (b) bar charts in which the bars are horizontal. In general, the
bar charts are more appropriate for comparison of categorical variables, whereas the column
chart is used for data showing relationships such as comparisons of drug effect over time.
Bar charts are very simple but effective visual displays. They are usually used to compare
some experimental outcome or other relevant data where the length of the bar represents the
magnitude. There are many variations of the simple bar chart [2]; an example is shown in Figure
2.15. In Figure 2.15(A), patients are categorized as having a good, fair, or poor response. Forty
percent of the patients had a good response, 35% had a fair response, and 25% had a poor
response.
Figure 2.15(B) shows bars in pairs to emphasize the comparative nature of two treatments.
It is clear from this diagram that Treatment X is superior to Treatment Y. Figure 2.15(C) is another
way of displaying the results shown in Figure 2.15(B). Which chart do you think better sends
the message of the results of this comparative study, Figure 2.15(B) or 2.15(C)? One should be
aware that the results correspond only to the length of the bar. If the order in which the bars
are presented is not obvious, displaying bars in order of magnitude is recommended. In the
example in Figure 2.15, the order is based on the nature of the results, “Good,” “Fair,” and
“Poor.” Everything else in the design of these charts is superfluous and the otherwise principal
objective is to prepare an aesthetic presentation that emphasizes but does not exaggerate the
results. For example, the use of graphic techniques such as shading, crosshatching, and color,
tastefully executed, can enhance the presentation.
Column charts are prepared in a similar way to bar charts. As noted above, whether or not
a bar or column chart is best to display data is not always clear. Data trends over time usually
are best shown using columns. Figure 2.16 shows the comparison of exercise time for two drugs
using a column chart. This is the same data used to prepare Figure 2.4(A) (also, see Exercise
Problem 8 at the end of this chapter).
36 CHAPTER 2

Figure 2.15 Graphical representation of patient responses to drug therapy.

RevolutionPharmD.com
DATA GRAPHICS 37

450

400
Exercise time (sec)

Drug 1

350 Drug 2

300

250
1 2 3 4 5
Time after dosing (hr)

Figure 2.16 Exercise time for two drugs in the form of a column chart using data of Figure 2.4.

Pie charts are popular ways of presenting categorical data. Although the principles used in
the construction of these charts are relatively simple, thought and care are necessary to convey
the correct message. For example, dividing the circle into too many categories can be confusing
and misleading. As a rule of thumb, no more than six sectors should be used. Another problem
with pie charts is that it is not always easy to differentiate two segments that are reasonably
close in size, whereas in the bar graph, values close in size are easily differentiated, since length
is the critical feature.
The circle (or pie) represents 100%, or all of the results. Each segment (or slice of pie) has an
area proportional to the area of the circle, representative of the contribution due to the particular
segment. In the example shown in Figure 2.17(A), the pie represents the anti-inflammatory
drug market. The slices are proportions of the market accounted for by major drugs in this
therapeutic class. These charts are frequently used for business and economic descriptions, but
can be applied to the presentation of scientific data in appropriate circumstances. Figure 2.17(B)
shows the proportion of patients with good, fair, and poor responses to a drug in a clinical trial
(see also Fig. 2.15).
Of course, we have not exhausted all possible ways of presenting data graphically. We
have introduced the cumulative plot in section 1.2.3. Other kinds of plots are the stick diagram
(analogous to the histogram) and frequency polygon [5]. The number of ways in which data
can be presented is limited only by our own ingenuity. An elegant pictorial presentation of
data can “make” a report or government submission. On the other hand, poor presentation of
data can detract from an otherwise good report. The book Statistical Graphics by Calvin Schmid
is recommended for those who wish detailed information on the presentation of graphs and
charts.

Figure 2.17 Examples of pie charts.


38 CHAPTER 2

KEY TERMS
Bar charts Independent variables
Bar graphs Key
Column charts Pie charts
Correlation Scatter plots
Data pairs Semilog plots
Dependent variables
Histogram

EXERCISES
1. Plot the following data, preparing and labeling the graph according to the guidelines out-
lined in this chapter. These data are the result of preparing various modifications of a
formulation and observing the effect of the modifications on tablet hardness.

Formulation modification

Starch (%) Lactose (%) Tablet hardness (kg)


10 5 8.3
10 10 9.1
10 15 9.6
10 20 10.2
5 5 9.1
5 10 9.4
5 15 9.8
5 20 10.4

(Hint: Plot these data on a single graph where the Y axis is tablet hardness and the X axis
is lactose concentration. There will be two curves, one at 10% starch and the other at 5%
starch.)
2. Prepare a histogram from the data of Table 1.3. Compare this histogram to that shown in
Figure 2.2(A). Which do you think is a better representation of the data distribution?
3. Plot the following data and label the graph appropriately.

X : response Y : response
Patient to product A to product B
1 2.5 3.8
2 3.6 2.4
3 8.9 4.7
4 6.4 5.9
5 9.5 2.1
6 7.4 5.0
7 1.0 8.5
8 4.7 7.8

What conclusion(s) can you draw from this plot if the responses are pain relief scores, where
a high score means more relief?
4. A batch of tables was shown to have 70% with no defects, 15% slightly chipped, 10%
discolored, and 5% dirty. Construct a pie chart from these data.
5. The following data from a dose–response experiment, a measure of physical activity, are the
responses of five animals at each of three doses.

RevolutionPharmD.com
ANOVA:
Analysis of Variation
The basic ANOVA situation
Two variables: 1 Categorical, 1 Quantitative

Main Question: Do the (means of) the quantitative


variables depend on which group (given by
categorical variable) the individual is in?

If categorical variable has only 2 values:


• 2-sample t-test

ANOVA allows for 3 or more groups


An example ANOVA situation
Subjects: 25 patients with blisters
Treatments: Treatment A, Treatment B, Placebo
Measurement: # of days until blisters heal

Data [and means]:


• A: 5,6,6,7,7,8,9,10 [7.25]
• B: 7,7,8,9,9,10,10,11 [8.875]
• P: 7,9,9,10,10,10,11,12,13 [10.11]

Are these differences significant?


Informal Investigation
Graphical investigation:
• side-by-side box plots
• multiple histograms

Whether the differences between the groups are


significant depends on
• the difference in the means
• the standard deviations of each group
• the sample sizes

ANOVA determines P-value from the F statistic


Side by Side Boxplots

13

12

11
10
days

A B P
treatment
What does ANOVA do?
At its simplest (there are extensions) ANOVA
tests the following hypotheses:
H0: The means of all the groups are equal.

Ha: Not all the means are equal


• doesn’t say how or which ones differ.
• Can follow up with “multiple comparisons”

Note: we usually refer to the sub-populations as


“groups” when doing ANOVA.
Assumptions of ANOVA
• each group is approximately normal
 checkthis by looking at histograms and/or
normal quantile plots, or use assumptions
 can handle some nonnormality, but not
severe outliers
• standard deviations of each group are
approximately equal
 rule of thumb: ratio of largest to smallest
sample st. dev. must be less than 2:1
Normality Check
We should check for normality using:
• assumptions about population
• histograms for each group
• normal quantile plot for each group

With such small data sets, there really isn’t a


really good way to check normality from data,
but we make the common assumption that
physical measurements of people tend to be
normally distributed.
Standard Deviation Check

Variable treatment N Mean Median StDev


days A 8 7.250 7.000 1.669
B 8 8.875 9.000 1.458
P 9 10.111 10.000 1.764

Compare largest and smallest standard deviations:


• largest: 1.764
• smallest: 1.458
• 1.458 x 2 = 2.916 > 1.764

Note: variance ratio of 4:1 is equivalent.


Notation for ANOVA
• n = number of individuals all together
• I = number of groups
• x = mean for entire data set is

Group i has
• ni = # of individuals in group i
• xij = value for individual j in group i
• xi = mean for group i
• si = standard deviation for group i
How ANOVA works (outline)
ANOVA measures two sources of variation in the data and
compares their relative sizes

• variation BETWEEN groups


• for each data value look at the difference between
its group mean and the overall mean

(x i − x ) 2

• variation WITHIN groups


• for each data value we look at the difference
between that value and the mean of its group

(x ij − xi )
2
The ANOVA F-statistic is a ratio of the
Between Group Variaton divided by the
Within Group Variation:

Between MSG
F= =
Within MSE
A large F is evidence against H0, since it
indicates that there is more difference
between groups than within groups.
Minitab ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00

R ANOVA Output
Df Sum Sq Mean Sq F value Pr(>F)
treatment 2 34.7 17.4 6.45 0.0063 **
Residuals 22 59.3 2.7
How are these computations
made?
We want to measure the amount of variation due
to BETWEEN group variation and WITHIN group
variation

For each data value, we calculate its contribution

( )
to: 2
• BETWEEN group variation: x i − x

• WITHIN group variation: ( x ij − x i ) 2


An even smaller example
Suppose we have three groups
• Group 1: 5.3, 6.0, 6.7
• Group 2: 5.5, 6.2, 6.4, 5.7
• Group 3: 7.5, 7.2, 7.9
We get the following statistics:

SUMMARY
Groups Count Sum Average Variance
Column 1 3 18 6 0.49
Column 2 4 23.8 5.95 0.176667
Column 3 3 22.6 7.533333 0.123333
Excel ANOVA Output
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416
Within Groups 1.756667 7 0.250952

Total 6.884 9

1 less than number number of data values -


of groups number of groups
(equals df for each
1 less than number of individuals group added together)
(just like other situations)
Computing ANOVA F statistic
WITHIN BETWEEN
difference: difference
group data - group mean group mean - overall mean
data group mean plain squared plain squared
5.3 1 6.00 -0.70 0.490 -0.4 0.194
6.0 1 6.00 0.00 0.000 -0.4 0.194
6.7 1 6.00 0.70 0.490 -0.4 0.194
5.5 2 5.95 -0.45 0.203 -0.5 0.240
6.2 2 5.95 0.25 0.063 -0.5 0.240
6.4 2 5.95 0.45 0.203 -0.5 0.240
5.7 2 5.95 -0.25 0.063 -0.5 0.240
7.5 3 7.53 -0.03 0.001 1.1 1.188
7.2 3 7.53 -0.33 0.109 1.1 1.188
7.9 3 7.53 0.37 0.137 1.1 1.188
TOTAL 1.757 5.106
TOTAL/df 0.25095714 2.55275

overall mean: 6.44 F = 2.5528/0.25025 = 10.21575


Minitab ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00

# of data values - # of groups

(equals df for each group


1 less than # of
added together)
groups

1 less than # of individuals


(just like other situations)
Minitab ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00

∑ (x − xi ) 2
∑ (x − x) 2

∑ (x
ij
obs ij − x) 2
obs
i

obs

SS stands for sum of squares


• ANOVA splits this into 3 parts
Minitab ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00

MSG = SSG / DFG


MSE = SSE / DFE
P-value
comes from
F = MSG / MSE F(DFG,DFE)

(P-values for the F statistic are in Table E)


So How big is F?
Since F is
Mean Square Between / Mean Square Within

= MSG / MSE

A large value of F indicates relatively more


difference between groups than within groups
(evidence against H0)

To get the P-value, we compare to F(I-1,n-I)-distribution


• I-1 degrees of freedom in numerator (# groups -1)
• n - I degrees of freedom in denominator (rest of df)
Connections between SST, MST,
and standard deviation
If ignore the groups for a moment and just
compute the standard deviation of the entire
data set, we see
∑ ( )
2
x −x SST
= = = MST
2 ij
s
n −1 DFT
So SST = (n -1) s2, and MST = s2. That is, SST
and MST measure the TOTAL variation in the
data set.
Connections between SSE, MSE,
and standard deviation

Remember: si
2
=
∑ (x ij − xi )
2

=
SS[ Within Group i ]
ni − 1 dfi
So SS[Within Group i] = (si2) (dfi )

This means that we can compute SSE from the


standard deviations and sizes (df) of each group:

SSE = SS[Within] = ∑ SS[Within Group i ]


= ∑ s (ni − 1) = ∑ s (dfi )
2
i
2
i
Pooled estimate for st. dev
One of the ANOVA assumptions is that all
groups have the same standard deviation. We
can estimate this with a weighted average:
(n −1)s 2
+ (n −1)s 2
+ ...+ (n −1)s 2
s 2p = 1 1 2 2 I I
n−I
(df1 )s + (df 2 )s + ...+ (df I )s
2 2 2
s =
2 1 2 I
df1 + df 2 + ...+ df I
p

so MSE is the
SSE
sp =
2
= MSE pooled estimate
DFE of variance
In Summary
SST = ∑ (x ij − x ) = s (DFT)
2 2

obs

SSE = ∑ (x ij − x i ) = ∑ si (df i )
2 2

obs groups

SSG = ∑ (x i − x) =2
∑ n (x i i − x) 2

obs groups

SS MSG
SSE +SSG = SST; MS = ; F=
DF MSE
R2 Statistic
R2 gives the percent of variance due to between
group variation

SS[Between ] SSG
R =
2
=
SS[Total ] SST

We will see R2 again when we study


regression.
Where’s the Difference?
Once ANOVA indicates that the groups do not all
appear to have the same means, what do we do?

Analysis of Variance for days


Source DF SS MS F P
treatmen 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ----------+---------+---------+------
A 8 7.250 1.669 (-------*-------)
B 8 8.875 1.458 (-------*-------)
P 9 10.111 1.764 (------*-------)
----------+---------+---------+------
Pooled StDev = 1.641 7.5 9.0 10.5

Clearest difference: P is worse than A (CI’s don’t overlap)


Multiple Comparisons
Once ANOVA indicates that the groups do not all
have the same means, we can compare them two
by two using the 2-sample t test

• We need to adjust our p-value threshold because we


are doing multiple tests with the same data.

•There are several methods for doing this.

• If we really just want to test the difference between one


pair of treatments, we should set the study up that way.
Tuckey’s Pairwise Comparisons
Tukey's pairwise comparisons
95% confidence
Family error rate = 0.0500
Individual error rate = 0.0199
Use alpha = 0.0199 for
Critical value = 3.55
each test.
Intervals for (column level mean) - (row level mean)

A B
These give 98.01%
B -3.685
0.435
CI’s for each pairwise
difference.
P -4.863 -3.238
-0.859 0.766 Only P vs A is significant
(both values have same sign)
98% CI for A-P is (-0.86,-4.86)
Tukey’s Method in R
Tukey multiple comparisons of means
95% family-wise confidence level

diff lwr upr


B-A 1.6250 -0.43650 3.6865
P-A 2.8611 0.85769 4.8645
P-B 1.2361 -0.76731 3.2395
One-Way ANOVA

Introduction to Analysis of Variance


(ANOVA)
What is ANOVA?

„ ANOVA is short for ANalysis Of VAriance


„ Used with 3 or more groups to test for MEAN
DIFFS.
„ E.g., caffeine study with 3 groups:
‰ No caffeine
‰ Mild dose
‰ Jolt group
„ Level is value, kind or amount of IV
„ Treatment Group is people who get specific
treatment or level of IV
„ Treatment Effect is size of difference in means
Rationale for ANOVA (1)

„ We have at least 3 means to test, e.g., H0: μ1 = μ2


= μ3.
„ Could take them 2 at a time, but really want to test
all 3 (or more) at once.
„ Instead of using a mean difference, we can use the
variance of the group means about the grand mean
over all groups.
„ Logic is just the same as for the t-test. Compare the
observed variance among means (observed
difference in means in the t-test) to what we would
expect to get by chance.
Rationale for ANOVA (2)
Suppose we drew 3 samples from the same population.
Our results might look like this:

Note that the


means from the 3
groups are not
exactly the same,
but they are
close, so the
variance among
means will be
small.
Rationale for ANOVA (3)
Suppose we sample people from 3 different populations.
Our results might look like this:
Note that the
sample means are
far away from one
another, so the
variance among
means will be
large.
Rationale for ANOVA (4)
Suppose we complete a study and find the following
results (either graph). How would we know or decide
whether there is a real effect or not?

To decide, we can compare our observed variance in


means to what we would expect to get on the basis
of chance given no true difference in means.
Review

„ When would we use a t-test versus 1-way


ANOVA?
„ In ANOVA, what happens to the variance in
means (between cells) if the treatment effect
is large?
Rationale for ANOVA
We can break the total variance in a study into
meaningful pieces that correspond to treatment effects
and error. That’s why we call this Analysis of Variance.

Definitions of Terms Used in ANOVA:

XG The Grand Mean, taken over all observations.

XA The mean of any level of a treatment.


The mean of a specific level (1 in this case)
X A1 of a treatment.
Xi The observation or raw data for the ith person.
The ANOVA Model
A treatment effect is the difference between the overall,
grand mean, and the mean of a cell (treatment level).
IV Effect = X A − X G
Error is the difference between a score and a cell
(treatment level) mean.
Error = X i − X A
The ANOVA Model:
X i = X G + (X A − X G ) + (X i − X A )
An individual’s A treatment
is The grand + or IV effect + Error
score mean
The ANOVA Model
X i = X G + (X A − X G ) + (X i − X A )
The grand A treatment Error
mean or IV effect

The graph shows the


terms in the equation.
There are three cells or
levels in this study.
The IV effect and error
for the highest scoring
cell is shown.
ANOVA Calculations
Sums of squares (squared deviations from the mean)
tell the story of variance. The simple ANOVA designs
have 3 sums of squares.

SS tot = ∑ ( X i − X G ) 2 The total sum of squares comes from the


distance of all the scores from the grand
mean. This is the total; it’s all you have.

SSW = ∑ ( X i − X A ) 2 The within-group or within-cell sum of


squares comes from the distance of the
observations to the cell means. This
indicates error.

SS B = ∑ N A ( X A − X G ) 2 The between-cells or between-groups


sum of squares tells of the distance of
the cell means from the grand mean.
SSTOT = SS B + SSW This indicates IV effects.
Computational Example: Caffeine on
Test Scores
G1: Control G2: Mild G3: Jolt
Test Scores
75=79-4 80=84-4 70=74-4
77=79-2 82=84-2 72=74-2
79=79+0 84=84+0 74=74+0
81=79+2 86=84+2 76=74+2
83=79+4 88=84+4 78=74+4
Means
79 84 74
SDs (N-1)
3.16 3.16 3.16
Xi XG ( X i − X G )2
G1 75 79 16
Total Control 77 79 4
Sum of M=79 79 79 0
Squares
SD=3.16 81 79 4
83 79 16
G2 80 79 1
M=84 82 79 9
SD=3.16 84 79 25
86 79 49
SS tot = ∑ ( X i − X G ) 2 88 79 81
G3 70 79 81
M=74 72 79 49
SD=3.16 74 79 25
76 79 9
78 79 1
Sum 370
In the total sum of squares, we are finding the
squared distance from the Grand Mean. If we took
the average, we would have a variance.
SS tot = ∑ ( X i − X G ) 2
Xi XA ( X i − X A )2
G1 75 79 16
Within Control 77 79 4
Sum of M=79 79 79 0
Squares
SD=3.16 81 79 4
83 79 16
G2 80 84 16
M=84 82 84 4
SD=3.16 84 84 0
86 84 4
SSW = ∑ ( X i − X A ) 2 88 84 16
G3 70 74 16
M=74 72 74 4
SD=3.16 74 74 0
76 74 4
78 74 16
Sum 120
Within sum of squares refers to the variance within
cells. That is, the difference between scores and their
cell means. SSW estimates error.

SSW = ∑ ( X i − X A ) 2
XA XG ( X A − X G )2
G1 79 79 0
Between Control 79 79 0
Sum of M=79 79 79 0
Squares
SD=3.16 79 79 0
79 79 0
G2 84 79 25
M=84 84 79 25
SD=3.16 84 79 25
SSB = ∑NA(XA − XG )2 84 79 25
84 79 25
G3 74 79 25
M=74 74 79 25
SD=3.16 74 79 25
74 79 25
74 79 25
Sum 250
The between sum of squares relates the Cell Means to
the Grand Mean. This is related to the variance of the
means.
SSB = ∑NA(XA − XG )2
ANOVA Source Table (1)
Source SS df MS F

Between 250 k-1=2 SS/df F=


Groups 250/2= MSB/MSW
125 = 125/10
=MSB =12.5
Within 120 N-k= 120/12 =
Groups 15-3=12 10 =
MSW
Total 370 N-1=14
ANOVA Source Table (2)

„ df – Degrees of freedom. Divide the sum of


squares by degrees of freedom to get
„ MS, Mean Squares, which are population
variance estimates.
„ F is the ratio of two mean squares. F is
another distribution like z and t. There are
tables of F used for significance testing.
The F Distribution
F Table – Critical Values
Numerator df: dfB

dfW 1 2 3 4 5

5 5% 6.61 5.79 5.41 5.19 5.05


1% 16.3 13.3 12.1 11.4 11.0
10 5% 4.96 4.10 3.71 3.48 3.33
1% 10.0 7.56 6.55 5.99 5.64
12 5% 4.75 3.89 3.49 3.26 3.11
1% 9.33 6.94 5.95 5.41 5.06
14 5% 4.60 3.74 3.34 3.11 2.96
1% 8.86 6.51 5.56 5.04 4.70
Review

„ What are critical values of a statistics (e.g.,


critical values of F)?
„ What are degrees of freedom?
„ What are mean squares?
„ What does MSW tell us?
Review 6 Steps

1. Set alpha (.05). 4. Determine critical value


F.05(2,12) = 3.89
2. State Null &
5. Decision rule: If test
Alternative statistic > critical value,
H 0: μ1 = μ 2 = μ 3 reject H0.
H1: not all μ are =. 6. Decision: Test is
significant (12.5>3.89).
3. Calculate test statistic: Means in population are
F=12.5 different.
Post Hoc Tests

„ If the t-test is significant, you have a


difference in population means.
„ If the F-test is significant, you have a
difference in population means. But you
don’t know where.
„ With 3 means, could be A=B>C or A>B>C or
A>B=C.
„ We need a test to tell which means are
different. Lots available, we will use 1.
Tukey HSD (1)
Use with equal sample size per cell.
HSD means honestly significant difference.
MSW α is the Type I error rate (.05).
HSDα = qα
NA
qα Is a value from a table of the studentized range
statistic based on alpha, dfW (12 in our example)
and k, the number of groups (3 in our example).

MSW Is the mean square within groups (10).


NA Is the number of people in each group (5).
MSW
10
HSD.05 = 3.77 = 5.33 Result for our example.
5
From table NA
Tukey HSD (2)

To see which means are significantly different, we


compare the observed differences among our means to
the critical value of the Tukey test.

The differences are:


1-2 is 79-84 = -5 (say 5 to be positive).
1-3 is 79-74 = 5
2-3 is 84-74 = 10. Because 10 is larger than 5.33, this result
is significant (2 is different than 3). The other differences
are not significant. Review 6 steps.
Review

„ What is a post hoc test? What is its use?


„ Describe the HSD test. What does HSD
stand for?
Test

„ Another name for mean square is


_________.
1. standard deviation
2. sum of squares
3. treatment level
4. variance
Test

When do we use post hoc tests?


„ a. after a significant overall F test
„ b. after a nonsignificant overall F test
„ c. in place of an overall F test
„ d. when we want to determine the impact of
different factors
ANOVA ‐ Analysis of Variance
ANOVA ‐ Analysis of Variance

• Extends independent‐samples t test
ANOVA ‐ Analysis of Variance

• Extends independent‐samples t test
• Compares the means of groups of 
independent observations
ANOVA ‐ Analysis of Variance

• Extends independent‐samples t test
• Compares the means of groups of 
independent observations
– Don’t be fooled by the name.  ANOVA does not 
compare variances.
ANOVA ‐ Analysis of Variance

• Extends independent‐samples t test
• Compares the means of groups of 
independent observations
– Don’t be fooled by the name.  ANOVA does not 
compare variances.
• Can compare more than two groups
ANOVA – Null and Alternative Hypotheses
Say the sample contains K independent groups
ANOVA – Null and Alternative Hypotheses
Say the sample contains K independent groups

• ANOVA tests the null hypothesis 
H0: μ1 = μ2 = … = μK
– That is, “the group means are all equal”
ANOVA – Null and Alternative Hypotheses
Say the sample contains K independent groups

• ANOVA tests the null hypothesis 
H0: μ1 = μ2 = … = μK
– That is, “the group means are all equal”
• The alternative hypothesis is
H1: μi ≠ μj for some i, j
– or,  “the group means are not all equal”
Example:  
Accuracy of Implant 
Placement

Implants were placed in a 
manikin using placement 
guides of various widths.

15 implants were placed 
using each guide.

Error (discrepancies with a 
reference implant) was 
measured for each implant. 
Example:  
Accuracy of Implant 
Placement

The overall mean of the 
entire sample was 0.248 
mm.

This is called the “grand” 
mean, and is often 
X
denoted by       . 

If H0 were true then we’d 
expect the group means to 
be close to the grand 
mean.
Example:  
Accuracy of Implant 
Placement
The ANOVA test is based 
on the combined distances 
from       .
X

If the combined distances 
are large, that indicates we 
should reject H0.
The Anova Statistic
To combine the differences from the grand mean we 
– Square the differences 
– Multiply by the numbers of observations in the groups
– Sum over the groups

( )2
( )2
(
SSB = 15 X 4 mm − X + 15 X 6 mm − X + 15 X 8 mm − X )
2

where the       are the group means.
X*

“SSB” = Sum of Squares Between groups
The Anova Statistic
To combine the differences from the grand mean we 
– Square the differences 
– Multiply by the numbers of observations in the groups
– Sum over the groups

( )2
( )2
(
SSB = 15 X 4 mm − X + 15 X 6 mm − X + 15 X 8 mm − X )
2

where the       are the group means.
X*

“SSB” = Sum of Squares Between groups

Note:  This looks a bit like a variance.
How big is big?

• For the Implant Accuracy Data, SSB = 0.0047

• Is that big enough to reject H0?

• As with the t test, we compare the statistic to the 
variability of the individual observations.

• In ANOVA the variability is estimated by the Mean 
Square Error, or MSE
MSE
Mean Square Error

The Mean Square Error is a 
measure of the variability 
after the group effects 
have been taken into 
account.

MSE =
1
(
∑∑ ij j
x − X )2

N −K j i
where xij is the ith
observation in the jth
group.  
MSE
Mean Square Error

The Mean Square Error is a 
measure of the variability 
after the group effects 
have been taken into 
account.

MSE =
1
(
∑∑ ij j
x − X )2

N −K j i
where xij is the ith
observation in the jth
group.  
MSE
Mean Square Error

The Mean Square Error is a 
measure of the variability 
after the group effects 
have been taken into 
account.

MSE =
1
(
∑∑ ij j
x − X )2

N −K j i
Note that the variation of 
the means seems quite 
small compared to the 
variance of observations 
within groups
Notes on MSE
• If there are only two groups, the MSE is equal to the 
pooled estimate of variance used in the equal‐
variance t test.

• ANOVA assumes that all the group variances are 
equal.

• Other options should be considered if group 
variances differ by a factor of 2 or more.
ANOVA F Test
• The ANOVA F test is based on the F statistic

SSB (K − 1)
F=
MSE
where K is the number of groups.

• Under H0 the F statistic has an “F” distribution, with 
K‐1 and N‐K degrees of freedom (N is the total 
number of observations)
Implant Data:
F test p‐value
To get a p‐value we 
compare our F statistic to 
an F(2, 42) distribution.
Implant Data:
F test p‐value
To get a p‐value we 
compare our F statistic to 
an F(2, 42) distribution.

In our example
.0047 2
F= = .211
.0467 42
Implant Data:
F test p‐value
To get a p‐value we 
compare our F statistic to 
an F(2, 42) distribution.

In our example
.0047 2
F= = .211
.0467 42

The p‐value is 

P (F (2,42)   >  .211) = 0.81


ANOVA Table
Results are often displayed using an ANOVA Table

Sum of  Mean 
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of  Mean 
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Pop Quiz!: Where are the following quantities presented in this table?

Sum of Squares Mean Square F Statistic p value


Between (SSB) Error (MSE)
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of  Mean 
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Sum of Squares Mean Square F Statistic p value


Between (SSB) Error (MSE)
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of  Mean 
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Sum of Squares Mean Square F Statistic p value


Between (SSB) Error (MSE)
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of  Mean 
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Sum of Squares Mean Square F Statistic p value


Between (SSB) Error (MSE)
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of  Mean 
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Sum of Squares Mean Square F Statistic p value


Between (SSB) Error (MSE)
Post Hoc Tests
NHANES I data, women
40-60 yrs old. Compare
cholesterol between
periodontal groups.

The ANOVA shows good


evidence (p = 0.002) that
the means are not all the
same.
Sum of  Mean 

Between 
Squares df Square F Sig.
Which means are different?
33383 3 11128 5.1 .002
Groups
Within 
Groups
4417119 2007 2201 Can directly compare the
Total 4450502 2010 subgroups using “post hoc”
tests.
Least Significant Difference test

Std. 
N Mean Deviation The most simple post hoc
Healthy 802 221.5 46. 2 test is called the Least
Gingivitis 490 223.5 45.3 Significant Difference Test.
Periodontitis 347 227.3 48.9
Edentulous 372 232.4 48. 8 The computation is very
similar to the equal-
variance t test.
Sum of  Mean 
Squares df Square F Sig. Compute an equal-variance
Between 
Groups
33383 3 11128 5.1 .002 t test, but replace the
Within 
4417119 2007 2201 pooled variance (s2) with
Groups
Total 4450502 2010
the MSE.
Least Significant Difference Test: Examples

Std. 
Compare Healthy group to
N Mean Deviation Periodontitis group:
Healthy 802 221.5 46. 2
221.5 − 227.3
T= = −1.92
2201(1 802 + 1 347)
Gingivitis 490 223.5 45.3
Periodontitis 347 227.3 48.9
Edentulous 372 232.4 48. 8
p = 2 ⋅ P(t1147 > 1.92) = 0.055

Compare Gingivitis group to


Sum of  Mean 
Periodontitis group:
Squares df Square F Sig. 223.5 − 227.3
T= = −1.15
2201(1 490 + 1 347)
Between 
33383 3 11128 5.1 .002
Groups
Within 
4417119 2007 2201
Groups p = 2 ⋅ P(t 835 > 1.15) = 0.25
Total 4450502 2010
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple 
comparisons.
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple 
comparisons.
• For example, if the data contain 4 groups, then 6 
different pairwise comparisons can be made

Healthy Gingivitis

Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple 
comparisons.
• For example, if the data contain 4 groups, then 6 
different pairwise comparisons can be made

Healthy Gingivitis

Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple 
comparisons.
• For example, if the data contain 4 groups, then 6 
different pairwise comparisons can be made

Healthy Gingivitis

Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple 
comparisons.
• For example, if the data contain 4 groups, then 6 
different pairwise comparisons can be made

Healthy Gingivitis

Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple 
comparisons.
• For example, if the data contain 4 groups, then 6 
different pairwise comparisons can be made

Healthy Gingivitis

Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple 
comparisons.
• For example, if the data contain 4 groups, then 6 
different pairwise comparisons can be made

Healthy Gingivitis

Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple 
comparisons.
• For example, if the data contain 4 groups, then 6 
different pairwise comparisons can be made

Healthy Gingivitis

Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at 
significance level α, there is probability α of rejecting 
in error.
• Performing multiple tests increases the chances of 
rejecting in error at least once.
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at 
significance level α, there is probability α of rejecting 
in error.
• Performing multiple tests increases the chances of 
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at 
significance level α, there is probability α of rejecting 
in error.
• Performing multiple tests increases the chances of 
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
– If, in truth, H0 were true for all six.
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at 
significance level α, there is probability α of rejecting 
in error.
• Performing multiple tests increases the chances of 
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
– If, in truth, H0 were true for all six.
– The probability that at least one test rejects H0 is 26%
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at 
significance level α, there is probability α of rejecting 
in error.
• Performing multiple tests increases the chances of 
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
– If, in truth, H0 were true for all six.
– The probability that at least one test rejects H0 is 26%
– P(at least one rejection) = 1‐P(no rejections) = 1‐.956 = .26
Bonferroni Correction for Multiple Comparisons
• The Bonferroni correction is a simple way to adjust 
for the multiple comparisons.

Bonferroni Correction
• Perform each test at significance level α.
• Multiply each p-value by the number of tests
performed.
• The overall significance level (chance of any of the
tests rejecting in error) will be less than α.
Example: Cholesterol Data post‐hoc comparisons
Mean Least
Difference Significant
Group 1 Group 2 (Group 1 - Difference Bonferroni
Group 2) p-value p-value
Healthy Gingivitis -2.0 .46 1.0
Healthy Periodontitis -5.8 .055 .330
Healthy Edentulous -10.9 .00021 .00126
Gingivitis Periodontitis -3.9 .25 1.0
Gingivitis Edentulous -8.9 .0056 .0336
Periodontitis Edentulous -5.1 .147 .88
Example: Cholesterol Data post‐hoc comparisons
Mean Least
Difference Significant
Group 1 Group 2 (Group 1 - Difference Bonferroni
Group 2) p-value p-value
Healthy Gingivitis -2.0 .46 1.0
Healthy Periodontitis -5.8 .055 .330
Healthy Edentulous -10.9 .00021 .00126
Gingivitis Periodontitis -3.9 .25 1.0
Gingivitis Edentulous -8.9 .0056 .0336
Periodontitis Edentulous -5.1 .147 .88

Conclusion: The Edentulous group is significantly different than 
the Healthy group and the Gingivitis group (p < 0.05), after 
adjustment for multiple comparisons
Summarizing Scores with
Measures of Central Tendency:
The Mean, Median, and Mode
Outline of the Course
III. Descriptive Statistics
A. Measures of Central Tendency (Chapter 3)
1. Mean
2. Median
3. Mode
B. Measures of Variability (Chapter 4)
1. Range
2. Mean deviation
3. Variance
4. Standard Deviation
C. Skewness (Chapter 2)
1. Positive skew
2. Normal distribution
3. Negative skew
D. Kurtosis
1. Platykurtic
2. Mesokurtic
3. Leptokurtic
Measures of Central Tendency

z The goal of measures of central tendency is


to come up with the one single number that
best describes a distribution of scores.
z Lets us know if the distribution of scores
tends to be composed of high scores or low
scores.
Measures of Central Tendency

z There are three basic measures of central


tendency, and choosing one over another
depends on two different things.
z 1. The scale of measurement used, so that
a summary makes sense given the nature
of the scores.
z 2. The shape of the frequency distribution,
so that the measure accurately
summarizes the distribution.
Measures of Central Tendency
Mode
z The most common observation in a group of scores.
z Distributions can be unimodal, bimodal, or multimodal.
z If the data is categorical (measured on the nominal scale)
then only the mode can be calculated.
z The most frequently occurring score (mode) is Vanilla.
Flavor f 30
25
Vanilla 28
20
Chocolate 22 15
f

Strawberry 15 10
5
Neapolitan 8 0
Butter Pecan 12
rry

d
n
lla

e
a
ca
at

ita

pl
ni

be

Ro
ol

ip
Pe
Va

ol
w
c

R
Rocky Road 9
p

ky
ho

ra

r
ea

ge
tte

oc
St
C

d
Bu

Fu
Fudge Ripple 6
Measures of Central Tendency
Mode

z The mode can also be calculated with


ordinal and higher data, but it often is not
appropriate.
z If other measures can be calculated, the
mode would never be the first choice!
z 7, 7, 7, 20, 23, 23, 24, 25, 26 has a mode
of 7, but obviously it doesn’t make much
sense.
Measures of Central Tendency
Median

z The number that divides a distribution of scores


exactly in half.
z The median is the same as the 50th percentile.

z Better than mode because only one score can be


median and the median will usually be around
where most scores fall.
z If data are perfectly normal, the mode is the
median.
z The median is computed when data are ordinal
scale or when they are highly skewed.
Measures of Central Tendency
Median
z There are three methods for computing the
median, depending on the distribution of scores.
z First, if you have an odd number of scores pick the
middle score.
z 1 4 6 7 12 14 18
z Median is 7
z Second, if you have an even number of scores,
take the average of the middle two.
z 1 4 6 7 8 12 14 16
z Median is (7+8)/2 = 7.5
z Third, if you have several scores with the same
value in the middle of the distribution use the
formula for percentiles (not found in your book).
Measures of Central Tendency
Mean

z The arithmetic average, computed simply by adding


together all scores and dividing by the number of
scores.
z It uses information from every single score.

ΣX Σ X
z For a population: μ = For a Sample: X =
N n
Measures of Central Tendency
Mean
Other Notes

z If data are perfectly normal, then the mean, median


and mode are exactly the same.

z I would prefer to use the mean whenever possible


since it uses information from EVERY score.

z Though the preferred symbol for the mean is an X with


a line over the top, creating this symbol is pretty tricky
on the computer. APA style says:
X =M
Measures of Central Tendency
The Shape of Distributions
z With perfectly bell
shaped distributions,
the mean, median, and
mode are identical.
z With positively skewed
data, the mode is
lowest, followed by the
median and mean.
z With negatively skewed
data, the mean is
lowest, followed by the
median and mode.
Measures of Central Tendency
Mean vs. Median
Salary Example
z On one block, the income from the families are (in
thousands of dollars) 40, 42, 41, 45, 38, 40, 42, 500

ΣX 788
z ΣX=788, X = = = 98.5
n 8
z The Mean salary for this sample is $98,500 which is
more than twice almost all of the scores.
z Arrange the scores 38, 40, 40, 41, 42, 42, 45, 500
z The middle two #’s are 41 and 42, thus the average is
$41500, perhaps a more accurate measure of central
tendency.
Measures of Central Tendency
Mean vs. Median
Reaction Time Example
z Data is time to complete task (in s):
z 45, 34, 87, 56, 21, didn’t finish, 49

z It is not possible to compute a mean with this


unknown number.
z Even though we do not know this person’s
time, I do know it is REALLY big.
z 21, 34, 45, 49, 56, 87, something bigger
z The median is the middle number, 49
Measures of Central Tendency
Mean
Algebra Revisited
z Its useful to consider the formula as the same as any
other algebraic formulas, subject to the same rules.

ΣX
X=
n
X •n = ΣX

z Therefore, if we know the mean of a group of scores,


we can figure out the ΣX.
Measures of Central Tendency
Mean
Weighted Mean
z Lets pretend that one semesters class of 23 students
scored M1 = 18 points on a quiz. The same quiz was
then given the next semester to 34 students who then
got M2 = 22 points. What is the overall (weighted)
mean for these 57 students.
z ΣX1 can be computed by multiplying M1 times the
sample size (ΣX1= M1*n1 = 18*23 = 414).
z For the second class, ΣX2 = M2*n2 = 22 * 34 = 748

z ΣXtotal = ΣX1 + ΣX2 = 414 + 748 = 1206

z ntotal = n1 + n2 = 23 + 34 = 57

z Mtotal = ΣXtotal / ntotal = 1206/57 = 21.158


Measures of Central Tendency
Mean
Adding a Score
z On the first exam, 15 students had M = 85.
z One kid came in late and took the test and
scored 53, what is Mnew?
z ΣXoriginal = Moriginal*noriginal = 85*15 = 1275

z ΣXnew = ΣXoriginal + new score = 1275 + 53 =


1328
z nnew = noriginal + 1

z Mnew = ΣXnew/nnew = 1328/16 = 83


Measures of Central Tendency
Mean
Changing an Existing Score
z On the first exam, 16 students had M = 83.
z One kid came in after the test and
complained, I listened and decided to give
him 10 extra points, now what?
z ΣXoriginal = Moriginal*noriginal = 83*16 = 1328

z ΣXnew = ΣXoriginal + extra points = 1328 + 10 =


1338
z Mnew = ΣXnew/n = 1338/16 = 83.625
Measures of Central Tendency
Mean
Transformations
z If a constant is added (or subtracted) to each score, the same
constant will be added (or subtracted) to the mean.
z If M for an exam is 82, then I find that I screwed up a
question and give everyone 5 extra points, M simply
becomes 82+5=87.
z If every score is multiplied or divided by a constant number,
then the mean will also be multiplied or divided by the same
number.
z This last property is particularly useful when converting
between units of measurement.
z If the M for the height of a group of first-graders is 47 inches,
but I need to know their heights in cm I could:
z Take every kids height * 2.54, then recompute M
z Or, I could take the mean times 2.54 and conclude the M
height of these kids is 119.38 cm.
Measures of Central Tendency
Deviations around the Mean

z A common
Exam Score X−X
formula we will 7 (7-9) = -2
be working with 6 (6-9) = -3
extensively is 8 (8-9) = -1
the deviation: 9 (9-9) = 0
X−X 12 (12-9) = 3
10 (10-9) = 1
ΣX = 72 11 (11-9) = 2
n=8 9 (9-9) = 0
ΣX 72
X= = =9 ∑ (X − X ) = 0
n 8
Measures of Central Tendency
Using the Mean to Interpret Data
Predicting Scores

z If asked to predict a score, and you know


nothing else, then predict the mean.
z However, we will probably be wrong, and our
error will equal:
X−X

z A score’s deviation indicates the amount of


error we have when using the mean to predict
an individual score.
Measures of Central Tendency
Using the Mean to Interpret Data
Describing a Score’s Location

z If you take a test and get a score of 45, the 45 means


nothing in and of itself. However, if you learn that the
M = 50, then we know more. Your score was 5 units
BELOW M.
z Positive deviations are above M.

z Negatives deviations are below M.

z Large deviations indicate a score far from M.

z Large deviations occur less frequently.


Measures of Central Tendency
Using the Mean to Interpret Data
Describing the Population Mean
z Remember, we usually want to know population
parameters, but populations are too large.
z So, we use the sample mean to estimate the
population mean.

X ≈μ
Measures of Central Tendency
Consider the Measurements and Frequency Table
Generated in the previous lecture
87, 85, 79, 75, 81, 88, 92, 86, 77, 72, 75, 77, 81, 80, 77,
73, 69, 71, 76, 79, 83, 81, 78, 75, 68, 67, 71, 73, 78, 75,
84, 81, 79, 82, 87, 89, 85, 81, 79, 77, 81, 78, 74, 76, 82,
85, 86, 81, 72, 69, 65, 71, 73, 78, 81, 77, 74, 77, 72, 68
Class Class Midpoint Total Frequency
64.5 - 69.5 67 6 0.100
69.5 – 74.5 72 11 0. 183
74.5 – 79.5 77 20 0.333
79.5 – 84.5 82 13 0.217
84.5 – 89.5 87 9 0.150
89.5 – 94.5 92 1 0.0167
Measures of Central Tendency

We have determined that the lowest and highest readings


in this set of measurements are:
•Low = 65
•High = 92

This gives a range = 92 – 25 = 27

The simplest measurement of central tendency in this


population is the midrange.

Define: midrange = (low value + high value)/2

Midrange = (65 + 92)/2 = 157/2 = 78.5


Measures of Central Tendency

The most descriptive measure of central tendency in a


population is its mean, because the mean of a sample taken
from the population can be shown to be predictive of the
population mean within some range determined by the
sampling error.

Define the population mean by the formula


μ = Σi xi/N where
μ = the population mean
Σi xi = the sum over each member of the population
xi
N = the number of items in the population
Measures of Central Tendency

For the 60 temperature readings in this population we


obtain:
87, 85, 79, 75, 81, 88, 92, 86, 77, 72, 75, 77, 81, 80, 77,
73, 69, 71, 76, 79, 83, 81, 78, 75, 68, 67, 71, 73, 78, 75,
84, 81, 79, 82, 87, 89, 85, 81, 79, 77, 81, 78, 74, 76, 82,
85, 86, 81, 72, 69, 65, 71, 73, 78, 81, 77, 74, 77, 72, 68

μ = (87+85+ 79 +….+72+68)/60 = 4751/60 = 79.183


Measures of Central Tendency
A third measure of central tendency is the median
The median of a population of size N is found by
1. Arranging the individual measurements in ascending
order, and
2. If N is odd, selecting the value in the middle of this list as
the median (there will be the same number of values
above and below the median)
3. If N is even find the values at position N/2 and N/2 + 1 in
this list (call them xN/2 and xN/2+1) and let median be given
by the formula median = (xN/2 + xN/2+1)/2 or be the value
halfway between these two measurements.
Note! When N is even the median will usually not be an
actual value in the population
Measures of Central Tendency
We now find the median of the population of temperature
readings
87, 85, 79, 75, 81, 88, 92, 86, 77, 72, 75, 77, 81, 80, 77,
73, 69, 71, 76, 79, 83, 81, 78, 75, 68, 67, 71, 73, 78, 75,
84, 81, 79, 82, 87, 89, 85, 81, 79, 77, 81, 78, 74, 76, 82,
85, 86, 81, 72, 69, 65, 71, 73, 78, 81, 77, 74, 77, 72, 68
Arrange these 60 measurements in ascending order

65, 67, 68, 68, 69, 69, 71, 71, 71, 72, 72, 72, 73, 73, 73,
74, 74, 75, 75, 75, 75, 76, 76, 77, 77, 77, 77, 77, 77, 78,
78, 78, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 81, 81,
81, 82, 82, 83, 84, 85, 85, 85, 86, 86, 87, 87, 88, 89, 92
Since N/2 = 30 and both the 30th and 31st values in the list are
the same, we obtain median = 78
Measures of Central Tendency
One further parameter of a population that may give some
indication of central tendency of the data is the mode

Define: mode = most frequently occurring value in the


population

From the previous data we see:

65, 67, 68, 68, 69, 69, 71, 71, 71, 72, 72, 72, 73, 73, 73,
74, 74, 75, 75, 75, 75, 76, 76, 77, 77, 77, 77, 77, 77, 78,
78, 78, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 81, 81,
81, 82, 82, 83, 84, 85, 85, 85, 86, 86, 87, 87, 88, 89, 92
That the value 81 occurs 8 times mode = 81

Note! If two different values were to occur most frequently, the


distribution would be bimodal. A distribution may be multi-modal.
Measures of Central Tendency
Next we show where each of these parameters occur in the
frequency distribution graph for this tabulated data.
Frequency %
42
39 Mean = 79.183
36
33 x Median = 78
30
27 Midrange = 78.5
24
Mode = 81
21 x
18 x
15 x
12
9 x
6
3 median mean
0 x
67 72 77 82 87 92 Temperature
Measures of Central Tendency

In the previous pages we have calculated the mean from the


raw data. We can also use the tabulated data to calculate the
mean of the population

Use the formula

μ = Σi(fi * xi) / Σi fi

Where xi = the midpoint of the ith class


and fi = the number of items in the ith class
Measures of Central Tendency
From the table we obtain
Class Class Midpoint (x) Total (f) Frequency f*x
64.5 - 69.5 67 6 0.100 402
69.5 – 74.5 72 11 0. 183 792
74.5 – 79.5 77 20 0.333 1540
79.5 – 84.5 82 13 0.217 1066
84.5 – 89.5 87 9 0.150 783
89.5 – 94.5 92 1 0.0167 92
60 4675

μ = Σi(fi * xi) / Σi fi = 4675/60 = 77.917


The small discrepancy between these two values for the mean is due to the
way the data is accumulated into classes. The mean of the raw data is more
accurate, the mean of the tabulated data is often more convenient to obtain.
Measure of central tendency
„ Central tendency
„ A statistical measure that identifies a single
score as representative for an entire
distribution. The goal of central tendency is
to find the single score that is most typical
or most representative of the entire group.
Measure of central tendency
Measure of central tendency
„ The mean
„ Population mean vs. sample mean
ΣX ΣX
μ= x=
N n
„ N=4: 3,7,4,6
ΣX 20
x= = =5
n 4
Measure of central tendency
„ The weighted mean
„ Group A: x = 6 n=12
„ Group B: x = 7 n=8
72 + 56
„ Weighted mean = ΣX 1 + ΣX 2 = 12 + 8
= 6.4
n1 + n2

„ Seriously sensitive to extreme scores.


Measure of central tendency
„ Median
„ The score that divides a distribution exactly
in half. Exactly 50 percent of the
individuals in a distribution have scores at
or below the median.
„ odd: 3, 5, 8, 10, 11 Æ median=8
„ even: 3, 3, 4, 5, 7, 8 Æ
median=(4+5)/2=4.5
Measure of central tendency
„ Median
„ The median is often used as a measure of
central tendency when the number of
scores is relatively small, when the data
have been obtained by rank-order
measurement, or when a mean score is
not appropriate.
Measure of central tendency
„ Mode
„ Most frequently obtained score in the data
„ Problems:
„ No mode
Measure of central tendency
„ Choosing a measure of central tendency
„ the level of measurement of the variable concerned
(nominal, ordinal, interval or ratio);
„ the shape of the frequency distribution;
„ what is to be done with the figure obtained.
„ The mean is really suitable only for ratio and interval
data. For ordinal variables, where the data can be ranked
but one cannot validly talk of `equal differences' between
values, the median, which is based on ranking, may be
used. Where it is not even possible to rank the data, as in
the case of a nominal variable, the mode may be the
only measure available.
Measure of central tendency
„ Central tendency and the shape of the
distribution
Summary
1. The purpose of central tendency is to determine the single value that best
represents the entire distribution of scores. The three standard measures of
central tendency are the mode, the median, and the mean.
2. The mean is the arithmetic average. It is computed by summing all the
scores and then dividing by the number of scores. Conceptually, the mean
is obtained by dividing the total (IX) equally among the number of
individuals (N or n). Although the calculation is the same for a popula-tion
or a sample mean, a population mean is identified by the symbol and a
sample mean is identified by X.
3. Changing any score in the distribution will cause the mean to be changed.
When a constant value is added to (or sub-tracted from) every score in a
distribution, the same con-stant value is added to (or subtracted from) the
mean. If every score is multiplied by a constant, the mean will be multiplied
by the same constant. In nearly all circum-stances, the mean is the best
representative value and is the preferred measure of central tendency.
Summary
1. The median is the value that divides a distribution exactly in half. The
median is the preferred measure of central tendency when a
distribution has a few extreme scores that displace the value of the
mean. The median also is used when there are undetermined
(infinite) scores that make it impossible to compute a mean.
2. The mode is the most frequently occurring score in a dis-tribution. It
is easily located by finding the peak in a frequency distribution graph.
For data measured on a nominal scale, the mode is the appropriate
measure of central ten-dency. It is possible for a distribution to have
more than one mode.
3. For symmetrical distributions, the mean will equal the me-dian. If
there is only one mode, then it will have the same value, too.
4. For skewed distributions, the mode will be located toward the side
where the scores pile up, and the mean will be pulled toward the
extreme scores in the tail. The median will be located between these
two values.
Homework

Imagine that you received the following data on the vocabulary test mentioned earlier:
20 22 23 23 23
23 23 23 24 25
28 29 30 30 30
30 30 30 31 32
32 33 33 34 35
35 36 36 37 37

1. Chart the data and draw the frequency polygon.

2. Compute the mean, mode, and median of the data and decide which of the three you
believe to be best for the central tendency of the data.
Measure of variability
„ Variability provides a quantitative
measure of the degree to which scores
in a distribution are spread out or
clustered together.
Measure of variability
„ Range
„ range=Xhighest – Xlowest
„ Quartile:
„ A statistical term describing a division of observations into four defined intervals
based upon the values of the data and how they compare to the entire set of
observations.
Each quartile contains 25% of the total observations. Generally, the data is
ordered from smallest to largest with those observations falling below 25% of all
the data analyzed allocated within the 1st quartile, observations falling between
25.1% and 50% and allocated in the 2nd quartile, then the observations falling
between 51% and 75% allocated in the 3rd quartile, and finally the remaining
observations allocated in the 4th quartile.
„ Interquartile: The interquartile range is a measure of spread or dispersion. It
is the difference between the 75th percentile (often called Q3) and the 25th
percentile (Q1). The formula for interquartile range is therefore: Q3-Q1.
„ Semi-interquartile: The semi-interquartile range is a measure of spread or
dispersion. It is computed as one half the difference between the 75th
percentile [often called (Q3)] and the 25th percentile (Q1). The formula for
semi-interquartile range is therefore: (Q3-Q1)/2.
„ TOEFL: (560-470)/2=45
Measure of variability
Measure of variability
„ Variance
„ Deviation: deviation of one score from the
mean
„ Variance: taking the distribution of all
scores into account.
Sum of
square (SS)

n=24
Measure of variability
„ Standard deviation
squar ed
scor e mean devi at i on* devi at i on
8 9. 67 - 1. 67 2. 79
25 9. 67 +15. 33 235. 01
7 9. 67 - 2. 67 7. 13
5 9. 67 - 4. 67 21. 81
8 9. 67 - 1. 67 2. 79
3 9. 67 - 6. 67 44. 49
10 9. 67 + . 33 . 11
12 9. 67 + 2. 33 5. 43
9 9. 67 - . 67 . 45
sum of squar ed dev= 320. 01

St andar d Devi at i on = Squar e r oot ( sum of squar ed devi at i ons / ( N- 1)


= Squar e r oot ( 320. 01/ ( 9- 1) )
= Squar e r oot ( 40)
= 6. 32
Measure of variability
„ The larger the standard deviation figure, the wider
the range of distribution away from the measure of
central tendency
Measure of variability
„ Adding a constant to each score does
not change the standard deviation.
„ Multiplying each score by a constant
causes the standard deviation to be
multiplied by the same constant.
Measure of variability
Group A Group B
11 20
8 10
10 1
9 8
8 0
12 30
10 13
11 6
Measure of variability

Reporting the standard deviation (APA):

Type of instrument
Listening Watching
Mean SD Mean SD
Males 15.72 4.43 6.94 2.26
Females 3.47 1.12 2.61 0.98
Measure of variability
„ Standard deviation and normal distribution
Homework
1. Calculate the mean, median, mode, range and standard
deviation for the following sample:

Midterm Exam
X X
100 85
88 82
83 96
105 107
78 102
98 113
126 94
85 119
67 91
88 100
88 72
77 88
114 85
Homework
2. Suppose that the following scores were obtained on administering a
language proficiency test to ten aphasics who had undergone a course
of treatment, and ten otherwise similar aphasics who had not
undergone the treatment:
Experimental group Control group
15 31
28 34
62 47
17 41
31 28
58 54
45 36
11 38
76 45
43 32
Calculate the mean score and standard deviation for each group, and
comment on the results.
Locating scores and finding
scales in a distribution
Percentiles, quartiles, deciles
Mind work

Imagine that you conducted an in-service course for ESL teachers. To receive university credit for the
course, the teachers must take examinations--in this case, a midterm and a final. The midterm was a
multiple-choice test of 50 items and the final exam presented teachers with 10 problem situations to
solve. Sue, like most teachers, was a whiz at taking multiple-choice exams, but bombed out on the
problem-solving final exam. She received a 48 on the midterm and a 1 on the final. Becky didn't do so
well on the midterm. She kept thinking of exceptions to answers on the multiple-choice exam. Her score
was 39. However, she really did shine on the final, scoring a 10. Since you expect students to do well on
both exams, you reason that Becky has done a creditable job on each and Sue has not. Becky gets the
higher grade. Yet, if you add the points together, Sue has 49 and Becky has 49. The question is whether
the points are really equal.
Should Sue also do this bit of arithmetic, she might come to your office to complain of the injustice of it
all. How will you show her that the value of each point on the two tests is different?
Locating scores and finding
scales in a distribution
„ Standard score (z-scores) X −x
z=
s
Locating scores and finding
scales in a distribution
Mind work

Suppose that we have measured the times taken by a very large number of
people to utter a particular sentence, and have shown these times to be
normally distributed with a mean of 3.45 sec and a standard deviation of
0.84 sec. Armed with this information, we can answer various questions.
1. What proportion of the (potentially infinite) population of utterance
times would be expected to fall below 3 sec?
2. What proportion would lie between 3 and 4 sec?
3. What is the time below which only 1 per cent of the times would be
expected to fall?
Mind work

„ 1. z-score for 3 sec. z = 3 − 3.45 = −0.54


0.84
„ 2. check the normal distribution table
„ 3. z-score for 4 sec. z = 4 − 3.45 = −0.66
0.84
„ 4. 100-29.46-25.46=45.1 per cent
„ 5. z-score for 1 per cent: 2.33
„ 6. − 2.33 = x − 3.45 x=(-2.33x0.84)+3.45=1.49
0.84
sec
Normal Distribution Table
Locating scores and finding
scales in a distribution
„ T-score
„ T score = 10(z) + 50
„ Z=(T-score-500)/100
X = z×s + x
Mind work

某外语学院在其研究生教学中规定,只要有一门课程的考试
成绩低于75分,即取消其撰写论文的资格。显然,这是不科
学的。因为这实质上也是把不同质的考试硬拉在一起进行比
较。同是75分,在不同考试中的意义是不一样的。在一个非
常容易的考试中,它可能是比较低的分数,而在一个难度较
大的考试中,它却可能是比较高的考分。如果凡是低于该分
数的都不让写论文,这是不科学的,也是不公平的。科学的
做法是把各科的考试分数换算成标准分,然后规定多少标准
分以下的没有资格写论文。同上例一样,有了标准分之后,
也可以把各科的成绩合成一个总分,或求平均分,排出名次,
再制定一个标准,以确定总分或平均分为多少的人才有资格
撰写论文
Locating scores and finding
scales in a distribution
„ Distributions with nominal data
„ Implicational scaling (Guttman scaling)
„ Coefficient of scalability
Homework
I. The following scores are obtained by 50 subjects on a language aptitude test:
42 62 44 32 47 42 52 76 36 43
55 27 46 55 47 28 53 44 15 61
18 59 58 57 49 55 88 49 50 62
61 82 66 80 64 50 40 53 28 63
63 25 58 71 82 52 73 67 58 77

1. Draw a histogram to show the distribution of the scores.


2. Calculate the mean and standard deviation of the scores.
3. Suppose Lihua scored 55 in this test, what’s her position in the
whole class?
II. Suppose there will be 418,900 test takers for the NMET in 2006 in
Guangdong, the key universities in China plan to enroll altogether 32,000
students in Guangdong. What score is the lowest threshold for a student to
be enrolled by the key universities? (Remember the mean is 500, standard
deviation is 100).
Sample statistics and population
parameter: estimation
„ Standard error
„ Sampling distribution of the mean
„ Standard error of mean
„ Standard error = s
N
„ In order to halve the standard error, we should have to
take a sample which was four times as big.
„ Central limit theorem:
„ For any population with mean of μand standard deviation
of σ, the distribution of sample means for sample size n
will approach a normal distribution with a mean of μand a
standard deviation of σ / n as n approaches infinity.
„ samples above 30
Sample statistics and population
parameter: estimation
„ Interpreting standard error: confidence
limits
Sample statistics and population
parameter: estimation

„ Normal distribution: sample is large


„ t-distribution: sample is small
„ Degree of freedom: N-1
„ When sample is large, t = z
Sample statistics and population
parameter: estimation
„ Interpreting standard error: confidence
limits
„ Mean=58.2
„ s=23.6
„ N=50
23.6
Standard error= N 50 = 3.3
s
=
„

.2−x
„ z=58 .2−x
58 51.7≤ x≤64.7
3.3 −1.96
≤ ≤1.9
3.3
Sample statistics and population
parameter: estimation
„ Confidence limits for proportions
„ Standard error = p(1 − p )
N
„ Confidence limits=proportion in sample
±(critical value x standard error)
Sample statistics and population
parameter: estimation
Suppose that we have taken a random sample of 500 finite verbs from a text,
and found that 150 of them have present tense form. How can we set
confidence limits for the proportion of present tense finite verbs in the whole
text, the population from which the sample is taken?

95% confidence limits = proportion in sample


± (1.96 X standard error) =0.30±(1.96x0.02)
= 0.30 ± 0.04 = 0.26 to 0.34.
We can thus be 95 per cent confident that the proportion of present tense
finite verbs in the population lies between 26 and 34 per cent.
Sample statistics and population
parameter: estimation
„ Estimating required sample sizes
„ Standard error = p(1 − p )
N

In a paragraph there are 46 word tokens, of which 11 are two-


letter words. The proportion of such words is thus 11/46 or 0.24.
How big a sample of words should we need in order to be 95 per
cent confident that we had measured the proportion to within an
accuracy of 1 per cent?
0.01=1.96 x standard error
Standard error = 0.01 x 1.96
0.24 × 0.76
N= 2
= 7007
(0.01 / 1.96)
Homework
I. The following are the times (in seconds) taken for a group of 30 subjects
to carry out the detransformation of a sentence into its simplest form:
0.55 0.56 0.52 0.59 0.51 0.50
0.42 0.41 0.37 0.22 0.24 0.41
0.49 0.59 0.75 0.65 0.63 0.61
0.72 0.77 0.76 0.39 0.26 0.68
0.30 0.32 0.44 0.61 0.54 0.47

Calculate (i) the mean, (ii) the standard deviation, (iii) the standard error
of the mean, (iv) the 99 per cent confidence limits for the mean.

II. A random sample of 300 finite verbs is taken from a text, and it is found
that 63 of these are auxiliaries. Calculate the 95 per cent confidence
limits for the proportion of finite verbs which are auxiliaries in the text
as a whole.
III. Using the data in question II, calculate the size of the sample of finite
verbs which would. be required in order to estimate the proportion of
auxiliaries to within an accuracy of 1 per cent, with 95 per cent
confidence.
Probability and Hypothesis
Testing
„ Null hypothesis (H0)
„ The null hypothesis states that in the general
population there is no change, no difference, or no
relationship. In the context of an experiment, H0
predicts that the independent variable (treatment)
will have no effect on the dependent variable for
the population. H0: μA- μB=0 or μA= μB
„ Alternative hypothesis (H1)
„ The alternative hypothesis (H1) states that there is
a change, a difference, or a relationship for the
general population. H1: μA≠ μB
Probability and Hypothesis
Testing
„ Null hypothesis (H0)
„ When we reject the null hypothesis, we want the probability to be
very low that we are wrong. If, on the other hand, we must accept
the null hypothesis, we still want the probability to be very low that
we are wrong in doing so.
„ Type I error and Type II error
„ A type I error is made when the researcher rejected the null
hypothesis when it should not have been rejected.
„ A type II error is made when the null hypothesis is accepted when
it should have been rejected.
„ In research, we test our hypothesis by finding the probability of
our results. Probability is the proportion of times that any
particular outcome would happen if the research were repeated
an infinite number of times.
Probability and Hypothesis
Testing
„ Two-tailed and one-tailed hypothesis
„ When we specify no direction for the null hypothesis (i.e.,
whether our score will be higher or lower than more typical
scores), we must consider both tails of the distribution. This
is called two-tailed hypothesis.
„ If we have good reason to believe that we will find a
difference (e.g., previous studies or research findings
suggest this is so), then we will use a one-tailed hypothesis.
One-tailed tests specify the direction of the predicted
difference. We use previous findings to tell us which
direction to select.
.05 .01
1-tailed 1.64 2.33
2-tailed 1.96 2.57
Probability and Hypothesis
Testing
„ Steps in hypothesis testing
1. State the null hypothesis.
2. Decide whether to test it as a one- or two-tailed hypothesis. If there is
no research evidence on the issue, select a two-tailed hypothesis. This
will allow you to reject the null hypothesis in favor of an alternative
hypothesis. If there is research evidence on the issue, select a
one-tailed hypothesis. This will allow you to reject the null
hypothesis in favor of a directional hypothesis.
3. Set the probability level (α level). Justify your choice.
4. Select the appropriate statistical test(s) for the data.
5. Collect the data and apply the statistical test(s).
6. Report the test results and interpret them correctly.
Probability and Hypothesis
Testing
„ Parametric vs. nonparametric
„ Parametric procedures
„ Make strong assumptions about the distribution of the
data
„ Assume the data are NOT frequencies or ordinal scales
but interval data
„ Data are normally distributed
„ Nonparametric procedures
„ Do not make strong assumptions about the shape of the
distribution of the data
„ Work with frequencies and rank-ordered scales
„ Used when the sample size is small
Homework
Lecture 14 chi-square test, P-
value
• Measurement error (review from lecture 13)
• Null hypothesis; alternative hypothesis
• Evidence against null hypothesis
• Measuring the Strength of evidence by P-
value
• Pre-setting significance level
• Conclusion
• Confidence interval
Some general thoughts about
hypothesis testing
• A claim is any statement made about the truth; it
could be a theory made by a scientist, or a
statement from a prosecutor, a manufacture or a
consumer
• Data cannot prove a claim however, because there
• May be other data that could contradict the theory
• Data can be used to reject the claim if there is a
contradiction to what may be expected
• Put any claim in the null hypothesis H0
• Come up with an alternative hypothesis and put it
as H1
• Study data and find a hypothesis testing statistics
which is an informative summary of data that is
• Testing statistics is obtained by experience or
statistical training; it depends on the
formulation of the problem and how the data
are related to the hypothesis.
• Find the strength of evidence by P-value :
from a future set of data, compute the
probability that the summary testing statistics
will be as large as or even greater than the
one obtained from the current data. If P-
value is very small , then either the null
hypothesis is false or you are extremely
unlucky. So statistician will argue that this is a
strong evidence against null hypothesis.
If P-value is smaller than a pre-specified level
(called significance level, 5% for example),
Back to the microarray
• example
Ho : true SD σ=0.1 (denote 0.1 by σ0)
• H1 : true SD σ > 0.1 (because this is the main
concern; you don’t care if SD is small)
• Summary :
• Sample SD (s) = square root of ( sum of
squares/ (n-1) ) = 0.18
• Where sum of squares = (1.1-1.3)2 + (1.2-
1.3)2 + (1.4-1.3)2 + (1.5-1.3)2 = 0.1, n=4
• The ratio s/ σ =1.8 , is it too big ?
• The P-value consideration:
• Suppose a future data set (n=4) will be
collected.
• Let s be the sample SD from this future
dataset; it is random; so what is the
b bilit th t / ill b
• P(s/ σ0 >1.8)
• But to find the probability we need to use chi-
square distribution :
• Recall that sum of squares/ true variance
follow a chi-square distribution ;
• Therefore, equivalently, we compute
• P ( future sum of squares/ σ02 > sum of
squares from the currently available data/
σ02), (recallσ0 is
• The value claimed under the null hypothesis) ;
Once again, if data were generated again, then Sum of
squares/ true variance is random and follows a chi-squared
distribution
with n-1 degrees of freedom; where sum of squares= sum of
squared distance between each data point and the sample
mean
P-value = P(chi-square random variable> computed
value
Note fromofdata)=P
: Sum (chisquare
squares= random
(n-1) sample variable
variance > 10.0)
= (n-1)(sample
SD)For
2 our case, n=4; so look at the chi-square

distribution with df=3; from table we see :


P-value is between .025
and .01, reject null
hypothesis at 5%
significance level
9.348 11.34
The value computed from available data =
.10/.01=10 (note sum of squares=.1, true variance
Confidence interval
• A 95% confidence interval for true variance σ2
is
• (Sum of squares/C2, sum of squares/C1)
• Where C1 and C2 are the cutting points from
chi-square table with d.f=n-1 so that
• P(chisquare random variable > C1)= .975
• P(chisquare random variable>C2)=.025
• This interval is derived from
• P( C <
For our data,
1 sum of squares/ σ 2 <C )=.95
sum of squares= .1 ; from
2 d.f=3 of table,
C1=.216, C2=9.348; so the confidence interval of σ2 is
0.1017 to .4629; how about confidence interval of σ ?
Chi-Square Procedures

11.1
Chi-Square Goodness of Fit
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends upon the degrees of freedom, just like
Student’s t-distribution.
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends upon the degrees of freedom, just like
Student’s t-distribution.
3. As the number of degrees of freedom
increases, the chi-square distribution becomes
more symmetric as is illustrated in Figure 1.
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends upon the degrees of freedom, just like
Student’s t-distribution.
3. As the number of degrees of freedom
increases, the chi-square distribution becomes
more symmetric as is illustrated in Figure 1.
4. The values are non-negative. That is, the
values of are greater than or equal to 0.
The Chi-Square Distribution
A goodness-of-fit test is an inferential
procedure used to determine whether a
frequency distribution follows a claimed
distribution.
Expected Counts
Suppose there are n independent trials an
experiment with k > 3 mutually exclusive possible
outcomes. Let p1 represent the probability of
observing the first outcome and E1 represent the
expected count of the first outcome, p2 represent
the probability of observing the second outcome
and E2 represent the expected count of the second
outcome, and so on. The expected counts for
each possible outcome is given by

Ei = μi = npi for i = 1, 2, …, k
EXAMPLE Finding Expected Counts
A sociologist wishes to determine whether the distribution for
the number of years grandparents who are responsible for
their grandchildren is different today than it was in 2000.
According to the United States Census Bureau, in 2000,
22.8% of grandparents have been responsible for their
grandchildren less than 1 year; 23.9% of grandparents have
been responsible for their grandchildren 1or 2 years; 17.6%
of grandparents have been responsible for their
grandchildren 3 or 4 years; and 35.7% of grandparents have
been responsible for their grandchildren for 5 or more years.
If the sociologist randomly selects 1,000 grandparents that
are responsible for their grandchildren, compute the
expected number within each category assuming the
distribution has not changed from 2000.
Test Statistic for Goodness-of-Fit Tests
Let Oi represent the observed counts of category i,
Ei represent the expected counts of an category i, k
represent the number of categories, and n represent
the number of independent trials of an experiment.
Then,
i = 1, 2, …, k

approximately follows the chi-square distribution


with k – 1 degrees of freedom provided (1) all
expected frequencies are greater than or equal to 1
(all Ei > 1) and (2) no more than 20% of the
expected frequencies are less than 5. NOTE: Ei =
npi for i = 1,2, ..., k.
The Chi-Square Goodness-of-Fit Test
If a claim is made regarding a distribution, we
can use the following steps to test the
claim provided
1. the data is randomly selected
The Chi-Square Goodness-of-Fit Test
If a claim is made regarding a distribution, we
can use the following steps to test the
claim provided
1. the data is randomly selected
2. all expected frequencies are greater than
or equal to 1.
The Chi-Square Goodness-of-Fit Test
If a claim is made regarding a distribution, we
can use the following steps to test the
claim provided
1. the data is randomly selected
2. all expected frequencies are greater than
or equal to 1.
3. no more than 20% of the expected
frequencies are less than 5.
Step 1: A claim is made regarding a
distribution. The claim is used to
determine the null and alternative
hypothesis.

Ho: the random variable follows the


claimed distribution

H1: the random variable does not follow


the claimed distribution
Step 2: Calculate the expected frequencies for
each of the k categories. The expected
frequencies are npi for i = 1, 2, …, k where n is
the number of trials and pi is the probability of the
ith category assuming the null hypothesis is true.
Step 3: Verify the requirements fort he
goodness-of-fit test are satisfied.

(1) all expected frequencies are greater


than or equal to 1 (all Ei > 1)

(2) no more than 20% of the expected


frequencies are less than 5.
EXAMPLE Testing a Claim Using the Goodness-of-Fit
Test
A sociologist wishes to determine whether the distribution
for the number of years grandparents who are
responsible for their grandchildren is different today than
it was in 2000. According to the United States Census
Bureau, in 2000, 22.8% of grandparents have been
responsible for their grandchildren less than 1 year;
23.9% of grandparents have been responsible for their
grandchildren 1or 2 years; 17.6% of grandparents have
been responsible for their grandchildren 3 or 4 years; and
35.7% of grandparents have been responsible for their
grandchildren for 5 or more years. The sociologist
randomly selects 1,000 grandparents that are responsible
for their grandchildren and obtains the following data.
Solution:
• Step 1. Construct the Hypothesis
• H0 : The distribution for the number of years
grandparents who are responsible for their
grandchildren is the same today as it was in 2000.

H1 : The distribution for the number of years


grandparents who are responsible for their
grandchildren is different today from what it was
in 2000.
• Step 2. Compute the expected counts for each
category, assuming that the null hypothesis is true.
Number of Years Frequency(Oi) Expected Frequency(Ei)
(observed count) (expected count)

Less than 1 year 252 228


1 or 2 years 255 239
3 or 4 years 162 176
5 or more years 331 357
Solution(cont’d):

• Step 3. Verify that the requirements for the


goodness-of-fit test are satisfied.
1. All expected frequencies( or expected
counts ) are bigger than or equal to 1?
2. No more than 20% of the expected
frequencies are less than 5.
Step 4. Find the critical values, determine the
critical region.
α=0.05, k = 4, degree of freedom = k-1 =3
Look in table IV,
χα2 =7.815
C:=(7.815, infinity)
• Step 5. Compute the test statistic

χ2 = (252-228)^2/228+(255-239)^2/239
+( 162-176)^2/176+(331-357)^2/357
=6.605
• Step 6. Compare the test statistics with the critical values
the test statistic < the critical value
or the test statistic does not lie in th critical region.
• Step 7. Conclusion?
• There is no sufficient evidence at the α=0.05 level of significance to reject
the null hypothesis, i.e., the claim of the distribution for the number of years
grandparents who are responsible for their grandchildren is the same today as
it was in 2000
Or ….
CORRELATION
Correlation
„key concepts:
Types of correlation
Methods of studying correlation
a) Scatter diagram
b) Karl pearson’s coefficient of correlation
c) Spearman’s Rank correlation coefficient
d) Method of least squares
Correlation
„ Correlation: The degree of relationship between the
variables under consideration is measure through the
correlation analysis.
„ The measure of correlation called the correlation coefficient
„ The degree of relationship is expressed by coefficient which
range from correlation ( -1 ≤ r ≥ +1)
„ The direction of change is indicated by a sign.
„ The correlation analysis enable us to have an idea about the
degree & direction of the relationship between the two
variables under study.
Correlation
„ Correlation is a statistical tool that helps
to measure and analyze the degree of
relationship between two variables.
„ Correlation analysis deals with the
association between two or more
variables.
Correlation & Causation
„ Causation means cause & effect relation.
„ Correlation denotes the interdependency among the
variables for correlating two phenomenon, it is essential
that the two phenomenon should have cause-effect
relationship,& if such relationship does not exist then the
two phenomenon can not be correlated.
„ If two variables vary in such a way that movement in one
are accompanied by movement in other, these variables
are called cause and effect relationship.
„ Causation always implies correlation but correlation does
not necessarily implies causation.
Types of Correlation
Type I

Correlation

Positive Correlation Negative Correlation


Types of Correlation Type I
„ Positive Correlation: The correlation is said to
be positive correlation if the values of two
variables changing with same direction.
Ex. Pub. Exp. & sales, Height & weight.
„ Negative Correlation: The correlation is said to
be negative correlation when the values of
variables change with opposite direction.
Ex. Price & qty. demanded.
Direction of the Correlation
„ Positive relationship – Variables change in the
same direction.
Indicated by
„ As X is increasing, Y is increasing
„ As X is decreasing, Y is decreasing
sign; (+) or (-).
„ E.g., As height increases, so does weight.
„ Negative relationship – Variables change in
opposite directions.
„ As X is increasing, Y is decreasing
„ As X is decreasing, Y is increasing

„ E.g., As TV time increases, grades decrease


More examples
„ Positive relationships „ Negative relationships:
„ water consumption „ alcohol consumption
and temperature. and driving ability.
„ study time and „ Price & quantity
grades. demanded
Types of Correlation
Type II

Correlation

Simple Multiple

Partial Total
Types of Correlation Type II
„ Simple correlation: Under simple correlation
problem there are only two variables are studied.
„ Multiple Correlation: Under Multiple
Correlation three or more than three variables
are studied. Ex. Qd = f ( P,PC, PS, t, y )
„ Partial correlation: analysis recognizes more
than two variables but considers only two
variables keeping the other constant.
„ Total correlation: is based on all the relevant
variables, which is normally not feasible.
Types of Correlation
Type III

Correlation

LINEAR NON LINEAR


Types of Correlation Type III
„ Linear correlation: Correlation is said to be linear
when the amount of change in one variable tends to
bear a constant ratio to the amount of change in the
other. The graph of the variables having a linear
relationship will form a straight line.
Ex X = 1, 2, 3, 4, 5, 6, 7, 8,
Y = 5, 7, 9, 11, 13, 15, 17, 19,
Y = 3 + 2x
„ Non Linear correlation: The correlation would be
non linear if the amount of change in one variable
does not bear a constant ratio to the amount of change
in the other variable.
Methods of Studying Correlation

„ Scatter Diagram Method


„ Graphic Method
„ Karl Pearson’s Coefficient of
Correlation
„ Method of Least Squares
Scatter Diagram Method

„ Scatter Diagram is a graph of observed


plotted points where each points
represents the values of X & Y as a
coordinate. It portrays the relationship
between these two variables graphically.
A perfect positive correlation
Weight
Weight
of B
Weight A linear
of A
relationship

Height
Height Height
of A of B
High Degree of positive correlation
„ Positive relationship
r = +.80

Weight

Height
Degree of correlation
„ Moderate Positive Correlation

r = + 0.4
Shoe
Size

Weight
Degree of correlation
„ Perfect Negative Correlation

r = -1.0
TV
watching
per
week

Exam score
Degree of correlation
„ Moderate Negative Correlation
r = -.80
TV
watching
per
week

Exam score
Degree of correlation
„ Weak negative Correlation

Shoe
r = - 0.2
Size

Weight
Degree of correlation
„ No Correlation (horizontal line)

r = 0.0
IQ

Height
Degree of correlation (r)
r = +.80 r = +.60

r = +.40 r = +.20
2) Direction of the Relationship
„ Positive relationship – Variables change in the
same direction.
Indicated by
„ As X is increasing, Y is increasing
„ As X is decreasing, Y is decreasing
sign; (+) or (-).
„ E.g., As height increases, so does weight.
„ Negative relationship – Variables change in
opposite directions.
„ As X is increasing, Y is decreasing
„ As X is decreasing, Y is increasing

„ E.g., As TV time increases, grades decrease


Advantages of Scatter Diagram
„ Simple & Non Mathematical method
„ Not influenced by the size of extreme
item
„ First step in investing the relationship
between two variables
Disadvantage of scatter diagram

Can not adopt the an exact degree of


correlation
Karl Pearson's
Coefficient of Correlation
„ Pearson’s ‘r’ is the most common
correlation coefficient.
„ Karl Pearson’s Coefficient of Correlation
denoted by- ‘r’ The coefficient of
correlation ‘r’ measure the degree of
linear relationship between two variables
say x & y.
Karl Pearson's
Coefficient of Correlation
„ Karl Pearson’s Coefficient of
Correlation denoted by- r
-1 ≤ r ≥ +1
„ Degree of Correlation is expressed by a
value of Coefficient
„ Direction of change is Indicated by sign
( - ve) or ( + ve)
Karl Pearson's
Coefficient of Correlation
„ When deviation taken from actual mean:
r(x, y)= Σxy /√ Σx² Σy²
„ When deviation taken from an assumed
mean:
r= N Σdxdy - Σdx Σdy
√N Σdx²-(Σdx)² √N Σdy²-(Σdy)²
Procedure for computing the
correlation coefficient
„ Calculate the mean of the two series ‘x’ &’y’
„ Calculate the deviations ‘x’ &’y’ in two series from their
respective mean.
„ Square each deviation of ‘x’ &’y’ then obtain the sum of
the squared deviation i.e.∑x2 & .∑y2
„ Multiply each deviation under x with each deviation under
y & obtain the product of ‘xy’.Then obtain the sum of the
product of x , y i.e. ∑xy
„ Substitute the value in the formula.
Interpretation of Correlation
Coefficient (r)
„ The value of correlation coefficient ‘r’ ranges
from -1 to +1
„ If r = +1, then the correlation between the two
variables is said to be perfect and positive
„ If r = -1, then the correlation between the two
variables is said to be perfect and negative
„ If r = 0, then there exists no correlation between
the variables
Properties of Correlation coefficient
„ The correlation coefficient lies between -1 & +1
symbolically ( - 1≤ r ≥ 1 )
„ The correlation coefficient is independent of the
change of origin & scale.
„ The coefficient of correlation is the geometric mean of
two regression coefficient.
r = √ bxy * byx
The one regression coefficient is (+ve) other regression
coefficient is also (+ve) correlation coefficient is (+ve)
Assumptions of Pearson’s
Correlation Coefficient
„ There is linear relationship between two
variables, i.e. when the two variables are
plotted on a scatter diagram a straight line
will be formed by the points.
„ Cause and effect relation exists between
different forces operating on the item of
the two variable series.
Advantages of Pearson’s Coefficient

„ It summarizes in one value, the


degree of correlation & direction
of correlation also.
Limitation of Pearson’s Coefficient

„ Always assume linear relationship


„ Interpreting the value of r is difficult.
„ Value of Correlation Coefficient is
affected by the extreme values.
„ Time consuming methods
Coefficient of Determination
„ The convenient way of interpreting the value of
correlation coefficient is to use of square of
coefficient of correlation which is called
Coefficient of Determination.
„ The Coefficient of Determination = r2.
„ Suppose: r = 0.9, r2 = 0.81 this would mean that
81% of the variation in the dependent variable
has been explained by the independent variable.
Coefficient of Determination
„ The maximum value of r2 is 1 because it is
possible to explain all of the variation in y but it
is not possible to explain more than all of it.
„ Coefficient of Determination = Explained
variation / Total variation
Coefficient of Determination: An example
„ Suppose: r = 0.60
r = 0.30 It does not mean that the first
correlation is twice as strong as the second the
‘r’ can be understood by computing the value of
r2 .
When r = 0.60 r2 = 0.36 -----(1)
r = 0.30 r2 = 0.09 -----(2)
This implies that in the first case 36% of the total
variation is explained whereas in second case
9% of the total variation is explained .
Spearman’s Rank Coefficient of
Correlation
„ When statistical series in which the variables
under study are not capable of quantitative
measurement but can be arranged in serial order,
in such situation pearson’s correlation coefficient
can not be used in such case Spearman Rank
correlation can be used.
„ R = 1- (6 ∑D2 ) / N (N2 – 1)
„ R = Rank correlation coefficient
„ D = Difference of rank between paired item in two series.
„ N = Total number of observation.
Interpretation of Rank
Correlation Coefficient (R)
„ The value of rank correlation coefficient, R
ranges from -1 to +1
„ If R = +1, then there is complete agreement in
the order of the ranks and the ranks are in the
same direction
„ If R = -1, then there is complete agreement in
the order of the ranks and the ranks are in the
opposite direction
„ If R = 0, then there is no correlation
Rank Correlation Coefficient (R)
a) Problems where actual rank are given.
1) Calculate the difference ‘D’ of two Ranks
i.e. (R1 – R2).
2) Square the difference & calculate the sum of
the difference i.e. ∑D2
3) Substitute the values obtained in the
formula.
Rank Correlation Coefficient
b) Problems where Ranks are not given :If the
ranks are not given, then we need to assign
ranks to the data series. The lowest value in the
series can be assigned rank 1 or the highest
value in the series can be assigned rank 1. We
need to follow the same scheme of ranking for
the other series.
Then calculate the rank correlation coefficient in
similar way as we do when the ranks are given.
Rank Correlation Coefficient (R)
„ Equal Ranks or tie in Ranks: In such cases
average ranks should be assigned to each
individual. R = 1- (6 ∑D2 ) + AF / N (N2 – 1)

AF = 1/12(m13 – m1) + 1/12(m23 – m2) +…. 1/12(m23 – m2)


m = The number of time an item is repeated
Merits Spearman’s Rank Correlation
„ This method is simpler to understand and easier
to apply compared to karl pearson’s correlation
method.
„ This method is useful where we can give the
ranks and not the actual data. (qualitative term)
„ This method is to use where the initial data in
the form of ranks.
Limitation Spearman’s Correlation
„ Cannot be used for finding out correlation in a
grouped frequency distribution.
„ This method should be applied where N
exceeds 30.
Advantages of Correlation studies
„ Show the amount (strength) of relationship
present
„ Can be used to make predictions about the
variables under study.
„ Can be used in many places, including natural
settings, libraries, etc.
„ Easier to collect co relational data
Regression Analysis
„ Regression Analysis is a very
powerful tool in the field of statistical
analysis in predicting the value of one
variable, given the value of another
variable, when those variables are
related to each other.
Regression Analysis
„ Regression Analysis is mathematical measure of
average relationship between two or more
variables.
„ Regression analysis is a statistical tool used in
prediction of value of unknown variable from
known variable.
Advantages of Regression Analysis
„ Regression analysis provides estimates of
values of the dependent variables from the
values of independent variables.
„ Regression analysis also helps to obtain a
measure of the error involved in using the
regression line as a basis for estimations .
„ Regression analysis helps in obtaining a
measure of the degree of association or
correlation that exists between the two variable.
Assumptions in Regression Analysis
„ Existence of actual linear relationship.
„ The regression analysis is used to estimate the
values within the range for which it is valid.
„ The relationship between the dependent and
independent variables remains the same till the
regression equation is calculated.
„ The dependent variable takes any random value but
the values of the independent variables are fixed.
„ In regression, we have only one dependant variable
in our estimating equation. However, we can use
more than one independent variable.
Regression line
„ Regression line is the line which gives the best
estimate of one variable from the value of any
other given variable.
„ The regression line gives the average
relationship between the two variables in
mathematical form.
„ The Regression would have the following
properties: a) ∑( Y – Yc ) = 0 and
b) ∑( Y – Yc )2 = Minimum
Regression line
„ For two variables X and Y, there are always two
lines of regression –
„ Regression line of X on Y : gives the best
estimate for the value of X for any specific
given values of Y
„ X=a+bY a = X - intercept
„ b = Slope of the line
„ X = Dependent variable
„ Y = Independent variable
Regression line
„ For two variables X and Y, there are always two
lines of regression –
„ Regression line of Y on X : gives the best
estimate for the value of Y for any specific given
values of X
„ Y = a + bx a = Y - intercept
„ b = Slope of the line
„ Y = Dependent variable
„ x= Independent variable
The Explanation of Regression Line
„ In case of perfect correlation ( positive or
negative ) the two line of regression coincide.
„ If the two R. line are far from each other then
degree of correlation is less, & vice versa.
„ The mean values of X &Y can be obtained as
the point of intersection of the two regression
line.
„ The higher degree of correlation between the
variables, the angle between the lines is
smaller & vice versa.
Regression Equation / Line
& Method of Least Squares
„ Regression Equation of y on x
Y = a + bx
In order to obtain the values of ‘a’ & ‘b’
∑y = na + b∑x
∑xy = a∑x + b∑x2
„ Regression Equation of x on y
X = c + dy
In order to obtain the values of ‘c’ & ‘d’
∑x = nc + d∑y
∑xy = c∑y + d∑y2
Regression Equation / Line when
Deviation taken from Arithmetic Mean
„ Regression Equation of y on x:
Y = a + bx
In order to obtain the values of ‘a’ & ‘b’
a = Y – bX b = ∑xy / ∑x2
„ Regression Equation of x on y:
X = c + dy
c = X – dY d = ∑xy / ∑y2
Regression Equation / Line when
Deviation taken from Arithmetic Mean
„ Regression Equation of y on x:
Y – Y = byx (X –X)
byx = ∑xy / ∑x2
byx = r (σy / σx )

„ Regression Equation of x on y:
X – X = bxy (Y –Y)
bxy = ∑xy / ∑y2
bxy = r (σx / σy )
Properties of the Regression Coefficients
„ The coefficient of correlation is geometric mean of the two
regression coefficients. r = √ byx * bxy
„ If byx is positive than bxy should also be positive & vice
versa.
„ If one regression coefficient is greater than one the other
must be less than one.
„ The coefficient of correlation will have the same sign as
that our regression coefficient.
„ Arithmetic mean of byx & bxy is equal to or greater than
coefficient of correlation. byx + bxy / 2 ≥ r
„ Regression coefficient are independent of origin but not of
scale.
Standard Error of Estimate.
„ Standard Error of Estimate is the measure of
variation around the computed regression line.
„ Standard error of estimate (SE) of Y measure the
variability of the observed values of Y around the
regression line.
„ Standard error of estimate gives us a measure about
the line of regression. of the scatter of the
observations about the line of regression.
Standard Error of Estimate.
„ Standard Error of Estimate of Y on X is:
S.E. of Yon X (SExy) = √∑(Y – Ye )2 / n-2
Y = Observed value of y
Ye = Estimated values from the estimated equation that correspond
to each y value
e = The error term (Y – Ye)
n = Number of observation in sample.

„ The convenient formula:


(SExy) = √∑Y2 _ a∑Y _ b∑YX / n – 2
X = Value of independent variable.
Y = Value of dependent variable.
a = Y intercept.
b = Slope of estimating equation.
n = Number of data points.
Correlation analysis vs.
Regression analysis.
„ Regression is the average relationship between two
variables
„ Correlation need not imply cause & effect
relationship between the variables understudy.- R A
clearly indicate the cause and effect relation ship
between the variables.
„ There may be non-sense correlation between two
variables.- There is no such thing like non-sense
regression.
Correlation analysis vs.
Regression analysis.
„ Regression is the average relationship between two
variables
„ R A.
What is regression?
„ Fitting a line to the data using an equation in order
to describe and predict data
„ Simple Regression
„ Uses just 2 variables (X and Y)
„ Other: Multiple Regression (one Y and many X’s)
„ Linear Regression
„ Fits data to a straight line
„ Other: Curvilinear Regression (curved line)
We’re doing: Simple, Linear Regression
From Geometry:
„ Any line can be described by an equation
„ For any point on a line for X, there will be a
corresponding Y
„ the equation for this is y = mx + b
„ m is the slope, b is the Y-intercept (when X = 0)
„ Slope = change in Y per unit change in X
„ Y-intercept = where the line crosses the Y axis
(when X = 0)
Regression equation
„ Find a line that fits the data the best, = find a
line that minimizes the distance from all the
data points to that line
^
„ Regression Equation: Y(Y-hat) = bX + a
„ Y(hat) is the predicted value of Y given a
certain X
„ b is the slope
„ a is the y-intercept
Regression Equation:
Y = .823X + -4.239
We can predict a Y score from an X by
plugging a value for X into the equation and
calculating Y
What would we expect a person to get on quiz
#4 if they got a 12.5 on quiz #3?

Y = .823(12.5) + -4.239 = 6.049


Advantages of Correlation studies
„ Show the amount (strength) of relationship present
„ Can be used to make predictions about the variables
studied
„ Can be used in many places, including natural
settings, libraries, etc.
„ Easier to collect correlational data
Chapter 5: Regression

'Regression' (latin) means 'retreat',


'going back to', 'stepping back'.
In a 'regression' we try to (stepwise)
retreat from our data and explain
them with one or more explanatory
predictor variables. We draw a
'regression line' that serves as the
(linear) model of our observed data.

www.vias.org/.../img/gm_regression.jpg
Correlation vs. regression
z Correlation z Regression
z In a correlation, we z In a regression, we
look at the try to predict the
relationship between outcome of one
two variables without variable from one or
knowing the direction more predictor
of causality variables. Thus, the
direction of causality
can be established.
z 1 predictor=simple
regression
z >1 predictor=multiple
regression
Correlation vs. regression
Correlation Regression
For a correlation you
do not need to For a regression you do want to find
know anything out about those relations between
about the possible variables, in particular, whether
relation between one 'causes' the other.
the two variables Therefore, an unambiguous causal
Many variables template has to be established
correlate with each between the causer and the
other for unknown causee before the analysis!
reasons This template is inferential.
Correlation underlies Regression is THE statistical
regression but is method underlying ALL inferential
descriptive only statistics (t-test, ANOVA, etc.). All
that follows is a variation of
regression.
Linear regression
Independent and dependent variables
In a regression, the predictor variables are
labelled 'independent' variables. They predict
the outcome variable labelled 'dependent'
variable.

A regression in SPSS is always a linear


regression, i.e., a straight line represents the
data as a model.

http://snobear.colorado.edu/Markw/SnowHydro/ERAN/regression.jp
Method of least squares
In order to know which line to choose as the best
model of a given data cloud, the method of least
squares is used. We select the line for which the
sum of all squared deviations (SS) of all data
points is lowest. This line is labelled 'line of best
fit', or 'regression line'.
Regression line
Simple regression In mathematics, a coefficient is a
constant multiplicative factor of a

Regression coefficients certain object. For example, the


coefficient in 9x2 is 9.
http://en.wikipedia.org/wiki/Coefficient

The linear regression equation ( 5.2) is:

Yi = (b0 + b1Xi) + εi
Yi = outcome we want to predict
b0 = intercept of the regression line regression
b1 = slope of the regression line coefficients

Xi = Score of subjecti on the predictor variable


εi = residual term, error
Slope/gradient and intercept

zSlope/gradient:
steepness of the line;
neg or pos
zIntercept: where the line

crosses the y-axis

Yi = (- 4 + 1.33Xi) + εi

http://algebra-tutoring.com/slope-intercept-form-equation-lines-1-gifs/slope-52.gif
'goodness-of-fit'

The line of best fit (regression line) is compared


with the most basic model. The former should be
significantly better than the latter. The most basic
model is the mean of the data.
Relation between tobacco and
alcohol consume

http://images.google.de/imgres?imgurl=http://math.uprm.edu/~wrolke/esma3102/graphs/rssfig2.pn
g&imgrefurl=http://math.uprm.edu/~wrolke/esma3102/rss.htm&h=552&w=553&sz=4&hl=de&start=
23&tbnid=eY0TWAtPXf0_ZM:&tbnh=133&tbnw=133&prev=/images%3Fq%3Dsum%2Bof%2Bsqua
res%26start%3D21%26svnum%3D10%26hl%3Dde%26lr%3D%26sa%3DN
Mean of Y as basic model
The summed
squared
Yi -⎯Y differences
between
observed
values and the
mean, SST, are
big, hence the
Mean, ⎯Y mean is not a
good model of
the data

Sum of squares total: SST


Regression line as a model
The summed
squared
differences
between
observed
values and the
regression line,
SSR, are
smaller, hence
this regression
line is a much
better model of
the data

sum of squares residual SSR


Model

Mean, ⎯Y Sum of squares


model, SSM

SSM: sum of squared differences between the


mean of Y and the regresion line (as our model)
Comparing the basic model and the
regression model: R 2

The improvement by the regression model


can be expressed by dividing the sum of
squares of the regression model SSM by the
sum of squares of the basic model SST:
The basic comparison in statistics is always
to compare the amount of variance that our
R2 = SSM model can explain with the total amount of
variation there is. If the model is good it can
SST explain a significant proportion of this overall
variance.

This is the same measure as the R2 in chapter 4 on


correlation. Take the square root of R2 and you have the
Pearson correlation coefficient r!
Comparing the basic model and the
regression model: F-Test
In the F-Test, the ratio of the improvement due to the
model SSM and the difference between the model and
the observed data, SSR, is calculated.
We take the mean sum of squares, or mean squares,
MS, for the model, MSM, and the observed data, MSR:

F = MSM
MSR
The F-ratio should be high (since the model should
have improved the prediction considerably, as
expressed in MSM). MSR, the difference between the
model and the observed data (the residual), should be
small.
The coefficient of a predictor
The coefficient of the predictor b1≠0
X is b1. B1 indicates the
gradient/slope of the regression b1=0
line. It says how much Y
changes when X is changed
one unit. In a good model, b1
should always be different from
0, since the slope is either
positive or negative. If b1=0, this means:
z A change in one unit of the

Only a bad model, i.e., the basic predictor X does not change
model of the mean, has a slope the predicted variable Y
zThe gradient of the
of 0. regression line is 0.
T-Test of the coefficient of the predictor
A good predictor variable should have a b1 that is
different from 0 (the regression coefficient of the
basic model, the mean). Whether this difference is
significant, can be tested by a t-test.
The b of the expected values (0-Hypothesis, i.e.,
0) is subtracted from the b of the observed values
and divided by the standard error of b.

t = bobserved – bexpected Since bexpeted=0

SEb

t= bobserved t should be * different from


0.

SEb
Simple regression on SPSS
(using the Record1.sav data)
Descriptive glance: Scatterplot of the correlation
between advertisement and record sales

Graphs --> Interactive --> Scatterplot

Under 'Fit', tick


'include constant'
and 'fit line to
total'
Comparing the mean and the regression model
(using the Record1.sav data)

Graphs --> Interactive -->


Scatterplot

Under 'Fit', tick


'mean'

--> The regression line is quite


different from the mean
Simple regression on SPSS
(using the Record1.sav data)

Analyze --> Regression --> Linear

Predictor:
How much money
(in 1000)
you spend on What you want to predict:
advertisement # of records (in 1000) sold
Output of simple regression on SPSS
(using the Record1.sav data)
Analyze --> Regress --> Linear
R is the simple Pearson
correlation between R² is the amount of
'advertisement' and explained variance
'records sold'

R2= 33% of the total variance can be explained


by the predictor 'advertisement'.

66% of the variance cannot be explained.


ANOVA for the SSM (F-test): advertisement
predicts sales significantly
F = MSM/MSR
= 433687,833/4354,87
SSR
= 99,587
SSM

MSM

MSR
SST
sum of squares total
b1 gradient
b0 intercept Regression If predictor X
is increased
where regres- coefficients b0, b1 by 1 unit (1000, then
sion line
crosses Y axis 96,12 extra
When no money records will
is spent (X=0), be sold
134,140 records are
sold t= B/SEB
134,14/7,537=
17,799

=.09612
A closer look at the t-values

The equation for computing the t-value is t= B/SEB


For the constant: 134,14/7,537=17,799
For ADVERTS: B=0.09612/.010 should result in 9.612, however, t= 9.979

What’s wrong? Nothing, this is a rounding error. If you double-click on the output table
“Coefficients”, a more exact number will be shown:
9.612E-02 = 0,09612448597388
.010 = 0,00963236621523
If you re-compute the equation with these numbers, the result is correct:
0,09612448597388/ 0,00963236621523 = 9.979
Using the model for Prediction
Imagine the record company wants to spend
100,000 £ for advertisement.
Using Equation 5.2, we can fit in the values of b0
and b1:

Yi = (b0 + b1Xi)
= 134.14 + (.09612 x Advertising Budgeti) Is that a
good deal?
Expl: If 100,000 £ are spent on ads,

134.14 + (.09612 x 100) = 143.75

144,000 records should be sold on the first week.


http://image.informatik.htw-
aalen.de/Thierauf/Knobelaufgaben/Sommer03/zweifel.png
Multiple regression

In a multiple regression, we predict the outcome of a


dependent variable Y by a linear combination of >1
independent predictor variables Xi
Outcomei = (Modeli) + errori

Every variable has its own coefficient: b1, b2,...,bn

(5.9) Yi = (b0 + b1X1 + b2X2 + ... + bnXn) + εi

b1X1= 1st predictor variable with its coefficient


b2X2 = 2nd predictor variable with its coefficient, etc.
εi = residual term
Multiple Regression on SPSS
using file record2.sav

We want to predict record sales (Y) by two


predictors:
X1 = advertisement budget
X2 = number of plays on Radio 1

Record Salesi = b0 + b1Adi + b2Playi + εi

Instead of a regression line, a regression plane (2


dimensions) is now fitted to the data (3
dimensions)
3D-Scatterplot of the relation
between record sale (Y) and
advertisement budget (X1)
No of plays on Radio 1/week (X2)
Graphs -->
Interactive -->
Scatterplot --> 3D
Multiple regression with 2 Variables
can be visualized as a 3D-scatterplot.
More variables cannot be
accomodated visually.
Regression planes and confidence
intervals of multiple regression

Under the
menu 'Fit',
specify the
following
options
3-D-scatterplot

If adjusted appropriately,
you can see the
regression plain and the The regression plains are
confidence plains chosen as to cover most of
almost like lines the data points in the three-
dimensional data cloud
Sum of squares, R, R2
The terms we encountered for simple regression,
SST, SSR, SSM, still mean the same, but are more
complicated to compute now.

Instead of the simple correlational coefficient R, we


use a multiple correlation coefficient Multiple R.

Multiple R is the correlation between the predicted


and observed values of the outcome. As in simple
R, Multiple R, should be great.
Multiple R2 is a measure of the explained variance
of Y by the predictor variables X1-Xn.
Methods of regression
The predictors of the model should be selected
carefully, e.g., based on past research or
theoretically well motivated.
zHierarchical method (ordered entry): first,
known predictors are entered, then new ones,
either blockwise (all together) or stepwise
zForced entry ('enter'): All predictors are forced

into the model simultaneously


zStepwise methods: Forward: Predictors are

introduced one by one, according to their


predictive power. Stepwise: Same as Forward + a
removal test. Backward: Predictors are judged
against a removal criterion and eliminated
accordingly.
How to choose one's predictors

zBased on the theoretical literature,


choose predictors in their order of
importance. Do not choose too many
zRun an initial multiple regression

zEliminate useless predictors

zTake ca. n=15 subjects per predictor


Evaluating the model

1. The model must fit the data sample


2. The model should generalize beyond
the sample
Evaluating the model - diagnostics
1. Fitting the observed data:
- Check for outliers which bias the Analyze --> Regression
model and enlarge the residual --> Linear
- Look at standardized residuals (z-Under 'Save', specify:
scores): If > 1% are lying outside
the margins of +/- 2.58, the model
is poor.
- Look at studentized residuals:
(unstandardized residuals/ SD that
varies point by point.) Yields a
more exact estimate of error
variance.

Note: SPSS adds the computed


scores into new columns in the
data file.
Evaluating the model - diagnostics
- continued
zIdentify influential cases and
see how the model changes if
they are excluded.
This is done by running the
regression without that particular
case and then use the new model to
predict the value of the just
excluded case (its 'adjusted
predicted value'). If the case is
similar to all other cases, its
'adjusted predicted value' will not
differ much from its predicted value,
given the model including it.
Evaluating the model - continued
DFBeta:a measure of the
influence of a case on the values
of bi.
DFFit: “...difference between the
adjusted predicted value and the
original predicted value of a
particular case.” (Field 2005, 729).
Deleted residual: residual based
on the adjusted predicted value.
“... the difference between the
adjusted predicted value for a
case and the original observed
value for that case.” (Field 2005,
728)
A way of standardizing the deleted
residual is to divide it by its SD -->
studentized deleted residual.
Evaluating the model
- continued
Identify influential cases and see how the
z

model changes if they are excluded.


Cook's distance measures the influence of
a case on the overall model's ability to
predict all cases.
Leverage estimates “the influence of the
observed value of the outcome variable
over the predicted values.” (Field 2005, 736)
Leverage values lie between 0<x>1 and may be
used to define cut-off points for excluding
influential cases.

Mahalanobis distances measure the


distance of cases from the means of the
predictor variables.
Example for using DFBeta as an
indicator of an 'influential case'
using file dfbeta.sav
z Run a simple regression with all data

(including outlier, case 30):


Analyze --> Regression --> Linear

What you
want to
predict

Your predictor
Example for using DFBeta as an
indicator of an 'influential case'
using file dfbeta.sav
z All data (including z Case 30 removed

outlier, case 30): (with Data --> Select


cases --> use filter
variable)
z B0=29; b1= -.90
z B0 = 31; b1=-1
→ Both regression coefficients
b0 (constant/intercept) and b1
(gradient/slope) changed !
Example for using DFBeta as an
indicator of an 'influential case'
using file dfbeta.sav

Dfbeta of the constant (dfb0)


and of the predictor x (dfb1)
are much higher than those
of the other cases
Summary of both calculations
Scatterplots for both samples
Parameter (b) + case 30 - case 30 Difference
Constant (b0) 29.00 31.00 -2.00
Gradient (b1) -.90 -1 .10
Model Y=(-.9)X+29 Y=(-1)X+31
Predicted Y 28.0100 30-1.09
z With case 30: z Without case 30

Outlier
DFBetas, DFFit, CVR's
All the following measures measure the difference
between a model including and one excluding
influential cases:
zStandardized DFBeta: Difference between a
parameter estimated using all cases and
estimated when one case is excluded, e.g.
DFBetas of the parameters b0 and b1.
zStandardized DFFit: Difference between the

predicted value for a case in a model including vs.


in a model excluding this value.
zCovariance ratio (CVR): measure of whether a

case influences the variance of the regression


parameters. This ratio should be close to 1.
Help-Window,Topic index 'Linear Regression'
Window „Save new variables“
I find it hard to
remember what all
those influence
statistics mean...

Why don't you look


them up in the „Help
window“ ?

http://image.informatik.htw-
aalen.de/Thierauf/Knobelaufgaben/Sommer03/zweifel.png
Residuals and influence statistics
(using the file pubs.sav)
The correlation between no.
outlier of pubs in London districts
and deaths with and without
the outlier.
Note: The residual for the
outlier fitted to the regression
line including it is small.
However, its influence
statistics is huge.
Why? The outlier is the 'City of
London' district, where a lot of
pubs are but only few residents
live. The ones who are drinking in
Scatterplot of both variables those pubs are visitors, hence,
Graphs --> Interactive --> the ratio of deaths of citizens
scatterplot given the overall consumation of
alcohol is relatively low.
Case summary: 8
London districts

St. Res. Lever St. DFFIT St. DFB Interc St. DFB Pubs
1 -1,34 0,04 -0,74 -0,74 0,37
2 -0,88 0,03 -0,41 -0,41 0,18
3 -0,42 0,02 -0,18 -0,17 0,07
4 0,04 0,02 0,02 0,02 -0,01
5 0,5 0,01 0,2 0,19 -0,06
6 0,96 0,01 0,4 0,38 -0,1
7 1,42 0 0,68 0,63 -0,12
8 -0,28 0,86 -4,60E+008 92676016 -4,30E+008
Total 8 8 8 8 8
The residual of the
outlier #8 is small
because it actually
sits very close to the The influence statistics are huge!
regression line
Excluding the outlier
(pubs.sav)
If you create a variable “num_dist” (number of the
district) in the variables list of the pubs.sav file and
simply allocate a number to each district (1-8), you can
use this variable to exclude the problematic district #8.
Data Æ Select cases Æ If condition is satisfied Æ
num_dist~=8
Excluding the outlier – continued
(pubs.sav)
Look at the scatterplot again
now that district # 8 has
been excluded:
Graphs Æ Interactive Æ
Scatterplot

Now the 7 remaining districts


all line up perfectly on the
(idealized) regression line
Will our sample regression
generalize to the population?
If we want to generalize our findings of one sample to
the population, we have to check some assumptions:
zVariable types: predictor variables must be

quantitative (interval) or categorical (binary); outcome


variable must be quantitative, continuous and
unbounded (whole range must be instantiated)
zNon-zero variance of predictors

zNo perfect correlation between ≥ 2 predictors

zPredictors are uncorrelated to any 'third variable'

which was not included in the regression


zAll levels of the predictor variables should have same

variance
Will our sample regression
generalize to the population?
- continued

zIndependent errors: The residual terms of any


two observations should be uncorrelated (Durbin-
Watson Test)
zResiduals should be normally distributed

zAll of the values of the outcome variable are

independent
zPredictors and outcome have a linear relation

zIf these assumptions are not met, we cannot


draw valid conclusions from our model!
Two methods for the cross-
validation of the model
If our model is generalizable, it should be able to
predict the outcome of a different sample.
− Adjusted R2: R2 indicates the loss of
predictive power (shrinkage) if the model
were applied to the population:
adj R2 = 1- n-1 n-2 n+1 (1-R2)
n-k-1 n-k-2 n

R²= unadjusted value


n= number of cases
k= number of predictors in the model

− Data splitting: The entire sample is split


into two. Regressions are computed and
compared for both halves. Nice method
but one rarely has so many data.
Sample size
The required sample size for a regression
depends on

zThe number of predictors k


zThe size of the effect

zThe size of the statistical power

e.g.,
large efffect --> n= 80 (for up to 20 predictors)
medium effect --> n=200
small effect --> n=600
(Multi-)Collinearity
If ≥ 2 predictors are inter-correlated, we speak of
collinearity. In the worst case, 2 variables have a
correlation of 1. This is bad for a regression, since
the regression cannot be computed reliably
anymore. This is because the variables become
interchangeable.
High collinearity is rare, but some degree of
collinearity is always around.
Problems with collinearity:
zIt underestimates the variance of a second variable if
this variable is strongly intercorrelated with the first
variable. It adds little unique variance although – taken
for itself – it would explain a lot.
zWe can't decide which variable is important, which

variable should be included


zThe regression coefficients (b-values) become instable.
How to deal with collinearity

SPSS has some collinearity diagnostics:


zVariance inflation factor
zTolerance statistics

z...

→ in the 'Statistics' window of the 'linear


regression' menu
Multiple Regression on SPSS
(using the file Record2.sav)
Example: Predicting the record sales from 3
predictors:
zX1: Advertisement budget,
zX2: times played on radio,

zX3: attractiveness of the band

Since we know already that money for ads is a predictor,


it will be entered into the regression first (1st block), and
the 2 new predictors later (2nd block) --> hierarchical
method ('Enter').
1st
2nd block
block
Var 2+3
Var 1
What the „Statistics“ box should look like
Analyze --> Regression --> Linear
Regression Plots
Plotting *ZRESID (standardized residuals = errors) against *ZPRED
(standardized predicted values) helps us determine whether the
assumption of random errors and homoscedasticity (equal variances) are
met.

*ZRED
*ZPRED

For heteroscedasticity

Heteroscedasticity occurs
when the residuals at each
level of the predictor
For 'random errors' variables have unequal
variances.
Regression diagnostics

The
regression
diagnostics
are saved in
the data file,
each as a
separate
variable in a
new column
Options
leave them as they are
Interpreting Multiple Regression

The 'Descriptives'
give you a brief
summary of the
variables
Interpreting Multiple Regression
Pearson correlations R

R of predictors 123 with


outcome
R of pred1 with the others
R of pred2 with the other
R of pred3 with the others
Significance levels for all
correlations

Correlations: R's between all variables and signif-


levels. Pred 2 (plays on radio) is the best predictor.
Predictors should not correlate higher than R>.9 (collinearity)
Summary of model
Correlation Degrees of
between Change from freedom; If errors are
predictor(s) 0 to .335 df1:p-1 independent.
and out- (Model 1) df2:N-p-1 If value close
Only come (N=sample size; to 2, then OK
adver- and another
change of .330 p=# of predictors)
tisement
as predic (Model 2)
tor

How well
the model
generalizes. F-values
3 predic Explained Similar val- for R2 The model(s)
tors variance ues to R2 change bring about
by the are good. a significant
predic- Only 5% change
tor(s) shrinkage
ANOVA for the model against the
basic model (the mean)
Df equal to
# of cases
Df equal to Df equal minus F-values:
# of
# of cases to # of coefficients MSM/MSR:
minus 1 predic- (b0,b1)
433687.833/4354.87=99.587
200-1=199 tors 200-2=198 287125.806/2217.217=129.498
SSM

Significance
level

SSR

SST Both Model 1


and 2 have
improved the
prediction
significantly,
Model 2
Mean squares: (3 predictors)
SS/df
433687.8/1=433687.8
even better
862264.2/198=4354.87 than Model 1
(1 predictor)
Record sales increase
by .511 SD's when
the predictor (ads)
changes 1 SD;
Model parameters
With 95% confidence the b-values
b1 and b2 have equal 'gains' lie within these boundaries
Tight boundaries are good
Model 1= same
as in first
analysis

b0
b1 *
b2
b3

Pearson Corr of
predictor x outcome
controlled for each single
other predictor
Pearson Corr of
The 'Coefficients' table tells us the predictor x outcome
individual contribution of variables to the controlled for all
other predictor
regression model. The Standardized Beta's 'unique relationship'
tell us the importance of each predictor
Excluded variables

What contribution would


this predictor have made
to a model containing it

SPSS gives a summary of those predictors that were not


entered in the Model (here only for Model 1) and
evaluates the contribution of the excluded variables.
Regression equation for
Model 2
(including all 3 predictor
variables)

Salesi = b0+b1Advertisingi +b2airplayi +b3attractivenessi


= -26.61+(0.08Adi)+ (3.37Airplayi) + (11.09 Attracti)

Interpretation:
If Ad increaes 1 unit-->sales increase .08 units; if airplay + 1
unit-->sales+3.37; if attract + 1 unit --> sales +11 units,
independent of the contributions of the other predictors.
No Multicollinearity
(In this regression, variables are not closely linearly
related)

Each predictor's variance proportions load highly on


a different dimension (Eigenvalue)
--> they are not intercorrelated, hence no collinearity
Casewise diagnostics
The casewise
z-value diagnostics lists cases
that lie outside the
>5% boundaries of 2 SD (in
the z-distribution, only
5% should be beyond
1.96 SD and only 1%
beyond 2.58
Case 169 deviates
>1%
most and needs to be
>1% followed up
>5%
Following up influential cases with „Case summaries“
--> everything OK
No DFBETA's >1 (all OK
Leverage values
<.06 (all OK)

Cook distances <1 (all OK) Mahalanobis' distances


<15 (all OK)
Identify influencing cases by the case
summary
zIn the standardized residulas, no more than 5%

must have values exceeding 2 and 1% exceeding 3.


z Cook's distances >1 might pose a problem

zLeverage (# of predictors + 1/sample size) must not

be twice or three times higher


zMahalanobis distance: cases with >25 in large

samples (n=500) and >15 in small samples (n=100)


can be problemantic
zAbsolute values of DFBeta should not exceed 1

zDetermine upper and lower limit of covariance ratio

(CVR). Upper limit = 1+3(average leverage); lower


limit = 1-3(average leverage).
Checking assumptions:
Heteroscedasticity
(Heteroscedasticity:
residuals (errors) at each
level of predictor have
different variances). Here
variances are equal

Plot of standardized residual *ZRESID/


standardized predicted value *ZPRED
Points are randomly and evently dispersed
--> assumptions of linearity and homoscedasticity
are met
Checking assumptions
Normality of residuals

The distribution of the residuals is normal (left


hand picture), the observed probabilities
correspond to the expected ones (right hand side)
Checking assumptions
Normality of residuals - continued

The Kolmogoroff-Smirnov-Test for


the standardized residuals is n.s.
--> normal distribution

Boxplots, too, show the


normality
(note the 3 outliers!)
Checking assumptions
Partial Regression Plots

Scatterplots of the residuals


of the outcome variable and
each of the predictors
separately.
No indication of outliers,
evenly spaced out cloud of
dots (only the residual
variance of 'attractiveness of
band' seems to be uneven.
EPIDEMIOLOGIC MEASURES:
INCIDENCE & PREVALENCE

1
Measuring Epidemiological Outcomes

Relationship between any two numbers


Ratio
(e.g. males / females)

A ratio where the numerator is


Proportion included in the denominator
(e.g. males / total births)

A proportion with the specification of time


Rate
(e.g. deaths in 2000 / population in 2000)
2
Definitions

y Incidence is the rate of new cases of a


disease or condition in a population at
risk during a time period

y Prevalence is the proportion of the


population affected

3
Number of new cases during a time period
Incidence =
Population at risk during that time period

y Incidence is a rate
y Calculated for a given time period (time
interval)
y Reflects risk of disease or condition 4
Number of existing cases
Prevalence =
Total number in the population at risk

y Prevalence is a proportion
y Point Prevalence: at a particular instant in time
y Period Prevalence: during a particular interval of time
(existing cases + new cases)
5
Prevalence = Incidence × Duration
Prevalence depends on the rate of occurrence (incidence)
AND the duration or persistence of the disease

At any point in time:


„ More new cases (increased risk) yields more
existing cases
„ Slow recovery or slow progression increases
the number of affected individuals
6
The population perspective requires
measuring disease in populations

• Science is built on classification and


measurement.
• Reality is infinitely detailed, infinitely complex.
• Classification and measurement seek to capture
the essential attributes.

7
Measurement “captures” the phenomenon

Classification and measurement are based on:


1. Objective of the classification
2. Conceptual model (understanding of the
phenomenon)
3. Availability of data (technology)

8
An example population (N=200)

dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb O
O
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb

O O
dbcdcdbcdcdbcdcdbcdcdbcdc
9
How can we quantify disease in populations?

dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
10
How can we quantify disease in populations?

dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
11
How can we quantify disease in populations?

dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
12
How can we quantify disease in populations?

dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
13
How can we quantify disease in populations?

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
14
How can we quantify the frequency?

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
15
Rate of occurrence of new cases
per unit time (e.g., 1 per month)
OO
dbcdcdbcdcdbcdcdbcdcdbcdc

O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
16
1 new case in month 1

dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
17
1 new case in month 2

dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
18
1 new case in month 3, for a total of 3 cases

dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
19
2 new cases in month 4

dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
20
1 new case in month 5 (total=6)

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
21
1 case in month 6

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
22
1 new case in month 7

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
23
2 new cases in month 8

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcdOOO
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
24
2 cases in month 9

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb

O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
25
Rate of occurrence of new cases during 9 months:
1 case/month to 2 cases/month
OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb

O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
26
Number of cases depends on length of interval

Divide by length of time interval, so can


compare across intervals
Number of new cases
Rate of new cases = –––––––––––––––––
Time interval

= 12 cases / 9 months = 1.33 cases / month

27
Number of cases depends on population size

So, divide by population and time:

Number of new cases


Incidence rate = ––––––––––––––––––
Population-time

28
How to estimate population-time?

Population at risk: the people eligible to


become a case and to be counted as one.
In this example that population declines as
each case occurs.
So estimate population-time as . . .

1/25/2011 Incidence and prevalence 29


Population-time =
Method 1: Add up the time that each person is
at risk
Method 2: Add up the population at risk during
each time segment
Method 3: Multiply the average size of the
population at risk by the length of the time
interval
1/25/2011 Incidence and prevalence 30
Estimating population-time - method 2
Total population-time over 9 months =
200 + 199 + 198 + 197 + 195 + 194 + 193 +
192 + 190
= 1,758 person-months
= 146.5 person-years
However, cases are not at risk for a full
month.
31
Estimating population-time - method 2
- better
Total population-time over 9 months =
199.5 + 198.5 + 197.5 + 196 + 194.5 + 193.5
+ 192.5 + 191 + 189
= 1,752 person-months
= 146 person-years
assuming that cases develop, on average, in
the middle of the month
32
Estimating population-time - method 3
Average size of the population at risk during the
9 months = 195.3 (1,758 / 9) or approximately:
(200 + 188) /2 = 194
Population-time = 195.3 x 9 months or
(approximately) 194 x 9 months
= 1,746 person-months
= 145.5 person-years
33
Equivalent to - method 3
Take initial size of population at risk and reduce
it for time the people were not at risk due to
acquiring the disease:
200 - 12/2 = 194 (approximately)
Population-time = 194 x 9 months
= 1,746 person-months
= 145.5 person-years
34
Incidence rate (“incidence density”)

Number of new cases


–––––––––––––––––––––––––––––––
Avg population at risk × Time interval

Number of new cases


= ––––––––––––––––––––
Population-time
35
What proportion of the population
at risk are affected after 5 months?
dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb O
O
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb

O O
dbcdcdbcdcdbcdcdbcdcdbcdc
36
What proportion of the population
is affected after 1 month? (1/200)
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
37
What proportion of the population
is affected after 2 months? (2/200)
dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
38
What proportion of the population
is affected after 3 months? (3/200)
dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
39
What proportion of the population is
affected after 4 months? (5/200)
dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
40
6 / 200 = 0.03 = 3% = 30 / 1,000
in 5 months
dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb

O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
41
Incidence proportion (“cumulative incidence”)

Number of new cases


5-month CI = –––––––––––––––––––
Population at risk
Incidence proportion estimates risk.

1/25/2011 Incidence and prevalence 42


Incidence rate versus incidence proportion
• Incidence rate measures how rapidly cases are
occurring.
• Incidence proportion is cumulative.
• When care only about the “bottom line” (i.e.,
what has happened by the end of given period):
incidence proportion (CI).

1/25/2011 Incidence and prevalence 43


Incidence rate versus incidence proportion
• If risk period is long (e.g., cancer), we usually
observe only a portion.
• To compare results from studies with different
length of follow-up, use incidence rate (IR)
• If risk period is short, we usually observe all of it
and can use incidence proportion.

44
Incidence rate versus incidence
proportion
(rare disease, IR = 0.005 /
month)(see spreadsheet at epidemiolog.net/studymat/)
45
Incidence rate versus incidence
proportion
(common disease, IR = 0.1 /
month)
46
Case fatality rate

“Case fatality rate” (but it’s really a proportion)


= proportion of cases who die
(in a specified time interval)
• Like a “cumulative incidence of death” in cases
[ “incidence rate of death” in cases =
“termination rate” = 1/(average survival time)]

47
Mortality rate
Number of deaths
Mortality rate = ––––––––––––––––––––––––––––
Population at risk × Time interval

Number of deaths
Annual mortality rate = ––––––––––––––––––––––
Mid-year population (x 1 yr)

48
Mortality rate (more notes)
Number of deaths
Mortality rate = ––––––––––––––––––––––––––––
Population at risk × Time interval

Number of deaths
Annual mortality rate = ––––––––––––––––––
Mid-year population
6/6/2002 Incidence and prevalence 49
Mortality rates versus incidence rates
• Mortality data are more generally available
• Fatality reflects many factors, so mortality
rates may not be a good surrogate of incidence
rates
• Death certificate cause of death not always
accurate or useful

50
Prevalence – another important proportion

Number of existing (and new) cases


Prevalence = –––––––––––––––––––––––––––––––
Population at risk

51
1 new case, 1 death

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
52
1 new case, 1 new death

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
53
2 new cases, no deaths

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcdOOO
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
54
2 new cases, 1 new death

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb

O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
55
What is the prevalence? (9 / 197)

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb

O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
56
Fine points . . .
•Who is “at risk”?
• Endometrial cancer? Prostate cancer?
Breast cancer?
• Only women who have not had a
hysterectomy?
“Could” develop the condition + “would” be
counted.
57
More fine points
• Age?
• Immunity?
• Genetically susceptible?

58
More fine points . . .

• How do we measure time?

• Are 10 people followed for 10 years the


same as 100 people followed for 1 year?

• Aging of the cohort? Secular changes?

59
Fine points . . .
• Importance of stating units and scaling
unless they are clear from the context
– e.g., 120 per 100,000 person-years =
10 per 100,000 person-months
– Hazards from lack of clarity

60
“You can never, never take anything
for granted.”
Noel Hinners, vice president for flight
systems at Lockheed Martin Astronautics in
Denver, concerning the loss of the Martian
Climate Orbiter due to the Lockheed Martin
spacecraft team’s having reported
measurements in English units whiles the
orbiter’s navigation team at the Jet
Propulsion Laboratory (JPL) in Pasadena,
California assumed the measurements
were in metric units.

61
Relation of incidence and prevalence

• Prevalence depends on incidence


• Higher incidence leads to higher prevalence if duration
of cases does not change.
• Limitation of the bathtub analogy – flow rate needs to
be expressed relative to the size of the source
• Introducing a new analogy . . .

62
Existing
cases

Population
at risk
Deaths,
cures, etc.
63
Incidence, prevalence, duration of hospitalization
Remote community of 101,000 people
One hospital, patient census = 1,000
Steady state
500 admissions per week
Prevalence = 1,000/101,000 = 9.9/1,000
IR = 500/100,000 = 5/1,000/week
Duration Prevalence / IR = 2 weeks
64
Relation of incidence and prevalence

Under somewhat special conditions,


Prevalence odds = incidence × duration
Prevalence incidence × duration
(see spreadsheet at www.epidemiolog.net/studymat/)

65
Standardization
• When objective is comparability, need to adjust
for different distributions of other determinants
• Strategy:
• Analyze within each subgroup (stratum)
• Take a weighted average across strata
• Use same weights for all populations
(See the Evolving Text on www.epidemiolog.net)
66
Familiar example of weighted averages
• Liters of petrol per kilometer - differs for Interstate
(0.050 LpK) and non-Interstate (0.100 LpK) driving.
• To compare different cars, can:
• Compare them for each type of driving separately
(stratified analysis)
• Average for each car, using one set of weights
(e.g., 80% Interstate, 20% non-Interstate)
• E.g. = 0.80 x 0.050 LpK + 0.20 x 0.100 LpK = 0.060 LpK
67
Comparing a Suburu and a Mazda
Juan drives a Suburu 800 km on Interstate highways
and 200 km on other roads. His car uses 0.050 LpK
on Interstates and 0.100 LpK on other roads, for a
total of 60 liters of petrol, an average of 0.060 LpK
(60 L / 1000 km). His overall LpK can be expressed
as a weighted average:
(800/1000) x 0.050 LpK + (200/1000) x 0.100 LpK
= 0.80 x 0.050 LpK + 0.20 x 0.100 LpK = 0.060 LpK
68
Comparing a Suburu and a Mazda
Shizu drives her Mazda on a different route, with
only 200 km on Interstate and 800 km on other
roads. She uses 0.045 lpk on Interstate highways
and 0.080 LpK on non-Interstate. She uses a total
of 73 liters, or 0.073 LpK. Her overall LpK can be
expressed as a weighted average:
(200/1,000) x 0.045 LpK + (800/1,000) x 0.080 LpK
= 0.20 x 0.045 LpK + 0.80 x 0.080 LpK =0.073 LpK
69
How can we compare their fuel efficiency?
Juan Shizu
Km LpK Km LpK
Interstate 800 0.050 200 0.045
Other 200 0.100 800 0.080
Total 1,000 0.060 1,000 0.073

70
Total fuel efficiency is not comparable
because weights are different
Juan Shizu
% LpK % LpK
Interstate 80 0.050 20 0.045
Other 20 0.100 80 0.080
Total 100% 0.060 100% 0.073

71
By adopting a “standard” set of weights we
can compare fairly
Juan Shizu
% LpK % LpK
Interstate 60 0.050 60 0.045
Other 40 0.100 40 0.080
Total 100 0.060 100 0.073
Standardized 0.070 0.059

72
Comparing a Suburu and a Mazda

•Juan’s Suburu:
= 0.60 x 0.050 LpK + 0.40 x 0.100 LpK =0.070 LpK
•Shizu’s Mazda:
= 0.60 x 0.045 LpK + 0.40 x 0.080 LpK =0.059 LpK

The choice of weights may often affect the results of


the comparison.
73
t- and F-tests
Testing hypotheses
Overview
• Distribution& Probability
• Standardised normal distribution
• t-test
• F-Test (ANOVA)
Starting Point
• Central aim of statistical tests:
– Determining the likelihood of a value in a
sample, given that the Null hypothesis is true:
P(value|H0)
• H0: no statistically significant difference between
sample & population (or between samples)
• H1: statistically significant difference between
sample & population (or between samples)

– Significance level: P(value|H0) < 0.05


Types of Error
Population

H0 H1

β-error
H0 1-α
(Type II error)
Sample
H1 α-error
1-β
(Type I error)
Distribution & Probability
If we know s.th. about the distribution of events, we know s.th.
about the probability of these events
n

∑x i
x= i =1

α/2 n
n

∑ i
( x − x ) 2

s= i =1

n
Standardised normal distribution

Population Sample

x−μ xi − x xz = 0
z= zi =
σ s sz = 1

• the z-score represents a value on the x-axis for which we know the
p-value

• 2-tailed: z = 1.96 is 2SD around mean = 95% Æ ‚significant‘


• 1-tailed: z = +-1.65 is 95% from ‚plus or minus infinity‘
t-tests:
Testing Hypotheses About Means

x1 − x 2 2 2

t= s x1 − x2 =
s1
+
s2
s x1 − x2 n1 n2

differences _ between _ sample _ means


t=
estimated _ standard _ error _ of _ differences _ between _ means
Degrees of freedom (df)
• Number of scores in a sample that are free to vary

• n=4 scores; mean=10 Æ df=n-1=4-1=3


– Mean= 40/4=10
– E.g.: score1 = 10, score2 = 15, score3 = 5 Æ score4 = 10
Kinds of t-tests
Formula is slightly different for each:
• Single-sample:
• tests whether a sample mean is significantly different from a
pre-existing value (e.g. norms)
• Paired-samples:
• tests the relationship between 2 linked samples, e.g. means
obtained in 2 conditions by a single group of participants
• Independent-samples:
• tests the relationship between 2 independent populations
• formula see previous slide
Independent sample t-test
Number of words recalled
Group 1 Group 2 (Imagery) df = (n1-1) + (n2-1) = 18
21 22
19 25
x1 − x 2 19 − 26
18 27 t= = = −7
18 24 s x1 − x2 1
23 26
17 24 t ( 0.05,18) = ±2.101
19 28
16 26
21 30 t > t ( 0.05,18)
18 28
mean = 19 mean = 26
Æ Reject H0
std = sqrt(40) std = sqrt(50)
Bonferroni correction
• To control for false positives:

p
pc =
n

•E.g. four comparisons:


0.05
pc = = 0.0125
4
F-tests / Analysis of Variance (ANOVA)
T-tests - inferences about 2 sample means
But what if you have more than 2 conditions?
e.g. placebo, drug 20mg, drug 40mg, drug 60mg
Placebo vs. 20mg 20mg vs. 40mg
Placebo vs 40mg 20mg vs. 60mg
Placebo vs 60mg 40mg vs. 60mg

Chance of making a type 1 error increases as you do more t-tests

ANOVA controls this error by testing all means at once - it can compare k
number of means. Drawback = loss of specificity
F-tests / Analysis of Variance (ANOVA)
Different types of ANOVA depending upon experimental
design (independent, repeated, multi-factorial)

Assumptions
• observations within each sample were independent
• samples must be normally distributed
• samples must have equal variances
F-tests / Analysis of Variance (ANOVA)

obtained difference between sample means


t= difference expected by chance (error)

variance (differences) between sample means


F= variance (differences) expected by chance (error)

Difference between sample means is easy for 2 samples:


(e.g. X1=20, X2=30, difference =10)
but if X3=35 the concept of differences between sample means gets
tricky
F-tests / Analysis of Variance (ANOVA)
Solution is to use variance - related to SD

Standard deviation = Variance

E.g. Set 1 Set 2


20 28
30 30 These 2 variances provide
a relatively accurate
35 31 representation of the size of
s2=58.3 s2=2.33 the differences
F-tests / Analysis of Variance (ANOVA)
Simple ANOVA example

Total variability

Between treatments Within treatments


variance variance
---------------------------- --------------------------
Measures differences due to: Measures differences due to:
1. Treatment effects 1. Chance
2. Chance
F-tests / Analysis of Variance (ANOVA)

When treatment has no effect, differences


between groups/treatments are entirely due
to chance. Numerator and denominator will
MSbetween be similar. F-ratio should have value around
F= 1.00
MSwithin
When the treatment does have an effect then
the between-treatment differences
(numerator) should be larger than chance
(denominator). F-ratio should be noticeably
larger than 1.00
F-tests / Analysis of Variance (ANOVA)
Simple independent samples ANOVA example

F(3, 8) = 9.00, p<0.05


Placebo Drug A Drug B Drug C
There is a difference
Mean 1.0 1.0 4.0 6.0
somewhere - have to use
SD 1.73 1.0 1.0 1.73 post-hoc tests (essentially
n 3 3 3 3 t-tests corrected for multiple
comparisons) to examine
further
F-tests / Analysis of Variance (ANOVA)
Gets more complicated than that though…
Bit of notation first:
An independent variable is called a factor
e.g. if we compare doses of a drug, then dose is our factor

Different values of our independent variable are our levels


e.g. 20mg, 40mg, 60mg are the 3 levels of our factor
F-tests / Analysis of Variance (ANOVA)
Can test more complicated hypotheses - example 2 factor
ANOVA (data modelled on Schachter, 1968)
Factors:
1. Weight - normal vs obese participants
2. Full stomach vs empty stomach

Participants have to rate 5 types of crackers, dependent


variable is how many they eat
This expt is a 2x2 factorial design - 2 factors x 2 levels
F-tests / Analysis of Variance (ANOVA)
Mean number of crackers eaten

Empty Full

Result:
Normal 22 15 = 37
No main effect for
factor A (normal/obese)

Obese 17 18 = 35
No main effect for
factor B (empty/full)

= 39 = 33
F-tests / Analysis of Variance (ANOVA)
Mean number of crackers eaten

Empty Full
23
22
Normal 22 15 21
20
19
18 obese
17
16
Obese 17 18 15 normal
14

Empty Full
Stomach Stomach
F-tests / Analysis of Variance (ANOVA)
Application to imaging…
F-tests / Analysis of Variance (ANOVA)
Application to imaging…

Early days => subtraction methodology => T-tests corrected for multiple comparisons

e.g. Pain Appropriate rest Statistical


Visual task - =
parametric map
condition
F-tests / Analysis of Variance (ANOVA)

This is still a fairly


simple analysis. It
shows the main
effect of pain
(collapsing across
the pain source) and
the individual
conditions.
More complex
analyses can look at
interactions between
factors

Derbyshire, Whalley, Stenger, Oakley, 2004


References

Gravetter & Wallnau - Statistics for the


behavioural sciences
Last years presentation, thank you to:
Louise Whiteley & Elisabeth Rounis
http://www.fil.ion.ucl.ac.uk/spm/doc/mfd-2004.html

Google
The geometric mean is the positive
number x such that:
A X
X B
To find the geometric mean:
Example: Find the geometric mean
between 4 and 16.

Write a proportion with X in the two


mean positions:
Find the
4 X Cross multiply: square root,
X2 = 64 simplify.
X 16
X=8
Another example:
Find the geometric mean between 4
and 20.

4 X X= 80
X2 = 80
X 20
X=4 5
What if you know the geometric mean?
Example: 6 is the geometric mean
between 9 and what other number:

Set up the Cross multiply:


proportion
this way: 9X = 36

X 6 And finish!
6 9 X = 4 *no square root!!!
Research: Hypothesis
Definition
‰ the word hypothesis is derived form the Greek words
9 “hypo” means under
9 “tithemi” means place

‰ Under known facts of the problem to explain relationship between these


‰ ........ is a statement subject to verification
‰ ......... a guess but experienced guess based on some facts
‰ ..is a hunch, assumption, suspicion, assertion or an idea about a phenomena,
relationship, or situation, the reality of truth of which one do not know
‰ a researcher calls these assumptions, assertions, statements, or hunches
hypotheses and they become the basis of an inquiry.
‰ In most cases, the hypothesis will be based upon either previous studies or the
researcher’s own or someone else’s observations
‰ Hypothesis is a conjectural statement of relationship between two or more
variable (Kerlinger, Fried N, Foundations of Behabioural Research , 3rd edition,
New York: Holt, Rinehart and Winston, 1986)
‰
Definition

‰ Hypothesis is proposition, condition or principle which is assumed, perhaps


without belief, in order to draw its logical consequences and by this method
to test its accord with facts which are known or may be determined
(Webster’s New International Dictionary of English).
‰ A tentative statement about something, the validity of which is usually
unknown (Black, James A & Dean J Champion, Method and Issues in Social
Research, New York: John Wiley & Sons, Inc, 1976)
‰ Hypothesis is proposition that is stated is a testable form and that predicts a
particular relationship between two or more variable. In other words, id we
think that a relationship exists, we first state it is hypothesis and then test
hypothesis in the field (Baily, Kenneth D, Methods of Social Research, 3rd
edition, New York: The Free Press, 1978)
Definition

‡ A hypothesis is written in such a way that it can be proven or disproven by


valid and reliable data – in order to obtain these data that we perform our
study (Grinnell, Richard, Jr. Social Work Research and Evaluation, 3rd edition,
Itasca, Illinois, F.E. Peacock Publishers, 988)
‡ A hypothesis may be defined as a tentative theory or supposition set up and
adopted provisionally as a basis of explaining certain facts or relationships and
as a guide in the further investigation of other facts or relationships (Crisp,
Richard D, Marketing Research, New York: McGraw Hill Book Co., 1957 )
Characteristics
‡ Hypotheses has the following characteristics:
9 a tentative proposition
9 unknown validity
9 specifies relation between two or more variables
9
Functions
‡ Bringing clarity to the research problem
‡ Serves the following functions
9 provides a study with focus
9 signifies what specific aspects of a research problem is to investigate
9 what data to be collected and what not to be collected
9 enhancement of objectivity of the study
9 formulate the theory
9 enable to conclude with what is true or what is false
Characteristics
‡ Simple, specific, and contextually clear
‡ Capable of verification
‡ Related to the existing body of knowledge
‡ Operationalisable
Typologies
‡ Three types
9 working hypothesis
9 Null hypothesis
9 Alternate hypothesis

Working hypothesis
The working or trail hypothesis is provisionally adopted to explain the
relationship between some observed facts for guiding a researcher in the
investigation of a problem.
A Statement constitutes a trail or working hypothesis (which) is to be tested and
conformed, modifies or even abandoned as the investigation proceeds.
Typologies
Null hypothesis
A null hypothesis is formulated against the working hypothesis; opposes the
statement of the working hypothesis
....it is contrary to the positive statement made in the working hypothesis;
formulated to disprove the contrary of a working hypothesis
When a researcher rejects a null hypothesis, he/she actually proves a working
hypothesis

In statistics, to mean a null hypothesis usually Ho is used. For example,


Ho ÆQ = O
where Q is the property of the population under investigation
O is hypothetical
Typologies
Alternate hypothesis
An alternate hypothesis is formulated when a researcher totally rejects null
hypothesis

He/she develops such a hypothesis with adequate reasons

The notion used to mean alternate hypothesis is H1 ÆQ>O


i.e., Q is greater than O
Example
‰ Working hypothesis: Population influences the number of bank branches in a
town

‰ Null hypothesis (Ho): Population do not have any influence on the number of
bank branches in a town.

‡ Alternate hypothesis (H1): Population has significant effect on the number of


bank branches in a town. A researcher formulates this hypothesis only after
rejecting the null hypothesis.
Hypothesis Testing
• Goal: Make statement(s) regarding unknown population
parameter values based on sample data
• Elements of a hypothesis test:
– Null hypothesis - Statement regarding the value(s) of unknown
parameter(s). Typically will imply no association between
explanatory and response variables in our applications (will
always contain an equality)
– Alternative hypothesis - Statement contradictory to the null
hypothesis (will always contain an inequality)
– Test statistic - Quantity based on sample data and null
hypothesis used to test between null and alternative hypotheses
– Rejection region - Values of the test statistic for which we
reject the null in favor of the alternative hypothesis
Hypothesis Testing
Test Result – H0 True H0 False

True State
H0 True Correct Type I Error
Decision
H0 False Type II Error Correct
Decision

α = P (Type I Error ) β = P (Type II Error )

• Goal: Keep α, β reasonably small


Example - Efficacy Test for New drug
• Drug company has new drug, wishes to compare it
with current standard treatment
• Federal regulators tell company that they must
demonstrate that new drug is better than current
treatment to receive approval
• Firm runs clinical trial where some patients
receive new drug, and others receive standard
treatment
• Numeric response of therapeutic effect is obtained
(higher scores are better).
• Parameter of interest: μNew - μStd
Example - Efficacy Test for New drug
• Null hypothesis - New drug is no better than standard trt

H 0 : μ New − μ Std ≤ 0 (μ New − μ Std = 0)

• Alternative hypothesis - New drug is better than standard trt

H A : μ New − μ Std > 0


• Experimental (Sample) data:

y New y Std
s New sStd
nNew nStd
Sampling Distribution of Difference in Means

• In large samples, the difference in two sample means is


approximately normally distributed:
⎛ σ 2
σ 2 ⎞
Y 1 − Y 2 ~ N ⎜ μ1 − μ 2 , 1
+ 2 ⎟
⎜ n n ⎟
⎝ 1 2 ⎠

• Under the null hypothesis, μ1-μ2=0 and:

Y1 −Y 2
Z= ~ N (0,1)
σ 2
σ 2
1
+ 2
n1 n2

• σ12 and σ22 are unknown and estimated by s12 and s22
Example - Efficacy Test for New drug

• Type I error - Concluding that the new drug is better than the
standard (HA) when in fact it is no better (H0). Ineffective drug is
deemed better.
– Traditionally α = P(Type I error) = 0.05

• Type II error - Failing to conclude that the new drug is better


(HA) when in fact it is. Effective drug is deemed to be no better.
– Traditionally a clinically important difference (Δ) is assigned
and sample sizes chosen so that:
β = P(Type II error | μ1-μ2 = Δ) ≤ .20
Elements of a Hypothesis Test
• Test Statistic - Difference between the Sample means,
scaled to number of standard deviations (standard errors)
from the null difference of 0 for the Population means:

y1 − y 2
T .S . : zobs =
s12 s22
+
n1 n2

• Rejection Region - Set of values of the test statistic that are


consistent with HA, such that the probability it falls in this
region when H0 is true is α (we will always set α=0.05)

R.R. : zobs ≥ zα α = 0.05 ⇒ zα = 1.645


P-value (aka Observed Significance Level)
• P-value - Measure of the strength of evidence the sample
data provides against the null hypothesis:
P(Evidence This strong or stronger against H0 | H0 is true)
P − val : p = P ( Z ≥ zobs )
Large-Sample Test H0:μ1-μ2=0 vs H0:μ1-μ2>0

• H0: μ1-μ2 = 0 (No difference in population means


• HA: μ1-μ2 > 0 (Population Mean 1 > Pop Mean 2)

y1 − y 2
• T . S . : z obs =
s 12 s 22
+
n1 n2
• R . R . : z obs ≥ z α
• P − value : P ( Z ≥ z obs )

• Conclusion - Reject H0 if test statistic falls in rejection region,


or equivalently the P-value is ≤ α
Example - Botox for Cervical Dystonia

• Patients - Individuals suffering from cervical dystonia


• Response - Tsui score of severity of cervical dystonia
(higher scores are more severe) at week 8 of Tx
• Research (alternative) hypothesis - Botox A
decreases mean Tsui score more than placebo
• Groups - Placebo (Group 1) and Botox A (Group 2)
• Experimental (Sample) Results:

y1 = 10.1 s1 = 3.6 n1 = 33
y 2 = 7.7 s2 = 3.4 n2 = 35
Source: Wissel, et al (2001)
Example - Botox for Cervical Dystonia
Test whether Botox A produces lower mean Tsui
scores than placebo (α = 0.05)
• H 0 : μ1 − μ 2 = 0
• H A : μ1 − μ 2 > 0
10.1 − 7.7 2.4
• T .S . : zobs = = = 2.82
2
(3.6) (3.4) 2 0.85
+
33 35
• R.R. : zobs ≥ zα = z.05 = 1.645
• P − val : P ( Z ≥ 2.82) = .0024

Conclusion: Botox A produces lower mean Tsui scores than


placebo (since 2.82 > 1.645 and P-value < 0.05)
2-Sided Tests
• Many studies don’t assume a direction wrt the
difference μ1-μ2
• H0: μ1-μ2 = 0 HA: μ1-μ2 ≠ 0
• Test statistic is the same as before
• Decision Rule:
– Conclude μ1-μ2 > 0 if zobs ≥ zα/2 (α=0.05 ⇒ zα/2=1.96)
– Conclude μ1-μ2 < 0 if zobs ≥ -zα/2 (α=0.05 ⇒ -zα/2= -1.96)
– Do not reject μ1-μ2 = 0 if -zα/2 ≤ zobs ≤ zα/2
• P-value: 2P(Z≥ |zobs|)
Power of a Test
• Power - Probability a test rejects H0 (depends on μ1- μ2)
– H0 True: Power = P(Type I error) = α
– H0 False: Power = 1-P(Type II error) = 1-β

· Example:
· H0: μ1- μ2 = 0 HA: μ1- μ2 > 0
• σ12 = σ22 = 25 n1 = n2 = 25
· Decision Rule: Reject H0 (at α=0.05 significance level) if:

y1 − y 2 y1 − y 2
z obs = = ≥ 1 .645 ⇒ y 1 − y 2 ≥ 2 .326
σ 2
σ 2
2
1
+ 2
n1 n2
Power of a Test
• Now suppose in reality that μ1-μ2 = 3.0 (HA is true)
• Power now refers to the probability we (correctly)
reject the null hypothesis. Note that the sampling
distribution of the difference in sample means is
approximately normal, with mean 3.0 and standard
deviation (standard error) 1.414.
• Decision Rule (from last slide): Conclude population
means differ if the sample mean for group 1 is at least
2.326 higher than the sample mean for group 2
• Power for this case can be computed as:

P (Y 1 − Y 2 ≥ 2.326) Y 1 − Y 2 ~ N (3, 2.0 = 1.414)


Power of a Test

2.326− 3
Power= P(Y 1 − Y 2 ≥ 2.326) = P(Z ≥ = −0.48) = .6844
1.41

• All else being equal:

• As sample sizes increase, power increases


• As population variances decrease, power increases
• As the true mean difference increases, power increases
Power of a Test
Distribution (H0) Distribution (HA)
Power of a Test

Power Curves for group sample sizes of 25,50,75,100 and


varying true values μ1-μ2 with σ1=σ2=5.
• For given μ1-μ2 , power increases with sample size
• For given sample size, power increases with μ1-μ2
Sample Size Calculations for Fixed Power
• Goal - Choose sample sizes to have a favorable chance of
detecting a clinically meaning difference
• Step 1 - Define an important difference in means:
– Case 1: σ approximated from prior experience or pilot study - dfference
can be stated in units of the data
– Case 2: σ unknown - difference must be stated in units of standard
deviations of the data

μ1 − μ 2
δ=
σ
• Step 2 - Choose the desired power to detect the the clinically
meaningful difference (1-β, typically at least .80). For 2-sided test:

2(zα / 2 + z β )
2

n1 = n2 =
δ2
Example - Rosiglitazone for HIV-1
Lipoatrophy
• Trts - Rosiglitazone vs Placebo
• Response - Change in Limb fat mass
• Clinically Meaningful Difference - 0.5 (std dev’s)
• Desired Power - 1-β = 0.80
• Significance Level - α = 0.05

zα / 2 = 1.96 z β = z.20 = .84


2(1.96 + 0.84 )
2
n1 = n2 = 2
= 63
(0.5)
Source: Carr, et al (2004)
Confidence Intervals
• Normally Distributed data - approximately 95% of
individual measurements lie within 2 standard
deviations of the mean
• Difference between 2 sample means is
approximately normally distributed in large
samples (regardless of shape of distribution of
individual measurements):

⎛ 2 ⎞
σ 2
σ
Y 1 − Y 2 ~ N ⎜ μ1 − μ 2 , 1 + 2 ⎟
⎜ n n ⎟
⎝ 1 2 ⎠
• Thus, we can expect (with 95% confidence) that our sample
mean difference lies within 2 standard errors of the true difference
(1-α)100% Confidence Interval for μ1-μ2

• Large sample Confidence Interval for μ1-μ2:

(y )
2 2
s s
1 − y 2 ± zα / 2 1
+ 2
n1 n2
• Standard level of confidence is 95% (z.025 = 1.96 ≈ 2)
• (1-α)100% CI’s and 2-sided tests reach the same
conclusions regarding whether μ1-μ2= 0
Example - Viagra for ED
• Comparison of Viagra (Group 1) and Placebo (Group 2)
for ED
• Data pooled from 6 double-blind trials
• Subjects - White males
• Response - Percent of succesful intercourse attempts in
past 4 weeks (Each subject reports his own percentage)

y1 = 63.2 s1 = 41.3 n2 = 264


y 2 = 23.5 s2 = 42.3 n2 = 240

95% CI for μ1- μ2:


(41.3)2 (42.3)2
(63.2 − 23.5) ±1.96 + ≡ 39.7 ± 7.3 ≡ (32.4,47.0)
264 240
Source: Carson, et al (2002)
Hypothesis Testing: Preliminaries

A hypothesis is a statement that something is true.


Null hypothesis: A hypothesis to be tested. We use
the symbol H0 to represent the null hypothesis
Alternative hypothesis: A hypothesis to be
considered as an alternative to the null hypothesis. We
use the symbol Ha to represent the alternative
hypothesis.
- The alternative hypothesis is the one believe to
to be true, or what you are trying to prove
is true.
In this course, we will always assume that the null
hypothesis for a population parameter, Θ , always specifies
a single value for that parameter. So, an equal sign always
appears:
H 0 :Θ = Θ 0

If the primary concern is deciding whether a population


parameter is different than a specified value, the alternative
hypothesis should be:
H a :Θ ≠ Θ 0

This form of alternative hypothesis is called a two-tailed


test.
Example: You suspect that the equilibrium wage of low
skilled workers is not the federal minimum wage level of
$5.15
*If the primary concern is whether a population
parameter, Θ , is less than a specified value Θ 0 , the
alternative hypothesis should be:
H a :Θ < Θ 0

A hypothesis test whose alternative hypothesis has


this form is called a left-tailed test.
*If the primary concern is whether a population
parameter, Θ , is greater than a specified value Θ 0,
the alternative hypothesis should be:
H a :Θ > Θ 0

A hypothesis test whose alternative hypothesis has


this form is called a right-tailed test.
A hypothesis test is called a one-tailed test if it is
either right- or left-tailed, i.e.,if it is not a two-tailed
test.
After we have the null hypothesis, we have to determine
whether to reject it or fail to reject it.
The decision to reject or fail to reject is based on information
contained in a sample drawn from the population of interest.
The sample values are used to compute a single number,
corresponding to a point on a line, which operates as a decision
maker. This decision maker is called test statistic
If test statistic falls in some interval which support alternative
hypothesis, we reject the null hypothesis. This interval is called
rejection region
It test statistic falls in some interval which support null
hypothesis, we fail to reject the null hypothesis. This interval is
called acceptance region
The value of the point, which divide the rejection region and
acceptance one is called critical value
We can make mistakes in the test.
Type I error: reject the null hypothesis when it is true.
probability of type I error is denoted by α
Type II error: accept the null hypothesis when it is wrong.
probability of type II error is denoted by β
Test of hypothesis for a population mean

• We are basically asking: What observed value of x


bar would be different enough from my null
hypothesis value to convince me that my null is
wrong
• We always talk in terms of type I errors, alpha,
which are always small (.1, .05, .01)
• The smaller alpha gets the more tight your proof
that the alternative is correct, because the
probability of type I error is reduced, but the
chances of pa type II error are increased
Test of hypothesis for a population mean
(two tailed and large sample)
1) Hypothesis: H 0 :μ = μ0
H a :μ ≠ μ0
2) Test statistic: large sample case
x − μ0
z obs =
σ / n
3) Critical value, rejection and acceptance region:
- The bigger the absolute value of z is, the more
possible to reject null hypothesis.
- The critical value depend on the significance level α
- rejection region: | zobs |> zα / 2 or crit
Test of hypothesis for a population mean
(one tailed test and large sample)
1) Hypothesis: H 0 :μ = μ0
H a :μ > μ0 or H a :μ < μ0
2) Test statistic: large sample case
x − μ0
z obs =
σ / n
3) Critical value, rejection and acceptance region:
rejection region: zobs > zα or zobs < − zα
Example: a sample of 60 students’ grades is take from
a large class, the average grade in the sample is 80
with a sample standard deviation 10.
Test the hypothesis that the average grade is 75 with
5% significance level (probability of making a type I
error).
Test of hypothesis for a population mean
(two tailed and small sample)
1) Hypothesis: H 0 :μ = μ0
H a :μ ≠ μ0
2) Test statistic: small sample case
x − μ0
t=
s/ n
3) Critical value, rejection and acceptance region:
- The bigger the absolute value of t is, the more
possible to reject null hypothesis.
- The critical value depends on significance level α

- rejection region: | t |> tα / 2 d.f.=n-1


Test of hypothesis for a population mean
(one tailed test and small sample)
1) Hypothesis: H 0 :μ = μ0
H a : μ > μ 0 or H a :μ < μ0
2) Test statistic: small sample case
x − μ0
t=
s/ n
3) Critical value, rejection and acceptance region:
rejection region: t > tα or t < −tα d.f.=n-1
Example: suppose you have a sample of 11 Econ 70
midterm exam grades. The mean of that sample
is 81 with a standard deviation of 9.
1) Test hypothesis that average grade of the
population is 75 with 5% significance level.
2) Test hypothesis that average grade of the
population is greater than 80 with 5%
significance level.
STATA
• ttest
Test of difference between two
population means

Population 1: faculty in public schools


Population 2: faculty in private schools

μ1 =mean salary of faculty in public schools


μ 2 =mean salary of faculty in private schools
Two samples: one from public the other from
private
H 0 : μ1 = μ 2
H a : μ1 ≠ μ 2
In large sample case, the sampling distribution
of difference between population mean x1 − x2 is
a normal distribution with mean
μ ( x − x ) = μ1 − μ 2
1 2

and the standard deviation is

σ 12 σ 22
σ (x −x ) = +
1 2
n1 n2
Test of hypothesis for difference of two population means
(two tailed and large sample)
1) Hypothesis: D0 is some specified difference that you
wish to test. For many tests, you will wish to
hypothesize that there is no difference between two
means, that is D0=0
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0
2) Test statistic: large sample case
( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1−
σ 12 σ 22
x2 )
+
n1 n2

3) Critical value, rejection and acceptance region:


rejection region: | zobs |> zα / 2
Test of hypothesis for difference of two population
means(one tailed test and large sample)

1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 > D 0 or H a : μ1 − μ 2 < D0
2) Test statistic: large sample case

( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1 − x2 )
σ 12 σ 22
+
n1 n2

3) Critical value, rejection and acceptance region:


rejection region: zobs > zα or zobs < − zα
Example: compare salary difference.

Population 1: faculty in public schools


Population 2: faculty in private schools
μ1
=mean salary of faculty in public schools
μ2
=mean salary of faculty in private schools
Sample 1: salaries of faculty members in public schools
(n=30)
Sample 2: salaries of faculty members in private schools
(n=35)
x1 = 57.48 x2 = 66.39
s1 = 9 s 2 = 9 .5

Test the hypothesis that the salaries are less for faculty in
public school with 5% significance level
In small sample case, the sampling
distribution of the difference between two
means is the t-distribution with mean
μ ( x − x ) = μ1 − μ 2
1 2

and standard deviation


1 1
s +
n1 n 2
where
( n − 1) s 2
+ ( n − 1) s 2
s2 = 1 1 2 2
n1 + n2 − 2

with n1+n2-2 degrees of freedom


Test of hypothesis for difference of two population
means
(two tailed and small sample)

1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0

2) Test statistic: small sample case


( x1 − x 2 ) − D 0
t obs =
1 1
s +
n1 n 2

3) Critical value, rejection and acceptance region:


rejection region: | tobs |> tα / 2 d.f=n1+n2-2
Test of hypothesis for difference of two population means
(one tailed test and small sample)
1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ 1 − μ 2 > D 0 or H a : μ1 − μ 2 < D0
2) Test statistic: small sample case
( x1 − x 2 ) − D 0
t obs =
1 1
s +
n1 n 2

3) Critical value, rejection and acceptance region:


rejection region: tobs > tα or tobs < −tα d.f.=n1+n2-2
Example: compare salary difference.

Population 1: faculty in public schools


Population 2: faculty in private schools

μ 1 =mean salary of faculty in public schools


μ 2 =mean salary of faculty in private schools

Sample 1: salaries of faculty members in public schools (n=10)


Sample 2: salaries of faculty members in private schools (n=15)

x1 = 57.48 x2 = 66.39
s1 = 9 s 2 = 9 .5
Test the hypothesis that the salaries are the same for faculty in
public and private school with 5% significance level
Test of hypothesis for binomial proportion
1) Hypothesis: H 0 : p = p0

Two-tailed: H a : p ≠ p0

One-tailed: H a : p > p0 or H a : p < p0

2) Test statistic: large sample case


pˆ − p 0
z obs = pˆ =
x
p0q0 n
n
3) Critical value, rejection and acceptance region:
rejection region: two-tailed : | z |> z
obs α /2

one-tailed: zobs > zα or zobs < − zα


STATA
• prtest
Test of hypothesis for difference in binomial
proportions

1) Hypothesis :
H A : ( p1 − p 2 ) ≠>< D 0 one/two tail tests
2)Test statistic

( pˆ 1 − pˆ 2 ) − D 0
z obs =
p1 q1 p 2 q 2
+
n1 n2
Test of hypothesis for difference in binomial
proportions

• Because p1 and p2 are not known use a pooled p


in the sample standard error when your testing
whether the difference is zero
x1 + x2
pˆ =
n1 + n2
• And when you are testing whether the difference
is something other than zero use the estimated
proportions from the two different samples
• Section 8.8 in the book has this spelled out nicely
P-values
P-values
The smallest value of alpha for which test results are
statistically significant, or in other words, statistically
different than the null hypothesis value.
Smallest value at which you still reject the null.
Example 1: You see a p-value of .025
- You would fail to reject at a 1% level of sig, but reject at
5%
Example 2: 60 students are polled average of 72 observed
with a standard deviation of 10, what is the p-value of the
test whether the population average is 75?
P-value
1. Calculate the z observed value for your observation
2. Find the area to the right of this value
3. If this is a two tailed test multiply this area by 2, if this is
a one-tail test you are done

Example:
60 students are polled average of 72 observed with a
standard deviation of 10, what is the p-value of the test
whether the population average is 75?
Power of a statistical test
- P(reject the null hypothesis when it is false)=1-β
-(1-α) is the probability we accept the null when it was in
fact true
-(1-β) is the probability we reject when the null is in fact
false - this is the power of the test.
-You would prefer to have a larger power
-The power changes depending on what the actual
population parameter is.
Hypothesis Testing: Preliminaries

A hypothesis is a statement that something is true.


Null hypothesis: A hypothesis to be tested. We use
the symbol H0 to represent the null hypothesis
Alternative hypothesis: A hypothesis to be
considered as an alternative to the null hypothesis. We
use the symbol Ha to represent the alternative
hypothesis.
- The alternative hypothesis is the one believe to
to be true, or what you are trying to prove
is true.
In this course, we will always assume that the null
hypothesis for a population parameter, Θ , always specifies
a single value for that parameter. So, an equal sign always
appears:
H 0 :Θ = Θ 0

If the primary concern is deciding whether a population


parameter is different than a specified value, the alternative
hypothesis should be:
H a :Θ ≠ Θ 0

This form of alternative hypothesis is called a two-tailed


test.
Example: You suspect that the equilibrium wage of low
skilled workers is not the federal minimum wage level of
$5.15
*If the primary concern is whether a population
parameter, Θ , is less than a specified value Θ 0 , the
alternative hypothesis should be:
H a :Θ < Θ 0

A hypothesis test whose alternative hypothesis has


this form is called a left-tailed test.
*If the primary concern is whether a population
parameter, Θ , is greater than a specified value Θ 0,
the alternative hypothesis should be:
H a :Θ > Θ 0

A hypothesis test whose alternative hypothesis has


this form is called a right-tailed test.
A hypothesis test is called a one-tailed test if it is
either right- or left-tailed, i.e.,if it is not a two-tailed
test.
After we have the null hypothesis, we have to determine
whether to reject it or fail to reject it.
The decision to reject or fail to reject is based on information
contained in a sample drawn from the population of interest.
The sample values are used to compute a single number,
corresponding to a point on a line, which operates as a decision
maker. This decision maker is called test statistic
If test statistic falls in some interval which support alternative
hypothesis, we reject the null hypothesis. This interval is called
rejection region
It test statistic falls in some interval which support null
hypothesis, we fail to reject the null hypothesis. This interval is
called acceptance region
The value of the point, which divide the rejection region and
acceptance one is called critical value
We can make mistakes in the test.
Type I error: reject the null hypothesis when it is true.
probability of type I error is denoted by α
Type II error: accept the null hypothesis when it is wrong.
probability of type II error is denoted by β
Test of hypothesis for a population mean

• We are basically asking: What observed value of x


bar would be different enough from my null
hypothesis value to convince me that my null is
wrong
• We always talk in terms of type I errors, alpha,
which are always small (.1, .05, .01)
• The smaller alpha gets the more tight your proof
that the alternative is correct, because the
probability of type I error is reduced, but the
chances of pa type II error are increased
Test of hypothesis for a population mean
(two tailed and large sample)
1) Hypothesis: H 0 :μ = μ0
H a :μ ≠ μ0
2) Test statistic: large sample case
x − μ0
z obs =
σ / n
3) Critical value, rejection and acceptance region:
- The bigger the absolute value of z is, the more
possible to reject null hypothesis.
- The critical value depend on the significance level α
- rejection region: | zobs |> zα / 2 or crit
Test of hypothesis for a population mean
(one tailed test and large sample)
1) Hypothesis: H 0 :μ = μ0
H a :μ > μ0 or H a :μ < μ0
2) Test statistic: large sample case
x − μ0
z obs =
σ / n
3) Critical value, rejection and acceptance region:
rejection region: zobs > zα or zobs < − zα
Example: a sample of 60 students’ grades is take from
a large class, the average grade in the sample is 80
with a sample standard deviation 10.
Test the hypothesis that the average grade is 75 with
5% significance level (probability of making a type I
error).
Test of hypothesis for a population mean
(two tailed and small sample)
1) Hypothesis: H 0 :μ = μ0
H a :μ ≠ μ0
2) Test statistic: small sample case
x − μ0
t=
s/ n
3) Critical value, rejection and acceptance region:
- The bigger the absolute value of t is, the more
possible to reject null hypothesis.
- The critical value depends on significance level α

- rejection region: | t |> tα / 2 d.f.=n-1


Test of hypothesis for a population mean
(one tailed test and small sample)
1) Hypothesis: H 0 :μ = μ0
H a : μ > μ 0 or H a :μ < μ0
2) Test statistic: small sample case
x − μ0
t=
s/ n
3) Critical value, rejection and acceptance region:
rejection region: t > tα or t < −tα d.f.=n-1
Example: suppose you have a sample of 11 Econ 70
midterm exam grades. The mean of that sample
is 81 with a standard deviation of 9.
1) Test hypothesis that average grade of the
population is 75 with 5% significance level.
2) Test hypothesis that average grade of the
population is greater than 80 with 5%
significance level.
STATA
• ttest
Test of difference between two
population means

Population 1: faculty in public schools


Population 2: faculty in private schools

μ1 =mean salary of faculty in public schools


μ 2 =mean salary of faculty in private schools
Two samples: one from public the other from
private
H 0 : μ1 = μ 2
H a : μ1 ≠ μ 2
In large sample case, the sampling distribution
of difference between population mean x1 − x2 is
a normal distribution with mean
μ ( x − x ) = μ1 − μ 2
1 2

and the standard deviation is

σ 12 σ 22
σ (x −x ) = +
1 2
n1 n2
Test of hypothesis for difference of two population means
(two tailed and large sample)
1) Hypothesis: D0 is some specified difference that you
wish to test. For many tests, you will wish to
hypothesize that there is no difference between two
means, that is D0=0
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0
2) Test statistic: large sample case
( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1−
σ 12 σ 22
x2 )
+
n1 n2

3) Critical value, rejection and acceptance region:


rejection region: | zobs |> zα / 2
Test of hypothesis for difference of two population
means(one tailed test and large sample)

1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 > D 0 or H a : μ1 − μ 2 < D0
2) Test statistic: large sample case

( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1 − x2 )
σ 12 σ 22
+
n1 n2

3) Critical value, rejection and acceptance region:


rejection region: zobs > zα or zobs < − zα
Example: compare salary difference.

Population 1: faculty in public schools


Population 2: faculty in private schools
μ1
=mean salary of faculty in public schools
μ2
=mean salary of faculty in private schools
Sample 1: salaries of faculty members in public schools
(n=30)
Sample 2: salaries of faculty members in private schools
(n=35)
x1 = 57.48 x2 = 66.39
s1 = 9 s 2 = 9 .5

Test the hypothesis that the salaries are less for faculty in
public school with 5% significance level
In small sample case, the sampling
distribution of the difference between two
means is the t-distribution with mean
μ ( x − x ) = μ1 − μ 2
1 2

and standard deviation


1 1
s +
n1 n 2
where
( n − 1) s 2
+ ( n − 1) s 2
s2 = 1 1 2 2
n1 + n2 − 2

with n1+n2-2 degrees of freedom


Test of hypothesis for difference of two population
means
(two tailed and small sample)

1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0

2) Test statistic: small sample case


( x1 − x 2 ) − D 0
t obs =
1 1
s +
n1 n 2

3) Critical value, rejection and acceptance region:


rejection region: | tobs |> tα / 2 d.f=n1+n2-2
Test of hypothesis for difference of two population means
(one tailed test and small sample)
1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ 1 − μ 2 > D 0 or H a : μ1 − μ 2 < D0
2) Test statistic: small sample case
( x1 − x 2 ) − D 0
t obs =
1 1
s +
n1 n 2

3) Critical value, rejection and acceptance region:


rejection region: tobs > tα or tobs < −tα d.f.=n1+n2-2
Example: compare salary difference.

Population 1: faculty in public schools


Population 2: faculty in private schools

μ 1 =mean salary of faculty in public schools


μ 2 =mean salary of faculty in private schools

Sample 1: salaries of faculty members in public schools (n=10)


Sample 2: salaries of faculty members in private schools (n=15)

x1 = 57.48 x2 = 66.39
s1 = 9 s 2 = 9 .5
Test the hypothesis that the salaries are the same for faculty in
public and private school with 5% significance level
Test of hypothesis for binomial proportion
1) Hypothesis: H 0 : p = p0

Two-tailed: H a : p ≠ p0

One-tailed: H a : p > p0 or H a : p < p0

2) Test statistic: large sample case


pˆ − p 0
z obs = pˆ =
x
p0q0 n
n
3) Critical value, rejection and acceptance region:
rejection region: two-tailed : | z |> z
obs α /2

one-tailed: zobs > zα or zobs < − zα


STATA
• prtest
Test of hypothesis for difference in binomial
proportions

1) Hypothesis :
H A : ( p1 − p 2 ) ≠>< D 0 one/two tail tests
2)Test statistic

( pˆ 1 − pˆ 2 ) − D 0
z obs =
p1 q1 p 2 q 2
+
n1 n2
Test of hypothesis for difference in binomial
proportions

• Because p1 and p2 are not known use a pooled p


in the sample standard error when your testing
whether the difference is zero
x1 + x2
pˆ =
n1 + n2
• And when you are testing whether the difference
is something other than zero use the estimated
proportions from the two different samples
• Section 8.8 in the book has this spelled out nicely
P-values
P-values
The smallest value of alpha for which test results are
statistically significant, or in other words, statistically
different than the null hypothesis value.
Smallest value at which you still reject the null.
Example 1: You see a p-value of .025
- You would fail to reject at a 1% level of sig, but reject at
5%
Example 2: 60 students are polled average of 72 observed
with a standard deviation of 10, what is the p-value of the
test whether the population average is 75?
P-value
1. Calculate the z observed value for your observation
2. Find the area to the right of this value
3. If this is a two tailed test multiply this area by 2, if this is
a one-tail test you are done

Example:
60 students are polled average of 72 observed with a
standard deviation of 10, what is the p-value of the test
whether the population average is 75?
Power of a statistical test
- P(reject the null hypothesis when it is false)=1-β
-(1-α) is the probability we accept the null when it was in
fact true
-(1-β) is the probability we reject when the null is in fact
false - this is the power of the test.
-You would prefer to have a larger power
-The power changes depending on what the actual
population parameter is.
Correlation
In this  you will cover:
• H

• A
W
T

x- x =-
x- x =+
y- y =+
- x +=+ y- y =+
+ x +=+

x- x =-
x- x
x =+
y- y =- y- y = -
- x - =+ + x - =+
(x − x )( y − y ) = S
∑ n
xy = covariance

• Covariance –

• W

• W

• W
(x − x )( y − y ) = S
∑ n
xy = covariance

• Covariance –

– W

– W

– W

• Y
– (you don’t
P
M C C
Is to standardise the covariance so that it
can interpreted easily. It converts the
covariance to a number between -1 to 1,
where:
• -1 is a perfect negative correlation Karl Pearson
• 1 is a perfect positive correlation
1857 - 1936
• 0 is no correlation

∑ ( x − x )( y − y )
r = n
⎛ ∑ (x − x ) ⎞⎛ ∑ (y − y ) ⎞
2 2

⎜ ⎟⎜ ⎟
⎜ n ⎟⎜ n ⎟
⎝ ⎠⎝ ⎠
P M C C

1
∑ (x − x )( y − y )
r= n
SxS y
T
T
T
• E A
•P 140
T
• I

F W

• No –
B
• S

– G
C 7 13
T
I
I

• O
– A
W

• N
– P
The Kruskal-Wallis H Test

• The Kruskal-Wallis H Test is a


nonparametric procedure that can be used to
compare more than two populations in a
completely randomized design.
• All n = n1+n2+…+nk measurements are jointly
ranked (i.e.treat as one large sample).
• We use the sums of the ranks of the k samples
to compare the distributions.
The Kruskal-Wallis H Test

9Rank the total measurements in all k samples


from 1 to n. Tied observations are assigned average of
the ranks they would have gotten if not tied.
9Calculate

ƒTi = rank sum for the ith sample i = 1, 2,…,k

9And the test statistic

12 Ti 2
H= ∑ − 3(n + 1)
n(n + 1) ni
The Kruskal-Wallis H Test

H0: the k distributions are identical versus


Ha: at least one distribution is different
Test statistic: Kruskal-Wallis H
When H0 is true, the test statistic H has an
approximate chi-square distribution with df
= k-1.
Use a right-tailed rejection region or p-
value based on the Chi-square distribution.
Example
Four groups of students were randomly
assigned to be taught with four different
techniques, and their achievement test scores
were recorded. Are the distributions of test
scores the same, or do they differ in location?
1 2 3 4
65 75 59 94
87 69 78 89
73 83 67 80
79 81 62 88
Teaching Methods
1 2 3 4
65 (3) 75 (7) 59 (1) 94 (16)
87 (13) 69 (5) 78 (8) 89 (15)
73 (6) 83 (12) 67 (4) 80 (10)
79 (9) 81 (11) 62 (2) 88 (14)
Ti 31 35 15 55

Rank the 16 H0: the distributions of scores are the same


measurements Ha: the distributions differ in location
from 1 to 16, 2
and calculate 12 T
Test statistic: H = ∑ i − 3(n + 1)
the four rank n(n + 1) ni
sums.
12 ⎛ 312 + 352 + 152 + 552 ⎞
= ⎜⎜ ⎟⎟ − 3(17) = 8.96
16(17) ⎝ 4 ⎠
Teaching Methods
H0: the distributions of scores are the same
Ha: the distributions differ in location

12 Ti 2
Test statistic: H = ∑ − 3(n + 1)
n(n + 1) ni
12 ⎛ 312 + 352 + 152 + 552 ⎞
= ⎜⎜ ⎟⎟ − 3(17) = 8.96
16(17) ⎝ 4 ⎠
Reject H0. There is sufficient
Rejection region: For a right-
evidence to indicate that there
tailed chi-square test with α =
is a difference in test scores for
.05 and df = 4-1 =3, reject H0 if
the four teaching techniques.
H ≥ 7.81.
Key Concepts
I. Nonparametric Methods
These methods can be used when the data cannot be measured on a
quantitative scale, or when
• The numerical scale of measurement is arbitrarily set by the
researcher, or when
• The parametric assumptions such as normality or constant
variance are seriously violated.
Key Concepts
Kruskal-Wallis H Test: Completely Randomized Design
1. Jointly rank all the observations in the k samples (treat as one
large sample of size n say). Calculate the rank sums, Ti = rank
sum of sample i, and the test statistic
12 Ti 2
H= ∑ − 3(n + 1)
n(n + 1) ni

2. If the null hypothesis of equality of distributions is false, H


will be unusually large, resulting in a one-tailed test.
3. For sample sizes of five or greater, the rejection region for H is
based on the chi-square distribution with (k − 1) degrees of
freedom.
The Mann-Whitney U
test
„ Peter Shaw
Introduction
„ We meet our first inferential test.
„ You should not get put off by the
messy-looking formulae – it’s
usually run on a PC anyway.
„ The important bit is to understand
the philosophy of the test.
Imagine..
„ That you have acquired a set of
measurements from 2 different sites.
‹ Maybe one is alleged to be polluted, the
other clean, and you measure residues in
the soil.
‹ Maybe these are questionnaire returns
from students identified as M or F.
„ You want to know whether these 2 sets
of measurements genuinely differ. The
issue here is that you need to rule out
the possibility of the results being
random noise.
The formal procedure:
„ Involves the creation of two competing
explanations for the data recorded.
‹ Idea 1:These are pattern-less random
data. Any observed patterns are due to
chance. This is the null hypothesis H0
‹ Idea 2: There is a defined pattern in the
data. This is the alternative hypothesis
H1
„ Without the statement of the competing
hypotheses, no meaning test can be run.
Occam’s razor
„ If competing explanations exist, chose
the simpler unless there is good reason
to reject it.
„ Here, you must assume H0 to be true
until you can reject it.
„ In point of fact you can never
ABSOLUTELY prove that your
observations are non-random. Any
pattern could arise in random noise, by
chance. Instead you work out how likely
H0 is to be true.
Example
You conduct a questionnaire survey of homes in the
Heathrow flight path, and also a control population of
homes in South west London. Responses to the question
“How intrusive is plane noise in your daily life” are
tabulated:
„ Noise complaints 1= no complaint, 5 = very unhappy
„ Homes near airport Control site
„ 5 3
„ 4 2
„ 4 4
„ 3 1
„ 5 2
„ 4 1
„ 5
Stage 1: Eyeball the
data!
„ These data are ordinal, but not normally
distributed (allowable scores are 1, 2, 3, 4 or
5).
„ Use Non-parametric statistics
„ It does look as though people are less happy
under the flightpath, but recall that we must
state our hypotheses H0, H1
‹ H0: There is no difference in attitudes to plane
noise between the two areas – any observed
differences are due to chance.
‹ H1: Responses to the question differed
between the two areas.
Now we assess how
likely it is that this
pattern could occur by
chance:
„ This is done by performing a calculation.
Don’t worry yet about what the calculation
entails.
„ What matters is that the calculation gives an
answer (a test statistic) whose likelihood
can be looked up in tables. Thus by means
of this tool - the test statistic - we can work
out an estimate of the probability that the
observed pattern could occur by chance in
random data
One philosophical
hurdle to go:
„ The test statistic generates a probability - a
number for 0 to 1, which is the probability of H0
being true.
„ If p = 0, H0 is certainly false. (Actually this is
over-simple, but a good approximation)
„ If p is large, say p = 0.8, H0 must be accepted
as true.
„ But how about p = 0.1, p = 0.01?
Significance

„ We have to define a threshold, a boundary, and


say that if p is below this threshold H0 is rejected
otherwise H1 is accepted.
„ This boundary is called the significance level. By
convention it is set at p=0.05 (1:20), but you can
chose any other number - as long as you specify
it in the write-up of your analyses.
„ WARNING!! This means that if you analyse 100
sets of random data, the expectance (log-term
average) is that 5 will generate a significant test.
The procedure:
Decide significance level
Set up H0, H1.
p=0.05

„ Data
„ 5 3
„ 4 2
„ 4 4
Test statistic Probability of
„ 3 1 U = 15.5 H0 being true
„ 5 2 p = 0.03
„ 4 1
„ 5
Is p above critical level?
Y N

Reject H0
Accept H0
This particular test:
„ The Mann-Whitney U test is a non-parametric
test which examines whether 2 columns of data
could have come from the same population (ie
“should” be the same)
„ It generates a test statistic called U (no idea why
it’s U). By hand we look U up in tables; PCs
give you an exact probability.
„ It requires 2 sets of data - these need not be
paired, nor need they be normally distributed,
nor need there be equal numbers in each set.
How to do it
„ 1: rank all data into 2 Harmonize ranks where the
ascending order, same value occurs more than
then re-code the data once
set replacing raw
data with ranks.

„ Data „ Data „ Data


„ 5 3 „ 5 #13 3 #5 „ 5 #13 = 12 3 #5 = 5.5
„ 4 2 „ 4 #10 2 #4 „ 4 #10 = 8.5 2 #4 = 3.5
„ 4 4 „ 4 #9 4 #7 „ 4 #9 = 8.5 4 #7 = 8.5
„ 3 1 „ 3 #6 1 #2 „ 3 #6 = 5.5 1 #2 = 1.5
„ 5 2 „ 5 #12 2 #3 „ 5 #12 = 12 2 #3 = 3.5
„ 4 1 „ 4 #8 1 #1 „ 4 #8 = 8.5 1 #1 = 1.5
„ 5 „ 5 #11 „ 5 #11 = 12
Once data are ranked:
„ Add up ranks for each column; call these rx and ry
„ (Optional but a good check:
‹ rx + ry = n2/2 + n/2, or you have an error)
„ Calculate
‹ Ux = NxNy + Nx(Nx+1)/2 - Rx
‹ Uy = NxNy + Ny(Ny+1)/2 - Ry
„ take the SMALLER of these 2 values and look up in tables. If U
is LESS than the critical value, reject H0

„ NB This test is unique in one feature: Here low values of the


test stat. Are significant - this is not true for any other test.
In this case:
„ Data
„ 5 #13 = 12 3 #5 = 5.5
„ 4 #10 = 8.5 2 #4 = 3.5
Ux = 6*7 + 7*8/2 - 67 = 3
„ 4 #9 = 8.5 4 #7 = 8.5 Uy = 6*7 + 6*7/2 - 24 = 39
„ 3 #6 = 5.5 1 #2 = 1.5
„ 5 #12 = 12 2 #3 = 3.5 Lowest U value is 3.
„ 4 #8 = 8.5 1 #1 = 1.5
„ 5 #11 = 12
„ ___ ___ Critical value of U (7,6) = 4 at p =
„ rx=67 ry=24 0.01.
Calculated U is < tabulated U so
„ Check: rx + ry + 91
„ 13*13/2 + 13/2 = 91 CHECK.
reject H0.

At p = 0.01 these two sets of data


differ.
Tails.. Generally use
2 tailed tests
2 tailed test: These
populations DIFFER.

1 tailed test: Population X is


Greater than Y (or Less than Y).

Lower tail of distribution Upper tail of distribution


Kruskal-Wallis: The U test’s big cousin
When we have 2 groups to compare (M/F, site 1/site 2, etc) the U test is
correct applicable and safe.

How to handle cases with 3 or more groups?

The simple answer is to run the Kruskal-Wallis test. This is run on a PC,
but behaves very much like the M-W U. It will give one significance
value, which simply tells you whether at least one group differs from one
other.
Males Females
Site 1 Site 2 Site 3

Do males differ
Do results differ
from females?
between these
sites?
Your coursework:

I will give each of you a sheet with data collected from 3 sites. (Don’t
try copying – each one is different and I know who gets which dataset!).

I want you to show me your data processing skills as follows:

1: Produce a boxplot of these data, showing how values differ between


the categories.
2: Run 3 separate Mann-Whitny U tests on them, comparing 1-2, 1-3 and
2-3. Only call the result significant if the p value is < 0.01
3: Run a Kruskal-Wallis anova on the three groups combined, and
comment on your results.
Mann-Whitney U test
pair no. S. male P. male
thorax thorax
width width
1 4 2.8
2 3 2.7
3 2.6 2.6
4 3.85 2.7
5 2.65 2.6
6 2.7 2.6
7 2.85 2.7
8 2.85 2.8
9 3.2 2.9
10 2.9 2.6
Mann-Whitney U test

Push this to
sort the data in
an ascending
order
Mann-Whitney U test
S. male P. male
thorax thorax
rank width rank width
3 2.6 3 2.6
6 2.65 3 2.6
8.5 2.7 3 2.6
13.5 2.85 3 2.6
13.5 2.85 8.5 2.7
15.5 2.9 8.5 2.7
17 3 8.5 2.7
18 3.2 11.5 2.8
19 3.85 11.5 2.8
20 4 15.5 2.9

•Rank both lists as one combined list


•I found this a time consuming task
Mann-Whitney U test
S. male P. male
thorax thorax
rank width rank width
3 2.6 3 2.6
6 2.65 3 2.6
8.5 2.7 3 2.6
13.5 2.85 3 2.6
13.5 2.85 8.5 2.7
15.5 2.9 8.5 2.7
17 3 8.5 2.7
18 3.2 11.5 2.8
19 3.85 11.5 2.8
20 4 15.5 2.9
R1= 30.6 R2= 26.9
•Sum the ranks for each sample
•N1= # obs in 1 N2= # obs in 2
Mann-Whitney U test
• Normally you would now use the formula’s
and chart in the Brown reading.
• U1=(N1)(N2)+[(N1)(N1+1)]/2 – R1 U1=124.4
• U2=(N1)(N2)+[(N2)(N2+1)]/2 – R2 U2=128.2

• However the sample size is larger than the


table will allow because any sample greater
than 20 can be assumed to mimic normality
• We therefore use the equation to convert the
U statistic to a Z- score.
Mann-Whitney U test
•U1=124.4 •N1=10
•U2=128.2 •N2=10

• Z = {largest U value – [N1*N2]/2}


(N1)(N2)(N1+N2+1)]/12
• Z = 5.9
• If Z > 1.96 than P < 0.05
• Therefore there is a significant difference
between the thorax width of single and
mated males
Wilcoxon Signed Rank
• When N>15 use a z score conversion
• µT+ = N(N+1)/4
• VarT+ = N(N+1)(2N+1)/24

• Z = T+ - µT+ / VarT+
• = T+ - [N(N+1)/4]
[N(N+1)(2N+1)/24]
• If Z > 1.96 than P < 0.05
• reject null hypothesis
MEASURES OF
DISPERSION
Measures of Dispersion
• While measures of central tendency indicate what value
of a variable is (in one sense or other) “average” or
“central” or “typical” in a set of data, measures of
dispersion (or variability or spread) indicate (in one
sense or other) the extent to which the observed values
are “spread out” around that center — how “far apart”
observed values typically are from each other and
therefore from some average value (in particular, the
mean). Thus:
– if all cases have identical observed values (and thereby are also
identical to [any] average value), dispersion is zero;
– if most cases have observed values that are quite “close
together” (and thereby are also quite “close” to the average
value), dispersion is low (but greater than zero); and
– if many cases have observed values that are quite “far away”
from many others (or from the average value), dispersion is high.
• A measure of dispersion provides a summary statistic
that indicates the magnitude of such dispersion and, like
a measure of central tendency, is a univariate statistic.
Importance of the Magnitude
Dispersion Around the Average
• Dispersion around the mean test score.

• Baltimore and Seattle have about the same mean daily


temperature (about 65 degrees) but very different
dispersions around that mean.

• Dispersion (Inequality) around average household


income.
Hypothetical Ideological Dispersion
Hypothetical Ideological Dispersion (cont.)
Dispersion in Percent Democratic in CDs
Measures of Dispersion
• Because dispersion is concerned with how “close
together” or “far apart” observed values are (i.e., with the
magnitude of the intervals between them), measures of
dispersion are defined only for interval (or ratio)
variables,
– or, in any case, variables we are willing to treat as interval (like
IDEOLOGY in the preceding charts).
– There is one exception: a very crude measure of dispersion
called the variation ratio, which is defined for ordinal and even
nominal variables. It will be discussed briefly in the Answers &
Discussion to PS #7.)

• There are two principal types of measures of dispersion:


range measures and deviation measures.
Range Measures of Dispersion
• Range measures are based on the distance between
pairs of (relatively) “extreme” values observed in the
data.
– They are conceptually connected with the median as a measure
of central tendency.

• The (“total” or “simple”) range is the maximum (highest)


value observed in the data [the value of the case at the
100th percentile] minus the minimum (lowest) value
observed in the data [the value of the case at the 0th
percentile]
– That is, it is the “distance” or “interval” between the values of the
two most extreme cases,
– e.g., range of test scores
TABLE 1 – PERCENT OF POPULATION AGED 65 OR HIGHER
IN THE 50 STATES
(UNIVARIATE DATA)

Alabama 12.4 Montana 12.5


Alaska 3.6 Nebraska 13.8
Arizona 12.7 Nevada 10.6
Arkansas 14.6 New Hampshire 11.5
California 10.6 New Jersey 13.0
Colorado 9.2 New Mexico 10.0
Connecticut 13.4 New York 13.0
Delaware 11.6 North Carolina 11.8
Florida 17.8 North Dakota 13.3
Georgia 10.0 Ohio 12.5
Hawaii 10.1 Oklahoma 12.8
Idaho 11.5 Oregon 13.7
Illinois 12.1 Pennsylvania 14.8
Indiana 12.1 Rhode Island 14.7
Iowa 14.8 South Carolina 10.7
Kansas 13.6 South Dakota 14.0
Kentucky 12.3 Tennessee 12.4
Louisiana 10.8 Texas 9.7
Maine 13.4 Utah 8.2
Maryland 10.7 Vermont 11.9
Massachusetts 13.7 Virginia 10.6
Michigan 11.5 Washington 11.8
Minnesota 12.6 West Virginia 13.9
Mississippi 12.1 Wisconsin 13.2
Missouri 13.8 Wyoming 8.9
Range in a Histogram
Problems with the [Total] Range

• The problem with the [total] range as a measure of


dispersion is that it depends on the values of just two
cases, which by definition have (possibly extraordinarily)
atypical values.
– In particular, the range makes no distinction between a polarized
distribution in which almost all observed values are close to
either the minimum or maximum values and a distribution in
which almost all observed values are bunched together but there
are a few extreme outliers.
• Recall Ideological Dispersion bar graphs =>
– Also the range is undefined for theoretical distributions that are
“open-ended,” like the normal distribution (that we will take up in
the next topic) or the upper end of an income distribution type of
curve (as in previous slides).
Two Ideological Distributions with
the Same Range
The Interdecile Range
• Therefore other variants of the range measure that do
not reach entirely out to the extremes of the frequency
distribution are often used instead of the total range.

• The interdecile range is the value of the case that stands


at the 90th percentile of the distribution minus the value
of the case that stands at the 10th percentile.
– That is, it is the “distance” or “interval” between the
values of these two rather less extreme cases.
The Interquartile Range

• The interquartile range is the value of the case that


stands at the 75th percentile of the distribution minus the
value of the case that stands at the 25th percentile.
– The first quartile is the median observed value among
all cases that lie below the overall median and the
third quartile is the median observed value among all
cases that lie above the overall median.
– In these terms, the interquartile range is third quartile
minus the first quartile.
The Standard Margin of Error Is a Range Measure
• Suppose the Gallup Poll takes a random sample of n respondents
and reports that the President's current approval rating is 62% and
that this sample statistic has a margin of error of ±3%. Here is what
this means: if (hypothetically) Gallup were to take a great many
random samples of the same size n from the same population (e.g.,
the American VAP on a given day), the different samples would give
different statistics (approval ratings), but 95% of these samples
would give approval ratings within 3 percentage points of the true
population parameter.

• Thus, if our data is the list of sample statistics produced


by the (hypothetical) “great many” random samples, the
margin of error specifies the range between the value of
the sample statistic that stands at the 97.5th percentile
minus the sample statistic that stands at the 2.5th
percentile (so that 95% of the sample statistics lie within
this range). Specifically (and letting P be the value of
the population parameter) this “95% range” is
(P + 3%) - (P -3%) = 6%, i.e., twice the margin error.
Deviation Measures of Dispersion
• Deviation measures are based on average deviations
from some average value.
– Since dispersion measures pertain to with interval variables, we
can calculate means, and deviation measures are typically
based on the mean deviation from the mean value.
– Thus the (mean and) standard deviation measures are
conceptually connected with the mean as a measure of central
tendency.
• Review: Suppose we have a variable X and a set of
cases numbered 1,2, . . . , n. Let the observed value of
the variable in each case be designated x1, x2, etc.
Thus:
Deviation Measures of Dispersion: Example
Deviation Measures of Dispersion (cont.)
• The deviation from the mean for a representative case i is xi - mean
of x.
– If almost all of these deviations are close to zero, dispersion is
small.
– If many of these deviations much different from zero, dispersion
is large.
• This suggests we could construct a measure D of dispersion that
would simply be the average (mean) of all the deviations.

But this does not work because, as we saw earlier, it is a property


of the mean that all deviations from it add up to zero (regardless of
how much dispersion there is).
Deviation Measures of Dispersion: Example
(cont.)
The Mean Deviation
• A practical way around this problem is simply to ignore
the fact that some deviations are negative while others
are positive by averaging the absolute values of the
deviations.
• This measure (called the mean deviation) tells us the
average (mean) amount that the values for all cases
deviate (regardless of whether they are higher or lower)
from the average (mean) value.
• Indeed, the Mean Deviation is an intuitive, understand-
able, and perfectly reasonable measure of dispersion,
and it is occasionally used in research.
The Mean Deviation (cont.)
The Variance
• Statisticians dislike this measure because the formula is
mathematically messy by virtue of being “non-algebraic”
(in that it ignores negative signs).
• Therefore statisticians, and most researchers, use
another slightly different deviation measure of dispersion
that is “algebraic.”
– This measure makes use of the fact that the square of any real
(positive or negative) number other than zero is itself always
positive.
• This measure --- the average of the squared deviations
from the mean (as opposed the average of the absolute
deviations) --- is called the variance.
The Variance (cont.)
The Variance (cont.)
• The variance is the average squared deviation from the
mean.
– The total (and average) average squared deviation from the mean
value of X is smaller than the average squared deviation from any
other value of X.
• The variance is the usual measure of dispersion in
statistical theory, but it has a drawback when researchers
want to describe the dispersion in data in a practical way.
– Whatever units the original data (and its average values and its
mean dispersion) are expressed in, the variance is expressed in
the square of those units, which may not make much (or any)
intuitive or practical sense.
– This can be remedied by finding the (positive) square root of the
variance (which takes us back to the original units).
• The square root of the variance is called the standard
deviation.
The Standard Deviation
The Standard Deviation (cont.)

• In order to interpret a standard deviation, or to make a


plausible estimate of the SD of some data, it is useful to
think of the mean deviation because
– it is easier to estimate (or guess) the magnitude of the MD than
the SD; and
– the standard deviation has approximately the same numerical
magnitude as the mean deviation, though it is almost always
somewhat larger.
• The SD is never less than the MD;
• the SD is equal to the mean deviation if the data is distributed in a
maximally “polarized” fashion;
• Otherwise the SD is somewhat larger than the MD — typically about
20-50% larger.
Standard Deviation Worksheet
1. Set up a worksheet like the one shown in the previous slides.
2. In the first column, list the values of the variable X for each of the n cases.
[This is the raw data.]
3. Find the mean value of the variable in the data, by adding up the values in
each case and dividing by the number of cases.
4. In the second column, subtract the mean from each value to get, for each
case, the deviation from the mean. Some deviations are positive, others
negative, and (apart from rounding error) they must add up to zero; add
them up as an arithmetic check.
5. In the third column, square each deviation from the mean, i.e., multiply the
deviation by itself. Since the product of two negative numbers is positive,
every squared deviation is non-negative, i.e., either positive or (in the event
a case has a value that coincides with the mean value).
6. Add up the squared deviations over all cases.
7. Divide the sum of the squared deviations by the number of cases; this gives
the average squared deviation from the mean, commonly called the
variance.
8. The standard deviation is the (positive) square root of the variance. (The
square root of x is that number which when multiplied by itself gives x.)
The Mean, Deviations, Variance, and SD

• What is the effect of adding a constant amount to (or


subtracting from) each observed value?
• What is the effect of multiplying each observed value (or
dividing it by) a constant amount?
Adding (subtracting) the same amount to (from) every
observed value changes the mean by the same amount
but does not change the dispersion (for either range or
deviation measures)
Multiplying (or dividing) every observed value by the same
factor changes the mean and the SD [or MD] by that same
factor and changes the variance by that factor squared.
Sample Estimates of Population
Dispersion
• Random sample statistics that are percentages or averages
provide unbiased estimates of the corresponding population
parameters.

• However, sample statistics that are dispersion measures


provide estimates of population dispersion that are biased
(at least slightly) downward.
– This is most obvious in the case of the range; it should
be evident that a sample range is almost always smaller
than, and can never be larger than, than the
corresponding population range.
Sample Estimates of Population Dispersion (cont.)
• The sample standard deviation (or variance) is also biased
downward, but only slightly if the sample at all large.
– While the SD of a particular sample can be larger than the population
SD, sample SDs are on average slightly smaller than the corresponding
population SDs).
• The sample SD can be adjusted to provide an unbiased estimate of
the population SD
– This simple adjustment consists of dividing the sum of the squared
deviations by n - 1, rather than by n.
– Clearly this adjustment makes no practical difference unless the sample
is quite small.
• Notice that if you apply the SD [or MD or any Range] formula in the
event that you have just a single observation in your sample, sample
dispersion = 0 regardless of what the observed value is.
– More intuitively, you can get no sense of how much dispersion there is
in a population with respect to some variable until you observe at least
two cases and can see how “far apart” they are.
• This is why you will often see the formula for the variance and SD
with an n - 1 divisor (and scientific calculators often build in this
formula).
– However, for POLI 300 problem sets and tests, you should use the
formula given in the previous section of this handout.
Dispersion in Ratio Variables
• Given a ratio variable (e.g. income), the interesting
“dispersion question” may pertain not to the interval
between two observed values or between an observed
value and the mean value but to the ratio between the
two values.
– For example, fifty years ago, the income of the household at the
25th percentile was about $5,000 and the income of the
household at the 75th percentile was about $10,000, while today
the figures are about $40,000 and $80,000 respectively.
• While the interval between the two income levels (the interquartile
range) has increased from $5,000 to $40,000, the ratio between the
two income levels has remained a constant 2 to 1.
• Other examples pertain to income:
– One household “poverty level” is defined as half of median
household income.
– Households with more than twice the median income are
sometimes characterized as “well off.”
– The average compensation of CEOs today is about 250 times
that of the average worker, whereas 50 years it was only about
40 times that of the average worker.)
Dispersion in Ratio Variables (cont.)

• The degree of dispersion in ratio variables can naturally


be referred to as the degree inequality.
– For example the two sets of income levels ($5K vs.
$10K and $40K vs. $80K) at the 25th and 75th
percentiles respectively seem to be “equally unequal”
because they are in the same ratio.

• Thus the SD does not work well as a measure of


inequality (of income, etc.), because it takes no account
of the ratio property of [ratio] variables.
The Coefficient of Variation
• One ratio measure of dispersion/inequality is called the
coefficient of variation, which is simply the standard
deviation divided by the mean.
– It answers the question: how big is the SD of the distribution
relative to the mean of the distribution?

• Recall PS#6, Question #7, comparing the distributions of


height and weight among American adults.
– We naturally to want to say that in some sense that American
adults exhibit more dispersion in weight than height.
– But if by dispersion we mean [any kind of] range, mean
deviation, or variance/SD, the claim is strictly meaningless
because the two variables are measured in in different units
(pounds, kilograms, etc. vs. inches, feet, centimeters, etc.), so
the numerical comparison is not valid.
Coefficient of Variation (cont.)
Summary statistics for WEIGHT and HEIGHT (both ratio variables) of American adults in
different units:
Weight Height
Mean 160 pounds 66 inches
72.6 kilograms 5.5 feet
.08 tons 168 centimeters

SD 30 pounds 4 inches
13.6 kilograms .33 feet
.015 tons 10.2 centimeters

Which variable [WEIGHT or HEIGHT] has greater dispersion? [No meaningful answer can
be given]
Which variable has greater dispersion relative to its average, e.g., greater Coefficient of
Dispersion (SD relative to mean)?

30 = 13.6 = .015 = .18 4 = .33 = 10.2 = .06


160 72.6 .08 66 5.5 168

Note that the Coefficient of Variation is a pure number, not expressed in any units and is
the same whatever units the variable is measured in.
Coefficient of Variation

• The old and new SDs are the same.


• The old Coefficient of Variation was
SD/Mean = 2/14 = 1/7 = 0.143
• while the new Coefficient of variation is
SD/Mean = 2/4 = 0.5
Coefficient of Variation (cont)

• The old and new SDs are the same.


• The old Coefficient of Variation was
– SD/mean = 2/14 = 1/7 = 0.143
• The new Coefficient of Variation is
– SD/mean = 2/114 = 0.0175
Coefficient of Variation (cont)

• The new SD is 10 times the old SD.


• But the old and new Coefficients of Variation are the
same:
SD/mean = 2/14 = 20/140 = 1/7 = 0.143
Introduction to Hypothesis Testing

The One-Sample z Test


The One-Sample z Test

• Conditions of Applicability:
– One group of subjects
– Comparing to population with known mean and variance.

• Note: this is not a common situation in Psychology!

PSYC 6130, PROF. J. ELDER 2


Example: Finish times for the 2005 Toronto Marathon
(Oct 16, 2005)

• Suppose your population of interest are women who ran the


marathon (slightly artificial).
• You hypothesize that women in their early twenties (20-24) are
faster than the average woman who ran the marathon.
• Here the ‘treatment’ is ‘youth’.

PSYC 6130, PROF. J. ELDER 3


Null Hypothesis Testing

• Largely due to English mathematician Sir R.A. Fisher (1890-1962)


• ‘Proof by contradiction’
• Suppose the null hypothesis is true
– In our example, the null hypothesis is that the finishing times for young
women are drawn from the same distribution as for the rest of the
female contestants.
– Knowing the mean and standard deviation of the population, we can
compute the sampling distribution of the mean for a sample of size n.
This is the null hypothesis distribution.
– The mean time for our sample of young women should be plausible
under this sampling distribution.
– If it is not plausible, it suggests that the null hypothesis is false.
– This lends credence to our alternate hypothesis (that young women are
faster).

PSYC 6130, PROF. J. ELDER 4


How do we judge the plausibility of the null hypothesis?

• The sample mean should be plausible under the sampling


distribution of the mean.

p( X ) σX

Implausible
X X X μ
Fairly plausible
Highly plausible

PSYC 6130, PROF. J. ELDER 5


Plausibility of the null hypothesis

• The plausibility of the null hypothesis is judged by computing the


probability p of observing a sample mean that is at least as deviant
from the population mean as the value we have observed.

p( X )
σX

X μ
p
PSYC 6130, PROF. J. ELDER 6
Plausibility of the null hypothesis

• This computation is simplified by converting to z-scores.


• Under the assumption of normality, we can determine this probability
from a standard normal table.

X −μ
z=
σX
p( z )
1

z 0
p
PSYC 6130, PROF. J. ELDER 7
Results for 2005 Toronto Marathon

n = 420
μ = 4hr 16min = 256 min
σ = 33min

PSYC 6130, PROF. J. ELDER 8


Results for Random Sample
of Women Under 25

n = 38
X = 4hr 9min = 249 min

PSYC 6130, PROF. J. ELDER 9


Statistical Decisions

• We now know the probability that an observation like ours could


have been drawn from the general female contestant population, i.e.
that our ‘treatment of youth’ had no effect.
• This probability is pretty small. Should we reject the null
hypothesis? This is the process of turning a continuous probability
(a real number) into a binary decision (yes or no).
• If we reject the null hypothesis, there is a chance we will be wrong.
We have to decide what chance we are willing to take, i.e. the
maximum p-value we will accept as grounds for rejecting the null
hypothesis.
• We call this probability threshold the alpha (α) level. A typical value
is .05.
• The α−level must be decided prior to the experiment.

PSYC 6130, PROF. J. ELDER 10


Type I and Type II Errors

• Type I Error: the null hypothesis is true and we reject it.


• Type II Error: the null hypothesis is false and we fail to
reject it.

Actual Situation

Researcher’s Decision Null Hypothesis is True Null Hypothesis is False

Accept the Null p (accept H0 | H0 true) p (accept H0 | H0 false)


Hypothesis = 1 −α =β
Reject the Null p (reject H0 | H0 true) p (reject H0 | H0 false)
Hypothesis =α = 1 − β (power)

PSYC 6130, PROF. J. ELDER 11


Type I and Type II Errors

• Which is more serious?


– Type I can be bad, as rejecting the null hypothesis (e.g., ‘This
stuff really works’), may cause actions to be taken that have no
value.
– Type II may not be so bad, if it is understood that the treatment
may still have an effect (we fail to reject the null hypothesis, but
we do not reject the alternate hypothesis).
– But Type II may be bad if it leads to inaction when action would
have produced good results (e.g., a cure for cancer).

PSYC 6130, PROF. J. ELDER 12


One-Tailed vs Two-Tailed Tests

• Our marathon hypothesis was one-tailed, because we


made a specific prediction about the direction of the
effect (young women are faster).
• Suppose we had simply hypothesized that young women
are different.

PSYC 6130, PROF. J. ELDER 13


Two-Tailed Test

X −μ
z=
p( z ) σX
1

z 0 z

PSYC 6130, PROF. J. ELDER 14


One-Tailed vs Two-Tailed Tests

• Use a one-tailed test when you have a specific reason to


believe the effect will be in a particular direction, and you
do not care if the effect is in the opposite direction.
• Otherwise, use a two-tailed test.
• One-tailed tests will always result in smaller p values,
and hence a greater chance of reaching significance for
your directional hypothesis.
• The decision of whether to perform one-tailed or two-
tailed tests must be made prior to data collection.

PSYC 6130, PROF. J. ELDER 15


Basic Procedure for Statistical Inference

1. State the hypothesis


2. Select the statistical test and significance level
3. Select the sample and collect the data
4. Find the region of rejection
5. Calculate the test statistic
6. Make the statistical decision

PSYC 6130, PROF. J. ELDER 16


Step 1. State the Hypothesis

Null hypothesis: marathon times for young women are the same
as for the general female contestant population.

Alternate hypothesis: young women are faster.

H 0 : μ = μ0
H A : μ < μ0

PSYC 6130, PROF. J. ELDER 17


Step 2. Select the Statistical Test and the
Significance Level
• We are comparing a sample mean to a population with
known mean and standard deviation Æ z-test
• p=.05 is probably appropriate.

PSYC 6130, PROF. J. ELDER 18


Step 3. Select the Sample and Collect the
Data
• Ideally, we would randomly assign the treatment to a
random sample of the population (Toronto Marathon
women). Is this possible?
• Instead, we randomly sample female contestants under
25.

PSYC 6130, PROF. J. ELDER 19


Step 4. Find the Region of Rejection

• The z value defining the rejection region is called the


critical value for your test, and is a function of the
selected α-level. For this reason, we often denote the
critical value as zα

p( z )
1

0
α = .05 zα = −1.65
PSYC 6130, PROF. J. ELDER 20
Step 5. Calculate the Test Statistic

X −μ
z=
σX

PSYC 6130, PROF. J. ELDER 21


Step 6. Make the Statistical Decision

• p<α: Reject null hypothesis.


• p>α: Fail to reject null hypothesis.

PSYC 6130, PROF. J. ELDER 22


Example: Height of Female Psychology Graduate Students

Canadian Adult Female Population: Sample: Female students enrolled in PSYC 6130C 2008-09
μ 162.10 cm
σ 6.55 cm

PSYC 6130, PROF. J. ELDER 23


Assumptions Underlying One-Sample z Test

• Random sampling
• Variable is normal
– CLT: Deviations from normality ok as long as sample is large.

• Dispersion of sampled population is the same as for the


comparison population
– e.g. suppose means are the same, but dispersion of sampled
population is greater than dispersion of comparison population.

PSYC 6130, PROF. J. ELDER 24


Limitations of the One-Sample Test

• Strongly depends on random sampling.


• Better to have two groups of subjects: test (treatment)
group and control group.
• Problem of random sampling reduces to problem of
random assignment to two groups: much easier!

PSYC 6130, PROF. J. ELDER 25


Reporting your results

• Express your result in evocative English, then include


the required numbers.
• Follow APA style.
• Example:
– Young female runners were not found to be significantly faster
than the general female contestant population, z=-1.31, p=0.095,
one-tailed.

PSYC 6130, PROF. J. ELDER 26


More on Type I and Type II Errors

H0 is true H0 is false
α 1− β

Total number of significant results

• Consistent use of a fixed alpha-level determines the proportion of null


experiments that generate significant results.
• Don’t have enough information to know how many reported results are
errors, because:
– Don’t know the relative proportion of cases where H0 is true and H0 is false.
– Don’t know the power of effective experiments.
– Typically only significant results are reported (publication bias).

PSYC 6130, PROF. J. ELDER 27


Chapter 9

Nonparametric Statistics

EPI 809 / Spring 2008


Learning Objectives

1. Distinguish Parametric &


Nonparametric Test Procedures
2. Explain commonly used
Nonparametric Test Procedures
3. Perform Hypothesis Tests Using
Nonparametric Procedures

EPI 809 / Spring 2008


Hypothesis Testing Procedures

Many More Tests Exist!


EPI 809 / Spring 2008
Parametric Test Procedures

1. Involve Population Parameters (Mean)

2. Have Stringent Assumptions


(Normality)

3. Examples: Z Test, t Test, χ2 Test,


F test
EPI 809 / Spring 2008
Nonparametric Test
Procedures
1. Do Not Involve Population Parameters
Example: Probability Distributions, Independence

2. Data Measured on Any Scale (Ratio or


Interval, Ordinal or Nominal)

3. Example: Wilcoxon Rank Sum Test

EPI 809 / Spring 2008


Advantages of
Nonparametric Tests
1. Used With All Scales
2. Easier to Compute
3. Make Fewer Assumptions
4. Need Not Involve
Population Parameters
5. Results May Be as Exact
as Parametric Procedures

© 1984-1994 T/Maker Co.

EPI 809 / Spring 2008


Disadvantages of
Nonparametric Tests

1. May Waste Information © 1984-1994 T/Maker Co.

Parametric model more efficient


if data Permit

2. Difficult to Compute by
hand for Large Samples
3. Tables Not Widely Available

EPI 809 / Spring 2008


Popular Nonparametric Tests
1. Sign Test

2. Wilcoxon Rank Sum Test

3. Wilcoxon Signed Rank Test

EPI 809 / Spring 2008


Sign Test

EPI 809 / Spring 2008


Sign Test
1. Tests One Population Median, η

2. Corresponds to t-Test for 1 Mean

3. Assumes Population Is Continuous

4. Small Sample Test Statistic: # Sample Values


Above (or Below) Median

5. Can Use Normal Approximation If n ≥ 10

EPI 809 / Spring 2008


Sign Test Concepts

¾ Make null hypothesis about true median

¾ Let S = number of values greater than median

¾ Each sampled item is independent

¾ If null hypothesis is true, S should have binomial


distribution with success probability .5

EPI 809 / Spring 2008


Sign Test Example

¾ You’re an analyst for Chef-Boy-R-Dee. You’ve


asked 7 people to rate a new ravioli on a 5-point
scale (1 = terrible,…, 5 = excellent) The ratings
are: 2 5 3 4 1 4 5.

¾ At the .05 level, is there evidence that the


median rating is at least 3?

EPI 809 / Spring 2008


Sign Test Solution
¾ H0: P-Value:
¾ Ha:
¾ α=
¾ Test Statistic:

Decision:

Conclusion:

EPI 809 / Spring 2008


Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3
¾ α=
¾ Test Statistic:

Decision:

Conclusion:

EPI 809 / Spring 2008


Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3
¾ α = .05
¾ Test Statistic:

Decision:

Conclusion:

EPI 809 / Spring 2008


Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3
¾ α = .05
¾ Test Statistic:
S=2
Decision:
(Ratings 1 & 2 Are
Less Than η = 3:
2, 5, 3, 4, 1, 4, 5) Conclusion:
Is observing 2 or
more a small prob
event? EPI 809 / Spring 2008
Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3 P(S ≥ 2) = 1 - P(S ≤ 1)
¾ α = .05 = .9297
¾ Test Statistic: (Binomial Table, n = 7,
p = 0.50)
S=2
Decision:
(Ratings 1 & 2 Are
Less Than η = 3:
2, 5, 3, 4, 1, 4, 5) Conclusion:
Is observing 2 or
more a small prob
event? EPI 809 / Spring 2008
Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3 P(x ≥ 2) = 1 - P(x ≤ 1)
¾ α = .05 = . 9297
¾ Test Statistic: (Binomial Table, n = 7,
p = 0.50)
S=2
Decision:
(Ratings 1 & 2 Are
Do Not Reject at α = .05
Less Than η = 3:
2, 5, 3, 4, 1, 4, 5) Conclusion:
Is observing 2 or
more a small prob
event? EPI 809 / Spring 2008
Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3 P(x ≥ 2) = 1 - P(x ≤ 1) =
¾ α = .05 = . 9297
¾ Test Statistic: (Binomial Table, n = 7,
p = 0.50)
S=2
Decision:
(Ratings 1 & 2 are
Do Not Reject at α = .05
< η = 3:
2, 5, 3, 4, 1, 4, 5) Conclusion:
Is observing 2 or There is No evidence for
more a small prob Median < 3
event? EPI 809 / Spring 2008
Wilcoxon Rank Sum
Test

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test

1.Tests Two Independent Population


Probability Distributions
2.Corresponds to t-Test for 2 Independent
Means
3.Assumptions
Independent, Random Samples
Populations Are Continuous
4.Can Use Normal Approximation If ni ≥ 10
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Procedure
1. Assign Ranks, Ri, to the n1 + n2 Sample
Observations
If Unequal Sample Sizes, Let n1 Refer to Smaller-Sized Sample
Smallest Value = 1

2. Sum the Ranks, Ti, for Each Sample


Test Statistic Is TA (Smallest Sample)
Null hypothesis: both samples come from the same underlying
distribution
Distribution of T is not quite as simple as binomial, but it can be
computed

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Example

¾ You’re a production planner. You want to see if


the operating rates for 2 factories is the same.
For factory 1, the rates (% of capacity) are 71,
82, 77, 92, 88. For factory 2, the rates are 85,
82, 94 & 97. Do the factory rates have the same
probability distributions at the .10 level?

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Solution
¾ H0: Test Statistic:
¾ Ha:
¾ α=
¾ n1 = n2 =
¾ Critical Value(s):
Decision:

Conclusion:
Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or
Right
¾ α=
¾ n1 = n2 =
¾ Critical Value(s): Decision:

Conclusion:
Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or
Right
¾ α = .10
¾ n1 = 4 n2 = 5
¾ Critical Value(s): Decision:

Conclusion:
Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum
Table 12 (Rosner) (Portion)
α = .05 two-tailed

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or
Right
¾ α = .10
¾ n1 = 4 n2 = 5
¾ Critical Value(s): Decision:

Reject Do Not Reject


Reject Conclusion:
12 28 Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank

Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 85
82 82
77 94
92 97
88 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85
82 82
77 94
92 97
88 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85
82 82
77 2 94
92 97
88 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85
82 3 82 4
77 2 94
92 97
88 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85
82 3 3.5 82 4 3.5
77 2 94
92 97
88 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 97
88 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 97
88 6 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 7 97
88 6 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97
88 6 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97 9
88 6 ... ...
Rank Sum

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97 9
88 6 ... ...
Rank Sum 19.5 25.5

EPI 809 / Spring 2008


Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or T2 = 5 + 3.5 + 8+ 9 = 25.5
Right (Smallest Sample)
¾ α = .10
¾ n1 = 4 n2 = 5
¾ Critical Value(s): Decision:

Reject Do Not Reject


Reject Conclusion:
12 28 Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or T2 = 5 + 3.5 + 8+ 9 = 25.5
Right (Smallest Sample)
¾ α = .10
¾ n1 = 4 n2 = 5
¾ Critical Value(s): Decision:
Do Not Reject at α = .10
Reject Do Not Reject
Reject Conclusion:
12 28 Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or T2 = 5 + 3.5 + 8+ 9 = 25.5
Right (Smallest Sample)
¾ α = .10
¾ n1 = 4 n2 = 5
¾ Critical Value(s): Decision:
Do Not Reject at α = .10
Reject Do Not Reject
Reject Conclusion:
There is No evidence for
12 28 Σ Ranks
unequal distrib
EPI 809 / Spring 2008
Regression Analysis:

A statistical procedure used to


find relationships among a set
of variables
In regression analysis, there is a dependent
variable, which is the one you are trying to
explain, and one or more independent
variables that are related to it.

You can express the relationship as a linear


equation, such as:

y = a + bx
y = a + bx
• y is the dependent variable
• x is the independent variable
• a is a constant
• b is the slope of the line
• For every increase of 1 in x, y changes by an amount equal
to b
• Some relationships are perfectly linear and fit this equation
exactly. Your cell phone bill, for instance, may be:

Total Charges = Base Fee + 30¢ (overage minutes)

If you know the base fee and the number of overage


minutes, you can predict the total charges exactly.
Other relationships may not
be so exact. 220
Weight, for instance, is to
some degree a function of 200

height, but there are 180


variations that height does

Weight
not explain. 160
On average, you might have
140
an equation like:
Weight = -222 + 5.7*Height 120
If you take a sample of
100
actual heights and weights,
60 65 70 75
you might see something
Height
like the graph to the right.
The line in the graph shows the
average relationship described by
220
the equation. Often, none of the
actual observations lie on the line.
The difference between the line
200 and any individual observation is
the error.
180 The new equation is:
Weight

Weight = -222 + 5.7*Height + e


160 This equation does not mean that
people who are short enough will
have a negative weight. The
140 observations that contributed to this
analysis were all for heights
120 between 5’ and 6’4”. The model
will likely provide a reasonable
estimate for anyone in this height
100 range. You cannot, however,
60 65 70 75 extrapolate the results to heights
outside of those observed. The
Height regression results are only valid for
the range of actual observations.
Regression finds the line that best fits the observations. It
does this by finding the line that results in the lowest sum of
squared errors. Since the line describes the mean of the
effects of the independent variables, by definition, the sum
of the actual errors will be zero. If you add up all of the
values of the dependent variable and you add up all the
values predicted by the model, the sum is the same. That
is, the sum of the negative errors (for points below the line)
will exactly offset the sum of the positive errors (for points
above the line). Summing just the errors wouldn’t be useful
because the sum is always zero. So, instead, regression
uses the sum of the squares of the errors. An Ordinary
Least Squares (OLS) regression finds the line that results
in the lowest sum of squared errors.
Multiple Regression
What if there are several factors affecting the
independent variable?
As an example, think of the price of a home as a
dependent variable. Several factors contribute to
the price of a home… among them are square
footage, the number of bedrooms, the number of
bathrooms, the age of the home, whether or not it
has a garage or a swimming pool, if it has both
central heat and air conditioning, how many
fireplaces it has, and, of course, location.
The Multiple Regression Equation
Each of these factors has a separate relationship with the
price of a home. The equation that describes a multiple
regression relationship is:

y = a + b1x1 + b2x2 + b3x3 + … bnxn + e

This equation separates each individual independent


variable from the rest, allowing each to have its own
coefficient describing its relationship to the dependent
variable. If square footage is one of the independent
variables, and it has a coefficient of $50, then every
additional square foot of space adds $50, on average, to
the price of the home.
How Do You Run a
Regression?
In a Multiple Regression Analysis of home prices,
you take data from actual homes that have sold
recently. You include the selling price, as well as
the values for the independent variables (square
footage, number of bedrooms, etc.). The multiple
regression analysis finds the coefficients for each
independent variable so that they make the line
that has the lowest sum of squared errors.
How Good is the Model?
One of the measures of how well the model
explains the data is the R2 value. Differences
between observations that are not explained by
the model remain in the error term. The R2 value
tells you what percent of those differences is
explained by the model. An R2 of .68 means that
68% of the variance in the observed values of the
dependent variable is explained by the model, and
32% of those differences remains unexplained in
the error term.
Sometimes There’s No
Accounting for Taste
Some of the error is random, and no model
will explain it. A prospective homebuyer
might value a basement playroom more than
other people because it reminds her of her
grandmother’s house where she played as a
child. This can’t be observed or measured,
and these types of effects will vary randomly
and unpredictably. Some variance will
always remain in the error term. As long as
it is random, it is of no concern.
“p-values” and Significance Levels
Each independent variable has another number attached to
it in the regression results… its “p-value” or significance
level.
The p-value is a percentage. It tells you how likely it is that
the coefficient for that independent variable emerged by
chance and does not describe a real relationship.
A p-value of .05 means that there is a 5% chance that the
relationship emerged randomly and a 95% chance that the
relationship is real.
It is generally accepted practice to consider variables with a
p-value of less than .1 as significant, though the only basis
for this cutoff is convention.
Significance Levels of “F”
There is also a significance level for the
model as a whole. This is the “Significance
F” value in Excel; some other statistical
programs call it by other names. This
measures the likelihood that the model as a
whole describes a relationship that emerged
at random, rather than a real relationship.
As with the p-value, the lower the
significance F value, the greater the chance
that the relationships in the model are real.
Some Things to Watch Out For

• Multicollinearity

• Omitted Variables

• Endogeneity

• Other
Multicollinearity
Multicollinearity occurs when one or more of your independent variables
are related to one another. The coefficient for each independent variable
shows how much an increase of one in its value will change the dependent
variable, holding all other independent variables constant. But what if you
cannot hold them constant? If you have two houses that are exactly the
same, and you add a bedroom to one of them, the value of the house may
go up by, say, $10,000. But you have also added to its square footage.
How much of that $10,000 is a result of the extra bedroom and how much
is a result of the extra square footage? If the variables are very closely
related, and/or if you have only a small number of observations, it can be
difficult to separate these effects. Your regression gives you the
coefficients that best describe your set of data, but the independent
variables may not have a good p-value if multicollinearity is present.
Sometimes it may be appropriate to remove a variable that is related to
others, but it may not always be appropriate. In the home value example,
both the number of bedrooms and the square footage are important on
their own, in addition to whatever combined effects they may have.
Removing them may be worse than leaving them in. This does not
necessarily mean that the model as a whole is hurt, but it may mean that
the model should not be used to draw conclusions about the relationship of
individual independent variables with the dependent variable.
Omitted Variables
If independent variables that have significant relationships with the
dependent variable are left out of the model, the results will not be as
good as if they are included. In the home value example, any real
estate agent will tell you that location is the most important variable of
all. But location is hard to measure. Locations are more or less
desirable based on a number of factors. Some of them, like population
density or crime rate, may be measurable factors that can be included.
Others, like perceived quality of the local schools, may be more difficult.
You must also decide what level of specificity to use. Do you use the
crime rate for the whole city, a quadrant of the city, the zip code, the
street? Is the data even available at the level of specificity you want to
use? These factors can lead to omitted variable bias… variance in the
error term that is not random and that could be explained by an
independent variable that is not in the model. Such bias can distort the
coefficients on the other independent variables, as well as decreasing
the R2 and increasing the Significance F. Sometimes data just isn’t
available, and some variables aren’t measurable. There are methods
for reducing the bias from omitted variables, but it can’t always be
completely corrected.
Endogeneity
Regression measures the effect of changes in the
independent variable on the dependent variable.
Endogeneity occurs when that relationship is either
backwards or circular, meaning that changes in the
dependent variable cause changes in the independent
variable. In the home value example, we had discussed
earlier that the perceived quality of the local schools might
affect home values. But the perceived quality is likely also
related to the actual quality, and the actual quality is at
least partially a result of funding levels. Funding levels are
often related to the property tax base, or the value of local
homes. So… good schools increase home values, but high
home values also improve schools. This circular
relationship, if it is strong, can bias the results of the
regression. There are strategies for reducing the bias if
removing the endogenous variable is not an option.
Others
There are several other types of biases that can
exist in a model for a variety of reasons. As with
the types already described, there are tests to
measure the levels of bias, and there are
strategies that can be used to reduce it.
Eventually, though, one may have to accept a
certain amount of bias in the final model,
especially when there are data limitations. In that
case, the best that can be done is to describe the
problem and the effects it might have when
presenting the model.
The 136 System Model Regression Equation

Local Revenue
per Pupil = -236 (y-intercept)
+ .0041 x County-area Property per Pupil
+ .0032 x System Unshared Property per Pupil
+ .0202 x County-area Sales per Pupil
+ .0022 x System Unshared Sales per Pupil
+ .0471 x System State-shared Taxes per Pupil
+ 296 x [County-area Commercial, Industrial,
Utility and Business Personal Property
Assessment ÷ Total Assessment]
+ 327 x [System Commercial, Industrial, Utility
and Business Personal Property
Assessment ÷ Total Assessment]
+ .0209 x County-area Median Household Income
+ -795 x System Child Poverty Rate
A Step by Step Guide to
Learning SAS

1
Objective
• Familiarize yourselves with the SAS
programming environment and language.
• Learn how to create and manipulate data
sets in SAS and how to use existing data
sets outside of SAS.
• Learn how to conduct a regression
analysis.
• Learn how to create simple plots to
illustrate relationships.
2
LECTURE OUTLINE
• Getting Started with SAS
• Elements of the SAS program
• Basics of SAS programming
• Data Step
• Proc Reg and Proc Plot
• Example
• Tidbits
• Questions/Comments
3
Getting Started with SAS
1.1 Windows or Batch Mode?
1.1.1 Pros and Cons
1.1.2 Windows
1.1.3 Batch Mode

Reference:

www.cquest.utoronto.ca/stats/sta332s/sas.html

4
1.1.1 Pros and Cons
Windows:
Pros:
• SAS online help available.
• You can avoid learning any Unix commands.
• Many people like to point and click.
Cons:
• SAS online help is incredibly annoying.
• Possibly very difficult to use outside CQUEST
lab.
• Number of windows can be hard to manage.

5
1.1.1 cont’d…
Batch Mode:
Pros:
• Easily usable outside CQUEST labs.
• Simpler to use if you are already familiar with
Unix.
• Established Unix programs perform most tasks
better than SAS's builtin utilities.
Cons:
• Can't access SAS's online help.
• Requires some basic knowledge of Unix.

6
1.1.2 Windows
• You can get started using either of these
two ways:
1. Click on Programs at the top left of the
screen and select
CQUEST_APPLICATIONS and then sas.
2. In a terminal window type: sas

A bunch of windows will appear –


don’t get scared!
7
1.1.3 Batch Mode
• First, make sure you have set up your account
so you can use batch mode.
• Second, you need to create a SAS program.
• Then ask SAS to run your program (foo) using
the command:
sas foo or sas foo.sas
Either way, SAS will create files with the same
name as your program with respective
extensions for a log and output file (if there were
no fatal errors).

8
1.2 SAS Help
• If you are running SAS in a window environment then
there is a online SAS available.
• How is it helpful?
You may want more information about a command or
some other aspect of SAS then what you remember from
today or that is in this guide.
• How to access SAS Help?
1. Click on the Help button in task bar.
2. Use the menu command – Online documentation
• There are three tabs: Contents, Index and Find

9
1.3 SAS Run
• If you are running SAS in a window
environment then simply click on the Run
Icon. It’s the icon with a picture of a
person running!
• For Batch mode, simply type the
command: filename.sas

10
Elements of the SAS Software
2.1 SAS Program Editor: Enhanced Editor
2.2 Important SAS Windows: Log and
Output Windows
2.3 Other SAS Windows: Explorer and
Results Windows

11
2.1 SAS Program Editor
• What is the Enhanced Editor Window?
This is where you write your SAS programs. It will contain
all the commands to run your program correctly.
• What should be in it?
All the essentials to SAS programming such as the
information on your data and the required steps to
conduct your analysis as well as any comments or titles
should be written in this window (for a single problem).
See Section 3-6.
• Where should I store the files?
In your home directory. SAS will read and save files
directly from there.
12
2.2 Log and Output Windows
• How do you know whether your program is
syntactically correct?
Check the Log window every time you run a
program to check that your program ran
correctly – at least syntactically. It will indicate
errors and also provide you with the run time.
• You ran your program but where’s your output?
There is an output window which uses the
extension .lst to save the file.
If something went seriously wrong – evidence will
appear in either or both of these windows.

13
2.3 Other SAS Windows
• There are two other windows that SAS executes
when you start it up: Results and Explorer
Windows
• Both of these can be used as data/file
management tools.
• The Results Window helps to manage the
contents of the output window.
• The SAS Explorer is a kind of directory
navigation tool. (Useful for heavy SAS users).

14
Basics of SAS Programming
3.1 Essentials
3.1.1 A program!
3.1.2 End of a command line/statement
3.1.3 Run Statement
3.2 Extra Essentials
3.2.1 Comments
3.2.2 Title
3.2.3 Options
3.2.4 Case (in)sensitivity

15
3.1 Essentials
of SAS Programming
3.1.1 Program
• You need a program containing some
SAS statements.
• It should contain one or more of the
following:
1) data step: consists of statements that
create a data set
2) proc step: used to analyze the data

16
3.1 cont’d…
3.1.2 End of a command line or statement
• Every statement requires a semi-colon (;) and hit enter
afterwards. Each statement should be on a new line.
• This is a very common mistake in SAS programming –
so check very carefully to see that you have placed a ; at
the end of each statement.
3.1.3 Run command or keyword
• In order to run the SAS program, type the command:
run; at the end of the last data or proc step.
• You still need to click on the running man in order to
process the whole program.

17
3.2 Extra Essentials
of SAS Programming
3.2.1 Comments
• In order to put comments in your SAS
program (which are words used to explain
what the program is doing but not which
SAS is to execute as commands), use /*
to start a comment and */ to end a
comment. For example,
/* My SAS commands go here. */

18
3.2 cont’d…
3.2.2 Title
• To create a SAS title in your output, simply type the
command:
Title ‘Regression Analysis of Crime Data’;
• If you have several lines of titles or titles for different
steps in your program, you can number the title
command. For example,
Title1 ‘This is the first title’;
Title2 ‘This is the second title’;
• You can use either single quotes or double quotes. Do
not use contractions in your title such as don’t or else it
will get confused with the last quotation mark.
19
3.2 cont’d…
3.2.3 Options
• There is a statement which allows you to control
the line size and page size. You can also
control whether you want the page numbers or
date to appear. For example,
options nodate nonumber ls=78 ps=60
3.2.4 Case (in)sensitivity
• SAS is not case sensitive. So please don’t use
the same name - once with capitals and once
without, because SAS reads the word as the
same variable name or data set name.

20
4. Data Step
• 4.1 What is it?
• 4.2 What are the ingredients?
• 4.3 What can you do within it?
• 4.4 Some Basic Examples
• 4.5 What can you do with it?
• 4.6 Some More Examples

21
4.1 What is a Data Step?
• A data step begins by setting up the data set. It
is usually the first big step in a SAS program that
tells SAS about the data.
• A data statement names the data set. It can
have any name you like as long as it starts with
a letter and has no more than eight characters of
numbers, letters or underscores.
• A data step has countless options and
variations. Fortunately, almost all your DATA
sets will come prepared so there will be little or
no manipulation required.
22
4.2 Ingredients of a Data Step
4.2.1 Input statement
• INPUT is the keyword that defines the names of the
variables. You can use any name for the variables as
long as it is 8 characters.
• Variables can be either numeric or character (also called
alphanumeric). SAS will assume that variables are
numeric unless specified. To assign a variable name to
have a character value use the dollar sign $.
4.2.2 Datalines statement (internal raw data)
• This statement signals the beginning of the lines of data.
• A ; is placed both at the end of the datalines staement
and on the line following the last line of data.
• Spacing in data lines does matter.

23
4.2 cont’d…
4.2.3 Raw Data Files
• The datalines statement is used when referring to
internal raw data files.
• The infile statement is used when your data comes
from an external file. The keyword is placed directly
before the input statement. The path and name are
enclosed within single quotes. You will also need a
filename statement before the data step.
• Here are some examples of infile statements under 1)
windows and 2) UNIX operating environments:
1) infile ‘c:\MyDir\President.dat’;
2) infile ‘/home/mydir/president.dat’;
24
4.3 What can you do within it?
• A data step not only allows you to create a data
set, but it also allows you to manipulate the data
set.
• For example, you may wish to add two variables
together to get the cumulative effect or you may
wish to create a variable that is the log of
another variable (Meat example) or you may
simply want a subset of the data. This can be
done very easily within a data step.
• More information on this will be provided in a
supplementary documentation to follow.
25
4.4.1 Basic Example of a Data
Step
options ls=79;

data meat;
input steer time pH;
datalines;
1 1 7.02
2 1 6.93
3 2 6.42
4 2 6.51
5 4 6.07
6 4 5.99
7 6 5.59
8 6 5.80
9 8 5.51
10 8 5.36
;
26
4.4.2 Manipulating the Existing
Data
options ls=79;

data meat;
input steer time pH;
logtime=log(time);
datalines;
1 1 7.02
2 1 6.93
3 2 6.42
4 2 6.51
5 4 6.07
6 4 5.99
7 6 5.59
8 6 5.80
9 8 5.51
10 8 5.36
; 27
4.4.3 Designating a Character
Variable
options ls=79;

/*
Data on Violent and Property Crimes in 23 US Metropolitan Areas
violcrim = number of violent crimes
propcrim = number of property crimes
popn = population in 1000's
*/

data crime;
/* city is a character valued-variable so it is followed by
a dollar sign in the input statement */
input city $ violcrim propcrim popn;
datalines;
AllentownPA 161.1 3162.5 636.7
BakersfieldCA 776.6 7701.3 403.1
; 28
4.4.4 Data from an External File
options nodate nonumber ls=79 ps=60;

filename datain ‘car.dat’;

data cars;
infile datain;
input mpg;
datalines;
/* some data goes here */
;

29
4.5 What can you do with it?
4.5.1 View the data set
• Suppose that you have done some
manipulation to the original data set. If
you want to see what has been done, use
a proc print statement to view it.

proc print data=meat;


run;
30
4.5 cont’d…
4.5.2 Create a new from an old data set
• Suppose you already have a data set and now
you want to manipulate it but want to keep the
old as is. You can use the set statement to do
it.
4.5.3 Merge two data sets together
• Suppose you have created two datasets about
the sample (subjects) and now you wish to
combine the information. You can use a merge
statement. There must be a common variable in
both data sets to merge.

31
4.6 Some Comments
• If you don’t want to view all the variables, you
can use the keyword var to specify which
variables the proc print procedure should
display.
• The command by is very useful in the previous
examples and of the procedures to follow. We
will take a look at its use through some
examples.
• Let’s look at the Meat Example again using SAS
to demonstrate the steps explained in 4.5.

32
5. Regression Analysis
5.1 What is proc reg?
5.2 What are the important ingredients?
5.3 What does it do?
5.4 What else can you do with it?
5.5 The cigarette example
5.6 The Output – regression analysis

33
5.1 Proc Reg
• What is a proc procedure?
It is a procedure used to do something to the
data – sort it, analyze it, print it, or plot it.
• What is proc reg?
It is a procedure used to conduct regression
analyses. It uses a model statement to
define the theoretical model for the
relationship between the independent and
dependent variables.
34
5.2 Ingredients of Proc Reg
5.2.1 General Form
proc reg data=somedata <options>;
by variables;
model dependent=independent
<options>;
plot yvar*xvar <options>;
run;

35
5.2 cont’d…
5.2.2 What you need and don’t need?
• You need to assign 1) the data to be
analyzed, and 2) the theoretical model to
be fit to the data.
• You don’t need the other statements
shown in 5.2.1 such as the by and plot
keywords nor do you need any of the
possible <options>; however, they can
prove useful, depending on the analysis.
36
5.2 cont’d… options
• There are more options for each keyword and the proc
reg statement itself.
• Besides defining the data set to be used in the proc
reg statement, you can also use the option simple to
provide descriptive statistics for each variable.
• For the model option, here are some options:
p prints observed, predicted and residual values
r prints everything above plus standard errors of the
predicted and residuals, studentized residuals and
Cook’s D-statistic.
clm prints 95% confidence intervals for mean of each obs
cli prints 95% prediction intervals
37
5.2 cont’d… more options
• And yes there are more options….
• Within proc reg you can also plot!
• The plot statement allows you to create a plot
that shows the predicted regression line.
• Use the variables in the model statement and
some special variables created by SAS such as
p. (predicted), r. (residuals), student.
(studentized residuals), L95. and U95. (cli
model option limits), and L95M. and U95M.
(clm. Model option limits). *Note the (.) at the
end of each variable name.

38
5.3 What does it do?
• Most simply, it analyzes the theoretical
model proposed.
• However, it (SAS) may have done all the
computational work, but it is up to you to
interpret it.
• Let’s look at an example to illustrate these
various options in SAS.

39
5.4 What else can you do with it?
• Plot it (of course!) using another procedure.
• There are two procedures that can be used: proc plot
and proc gplot.
• These procedures are very similar (in form) but the latter
allows you to do a lot more.
• Here is the general form:
proc gplot data=somedata;
plot yvar*xvar;
run;
• Again, you need to identify a data set and the plot
statement. The plot keyword works similarly to the way
it works in proc reg.

40
5.4 cont’d… plot options
• Some plot options:
yvar*xvar=‘char’ obs. plotted using character
specified
yvar*(xvar1 xavr2) two plots appear on
separate pages
yvar*(xvar1 xavr2)=‘char1’ two plots
appear on separate pages
yvar*(xvar1 xavr2)=‘char2’ two plots
appear on the sample plot distinguished by the
character specification

41
5.5 An Example
• Let’s take a look at a complete example.
Consider the cigarette example.
• Suppose you want to (1)find the estimated
regression line, (2) plot the estimated regression
line, and (3) generate confidence intervals and
prediction intervals.
• We’ll look at all the key elements needed to
create the SAS program in order to perform the
analysis as well as interpreting the output.

42
5.6 Output
• Identify all the different components
displayed in the SAS output and determine
what they mean.
• Begin by identifying what the sources of
variation are and their respective degrees
of freedom.
• The last page contains your predicted,
observed and residual values as well as
confidence and prediction intervals.
43
Analysis – some questions
Now let’s answer the following questions in order
to understand all the output displayed.
• What do the sums of squares tell us? Or What
do they account for?
• How do you determine the mean square(s)?
• How do you determine the F-statistics? What is
it used for? What does the p-value indicate?
• What are the root mean square error, the
dependent mean and the coeff var? What do
they measure?

44
More questions….
• What is the R-square? What does it measure?
• What are the parameter estimates? What is the
fitted model expression? What does this mean?
• What do the estimated standard errors tells us?
• How do you determine t-statistics? What are
they used for? What does the p-value indicate?

45
Now you can….
You should be able to do the:
• Create a data set using a data step in order to:
- manipulate a data set (in various ways)
- use external raw data files
• Use various procedures in order to:
- find the estimated regression line
- plot the estimated regression line with data
- generate confidence intervals and prediction
intervals

46
6. Hints and Tidbits
• For assignments, summarize the output, and write the
answer to the questions being asked as well as clearly
interpreting and indicating where in the output the
numbers came from.
• You will need to be able to do this for your tests too – so
you might as well practice…..
• Practice with the examples provided in class and the
practice problems suggested by Professor Gibbs.
• Before going into the lab to use SAS, read over the
questions carefully and determine what needs to be
done. Look over examples that have already been
presented to you to give you an idea. It will save you lots
of time!
• Always check the log file for any errors!
47
Last Comments & Contact
• I will provide you with a short
supplementary document to help with the
SAS language and simple programming
steps (closer to the assignment time).

Anjali Mazumder
E-mail: mazumder@utstat.toronto.edu
www.utstat.toronto.edu/mazumder
48
References
1. Delwiche, Lora D. (1996). The Little SAS
Book: a primer. (2nd ed.)
2. Elliott, Rebecca J. (2000). Learning SAS
in the Computer Lab. (2nd ed.)
3. Freund, Rudolf J. and Littell, Ramon C.
(2000). SAS System for Regression. (3rd
ed.)

49
Introduction to SAS 

1
Why use statistical packages
• Built‐in functions
• Data manipulation
• Updated often to include new applications
• Different packages complete certain tasks 
more easily than others
• Packages we will introduce
– SAS
– R (S‐plus)
2
SAS
• Easy to input and output data sets
• Preferred for data manipulation
• “proc” used to complete analyses with built‐in 
functions
• Macros used to build your own functions

3
Outline
• SAS Structure
• Efficient SAS Code for Large Files
• SAS Macro Facility

4
Common errors
• Missing semicolon
• Misspelling
• Unmatched quotes/comments
• Mixed proc and data statement
• Using wrong options

5
SAS Structure
• Data Step: input, create, manipulate or output 
data
– Always start with a data line
– Ex. data one;
• Procedure Step: complete an operation on 
data
– Always start with a proc line
– Ex. proc contents;

6
Statements for Reading Data
• data statement names the data set you are 
making
• Can use any of the following commands to 
input data
– infile Identifies an external raw data file to read 
with an INPUT statement 
– input Lists variable names in the input file
– cards Indicates internal data 
– set Reads a SAS data set

7
Example
data temp; 
infile ‘g:\shared\BIO271summer\baby.csv' delimiter=',' 
dsd;
input id headcir length bwt gestwks mage mnocig 
mheight mppwt fage fedyrs fnocig fheig; 
run; 
proc print data = temp (obs=10); 
run; 

8
Delimiter Option

• blank space (default)
• DELIMITER= option specifies that the INPUT 
statement use a character other than a blank 
as a delimiter for data values that are read 
with list input 

9
Delimiter Example
Sometimes you want to input the data yourself
Try the following data step:
data nums; 
infile datalines dsd delimiter=‘&'; 
input X Y Z; 
datalines; 
1&2&3 
4&5&6 
7&8&9 ;
Notice that there are no semicolons until the end of the 
datalines

10
DSD option
• Change how SAS treats delimiters when list input is used and 
sets the default delimiter to a comma. When you specify DSD, 
SAS treats two consecutive delimiters as a missing value and 
removes quotation marks from character values. 
• Use the DSD option and list input to read a character value 
that contains a delimiter within a quoted string. The INPUT 
statement treats the delimiter as a valid character and 
removes the quotation marks from the character string before 
the value is stored. Use the tilde (~) format modifier to retain 
the quotation marks.

11
Example: Reading Delimited Data
SAS data step:
data scores;
infile datalines delimiter=',';
input test1 test2 test3;
datalines;
91,87,95
97,,92
,1,1
;
Output:
Obs test1 test2 test3
1 91 87 95
2 97 92 1

12
Example: Correction
SAS data step
data scores;
infile datalines delimiter=',‘ dsd;
input test1 test2 test3;
datalines;
91,87,95
97,,92
,1,1
;
Output:
Obs test1 test2 test3
1 91 87 95
2 97 . 92
3 . 1 1
13
Modified List Input
Read data that are separated by commas and that may 
contain commas as part of a character value: 

data scores;
infile datalines dsd;
input Name : $9. Score Team : $25. Div $;
datalines;
Joseph,76,"Red Racers, Washington",AAA
Mitchel,82,"Blue Bunnies, Richmond",AAA
Sue Ellen,74,"Green Gazelles, Atlanta",AA
;
14
Modified List Input
Output:

Obs Name Score Team Div


1 Joseph 76 Red Racers, Washington AAA
2 Mitchel 82 Blue Bunnies, Richmond AAA
3 Sue Ellen 74 Green Gazelles, Atlanta AA

15
Dynamic Data Exchange (DDE)
• Dynamic Data Exchange (DDE) is a method of dynamically 
exchanging information between Windows applications. DDE 
uses a client/server relationship to enable a client application 
to request information from a server application. In Version 8, 
the SAS System is always the client. In this role, the SAS 
System requests data from server applications, sends data to 
server applications, or sends commands to server 
applications. 
• You can use DDE with the DATA step, the SAS macro facility, 
SAS/AF applications, or any other portion of the SAS System 
that requests and generates data. DDE has many potential 
uses, one of which is to acquire data from a Windows 
spreadsheet or database application. 

16
Dynamic Data Exchange (DDE)
• NOTAB  is used only in the context of Dynamic 
Data Exchange (DDE). This option enables you 
to use nontab character delimiters between 
variables. 

17
DDE Example
FILENAME biostat DDE 'Excel|book1!r1c1:r27c2';
DATA NEW;
INFILE biostat dlm='09'x notab dsd missover;
INFORMAT seqno 10. no 2.;
INPUT seqno no; RUN;

Note:
SAS reads in the first 27 rows and 2 columns of the
spreadsheet named book1 in a open Excel file
through the Dynamic Data Exchange (DDE).

18
Statements for Outputting Data
• file: Specifies the current output file for PUT 
statements
• put: Writes lines to the SAS log, to the SAS procedure 
output file, or to an external file that is specified in 
the most recent FILE statement.
Example:
data _null_; 
set new;
file 'c:\out.csv' delimiter=',' dsd;
put seqno no ; 
run; 

19
Comparisons
• The INFILE statement specifies the input file for any INPUT 
statements in the DATA step. The FILE statement specifies the 
output file for any PUT statements in the DATA step. 
• Both the FILE and INFILE statements allow you to use options 
that provide SAS with additional information about the 
external file being used.
• An INFILE statement usually identifies data from an external 
file. A DATALINES statement indicates that data follow in the 
job stream. You can use the INFILE statement with the file 
specification DATALINES to take advantage of certain data‐
reading options that effect how the INPUT statement reads in‐
stream data.

20
Read Dates with Formatted Input
DATA Dates;
INPUT @1 A date11.
@13 B ddmmyy6.
@20 C mmddyy10.
@31 D yymmdd8.;
duration=A-mdy(1,1,1970);
FORMAT A B C D mmddyy10.; cards;
13/APR/1999 130499 04-13-1999 99 04 13
01/JAN/1960 010160 01-01-1960 60 01 01;
RUN;
Obs A B C D duration
1 04/13/1999 04/13/1999 04/13/1999 04/13/1999 10694
2 01/01/1960 01/01/1960 01/01/1960 01/01/1960 -3653

21
Procedures To Import/Outport 
Data
• IMPORT: reads data from an external data source 
and writes it to a SAS data set.
• CPORT: writes SAS data sets, SAS catalogs, or SAS 
data libraries to sequential file formats (transport 
files). 
• CIMPORT: imports a transport file that was created 
(exported) by the CPORT procedure. It restores the 
transport file to its original form as a SAS catalog, SAS 
data set, or SAS data library.

22
PROC IMPORT
• Syntax: 
PROC IMPORT 
DATAFILE="filename" | TABLE="tablename" 
OUT=SAS‐data‐set 
<DBMS=identifier><REPLACE>; 

23
PORC IMPORT
Space.txt:
MAKE MPG WEIGHT PRICE
AMC 22 2930 4099
AMC 17 3350 4749
AMC 22 2640 3799
Buick 20 3250 4816
Buick 15 4080 7827
proc import datafile="space.txt" out=mydata
dbms=dlm replace;
getnames=yes;
datarow=4;
run;
24
Common DBMS Specifications
Identifier Input Data Source Extension
ACCESS Microsoft Access .MDB
Database
DBF dBASE file .DBF
EXCEL EXCEL file .XLS
DLM delimited file (default .*
delimiter is a blank)
CSV comma-separated file .CSV
TAB tab-delimited file .TXT
25
SAS Programming Efficiency
• CPU time
• I/O time
• Memory
• Data storage
• Programming time

26
Use ELSE statement to reduce CPU 
time
IF agegrp=3 THEN DO;...END;
IF agegrp=2 THEN DO;...END;
IF agegrp=1 THEN DO;...END;

IF agegrp=3 THEN DO;...END;
ELSE IF agegrp=2 THEN DO;...END;
ELSE IF agegrp=1 THEN DO;...END;

27
Subset a SAS Dataset
DATA div1; SET adults;
IF division=1; RUN;
DATA div2; SET adults;
IF division=2; RUN;

DATA div1 div2;
SET adults;
IF division=1 THEN OUTPUT div1;
ELSE IF division=2 THEN OUTPUT div2;
28
MODIFY is Better Than SET
DATA salary;
SET salary;
wages=wagesy*0.1;

DATA salary;
MODIFY salary;
wages=wages*0.1;

29
Save Space by DROP or KEEP
DATA new;
SET old (KEEP=a b c);
RUN;

DATA new;
SET old (DROP=a);
RUN;

30
Save Space by Deleting Data Sets
DATA three;
MERGE one two;
BY type;
RUN;

PROC DATASETS;
DELETE one two;
RUN;

31
Save Space by Compress
DATA new (COMPRESS=YES);
SET  old;

PROC SORT DATA=a OUT=b (COMPRESS=YES);

PROC SUMMARY;
VAR score;
OUTPUT OUT=SUM1 (COMPRESS=YES) SUM=;

32
Read Only What You Need
DATA large:
INFILE myDATA;
INPUT @15 type $2. @ ;
INPUT @1 X $1. @2 Y $5. ;

DATA large:
INFILE myDATA;
INPUT @15 type $2. @ ;
IF type in ('10','11','12') THEN
INPUT @1 X $1. @2 Y $5.;

33
PROC FORMAT Is Better Than 
IF‐THEN
DATA new; 
SET  old;
IF   0 LE age LE 10 THEN agegroup=0;
ELSE IF 10 LE age LE 20 THEN agegroup=10;
ELSE IF 20 LE age LE 30 THEN agegroup=20;
ELSE IF 30 LE age LE 40 THEN agegroup=30; 
RUN;

PROC FORMAT;
VALUE age  0‐09=0 10‐19=10 20‐29=20 30‐39=30;
RUN;

DATA new; 
SET  old; 
agegroup=PUT(age,age.);
RUN;
34
Shorten Expressions with Functions
array c{10} cost1-cost10;
tot=0;
do I=1 to 10;
if c{i} ne . then do;
tot+c{i};
end;
end;

tot=sum(of cost1-cost10);
35
IF‐THEN Better Than AND

IF status1=1 and status2=9 THEN OUTPUT;

IF status1=1 THEN
IF status2=9 THEN OUTPUT;

36
Use SAS Functions Whenever 
Possible
DATA new; SET old;
meanxyz = (x+y+z)/3;
RUN;

DATA new; SET old;


meanxyz = mean(x, y, z);
RUN;

37
Use RETAIN to Initialize Constants
DATA new; SET old;
a = 5; b = 13;
(programming statements); RUN;

DATA new; SET old;


retain a 5 b 13;
(programming statements);
RUN;

38
Efficient Sort
PROC SORT;
BY vara varb varc vard vare;
RUN;

DATA new; SET old;


sortvar=vara||varb||varc||vard||vare;
RUN;
PROC SORT;
BY sortvar;
RUN;
39
Use Arrays and Macros
Using arrays and macros can save you the time of 
having to repeatedly type groups of statements.
– Example: Convert Missing Values to 0
data one; input chr $ a b c; cards;
x 2 . 9
y . 3 .
z 8 . .;
data two; set one; drop i;
array x(*) _numeric_;
do i= 1 to dim(x);
if x(i) = . then x(i)=0;
end; run;

40
When w has many missing values.
DATA new;
SET old;
wyzsum = 26 + y + z + w;
RUN;

DATA new;
SET old;
IF x > . THEN wyzsum = 26 + y + z + w;
RUN;
41
Put Loops With the Fewest 
Iterations Outermost
DATA new;  DATA new; 
SET old;                  SET old;  
DO i = 1 TO 100;   DO i = 1 TO 10;  
DO j = 1 TO 10;      DO j = 1 TO 100;
(programming 
(programming  statements);
statements);
END; 
END;  END; 
END;  RUN; 
RUN; 
42
IN Better Than OR
IF status=1 OR status=5 THEN
newstat="single";
ELSE newstat="not single";

IF status IN (1,5) THEN newstat="single";


ELSE newstat="not single";

43
SAS Macro 
What can we do with Macro?

• Avoid repetitious SAS code


• Create generalizable and flexible SAS code
• Pass information from one part of a SAS job to
another
• Conditionally execute data steps and PROCs
• Dynamically create code at execution time

44
SAS Macro Facility
• SAS macro variable
• SAS Macro
• Autocall Macro Facility 
• Stored Compiled Macro Facility

45
SAS Macro Delimiters
Two delimiters will trigger the macro
processor in a SAS program.

• &macro-name
This refers to a macro variable. The current
value of the variable will replace &macro-name;

• %macro-name
This refers to a macro, which consists of one or
more complete SAS statements, or even whole data
or proc steps.

46
SAS Macro Variables
• SAS Macro variables can be defined and used 
anywhere in a SAS program, except in data 
lines. They are independent of a SAS dataset. 
• Macro variables contain a single character 
value that remains constant until it is 
explicitly changed.

47
SAS Macro Variables 
%LET: assign text to a macro variable;
%LET macrovar = value
1. Macrovar is the name of a global macro variable;
2. Value is macro variable value, which is a character string
without quotation or macro expression.

%PUT: display macro variable values as text in


the SAS log; %put _all_, %put _user_

&macrovar: Substitute the value of a macro


variable in a program;
48
SAS Macro Variables 
• SAS-supplied Macro Variables:
%put &SYSDAY; Tuesday
%put &SYSDATE; 30SEP03
%put &SYSTIME; 11:02
%put &SYSVER; 8.2

• %put _all_ shows SAS-supplied


automatic and user-defined macro
variables.

49
SAS Macro Variables 
Combine Macro Variables with Text 
%LET first = John; 
%LET last = Smith; 
%put &first.&last; (combine)
%put &first. &last; (blank separate)
%put Mr. &first. &last; (prefix)
%put &first. &last. HSPH; (suffix)
output:
JohnSmith
John Smith
Mr. John Smith
John Smith HSPH
50
Create SAS Macro 
• Definition:
%MACRO macro‐name (parm1, parm2,…parmk);
Macro definition (&parm1,&parm2,…&parmk)
%MEND macro‐name;

• Application:
%macro‐name(values of parm1, parm2,…,parmk);

51
SAS Macro Example
Import Excel to SAS Datasets by a Macro
%macro excelsas(in=, out=);
proc import out=work.&out
datafile="c:\&in"
dbms=excel2000 replace;
getnames=yes; run;
%mend excelsas;

% excelsas(class1, score1)
% excelsas(class2, score2)
52
SAS System Options
• System options are global instructions that affect the 
entire SAS session and control the way SAS performs 
operations. SAS system options differ from SAS data 
set options and statement options in that once you 
invoke a system option, it remains in effect for all 
subsequent data and proc steps in a SAS job, unless 
you specify them.
• In order to view which options are available and in 
effect for your SAS session, use proc options.
PROC OPTIONS; RUN;
53
SAS system options
• NOCAPS Translate quoted strings and titles to upper case?
• CENTER Center SAS output?
• DATE Date printed in title?
• ERRORS=20 Maximum number of observations with error messages
• FIRSTOBS=1 First observation of each data set to be processed
• FMTERR Treat missing format or informat as an error?
• LABEL Allow procedures to use variable labels?
• LINESIZE=96 Line size for printed output
• MISSING=. Character printed to represent numeric missing values
• NUMBER Print page number on each page of SAS output?
• OBS=MAX Number of last observation to be processed
• PAGENO=1 Resets the current page number on the print file
• PAGESIZE=54 Number of lines printed per page of output
• YEARCUTOFF=1900 Cutoff year for DATE7. informat

54
Log, output and procedure options
• center controls whether SAS procedure output is centered. By default, output is centered. To 
specify not centered, use nocenter.
• date prints the date and time to the log and output window. By default, the date and time is 
printed. To suppress the printing of the date, use nodate.
• label allows SAS procedures to use labels with variables. By default, labels are permitted. To 
suppress the printing of labels, use nolabel.
• notes controls whether notes are printed to the SAS log. By default, notes are printed. To 
suppress the printing of notes, use nonotes.
• number controls whether page numbers are printed. By default, page numbers are printed. 
To suppress the printing of page numbers, use nonumber.
• linesize= specifies the line size (printer line width) for the SAS log and the SAS procedure 
output file used by the data step and procedures.
• pagesize= specifies # of lines that can be printed per page of SAS output.
• missing= specifies the character to be printed for missing numeric values.
• formchar= specifies the the list of graphics characters that define table boundaries. 

Example:
OPTIONS NOCENTER NODATE NONOTES LINESIZE=80 MISSING=. ;

55
SAS data set control options
SAS data set control options specify how SAS data sets 
are input, processed, and output. 
• firstobs= causes SAS to begin reading at a specified observation in a data 
set. The default is firstobs=1.
• obs= specifies the last observation from a data set or the last record from 
a raw data file that SAS is to read. To return to using all observations in a 
data set use obs=all 
• replace specifies whether permanently stored SAS data sets are to be 
replaced. By default, the SAS system will over‐write existing SAS data sets 
if the SAS data set is re‐specified in a data step. To suppress this option, 
use noreplace.

Example:
• OPTIONS OBS=100 NOREPLACE;
56
Error handling options
Error handling options specify how the SAS System 
reports on and recovers from error conditions. 
• errors= controls the maximum number of observations for which 
complete error messages are printed. The default maximum number of 
complete error messages is errors=20
• fmterr controls whether the SAS System generates an error message 
when the system cannot find a format to associate with a variable. SAS 
will generate an ERROR message for every unknown format it 
encounters and will terminate the SAS job without running any following 
data and proc steps. To read a SAS system data set without requiring a 
SAS format library, use nofmterr.

Example:
OPTIONS ERRORS=100 NOFMTERR;
57
Using where statement

where statement allows us to run procedures on a 
subset records.

Examples:

PROC PRINT DATA=auto; 
WHERE (rep78 >= 3); 
VAR make rep78; 
RUN;

PROC PRINT DATA=auto; 
WHERE (rep78 <= 2) and (rep78 ^= .) ; 
VAR make price rep78 ; 
RUN; 
58
Missing Values
As a general rule, SAS procedures that perform 
computations handle missing data by omitting 
the missing values. 

59
Summary of how missing values 
are handled in SAS procedures 
• proc means
For each variable, the number of non‐missing values 
are used 
• proc freq
By default, missing values are excluded and 
percentages are based on the number of non‐missing 
values. If you use the missing option on the tables
statement, the percentages are based on the total 
number of observations (non‐missing and missing) 
and the percentage of missing values are reported in 
the table.

60
Summary of how missing values 
are handled in SAS procedures
• proc corr
By default, correlations are computed based on the 
number of pairs with non‐missing data (pairwise 
deletion of missing data). The nomiss option can be 
used to request that correlations be computed only 
for observations that have non‐missing data for all 
variables on the var statement (listwise deletion of 
missing data). 
• proc reg
If any of the variables on the model or var statement 
are missing, they are excluded from the analysis (i.e., 
listwise deletion of missing data)
61
Summary of how missing values 
are handled in SAS procedures
• proc glm
If you have an analysis with just one variable on the 
left side of the model statement (just one outcome 
or dependent variable), observations are eliminated 
if any of the variables on the model statement are 
missing. Likewise, if you are performing a repeated 
measures ANOVA or a MANOVA, then observations 
are eliminated if any of the variables in the model 
statement are missing. For other situations, see the 
SAS/STAT manual about proc glm. 

62
Missing values in assignment 
statements
• As a general rule, computations involving missing 
values yield missing values.
2 + 2 yields 4
2 + . yields .
• mean(of var1‐varn):average the data for the non‐
missing values in a list of variables. 
avg = mean(of var1‐var10)
N(of var1‐varn): determine the number of non‐
missing values in a list of variables
n = N(var1, var2, var3)

63
Missing values in logical 
statements
• SAS treats a missing value as the smallest possible 
value (e.g., negative infinity) in logical statements.
DATA times6; 
SET times ; 
if (var1 <= 1.5) then varc1 = 0; else varc1 = 1 ; 
RUN ; 

Output:
Obs id var1 varc1
1 1 1.5 0
2 2 . 0
3 3 2.1 1

64
Subsetting Data
Subsetting variables using keep or drop statements

Example:
DATA auto2; 
SET auto; 
KEEP make mpg price; 
RUN;

DATA auto3; 
SET auto; 
DROP rep78 hdroom trunk weight length turn displ gratio foreign; 
RUN;

65
Subsetting Data
Subsetting observations using if statements

Example:
DATA auto4; 
SET auto; 
IF rep78 ^= . ;
RUN;

DATA auto5; 
SET auto; 
IF rep78 > 3 THEN DELETE ; 
RUN;

66
Labeling variables
Variable label: Use the label statement in the data step to assign 
labels to the variables. You could also assign labels to 
variables in proc steps, but then the labels only exist for that 
step. When labels are assigned in the data step they are 
available for all procedures that use that data set.

Example:
DATA auto2; 
SET auto; 
LABEL rep78 ="1978 Repair Record" mpg ="Miles Per Gallon" foreign="Where Car 
Was Made"; 
RUN; 
PROC CONTENTS DATA=auto2; 
RUN;

67
Labeling variable values
Labeling values is a two step process. First, you must create the 
label formats with proc format using a value statement. Next, 
you attach the label format to the variable with a format
statement. This format statement can be used in either proc
or data steps.
Example:
*first create the label formats forgnf and makef;
PROC FORMAT; 
VALUE forgnf 0="domestic" 1="foreign" ; 
VALUE $makef "AMC" ="American Motors" "Buick" ="Buick (GM)" "Cad." ="Cadallac (GM)" 
"Chev." ="Cheverolet (GM)" "Datsun" ="Datsun (Nissan)"; 
RUN;
*now we link them to the variables foreign and make;
PROC FREQ DATA=auto2; 
FORMAT foreign forgnf. make $makef.; 
TABLES foreign make; RUN;
68
Sort data
Use proc sort to sort this data file. 
Examples:
PROC SORT DATA=auto ; BY foreign ; RUN ; 

PROC SORT DATA=auto OUT=auto2 ; 
BY foreign ; RUN ; 

PROC SORT DATA=auto OUT=auto3; 
BY descending foreign ; RUN ; 

PROC SORT DATA=auto OUT=auto2 noduplicates;
BY foreign ; RUN ; 

69
Making and using permanent SAS 
data files
• Use a libname statement.
libname diss 'c:\dissertation\'; 
data diss.salary; 
input sal1996‐sal2000 ; 
cards;  
14000 16500 18000 22000 29000 

run;
• specify the name of the data file by directly specifying the 
path name of the file 
data 'c:\dissertation\salarylong'; 
input Salary1996‐Salary2000 ; 
cards; 
14000 16500 18000 22000 29000 

run;

70
Merge data files
One‐to‐one merge: there are three steps to match 
merge two data files dads and faminc on the same 
variable famid.
1. Use proc sort to sort dads on famid and save that 
file (we will call it dads2) 
PROC SORT DATA=dads OUT=dads2;   BY famid; RUN; 

• Use proc sort to sort faminc on famid and save that 


file (we will call it faminc2) 
PROC SORT DATA=faminc OUT=faminc2;   BY famid; RUN; 

• merge the dads2 and faminc2 files based on famid


DATA dadfam ;   MERGE dads2 faminc2;   BY famid; RUN: 

71
Merge data files
One‐to‐many merge: there are three steps to match 
merge two data files dads and kids on the same 
variable famid.
1. Use proc sort to sort dads on famid and save that 
file (we will call it dads2) 
PROC SORT DATA=dads OUT=dads2;   BY famid; RUN; 

• Use proc sort to sort kids on famid and save that 


file (we will call it kid2) 
PROC SORT DATA=kids OUT=kids2;   BY famid; RUN; 

• merge the dads2 and faminc2 files based on famid


DATA dadkid;   MERGE dads2 kids2;   BY famid; RUN: 

72
Merge data files: mismatch
• Mismatching records in one‐to‐one merge: use the in
option to create a 0/1 variable
DATA merge121; 
MERGE dads(IN=fromdadx) faminc(IN=fromfamx); 
BY famid; 
fromdad = fromdadx; 
fromfam = fromfamx; 
RUN;

• Variables with the same name, but different 
information: rename variables
DATA merge121; 
MERGE faminc(RENAME=(inc96=faminc96 inc97=faminc97 inc98=faminc98)) 
dads(RENAME=(inc98=dadinc98)); 
BY famid; 
RUN; 

73
Concatenating data files in SAS
• Use set to stack data files
DATA dadmom;   SET dads moms; RUN;
• Use rename to stack two data files with different variable 
names for the same thing
DATA momdad; 
SET dads(RENAME=(dadinc=inc)) moms(RENAME=(mominc=inc));
RUN; 
• Two data files with different lengths for variables of the same 
name
DATA momdad; 
LENGTH name $ 4; 
SET dads moms; 
RUN; 

74
Concatenating data files in SAS
• The two data files have variables with the same 
name but different codes
dads moms
famid name inc fulltime famid name inc fulltime;
1 Bill 30000 1 1 Bess 15000 N
2 Art 22000 0 2 Amy 18000 N
3 Paul 25000 3 3 Pat 50000 Y;
DATA dads; SET dads; full=fulltime; DROP fulltime;RUN;

DATA moms; SET moms; 
IF fulltime="Y" THEN full=1; IF fulltime="N" THEN full=0; 
DROP fulltime;RUN;

DATA momdad; SET dads moms;RUN;

75
SAS Macro 
What can we do with Macro?

• Avoid repetitious SAS code


• Create generalizable and flexible SAS code
• Pass information from one part of a SAS job to
another
• Conditionally execute data steps and PROCs
• Dynamically create code at execution time

76
SAS Macro Facility
• SAS macro variable
• SAS Macro
• Autocall Macro Facility 
• Stored Compiled Macro Facility

77
SAS Macro Delimiters
Two delimiters will trigger the macro
processor in a SAS program.

• &macro-name
This refers to a macro variable. The current
value of the variable will replace &macro-name;

• %macro-name
This refers to a macro, which consists of one or
more complete SAS statements, or even whole data
or proc steps.

78
SAS Macro Variables
• SAS Macro variables can be defined and used 
anywhere in a SAS program, except in data 
lines. They are independent of a SAS dataset. 
• Macro variables contain a single character 
value that remains constant until it is 
explicitly changed.
• To record the SAS macro use 
• options macro;

79
SAS Macro Variables 
%LET: assign text to a macro variable;
%LET macrovar = value
1. Macrovar is the name of a global macro variable;
2. Value is macro variable value, which is a character string
without quotation or macro expression.

%PUT: display macro variable values as text in


the SAS log; %put _all_, %put _user_

&macrovar: Substitute the value of a macro


variable in a program;
80
SAS Macro Variables 
• SAS-supplied Macro Variables:
%put &SYSDAY; Tuesday
%put &SYSDATE; 30SEP03
%put &SYSTIME; 11:02
%put &SYSVER; 8.2

• %put _all_ shows SAS-supplied


automatic and user-defined macro
variables.

81
SAS Macro Variables 
Combine Macro Variables with Text 
%LET first = John; 
%LET last = Smith; 
%put &first.&last; (combine)
%put &first. &last; (blank separate)
%put Mr. &first. &last; (prefix)
%put &first. &last. HSPH; (suffix)
output:
JohnSmith
John Smith
Mr. John Smith
John Smith HSPH
82
Create SAS Macro 
• Definition:
%MACRO macro‐name (parm1, parm2,…parmk);
Macro definition (&parm1,&parm2,…&parmk)
%MEND macro‐name;

• Application:
%macro‐name(values of parm1, parm2,…,parmk);

83
SAS Macro Example
Import Excel to SAS Datasets by a Macro
%macro excelsas(in,out);
proc import out=work.&out
datafile="c:\&in"
dbms=excel2000 replace;
getnames=yes; run;
%mend excelsas;

% excelsas(class1, score1)
% excelsas(class2, score2)
84
SAS Macro Example
Use proc means by a Macro
%macro auto(var1, var2); 
proc sort data=auto;
by &var2;
run;

proc means data=auto;
var  &var1;
by &var2;
run;
%mend auto;

%auto(price, rep78) ;
%auto(price, foreign); 
85
Inclass practice
Use the auto data to do the following
• check missing values for each variable
• create a new variable model (first part of 
make)
• get means/frequencies for each variable by 
model
• create 5 data files with 1‐5 repairs using macro

86
SCATTER DIAGRAM
A scatter diagram is a tool for analyzing relationships between two variables. One
variable is plotted on the horizontal axis and the other is plotted on the vertical axis. The
pattern of theirintersecting points can graphically show relationship patterns. Most often
a scatter diagram is usedto prove or disprove cause-and-effect relationships. While the
diagram shows relationships, itdoes not by itself prove that one variable causes the
other. In addition to showing possible causeand-effect relationships, a scatter diagram
can show that two variables are from a common causethat is unknown or that one
variable can be used as a surrogate for the other.

When to use it:


Use a scatter diagram to examine theories about cause-and-effect relationships and to
search for root causes of an identified problem. Use a scatter diagram to design a
control system to ensure that gains from quality improvement efforts are maintained.

How to use it:


Collect data. Gather 50 to 100 paired samples of data that show a possible
relationship.
Draw the diagram. Draw roughly equal horizontal and vertical axes of the diagram,
creating a square plotting area. Label the axes in convenient multiples (1, 2, 5, etc.)
increasing on the horizontal axes from left to right and on the vertical axis from bottom
to top. Label both axes.

Plot the paired data. Plot the data on the chart, using concentric circles to indicate
repeated data points.

Title and label the diagram.


Interpret the data.
Scatter diagrams will generally show one of six possible correlations between the
variables:
Strong Positive Correlation
The value of Y clearly increases as the value of X increases.

Strong Negative Correlation


The value of Y clearly decreases as the value of X increases.

Weak Positive Correlation


The value of Y increases slightly as the value of X increases.

Weak Negative Correlation


The value of Y decreases slightly as the value of X increases.

Complex Correlation
The value of Y seems to be related to the value of X, but the relationship is not easily
determined.

No Correlation
There is no demonstrated connection between the two variables.

Scatter Diagram Example


Strong Positive Correlation
Strong Negative Correlation

Weak Positive Correlation

Weak Negative Correlation


Complex Correlation

No Correlation
used with ordinal data . . .
Inferential statistics

We move from descriptive to inferential


statistics. Nor longer are we simply comparing
data sets ; we are now seeking ‘cause and effect’
relationships.

Note that a ‘statistical relationship’ can occur


where no ‘meaningful relationship’ is possible.
Such a relationship is spurious.

So any positive statistical result must always be


backed up by sound reasoning.
Scatter graphs

It is usually wise to draw a scatter graph, if


undertaking any correlation. It is an easy way to
highlight any relationship that may exist and its
type, whether direct or inverse.

Direct relationship Inverse relationship No relationship

40 40 40

30 30 30

y 20 y 20 y 20

10 10 10

0 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
x x x
GNP and adult literacy

Let us test whether there is a relationship between


GNP per capita and educational provision.

GNP per capita % adult literacy


Nepal 210 39.7
Sudan 290 55.5
Gambia 340 34.5
Peru 2460 89
Turkey 3160 81.4
Brazil 4570 84
Argentina 8970 97
Israel 15940 96
U.A.E. 18220 74.3
Netherlands 24760 100
GNP and adult literacy

First, construct a null hypothesis (Ho) – that there is no


relationship between GNP per capita and % adult
literacy.
GNP per capita % adult literacy
Nepal 210 39.7
Sudan 290 55.5
Gambia 340 34.5
Peru 2460 89
Turkey 3160 81.4
Brazil 4570 84
Argentina 8970 97
Israel 15940 96
U.A.E. 18220 74.3
Netherlands 24760 100
GNP and adult literacy
Spearman’s Rank correlation coefficient (Rs) is the best
method to use, as the GNP data is skewed.

GNP per capita % adult literacy


Nepal 210 39.7

Remember, Spearman’s Sudan 290 55.5

Rank can only be used Gambia 340 34.5

with ordinal data. Peru 2460 89


Turkey 3160 81.4
Brazil 4570 84
Argentina 8970 97
It is necessary,
Israel 15940 96
therefore, to rank-order
U.A.E. 18220 74.3
the data first.
Netherlands 24760 100
Setting out Spearman’s Rank . . .
Set the data out as shown with columns for ‘ranking’ the
variables and ‘n =’ and ‘Σd2’ at base :
GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 39.7
Sudan 290 55.5
Gambia 340 34.5
Peru 2460 89
Turkey 3160 81.4
Brazil 4570 84
Argentina 8970 97
Israel 15940 96
U.A.E. 18220 74.3
Netherlands 24760 100
n=
Ranking the X data . . .
The GNP (X) is already rank-ordered ; all it needs is to
enter the ranking, in this case from lowest to highest.
GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7
Sudan 290 2 55.5
Gambia 340 3 34.5
Peru 2460 89
Turkey 3160 81.4
Brazil 4570 84
Argentina 8970 97
Israel 15940 96
U.A.E. 18220 74.3
Netherlands 24760 100
n=
Ranking the Y data . . .
Ranking is now complete for GNP. Do the same for %
adult literacy (again start with the lowest value. . .)
GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2
Sudan 290 2 55.5 3
Gambia 340 3 34.5 1
Peru 2460 4 89
Turkey 3160 5 81.4
Brazil 4570 6 84
Argentina 8970 7 97
Israel 15940 8 96
U.A.E. 18220 9 74.3
Netherlands 24760 10 100
n=
Rank ordering completed . . .
Both variables X and Y are now ranked. It is time to find
the difference of (Rank X - Rank Y) or ‘d’.
GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2
Sudan 290 2 55.5 3
Gambia 340 3 34.5 1
Peru 2460 4 89 7
Turkey 3160 5 81.4 5
Brazil 4570 6 84 6
Argentina 8970 7 97 9
Israel 15940 8 96 8
U.A.E. 18220 9 74.3 4
Netherlands 24760 10 100 10
n=
Squaring ‘d’ . . .
To get rid of the minuses, square (d) . . .

GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1
Sudan 290 2 55.5 3 -1
Gambia 340 3 34.5 1 2
Peru 2460 4 89 7 -3
Turkey 3160 5 81.4 5 0
Brazil 4570 6 84 6 0
Argentina 8970 7 97 9 -2
Israel 15940 8 96 8 0
U.A.E. 18220 9 74.3 4 5
Netherlands 24760 10 100 10 0
n = 10
Summing ‘d^2’ . . .

Now sum ‘d^2’ . . . . . . which gives the answer 44.

GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1 1
Sudan 290 2 55.5 3 -1 1
Gambia 340 3 34.5 1 2 4
Peru 2460 4 89 7 -3 9
Turkey 3160 5 81.4 5 0 0
Brazil 4570 6 84 6 0 0
Argentina 8970 7 97 9 -2 4
Israel 15940 8 96 8 0 0
U.A.E. 18220 9 74.3 4 5 25
Netherlands 24760 10 100 10 0 0
n = 10 44
Getting ‘n’ . . .

There are ten countries – so ‘n’ = 10

GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1 1
Sudan 290 2 55.5 3 -1 1
Gambia 340 3 34.5 1 2 4
Peru 2460 4 89 7 -3 9
Turkey 3160 5 81.4 5 0 0
Brazil 4570 6 84 6 0 0
Argentina 8970 7 97 9 -2 4
Israel 15940 8 96 8 0 0
U.A.E. 18220 9 74.3 4 5 25
Netherlands 24760 10 100 10 0 0
n = 10 44
Calculating Spearman’s Rank . . .

Insert the figures into the equation . . .

GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1 1
Sudan 290 2 55.5 3 -1 1
Gambia 340 3 34.5 1 2 4
Peru 2460 4 89 7 -3 9
Turkey 3160 5 81.4 5 0 0
Brazil 4570 6 84 6 0 0
Argentina 8970 7 97 9 -2 4
Israel 15940 8 96 8 0 0
U.A.E. 18220 9 74.3 4 5 25
Netherlands 24760 10 100 10 0 0
n = 10 44

Rs = 1 - ( 6 x 44 )
(1000 – 100)
The answer to Spearman’s . . .

. . . and, he presto, the answer to Rs is 0.733.

GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1 1
Sudan 290 2 55.5 3 -1 1
Gambia 340 3 34.5 1 2 4
Peru 2460 4 89 7 -3 9
Turkey 3160 5 81.4 5 0 0
Brazil 4570 6 84 6 0 0
Argentina 8970 7 97 9 -2 4
Israel 15940 8 96 8 0 0
U.A.E. 18220 9 74.3 4 5 25
Netherlands 24760 10 100 10 0 0
n = 10 44

Rs = 1 - 264 Rs = 1 – 0.267 Rs = 0.733


990
Your (Rs) findings . . .
Null hypothesis (Ho) was that there was no relationship
between GNP per capita and % adult literacy.

The degrees of freedom are (n – 1). So ‘df’ = 9.

Spearman’s Rank correlation coefficient (Rs) result of


0.733 exceeds the 95% probability value of 0.60 at 9
degrees of freedom.

Therefore the Ho must be rejected and replaced by the


alternative hypothesis (H1) – that there is a relationship
between GNP per capita and % adult literacy.
SPSS for Beginners

1
What is in this workshop
• SPSS interface: data view and variable view
• How to enter data in SPSS
• How to import external data into SPSS
• How to clean and edit data
• How to transform variables
• How to sort and select cases
• How to get descriptive statistics 

2
Data used in the workshop
• We use 2009 Youth Risk Behavior Surveillance 
System (YRBSS, CDC) as an example.
– YRBSS monitors priority health‐risk behaviors and 
the prevalence of obesity and asthma among 
youth and young adults.  
– The target population is high school students
– Multiple health behaviors include drinking, 
smoking, exercise, eating habits, etc. 

3
SPSS interface
• Data view
– The place to enter data
– Columns: variables
– Rows: records
• Variable view
– The place to enter variables
– List of all variables
– Characteristics of all variables

4
Before the data entry
• You need a code book/scoring guide
• You give ID number for each case (NOT real 
identification numbers of your subjects) if you 
use paper survey.
• If you use online survey, you need something 
to identify your cases.
• You also can use Excel to do data entry. 

5
Example of a code book

A code book is about how you code your 
variables.  What are in code book? 
1. Variable names
2. Values for each response option
3. How to recode variables

6
Enter data in SPSS 19.0

Columns: 
variables

Rows: cases

Under Data 
View

7
Enter variables

1. Click Variable View
2. Type variable name under 
4. Description 
2. Type  of variable
Name column (e.g. Q01). 
variable name NOTE: Variable name can be 64 
3. Type:  bytes long, and the first 
numeric or  character must be a letter or 
string one of the characters @, #, or 
$.
3. Type: Numeric, string, etc.
4. Label: description of variables. 

1. Click this 
Window

8
Enter variables

Based on your 
code book!

9
Enter cases

1. Two variables in the data set.
2. They are: Code and Q01.
3. Code is an ID variable, used to identify individual case 
(NOT people’s real IDs).  
4. Q01 is about participants’ ages: 1 = 12 years or younger, 
2 = 13 years, 3 = 14 years…

Under Data 
View

10
Import data from Excel
• Select File            Open            Data
• Choose Excel as file type
• Select the file you want to import
• Then click Open 

11
Open Excel files in SPSS

12
Import data from CVS file
• CVS is a comma‐separated values file.
• If you use Qualtrics to collect data (online 
survey), you will get a CVS data file. 
• Select File             Open             Data
• Choose All files as file type
• Select the file you want to import
• Then click Open

13
Continue

14
Continue

15
Continue

16
Continue

17
Continue

18
Continue

19
Continue

Save this file 
as SPSS data 

20
Clean data after import data files
• Key in values and labels for each variable
• Run frequency for each variable
• Check outputs to see if you have variables 
with wrong values.
• Check missing values and physical surveys if 
you use paper surveys, and make sure they 
are real missing. 
• Sometimes, you need to recode string 
variables into numeric variables
21
Continue

Wrong 
entries

22
Variable transformation
• Recode variables
1. Select Transform           Recode 
into Different Variables 
2. Select variable that you want to 
transform (e.g. Q20): we want
1= Yes and 0 = No
3. Click Arrow button to put your 
variable into the right window
4. Under Output Variable: type 
name for new variable and 
label, then click Change
5. Click Old and New Values

23
Continue
6. Type 1 under Old Value
and 1 under New Value, 
click Add. Then type 2
under Old Value, and 0
under New Value, click 
Add.
7. Click Continue after 
finish all the changes. 
8. Click Ok 24
Variable transformation
y Compute variable (use YRBSS 2009 data)
y Example 1. Create a new variable: drug_use (During the past 30 
days, any use of cigarettes, alcohol, and marijuana is defined as 
use, else as non‐use). There are two categories for the new 
variable (use vs. non‐use). Coding: 1= Use and 0 = Non‐use
1. Use Q30, Q41, and Q47 from 2009 YRBSS survey
2. Non‐users means those who answered 0 days/times to all three 
questions.
3. Go to Transform          Compute Variable

25
Continue
4. Type “drug_use” under 
Target Variable
5. Type “0” under Numeric 
Expression. 0 means 
Non‐use
6.  Click If button. 

26
Continue
7. With help of that 
Arrow button, type 
Q30= 1 & Q41 = 1 & Q47= 1

then click Continue
8. Do the same thing for 
AND
OR
Use, but the numeric
expression is different:
Q30> 1 | Q41> 1 | Q47>1

27
Continue
9. Click OK
10. After click OK,
a small window asks
if you want to
change existing
variable because
drug_use was already
created when you
first define non‐use.
11. Click ok.   

28
Continue
y Compute variables
y Example 2. Create a new variable drug_N that 
assesses total number of drugs that adolescents 
used during the last 30 days.
1. Use Q30 (cigarettes), 41 (alcohol), 47 
(marijuana), and 50 (cocaine). The number of 
drugs used should be between 0 and 4.
2. First, recode all four variables into two 
categories: 0 = non‐use (0 days), 1 = use (at least 
1 day/time) 
3. Four variables have 6 or 7 categories 29
Continue
4. Recode four  variables: 1 (old) = 0 (new), 2‐6/7 (old) = 1 (New). 
5. Then select Transform           Compute  Variable

30
Continue
6. Type drug_N under Target Variable
7. Numeric Expression: SUM (Q30r,Q41r,Q47r,Q50r)
8. Click OK

31
Continue
• Compute variables
– Example 3: Convert string variable into numeric 
variable 
1. Enter 1 at Numeric 
Expression.
2. Click If button and type 
Q2 = ‘Female’
3. Then click Ok.
4. Enter 2 at Numeric 
Expression.
5. Click If button and type 
Q2 = ‘Male’
6. Then click Ok

32
Sort and select cases
• Sort cases by variables: Data           Sort Cases 
• You can use Sort Cases to find missing. 

33
Sort and select cases
• Select cases
– Example 1. Select Females for analysis.
1. Go to Data           Select Cases
2. Under Select: check the second one
3. Click If button

34
Continue
4. Q2 (gender) = 1,
1 means Female
5. Click Continue
6. Click Ok
Unselected 
cases : 
Q2 = 2

35
Sort and select cases
7. You will see a new variable: filter_$ (Variable 
view)

36
Sort and select cases
• Select cases
– Example 2. Select cases who used any of cigarettes, alcohol, and marijuana 
during the last 30 days. 
1. Data            Select Cases
2. Click If button
3. Type Q30  > 1 | Q41 > 1 | Q47 > 1, click Continue

37
Basic statistical analysis
• Descriptive statistics
– Purposes: 
1. Find wrong entries
2. Have basic knowledge about the sample and 
targeted variables in a study
3. Summarize data

Analyze         Descriptive statistics        Frequency

38
Continue

39
Frequency table

40
1. Skewness: a measure of the 
asymmetry of a distribution.
The normal distribution is
symmetric and has a skewness
value of zero. 
Positive skewness:  a long right tail. 
Negative skewness: a long left tail. 
Departure from symmetry : a
skewness value more than twice 
its standard error.
2. Kurtosis: A measure of the extent
to which observations cluster around 
a central point. For a normal 
Normal  distribution, the value of the kurtosis 
Curve statistic is zero. Leptokurtic data 
values are more peaked, whereas 
platykurtic data values are flatter and 
more dispersed along the X axis. 
41
42
` About the four-windows in SPSS

` The basics of managing data files

` The basic analysis in SPSS


` Originally it is an acronym of Statistical
Package for the Social Science but now it
stands for Statistical Product and Service
Solutions

` One of the most popular statistical


packages which can perform highly
complex data manipulation and analysis
with simple instructions
` Data Editor
Spreadsheet-like system for defining, entering, editing,
and displaying data. Extension of the saved file will be
“sav.”
` Output Viewer
Displays output and errors. Extension of the saved file will
be “spv.”
` Syntax Editor
Text editor for syntax composition. Extension of the
saved file will be “sps.”
` Script Window
Provides the opportunity to write full-blown programs,
in a BASIC-like language. Text editor for syntax
composition. Extension of the saved file will be “sbs.”
` Start → All Programs → SPSS Inc→ SPSS 16.0 →
SPSS 16.0
` The default window will have the data editor
` There are two sheets in the window:
1. Data view 2. Variable view
` The Data View window
This sheet is visible when you first open the Data Editor
and this sheet contains the data
` Click on the tab labeled Variable View

Click
` This sheet contains information about the data set that is stored
with the dataset
` Name
◦ The first character of the variable name must be alphabetic
◦ Variable names must be unique, and have to be less than 64
characters.
◦ Spaces are NOT allowed.
` Type
◦ Click on the ‘type’ box. The two basic types of variables
that you will use are numeric and string. This column
enables you to specify the type of variable.
` Width
◦ Width allows you to determine the number of
characters SPSS will allow to be entered for the
variable
` Decimals
◦ Number of decimals
◦ It has to be less than or equal to 16

3.14159265L
` Label
◦ You can specify the details of the variable
◦ You can write characters with spaces up to 256
characters
` Values
◦ This is used and to suggest which numbers
represent which categories when the
variable represents a category
` Click the cell in the values column as shown below
` For the value, and the label, you can put up to 60
characters.
` After defining the values click add and then click OK.

Click
` How would you put the following information into SPSS?

Value = 1 represents Male and Value = 2 represents Female


Click
Click
` To save the data file you created simply click ‘file’ and
click ‘save as.’ You can save the file in different forms
by clicking “Save as type.”

Click
` Click ‘Data’ and then click Sort Cases
` Double Click ‘Name of the students.’ Then click
ok.

Click

Click
` How would you sort the data by the
‘Height’ of students in descending order?
` Answer
◦ Click data, sort cases, double click ‘height of
students,’ click ‘descending,’ and finally click ok.
` Click ‘Transform’ and then click ‘Compute Variable…’
` Example: Adding a new variable named ‘lnheight’ which is
the natural log of height
◦ Type in lnheight in the ‘Target Variable’ box. Then type in
‘ln(height)’ in the ‘Numeric Expression’ box. Click OK

Click
` A new variable ‘lnheight’ is added to the table
` Create a new variable named “sqrtheight”
which is the square root of height.
` Answer
` Frequencies
◦ This analysis produces frequency tables showing
frequency counts and percentages of the values of
individual variables.

` Descriptives
◦ This analysis shows the maximum, minimum,
mean, and standard deviation of the variables

` Linear regression analysis


◦ Linear Regression estimates the coefficients of the
linear equation
` Open ‘Employee data.sav’ from the SPSS
◦ Go to “File,” “Open,” and Click Data
` Go to Program Files,” “SPSSInc,” “SPSS16,” and
“Samples” folder.
` Open “Employee Data.sav” file
` Click ‘Analyze,’ ‘Descriptive statistics,’ then
click ‘Frequencies’
` Click gender and put it into the variable box.
` Click ‘Charts.’
` Then click ‘Bar charts’ and click ‘Continue.’

Click Click
` Finally Click OK in the Frequencies box.

Click
` Click ‘Analyze,’ ‘Descriptive statistics,’ then
click ‘Frequencies.’
` Put ‘Gender’ in the Variable(s) box.
` Then click ‘Charts,’ ‘Bar charts,’ and click
‘Continue.’
` Click ‘Paste.’

Click
` Highlight the commands in the Syntax editor
and then click the run icon.
` You can do the same thing by right clicking the
highlighted area and then by clicking ‘Run
Current’

Right
Click Click!
` Do a frequency analysis on the
variable “minority”

` Create pie charts for it

` Dothe same analysis using the


syntax editor
Click
` Click ‘Analyze,’ ‘Descriptive statistics,’ then
click ‘Descriptives…’
` Click ‘Educational level’ and ‘Beginning
Salary,’ and put it into the variable box.
` Click Options

Click
` The options allows you to analyze other
descriptive statistics besides the mean and Std.
` Click ‘variance’ and ‘kurtosis’
` Finally click ‘Continue’

Click

Click
` Finally Click OK in the Descriptives box. You will
be able to see the result of the analysis.
` Click ‘Analyze,’ ‘Regression,’ then click
‘Linear’ from the main menu.
` For example let’s analyze the model salbegin = β 0 + β1edu + ε
` Put ‘Beginning Salary’ as Dependent and ‘Educational Level’ as
Independent.

Click
Click
` Clicking OK gives the result
` Click ‘Graphs,’ ‘Legacy Dialogs,’
‘Interactive,’ and ‘Scatterplot’ from the
main menu.
` Drag ‘Current Salary’ into the vertical axis box
and ‘Beginning Salary’ in the horizontal axis box.
` Click ‘Fit’ bar. Make sure the Method is
regression in the Fit box. Then click ‘OK’.

Set this to
Click
Regression!
` Find out whether or not the previous
experience of workers has any affect on their
beginning salary?
◦ Take the variable “salbegin,” and “prevexp” as
dependent and independent variables respectively.

` Plot the regression line for the above analysis


using the “scatter plot” menu.
Click
Click on the “fit” tab to make
sure the method is regression
For further Questions:
kentaka@mail.uri.edu
Standard Deviation
Two classes took a
recent quiz. There
were 10 students in
each class, and each
class had an average
score of 81.5
Since the averages are
the same, can we
assume that the students
in both classes all did
pretty much the same on
the exam?
The answer is… No.

The average (mean)


does not tell us anything
about the distribution or
variation in the grades.
Here are Dot-Plots of the
grades in each class:
Mean
So, we need to come up
with some way of
measuring not just the
average, but also the
spread of the distribution
of our data.
Why not just give an
average and the range
of data (the highest and
lowest values) to
describe the distribution
of the data?
Well, for example, lets say
from a set of data, the
average is 17.95 and the
range is 23.

But what if the data looked


like this:
Here is the average

But really, most of


the numbers are in
this area, and are
And here is the range not evenly
distributed
throughout the
range.
The Standard Deviation
is a number that
measures how far away
each number in a set of
data is from their mean.
If the Standard Deviation is
large, it means the numbers
are spread out from their
mean.

If the Standard Deviation is


small,
small, it means the numbers
are close to their mean.
Here are
72
the scores
76
on the 80
math quiz 80
for Team 81 Average:
83
A: 84
81.5
85
85
89
The Standard Deviation measures how far away each
number in a set of data is from their mean.
For example, start with the lowest score, 72. How far
away is 72 from the mean of 81.5?
72 - 81.5 = - 9.5

- 9.5
Or, start with the lowest score, 89. How far away is 89
from the mean of 81.5?
89 - 81.5 = 7.5

- 9.5 7.5
Distance
from
So, the Mean

first step to 72 -9.5


finding the 76
Standard 80
Deviation 80
81
is to find
83
all the
84
distances
85
from the
85
mean. 89 7.5
Distance
from
So, the Mean

first step to 72 - 9.5


finding the 76 - 5.5
Standard 80 - 1.5
Deviation 80 - 1.5
81 - 0.5
is to find
83 1.5
all the
84 2.5
distances
85 3.5
from the
85 3.5
mean. 89 7.5
Distance
Distances
Next, you from
Mean Squared
need to 72 - 9.5 90.25
square 76 - 5.5 30.25
each of 80 - 1.5
the 80 - 1.5
distances 81 - 0.5
to turn 83 1.5
them all 84 2.5
into 85 3.5
positive 85 3.5
numbers 89 7.5
Distance
Distances
Next, you from
Mean Squared
need to 72 - 9.5 90.25
square 76 - 5.5 30.25
each of 80 - 1.5 2.25
the 80 - 1.5 2.25
distances 81 - 0.5 0.25
to turn 83 1.5 2.25
them all 84 2.5 6.25
into 85 3.5 12.25
positive 85 3.5 12.25
numbers 89 7.5 56.25
Distance
from Distances
Mean Squared

72 - 9.5 90.25
76 - 5.5 30.25 Sum:
80 - 1.5 2.25 214.5
Add up all 80 - 1.5 2.25
of the 81 - 0.5 0.25
distances 83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance
from Distances
Mean Squared

72 - 9.5 90.25
76 - 5.5 30.25 Sum:
Divide by (n
80 - 1.5 2.25 214.5
- 1) where n
represents 80 - 1.5 2.25 (10 - 1)

the amount 81 - 0.5 0.25


= 23.8
of numbers 83 1.5 2.25
you have. 84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance
from Distances
Mean Squared

72 - 9.5 90.25
Finally, 76 - 5.5 30.25 Sum:
take the 80 - 1.5 2.25 214.5

Square 80 - 1.5 2.25 (10 - 1)


81 - 0.5 0.25
Root of the = 23.8
83 1.5 2.25
average = 4.88
84 2.5 6.25
distance
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance
from Distances
Mean Squared

72 - 9.5 90.25
76 - 5.5 30.25 Sum:
80 - 1.5 2.25 214.5
This is the 80 - 1.5 2.25 (10 - 1)
Standard 81 - 0.5 0.25
= 23.8
Deviation 83 1.5 2.25
= 4.88
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance
from Distances
Mean Squared

57 - 24.5 600.25
Now find 65 - 16.5 272.25 Sum:
the 83 1.5 2.25 2280.5
Standard 94 12.5 156.25 (10 - 1)
Deviation 95 13.5 182.25
= 253.4
for the 96 14.5 210.25
= 15.91
other class 98 16.5 272.25
grades 93 11.5 132.25
71 - 10.5 110.25
63 -18.5 342.25
Now, lets compare the two
classes again

Team A Team B

Average on
the Quiz 81.5 81.5
Standard
Deviation 4.88 15.91
Variance and Standard Deviation
Variance: a measure of how data
points differ from the mean
y Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7

What is the mean and median of the above data set?

Data Set 1: mean = 7, median = 7


Data Set 2: mean = 7, median = 7

But we know that the two data sets are not identical! The variance
shows how they are different.

We want to find a way to represent these two data set numerically.


How to Calculate?
y If we conceptualize the spread of a distribution as the
extent to which the values in the distribution differ from
the mean and from each other, then a reasonable
measure of spread might be the average deviation, or
difference, of the values from the mean.

∑( x − X )
N
y Although this might seem reasonable, this expression always equals 0,
because the negative deviations about the mean always cancel out the
positive deviations about the mean.
y We could just drop the negative signs, which is the same mathematically as
taking the absolute value, which is known as the mean deviations.
y The concept of absolute value does not lend itself to the kind of advanced
mathematical manipulation necessary for the development of inferential
statistical formulas.
y The average of the squared deviations about the mean is called the variance.

∑(x − X )
2

σ = 2 For population variance

∑(x − X )
2
For sample variance
s =
2

n −1
Score (
X − X)
2
X−X
X

1
3
2
5
3
7
4
10
5
10
Totals
35

The mean is 35/5=7.


Score (
X − X)
2
X−X
X

1
3 3-7=-4
2
5 5-7=-2
3
7 7-7=0
4
10 10-7=3
5
10 10-7=3
Totals
35
Score (
X − X)
2
X−X
X

1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals
35 38
Score (
X − X)
2
X−X
X

1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals
35 38

∑(x − X )
2
38
s = 2
= = 7.6
n 5
Example 2

Dive Mark Myrna


1 28 27
2 22 27
3 21 28
4 26 6
5 18 27
Find the mean, median, mode, range?

mean 23 23
median 22 27
range 10 22

What can be said about this data?

Due to the outlier, the median is more typical of overall performance.

Which diver was more consistent?


Dive Mark's Score ( X − X )2
X−X
X

1 28 5 25

2 22 -1 1

3 21 -2 4

4 26 3 9

5 18 -5 25

Totals 115 0 64

Mark’s Variance = 64 / 5 = 12.8


Myrna’s Variance = 362 / 5 = 72.4

Conclusion: Mark has a lower variance therefore he is more consistent.


standard deviation - a measure of
variation of scores about the mean
y Can think of standard deviation as the average distance to
the mean, although that's not numerically accurate, it's
conceptually helpful. All ways of saying the same thing:
higher standard deviation indicates higher spread, less
consistency, and less clustering.

y sample standard deviation:


∑(x − X )
2

s=
n −1
y population standard deviation:

∑(x − μ)
2

σ=
N
Another formula
y Definitional formula for variance for data in a frequency
distribution

S 2
=
∑ (X − X ) 2
f
∑f
y Definitional formula for standard deviation for data in a
frequency distribution

S=
∑ ( X − X ) 2
f
∑f
The mean is 23

Myrna’s Score X f ( X − X)2 ( X − X )2 x f


X−X

28 1

27 3

6 1

115 5
Myrna’s Score X f ( X − X)2 ( X − X )2 x f
X−X

28 1 5

27 3 4

6 1 -17

115 5
Myrna’s Score X f ( X − X)2 ( X − X )2 x f
X−X

28 1 5 25

27 3 4 16

6 1 -17 289

115 5
round-off rule – carry
one more decimal
Myrna’s Score X f ( X − X)2 ( X − X )2 x f
place than was
X−X
present in the
original data

28 1 5 25 25

27 3 4 16 48

6 1 -17 289 289

115 5 362

Variance = S2 = 362 / 5 = 72.4


Standard Deviation = √72.4 = 8.5
Bell shaped curve
y empirical rule for data (68-95-99) - only applies to a set of
data having a distribution that is approximately bell-shaped:
(figure pg 220)
y ≈ 68% of all scores fall with 1 standard deviation of the
mean
y ≈ 95% of all scores fall with 2 standard deviation of the
mean
y ≈ 99.7% of all scores fall with 3 standard deviation of the
mean
T-Test
z Hypothesis testing involves making a decision concerning
some hypothesis or statement about a population parameter
such as the population mean, μ using the sample mean, X to
decide whether this statement about the value of μ is valid or
not.
z The steps of the hypothesis testing :

1- The first step is to formulate a null hypothesis written H0 . The


statement for H0 is usually expressed as an equation or
inequality as follows: H 0: μ = given value
H 0: μ ≤ given value
H 0: μ ≥ given value
1
Ha
Also in this step it is stated an alternative hypothesis, written ,
a statement that indicates μtheHopinion of the conductor of the test
a
as H
to the
μ ≠ actual value of . is expressed as follows:
given value
a:

H a: μ < given value


H a: μ > given value
μ
We conduct a hypothesis test on a given value to find out if
actual observation would lead us to reject the stated value.

2
T-Test
The alternative hypothesis suggests the direction of the actual
value of the parameter relative to the stated value. The statement
of Ha in the form of an inequality that indicates that the
investigator has no opinion as to whether the actual value of μ is
more than or less than the stated value but the feeling is that the
stated value is incorrect. In this case the test is two-tail test.
Statements in the form of strictly greater than or strictly less than
relationship indicate that the investigator has an opinion as to the
direction of the value of the parameter relative to the stated
value. In this case it is called one-tail test.

3
T-Test
z 2- State the level of significance of the test and the
corresponding Z values (for large sample tests), or the
corresponding T values ( for small sample tests). The
hypothesis test is frequently conducted at the 5%, 1% and 10%
levels of significance. Some can use the Z values. For a test
conducted at any other level of significance, we simply use the
normal distribution table to determine a corresponding Z value.
z 3- Calculate the test statistic for the sample that has taken.
z There are three cases:

4
T-Test
z Case 1: The variable has a normal distribution and σ2is known.
In this case the test statistic is
x − μ0
Z =
σ
n
which has a standard normal distribution if μ = μ0 in H 0.
z Case 2: The variable has a normal distribution and σ is
2

unknown. The test statistic is


x − μ0
Z =
S
n
which has a t n −1distribution if H0 is true.
5
T-Test
z Case 3: The variable is not normal but n is large (which n>30),
σ 2may be known or unknown.
z The test statistic is
x − μ0
Z = if σ 2 is known
σ
n
x − μ0
or Z = if σ 2 is unknown
s
n

z By central limit theorem it has approximately standard normal


distribution (0,1) if H0 is true.

6
T-Test
4- Determine the boundary (or boundaries) for the area of
rejection regions using either X c or Zc values. A critical value is
the boundary or limit value that requires as to reject the
statement of the null hypothesis.

7
T-Test

Rejection region Rejection region

Lower X C μ upper XC

In directional test there are two critical values when:


H a : μ ≠ μo

8
T-Test

Rejection region

μ upper XC
In directional test there is one critical value (upper boundary ) when:

H a : μ > μo

9
T-Test

Rejection region

Lower XC μ
In directional test there is one critical value (lower boundary ) when:

H a : μ < μo

10
z The critical value X is simply the maximum or minimum value that we are
willing to accept as being consistent with the stated parameter . The mean of the
distribution is given by:
μx = μ
z The standard deviation of the distribution is given by:
σ
σx =
n
z 5- Formulate a decision rule on the basis of the boundary values obtained in step
4. When we conduct an hypothesis test, we are required to make one of two
decisions:
z a- Reject Ho or
z B- Accept Ho

11
It is possible to make two errors in decision . One error is
called a type I error or α − error .We make a type I
error whenever we reject the statement of H 0 ,when is in
fact true. The probability of making a type I error is the
level of significance of the test. The second error we can
make in an hypothesis test is called a type II error, or B-
error. We commit a type II error if we fail to reject the
statement of ,when
H is in fact false. The four combinations
0
of truth values of and the resulting
H
0
decisions are
summarizing below:

12
H H
0 0

True False
Reject Type I Correct
H
0 error Decision
Accept Correct Type II
H
0 Decision error

13
When we lower the level of significance of an
hypothesis test we always increase the possibility of
committing a B-error.
6- State a conclusion for the hypothesis test based on the
sample data obtained and the decision rule stated in steps.

14
z P-value of a test:
z The p- value is the probability of getting a value more
extreme than one observed value of the test statistic, it is
denoted by Z When H is as follows: H a ≠
obs a
z P-value= 2p (Z >| Z obs |)
z When H a is :>
z p-value= p (Z > Z obs )
z When H a is :<

z P-value = p (Z < Z obs )

15
z If we have a T statistic with a t n − 1 distribution and
observe value t , these p-values becomes:
obs
z ≠ alternative :p-value = 2p (t n − 1 >| t obs |)
¾ > alternative :p-value = p ( t n − 1 > t obs )
¾ < alternative :p-value = p( t < t )
n −1 obs

16
¾ Thus H is rejected if p-value < α . When data is
o
collected from a normally distributed population and the
sample size is small, the t values of the student t
distribution must be used in the hypothesis test not the Z
values of the normal distribution. This is due to the fact
that her central limit theorem does not apply when n < 30.

17
¾ Ex:
¾ Suppose we measure the sulfur content (as a percent) of
15 samples of crude oil from a particular Middle Eastern
area obtaining:
¾ 1.9,2.3,2.9,2.5,2.1,2.7,2.8,2.6,2.6,2.5,2.7,2.2,2.8,2.7,3.
¾ Assume that sulfur content are normally distributed . Can
we conclude that the average sulfur content in this area is
less than 2.6? Use a level of significance of .05.

18
n = 15 X = 2.533 S = .3091 α = .05

H 0 : μ = 2.6

H a : μ < 2.6

19
Rejection region
.95

.05

-1.6

20
21
22
¾ Testing for the Difference in Two Population means:
¾ Often we have two populations for which we would
like to compare the means. Independent random
samples of sizes n 1 and n 2 are selected from the two
populations with no relationship between the elements
we drawn from the two populations. The statistical
hypothesis are given by:

23
H 0 : μ1= μ 2 vs H a : μ1 ≠ μ 2

or H a : μ1 > μ 2

or H a : μ1 < μ 2

24
z There are three cases which depend on what is known
about the the population variances.

σ 1
2
an d σ 2
2

z Case1:
z Population variances are known for normal populations
(or non normal populations with both n 1 and n 2 large).
In this case the test statistic is to be :
X 1 − X 2
Z =
σ 12 σ 22
+
n1 n 2

25
z Case2:
z Populations are unknown but are to be equalσ 12 = σ 22 = σ 2
z in normal populations. In this case, we pool our estimates
to get the pooled two- sample variance
z

2 ( n1 − 1) S12 + ( n2 − 1) S 22
S =
p n1 + n2 − 2

26
z And the test statistic is to be

X1 − X 2
T =
2 1 1
Sp( + )
n1 n2
z Which has a t n +n −2
1 2
distribution if H is true.
0

27
z Case 3:
1 2
σ σ
z 2 and 2 are unknown and unequal normal
populations . In this case the test statistic is given by:
X1 − X 2
T ′ =
S 12 S 22
+
n1 n2
which does not have a known distribution.

28
Ex:
The amount of solar ultraviolet light of wavelength from 290 to 320
nm which reached the earths surface in the Riyadh area was
measured for independent samples of days in cooler months
(October to March) and in warmer months (April to September):
z Cooler:5.31,4.36,3.71,3.74,4.51,4.58,4.64,3.83,3.16,3.67,4.34,2.95,
3.62,3.29,2.45.
z Warmer:4.07,3.83,4.75,4.84,5.03,5.48,4.11,4.15,3.9,4.39,4.55,4.91,
4.11,3.16,2.99,3.01,3.5,3.77.

29
z Assuming normal distributions with equal variances ,
test whether there is a difference in the average ultraviolet
light reaching Riyadh in the cooler and warmer months .
Use a level of significance of .05.

30
n = 15 n = 18
1 2

X
1 = 3 . 877 X 2 = 4 . 142

S 1 = . 751 S 2 = . 709

H 0 : μ 1 = μ 2

H a : μ 1 ≠ μ 2

31
z The pooled two sample variance is

2 ( n1 − 1 ) S 12 + ( n 2 − 1 ) S 22
S = = . 531
p n1 + n 2 − 1

z And the test statistic is to be

X1 − X 2
T = = −1.033
1 1
S 2p ( + )
n1 n2

32
.95

.025 .025

−t = 2.0423 t
31 , 025 = 2 . 0423
31.025

33
34
35
z Since the value of the test statistic is in the
acceptance region , then H 0 is accepted at α = 05 .
z It means that there is no difference in the average
ultraviolet light reaching Riyadh in the cooler and
warmer months .

36
z Dependent Samples:

¾ The method of comparing parameters of populations


using paired dependent samples requires that we pair the
items of data as we sample them from the two
populations. .Further more , the size of the two
populations selected from both populations is the same,
that is n = n = n
1 2

37
¾ For each X (the elements of the sample before the
i
experiment) and Y i (the elements of the sample after the
experiment) we obtain in the two samples, we compute a
value d of a random variable D which represents the
i
difference between the two populations and n is the
number of items of data obtained in each of the two
samples .

38
The samples drawn from the two populations are
therefore converted to single sample –a sample of d i ' s
The mean , d , and the standard deviation, S , of the
distribution of d i ' s are obtained as follows: d
z
∑ di ∑ ( xi − yi )
d = =
n n

∑ (di − d )2
Sd =
n − 1

39
¾ We are interested in testing one of the tests of hypothesis:
H 0 : μd = 0 vs H a : μd ≠ 0

or H a : μd > 0

or H a : μd < 0
z Thus the quantity
d − μ d
T =
S d
n

z has a t distribution.
n −1

40
Ex:
In an experiment comparing two feeding methods for
calves, eight pairs of twins were used-one twin receiving
Method A and the other twin receiving Method B. At the
end of a given time, the calves were slaughtered and
cooked, and the meat was rated for its taste (with a higher
number indicating a better taste

41
Twin pair Method A Method B
1 27 23
2 37 28
3 31 30
4 38 32
5 29 27
6 35 29
7 41 36
8 37 31

42
Assuming approximate normality, test if the average taste
score for calves fed by Method B is less than the average
taste for calves fed by Method A. Use α = .05 .

43
2
d d
i i
4 16
9 81
1 1
6 36
2 4
6 36
5 25
6 36
39 235 44
H 0 : μd = 0 vs H a : μd > 0

d =
∑di = 4.875
n


1
Sd = ( d i2 − n d 2 ) = 2.542
n

45
The test statistic is

d − μd
T = = 5.447
Sd
n

46
.95 rejection region

.05
t = 1 . 8946
n − 1, α

47
48
49
50
Quality Control
z A “defect” is an instance of a failure to meet a requirement
imposed on a unit with respect to single quality characteristic . In
inspection or testing , each unit is checked to see if it does or dose
not contain any defects. For example
±
, if every dosage unit could
be tested , the expense would probably be prohibitive both to
manufacturer and consumer. Also it is may cause misclassification
of items and other errors . Quality can be accurately and precisely
estimated by testing only part of the total material (a sample) .It
requires small samples for inspection or analysis .

51
z Data obtained from this sampling can then be treated
statistically to estimate population parameters. After
inspection (n) units we will have found say (d) of them to
be defectives and (n - d) of them to be good ones. On the
other hand we may count and record the number of
defects, c, we find on single unit. This count may be
0,1,2,…. Such an approach of counting of defects on a
unit becomes especially useful if most of the units contain
one or more defects.

52
z Control charts can be applied during in - process
manufacturing operations, for finished product
characteristics and in research and development for
repetitive procedures.We may always convert a
measurable characteristics of a unit to an attribute by
setting limits, say L (lower bound) and U (upper bound)
for x. Then if x lies between, the unit is a good one, or if
outside, it is a defective one. As an example for the
control chart the tablet weight.

53
z We are interested in ensuring that tablet weight remain
close to a target value under “statistical control”. To
achieve this object , we will periodically sample a group
of tablets, measuring the mean weight and variability.
Variability can be calculated on the basis of the standard
deviation or the range. The range is the difference between
the lowest and highest value.

54
z If the sample size is not large (<10) the range is an
efficient estimator of the standard deviation. The mean
weight and variability of each sample (subgroup) are
plotted sequentially as a function of time. The control
chart is a graph that has time or order of submission of
sequential lots on the x axis and the average test result on
the Y axis. The subgroups should be as homogeneous as
possible relative to overall process. They are usually ( but
not always) taken as units manufactured close in time.

55
z Four to five items per subgroup is usually as adequate
sample size. In our example (10) tablets are individually
weighted at approximately (1) hour intervals. The mean
and range are calculated for each of the subgroups
samples. As long as the mean and range of the 10 tablet
samples do not vary “ too much” from subgroup to
subgroup, the product is considered to be in control (it
means that the observed variation is due only to the
random, uncontrolled variation inherent in the process).

56
z We will define upper and lower limits for the mean and
range of the subgroups. The construct of these limits is
based on normal distribution. In particular, a value more
than (3) standard deviations from the mean is highly
unlikely and can be considered to be probably due to some
systematic, assignable cause. The average line (the target
value) may be determined from the history of the product
regular updating or may be determined from the product
specifications .

57
z The action lines (the limits) are constructed to represent
z ± 3 standard deviations ( ± 3 σ limits) from the
target value. The upper and lower limits for the mean X
z chart are given by:
X ± AR ,

R =
∑ R
K
is the average range , K is the number of samples
(subgroups).A is a factor which is obtained from a table
according to the sample size .

58
z The central line, the upper and lower limits for the range chart are
given by:
z Central line =

R =
∑R
z
K

z Lower limit = D R
L
z Upper limit = D R
U

59
z Where D and D are factors which are
L U
z obtained from a table according to the sample size. It is
noticed that the sample size is constant.
z Ex:
z Tablet weights and ranges from a tablet Manufacturing
Process (Data are the average and range of 10 tablets):

60
Date Time Mean Range
X R
3/1 11 a.m. 302.4 16
12 p.m. 298.4 13
1 p.m. 300.2 10
2 p.m. 299 9
3/5 11 a.m. 300.4 13
12 p.m. 302.4 5
1 p.m. 300.3 12
2 p.m. 299 17
61
Date Time Mean Range R
X

3/9 11 a.m. 300.8 18


12 p.m. 301.5 6
1 p.m. 301.6 7
2 p.m. 301.3 8
3/11 11 a.m. 301.7 12
12 p.m. 303 9
1 p.m. 300.5 9
2 p.m. 299.3 11
62
Date Time Mean X Range R

3/16 11 a.m. 300 13


12 p.m. 299.1 8
1 p.m. 300.1 8
2 p.m. 303.5 10
3/22 11 a.m. 297.2 14
12 p.m. 296.2 9
1 p.m. 297.4 11
2 p.m. 296 12
63
X chart

The central line ( the t arg et value ) is X = 300

A = . 31 at n = 10 , R = 10 . 833

L , U = X ± A R = 300 ± (. 31 )( 10 . 833 ) = 300 ± 3 . 358

Lower Limit = 296 . 642

Upper Limit = 303 . 358

64
R chart

The central line = R = 10 . 833

D = . 22 , D = 1 . 78 at n = 10
L U

Lower Limit = D R = 2 . 383


L

Upper Limit = D R = 19 . 283


U

65
X
304 U c L=303.358
302
C L=300
300
298 L c L=296.642
296
294
292
290

3\1 3\5 3\9 3\11 3\16 3\22

66
U c L=19.283
R 18
16
14
12
C L=10.833
10
8
6
4 L c L=2.383
3\1 3\5 3\9 3\11 3\16 3\22
67
Non‐parametric tests
• Note: When valid use parametric
• Commonly used
Wilcoxon
Chi square etc.
• Performance comparable to parametric
• Useful for non‐normal data
• If normalization not possible
• Note: CI derivation‐difficult/impossible
Wilcoxon signed rank test

To test difference between paired 
data
STEP 1
• Exclude any differences which are zero

• Put the rest of differences in ascending order

• Ignore their signs

• Assign them ranks

• If any differences are equal, average their ranks
STEP 2

• Count up the ranks of +ives as T+

• Count up the ranks of –ives as T‐
STEP 3
• If there is no difference between drug (T+) and 
placebo (T‐), then T+  & T‐ would be similar

• If there were a difference  
one sum would be much smaller and    
the other much larger than expected

• The smaller sum is denoted as T

• T = smaller of T+ and T‐
STEP 4
• Compare the value obtained with the critical 
values (5%, 2% and 1% ) in table

• N is the number of differences that were 
ranked (not the total number of differences)

• So the zero differences are excluded
Hours of sleep Rank
Patient Drug Placebo Difference Ignoring sign
1 6.1 5.2 0.9 3.5*
2 7.0 7.9 -0.9 3.5*
3 8.2 3.9 4.3 10
4 7.6 4.7 2.9 7
5 6.5 5.3 1.2 5
6 8.4 5.4 3.0 8
7 6.9 4.2 2.7 6
8 6.7 6.1 0.6 2
9 7.4 3.8 3.6 9
10 5.8 6.3 -0.5 1
3rd & 4th ranks are tied hence averaged
T= smaller of T+ (50.5) and T- (4.5)
Here T=4.5 significant at 2% level indicating the drug (hypnotic) is
more effective than placebo
Wilcoxon rank sum test

• To compare two groups

• Consists of 3 basic steps
Non‐parametric equivalent of 
t test
Step 1

• Rank the data of both the groups in 
ascending order

• If any values are equal average their ranks
Step 2

• Add up the ranks in group with smaller 
sample size

• If the two groups are of the same size either 
one may be picked

• T= sum of ranks in group with smaller sample 
size
Step 3
• Compare this sum with the critical ranges 
given in table

• Look up the rows corresponding to the sample 
sizes of the two groups

• A range will be shown for the 5% significance 
level
Non-smokers (n=15) Heavy smokers (n=14)
Birth wt (Kg) Rank Birth wt (Kg) Rank
3.99 27 3.18 7
3.79 24 2.84 5
3.60* 18 2.90 6
3.73 22 3.27 11
3.21 8 3.85 26
3.60* 18 3.52 14
4.08 28 3.23 9
3.61 20 2.76 4
3.83 25 3.60* 18
3.31 12 3.75 23
4.13 29 3.59 16
3.26 10 3.63 21
3.54 15 2.38 2
3.51 13 2.34 1
2.71 3
Sum=272 Sum=163

* 17, 18 & 19are tied hence the ranks are averaged


NON-PARAMETRIC TEST
Statistical tests fall into two
categories:
(i) Parametric tests
(ii) Non-parametric tests
The parametric tests make the following
assumptions
• the population is normally distributed;
• homogeneity of variance

If any or all of these assumptions are untrue


• then the results of the test may be invalid.
• it is safest to use a non-parametric test.
ADVANTAGES OF NON-PARAMETRIC TESTS
• If the sample size is small there is no
alternative
• If the data is nominal or ordinal
• These tests are much easier to apply
DISADVANTAGES OF NON-PARAMETRIC TESTS

i) Discard information by converting to ranks


ii) Parametric tests are more powerful
iii) Tables of critical values may not be easily
available.
iv) It is merely for testing of hypothesis and no
confidence limits could be calculated.
Non-parametric tests
„ Note: When valid, use parametric
„ Commonly used
Wilcoxon signed-rank test
Wilcoxon rank-sum test
Spearman rank correlation
Chi square etc.
„ Useful for non-normal data
„ If possible use some transformation
„ If normalization not possible
„ Note: CI interval -difficult/impossible
?
Wilcoxon signed rank test

„ Totest difference between


paired data
EXAMPLE
Hours of sleep
Patient Drug Placebo

1 6.1 5.2
2 7.0 7.9
3 8.2 3.9
4 7.6 4.7
5 6.5 5.3
6 8.4 5.4
7 6.9 4.2
8 6.7 6.1
9 7.4 3.8
10 5.8 6.3

Null Hypothesis: Hours of sleep are the same using placebo & the drug
STEP 1
• Exclude any differences which are zero

• Ignore their signs

• Put the rest of differences in ascending order

• Assign them ranks

• If any differences are equal, average their


ranks
STEP 2

• Count up the ranks of +ives as T+

• Count up the ranks of –ives as T-


STEP 3
• If there is no difference between drug (T+)
and placebo (T-), then T+ & T- would be
similar

• If there is a difference
one sum would be much smaller and
the other much larger than expected

• The larger sum is denoted as T

• T = larger of T+ and T-
STEP 4
• Compare the value obtained with the
critical values (5%, 2% and 1% ) in table

• N is the number of differences that


were ranked (not the total number of
differences)

• So the zero differences are excluded


Hours of sleep Rank
Patient Drug Placebo Difference Ignoring sign
1 6.1 5.2 0.9 3.5*
2 7.0 7.9 -0.9 3.5*
3 8.2 3.9 4.3 10
4 7.6 4.7 2.9 7
5 6.5 5.3 1.2 5
6 8.4 5.4 3.0 8
7 6.9 4.2 2.7 6
8 6.7 6.1 0.6 2
9 7.4 3.8 3.6 9
10 5.8 6.3 -0.5 1
3rd & 4th ranks are tied hence averaged; T= larger of T+ (50.5) and T- (4.5)
Here, calculated value of T= 50.5; tabulated value of T= 47 (at 5%)
significant at 5% level indicating that the drug (hypnotic) is more effective
than placebo
Wilcoxon rank sum test

• To compare two groups

• Consists of 3 basic steps


Non-smokers (n=15) Heavy smokers (n=14)
Birth wt (Kg) Birth wt (Kg)
3.99 3.18
3.79 2.84
3.60* 2.90
3.73 3.27
3.21 3.85
3.60* 3.52
4.08 3.23
3.61 2.76
3.83 3.60*
3.31 3.75
4.13 3.59
3.26 3.63
3.54 2.38
3.51 2.34
2.71
Null Hypothesis: Mean birth weight is same between non-smokers & smokers
Step 1

• Rank the data of both the groups in


ascending order

• If any values are equal, average their


ranks
Step 2

• Add up the ranks in the group with


smaller sample size

• If the two groups are of the same size


either one may be picked

• T= sum of ranks in the group with


smaller sample size
Step 3
• Compare this sum with the critical ranges 
given in table

• Look up the rows corresponding to the sample 
sizes of the two groups

• A range will be shown for the 5% significance 
level
Non-smokers (n=15) Heavy smokers (n=14)
Birth wt (Kg) Rank Birth wt (Kg) Rank
3.99 27 3.18 7
3.79 24 2.84 5
3.60* 18 2.90 6
3.73 22 3.27 11
3.21 8 3.85 26
3.60* 18 3.52 14
4.08 28 3.23 9
3.61 20 2.76 4
3.83 25 3.60* 18
3.31 12 3.75 23
4.13 29 3.59 16
3.26 10 3.63 21
3.54 15 2.38 2
3.51 13 2.34 1
2.71 3
Sum=272 Sum=163
* 17, 18 & 19are tied hence the ranks are averaged
Hence caculated value of T = 163; tabulated value of T (14,15) = 151
Mean birth weights are not same for non-smokers & smokers
they are significantly different
Spearman’s Rank Correlation Coefficient

• based on the ranks of the items rather


than actual values.
• can be used even with the actual values

Examples
• to know the correlation between honesty
and wisdom of the boys of a class.

• It can also be used to find the degree of


agreement between the judgements of
two examiners or two judges.
R (Rank correlation coefficient) =

D = Difference between the ranks of two items


N = The number of observations.
Note: -1 ≤ R ≤ 1.

i) When R = +1 Perfect positive correlation or


complete agreement in the same direction

ii) When R = -1 Perfect negative correlation or


complete agreement in the opposite direction.

iii) When R = 0 No Correlation.


Computation

i. Give ranks to the values of items.


Generally the item with the highest value is
ranked 1 and then the others are given ranks 2,
3, 4, .... according to their values in the
decreasing order.
ii. Find the difference D = R1 - R2
where R1 = Rank of x and R2 = Rank of y
Note that ΣD = 0 (always)
iii. Calculate D2 and then find ΣD2
iv. Apply the formula.
If there is a tie between two or more items.
Then give the average rank. If m be the number of items of
equal rank, the factor 1(m3-m)/12 is added to ΣD2. If there
is more than one such case then this factor is added as
many times as the number of such cases, then
Student Rank in Rank in R1 - R2 (R1 - R2 )2
No. Maths Stats D D2
(R1) (R2)
1 1 3 -2 4
2 3 1 2 4
3 7 4 3 9
4 5 5 0 0
5 4 6 -2 4
6 6 9 -3 9
7 2 7 -5 25
8 10 8 2 4
9 9 10 -1 1
10 8 2 6 36
N = 10 ΣD=0 Σ D2 = 96
Pharmacoepidemiologic Study
Designs

Revolutionpharmd.com
Study Designs

1. Case Reports

2. Case Series

3. Analyses of Secular Trends

4. Case – Control Studies

5. Cohort Studies

6. Randomized Clinical Trials

Revolutionpharmd.com
1. Case Reports

• Are simply reports of events observed in single patients.


• Useful for raising hypotheses about drug effects. Leads to
the drug test with more rigorous study design.
• Very rare to use to make a statement of causation.
• Exception to this is when the outcome is very rare and so
characteristic that one knows that it is due to the exposure.
• Is accepted when challenge situation is very fatal.

Revolutionpharmd.com
2. Case Series

• Collections of patients, all of whom have a single exposure,


whose clinical outcomes are then evaluated and described.
• Alternatively case series can be collection of patients with a
single outcome, looking at their antecedent exposure.
• Useful for quantifying the incidence of an adverse reaction or
whether occurs in larger population.
• Just provides clinical descriptions of a disease or of patients
who receive an exposure.

Revolutionpharmd.com
3. Analyses of Secular Trends
• Called as ecological studies.
• Examines trends in an exposure that is a presumed cause and
trends in a disease that is a presumed effect and test whether the
trends coincide.
• Vital statistics and record linkage are often used in these studies.
• Useful for rapidly providing evidence for or against a hypothesis.
• Unable to control confounding variables. E.g. lung cancer might
be the cause of cigarettes but chance of occupational hazards
can still not be ruled out

Revolutionpharmd.com
4. Case – Control Studies
• Compare cases with the disease to controls without the
disease, looking for differences in exposure.
• Multiple possible causes of a single disease can be
studied.
• Helps in studying relatively rare disease requires smaller
sample size.
• Informations are generally obtained retrospectively from
the medical records, by interviews or questionnaires.
• Limitations are validity of retrospective information and
selection of control is challenging task. Inappropriate
Revolutionpharmd.com
control selection can lead to incorrect conclusion.
5. Cohort Studies
• Identify subsets of a defined population and followed them
over time, looking for differences in their outcome.
• Used to compare exposed patients to unexposed patients, can
also be used to compare one exposure to another or when
multiple outcomes from single exposure is to be studied.
• Either done prospectively or retrospectively.
• More reliable causal association.
• But requires large sample size (even for an uncommon
outcome) and can require prolonged time period to study
delayed outcomes.
Revolutionpharmd.com
Differences between Cohort study and Case – Control study

Case – Control Studies

Disease

Present Absent
Cohort (Cases) (Controls)
studies

Present
(Exposed)
Factor

Absent
(Unexposed)
Revolutionpharmd.com
6. Randomized Clinical Trials

• An experimental study – the investigator controls the


therapy that is to be received by each participant.
• Major strength is the randomization.
• Problems might include the ethical issues and are
expensive. They are not of big importance after marketing.

Revolutionpharmd.com
Computer System in Hospital
Pharmacy
1. Pattern of Computer Uses in
Hospital Pharmacy
• * Hospital Pharmacy is very slow to adopt
computers
* Only about 60% of Pharmacies are
computerized to any extent
* Institutional Pharmacy Manager may
be wary about the computerization
www.revolutionpharmd.com
Pattern of Computer Uses in
Hospital Pharmacy
• Because Hospital Pharmacy is a more
difficult and complex operation than the
retail pharmacy
• Retail Pharmacies dispense prescription
in more or less the same way
• Hospital pharmacy dept. distributes
different types of drug products and giving
different types of services

www.revolutionpharmd.com
Pattern of Computer Uses in

Hospital Pharmacy
Institutional Pharmacist are also aware
that many computer systems have
performed in less than satisfactory
manner
• One survey revealed that only about 69%
of hospital pharmacies were fully satisfied
with their computer system

www.revolutionpharmd.com
Pattern of Computer Uses in
Hospital Pharmacy
• Nearly 2/3rd of hospital pharmacies
believed that their computer system.
Because it has improved some
pharmacy operations such as billing,
quality of drug therapy in the hospital
• Each having its own information
requirements
• Therefore considerable room for
improvement is required
www.revolutionpharmd.com
Pattern of Computer Uses in
Hospital
• The most common system feature is ability to generate
• 1 Drug order labels
• 2. Maintain Patient Profile
• 3. Generating drug use review data
• 4. Maintaining a drug formulary
• 5. Updating drug Price
• 6. Transferring Patients drug charges to
the billing department
7. To have some inventory control function
8. Food and drug interaction

www.revolutionpharmd.com
Capabilities Of A Hospital
Computer System
• System is capable performing
• 1. Patient record database management
• 2.Medication order entry
• 3.Drug label & Lists
• 4.Intravenous solutions and admixtures
• 5.Patient medication profiles
• 6.Drug utilization review

www.revolutionpharmd.com
Capabilities Of A Hospital
Computer System
• 7. Drug therapy problem detection
• 8. Drug therapy monitoring
• 9. Formulary search and update
• 10. Purchasing and Inventory control
• 11. Billing Procedure
• 12. Management information systems and
decision support
• 13. Integration with other hospital
departments

www.revolutionpharmd.com
1. Patient record database
management
• A Hospital Pharmacy computer assure
that the pharmacy’s Patient record
database is continually updated to reflect
the current status of patient
• Updating has to done by accessing
information from admitting department
database to determine recent admission
discharge and patient transfer (ADT)

www.revolutionpharmd.com
1.Patient record database
management
• Computer system should be capable of
producing a current roster of patients, by
identifying Name, Age, sex, Room number
and Hospital service Unit
• Computer system must be capable of
displaying
• 1. Present diagnoses
• 2. Other diseases present

www.revolutionpharmd.com
1. Patient record database
management
• 3. Allergies
• 4. Weight
• 5. Height
• 6. Physician
• 7. Special Note about patient

www.revolutionpharmd.com
Medication order Entry
• Rapid processing of drug orders is an
essential function of computer system
• Typically orders are entered at a terminal
by technical person
• Formatted data entry screen should allow
easy entry and Retrieval of orders

www.revolutionpharmd.com
Medication order Entry
• A Pharmacist should be able to retrieve
order for review or verification prior to
administration to patient
• All drug order should contain at least the
following data elements
• 1. Physician
• 2. Drug code
• 3. Drug generic name and strength

www.revolutionpharmd.com
Medication order Entry
• 4. Route of administration
• 5. Dosage administration schedule
• 6. Start date
• 7. Stop date
• 8. Order status: Conditional, Active,
Discontinued
9. Pharmacist verification code

www.revolutionpharmd.com
Medication order Entry
• System – Capable of Easily and Rapidly
aggregating and displaying entries by Name,
Chart or room number
• It should be possible to separate
• Scheduled & Prn orders and Patients with
• 1. Active therapy
• 2. Patient with no therapy
• System – Should be capable of automatically
scheduling start and stop dated for
administration of each drug

www.revolutionpharmd.com
Drug Labels and List
• System – Should be capable of generating
medication container labels and report in the
form
• 1. Patient medication Profile
• 2. “Fill list” for the preparation of individual
doses in medication charts
• 3. List of medication changes since the last “Fill
list “ was prepared.
• 4. Drug order renewal list for the prescriber
• 5. Medication administration record (MAR)

www.revolutionpharmd.com
Medication Administration record

• MAR – Designed to provide nurses with a


document for administering drugs to
• Patients more easily and accurately
• It contains
• 1. Name
• 2. Bed number
• 3. Diagnosis
• 4. Sex

www.revolutionpharmd.com
Medication Administration record

• 6. Weight
• 7. Height
• 8. Allergies
• Medication Currently Scheduled for
administration are listed
• Contains - Drug, Strength and Dosage
Scheduled and Start and Stop Date

www.revolutionpharmd.com
Intravenous Solutions and
Admixture
• IV solutions are prepared separately from
other medication order
• It requires separate computer reports
• Orders for these solution should contain
• 1. Patient identifying information
• 2. Medication order information
• 3. Start and stop date

www.revolutionpharmd.com
Intravenous Solutions and Admixture
• 4. Administration rate
• 5. Order status (Conditional , Active)

• System should calculate required volume


of IV solution components and flow rate

www.revolutionpharmd.com
Intravenous Solutions and Admixture

• Checks for incompatibilities and


interactions should occur with other
medications
• System should be capable of producing
a report of solution to be prepared at
the pharmacy and all necessary labels
• Current IV medication should be listed
on a separate section of MAR
www.revolutionpharmd.com
Intravenous Solutions and
Admixture

• Finally - System should generate lists of


solutions soon to expire for possible
renewal in the same manner as for other
medication order

www.revolutionpharmd.com
Patient Medication Profile

• Here we should have medication orders


separated into
• current orders,
• Discontinued orders
• IVs
• Other injectables
• Profile should be easily generated at any
time for review

www.revolutionpharmd.com
Intravenous Solutions and Admixture

• They should be updated in a common


profile
• Physician drug renewal list are used to
remind the physician
• Physician can consider reviewing a
medication order that is about to expire
• In contains - Information

www.revolutionpharmd.com
Intravenous Solutions and
Admixture
• Medication order
• Drug name
• Dosage
• Administration schedule
• Original start date
• Stop date

www.revolutionpharmd.com
Purchasing and Inventory Control
• Drug inventory is a delicate balance
between not ordering too much and
avoiding out-of-stock
• Here we can use – Perpetual and Periodic
inventory control system
• Perpetual system is most sophisticated
system

www.revolutionpharmd.com
Purchasing and Inventory
Control
• Perpetual system – It is impossible to
main in a pharmacy except by computer
• It involves maintaining running balance of
all drug in stocks
• But hospital pharmacy starts with Periodic
control system
• All drugs are entered into the dada base
as they are received in the Pharmacy

www.revolutionpharmd.com
Purchasing and Inventory
Control
• Added to beginning inventory level
• It will reflect the current stock level of
each drug
• Quantity of all drugs leaving the pharmacy
are similarly subtracted from the inventory
balance

www.revolutionpharmd.com
Purchasing and Inventory
Control
• This is done automatically done whenever
drug order is processed
• Residual inventory balance can be
checked by inventory system manager to
verify the current balance with minimum
order balance
• List of the drug that need to be ordered is
produced on demand

www.revolutionpharmd.com
Purchasing and Inventory
Control
• Perpetual Inventory system is a remarkable labor
saving device
• Helpful in avoiding costly stock outs
• Here physical inventory should be accurate in
computer maintained file
• Pharmacy personnel must be careful to assure
that all drugs leaving or entering the pharmacy
are entered into the computer

www.revolutionpharmd.com
Purchasing and Inventory
Control
• Finally the Periodically checked against the self
inventory to assure accuracy
• Usually checked once in a week.
• The amount of inventory on hand is compared to
minimum and maximum stock levels
• The existing stock level should be entered into
computer manually , then it will generate
• Placement order copy.

www.revolutionpharmd.com
Purchasing and Inventory
Control
• Then Purchase order may be then
generated for each supplier determining
minimum order quantity for each supplier
• Computer Provides reports of drug
purchase history information that are
invaluable in hospital inventory control
management

www.revolutionpharmd.com
Purchasing and Inventory
Control
• Sophisticated computer system can
perform Purchasing order processing and
Inventory management functions
• 1. Detects items that have reached pre-
determined minimum order levels and list
of those items and sorted by suppliers, in
the form of a purchase order

www.revolutionpharmd.com
Purchasing and Inventory
Control
• 2. Track purchase order through the
hospital and vendor purchasing system to
avoid duplicate order
• 3. Display requested drug to know current
stock and gives necessary information for
processing orders
• 4.Provides aggregate drug usage
statistics

www.revolutionpharmd.com
Purchasing and Inventory
Control
• 5.Provide periodic drug order history
report - ( 6 months) – Contains
information of price and stock number
• 6.Automatically recalculate and update
optimum reorder point based on order
history and lead time from suppliers
• 7.Detects and reports infrequently
purchased items

www.revolutionpharmd.com
Purchasing and Inventory
Control
• 8.Automatically print Physical inventory
reconciliation and inventory shrinkage reports for
perpetual inventory control system
• 9. Gives reports of inventory in hand, stock
turnover and value of drug purchased from
individual supplier
• 10. Update the cost information in database

www.revolutionpharmd.com
Management reports and
Statistics
• Need to develop and maintain an
information system to assist managerial
decision making
• The computer system should generate
management reports relating to the
pharmacy work load for specific period of
time ( ex Monthly)

www.revolutionpharmd.com
Management reports and
Statistics
• These reports help pharmacy managers to plan
and monitor work schedule and budget
• Improve operational efficiency
• 1. Hospital census data :
• Number of admission
• Discharges
• Patients days

www.revolutionpharmd.com
Management reports and
Statistics
• 2. Aggregate drug usage ---
• Total drug orders and doses
• produced
• Types of drug order (Oral and
• topical, Injectables)
• Pharmacy preparation hours

www.revolutionpharmd.com
Management reports and
Statistics
• 3. Drug use per patients by drug,
diagnosis or hospital service unit
• Aggregate drug usage – Number of
patients receiving drugs
• Average number of doses and total
cost
• Usage by drug category -Average
number of doses and total cost

www.revolutionpharmd.com
Use of computer in community pharmacy
-SAI KUMAR

Computers have invaded i n every walk of life and almost all comm ercial organizati on and business
firms have undergone significant computerization with no exception of community pharmacy
establishment. At present c ommunity pharmacy use computer for s electi ve pharmaceutical purposes.
While there are several possible purposes. Following is a list of majority of community pharmacy functions
that could be computerized.
(1) Cleric al: Preparation of prescription levels. Providing a receipt for patient, Generation of hard
copy record of transaction. Calculation of total prescription cost. Maintenance of perpetual record of
inventory record. Accumulation of suggested orders based on suggested order quantity. Automatically
order required inventory via electronic transmission. Calculation and storing of annual withholding
statements.
(2) Managerial: Preparation of daily sales report. Generation of complete sales analysis as required
for a day, week, month, year and to date for number of prescriptions handled and amount in cash.
Estimation of profit and financial rati o analysis. Production of drug usage reports. Calculation of gross
margin, reported in all manner of details. Calculate number of prescriptions handled per unit time, to help
in staff scheduling. Printing of billing a payment summary.
(3) Professional: Building a patient profile. Storing of information on drug and ot her allergies to warn
about possible problems. Retrieval of current drug regimen for review. Updating of patient information in

m
file. P rinting of drug–drug and drug–food interactions. Maintaining of physicians file including specialty,

co
designation, address, hone office hours, etc.
d.
rm
(4) Clinical support:
– Patient medication profile – Patient education profile
ha

– Consulting pharmacist activities – Drug utilization monitoring


np
io
ut
ol
ev
R
DRUG INFORMATION RETRIEVAL AND STORAGE
A complete search of the drug information is necessary for the clinical pharmacist so as to
satisfy the queries about pharmacology, drug interactions, adverse drug reactions, toxicology
etc. This job of searching can be simplified by using computers.
In 1964, National Library of medicine created a computerised medical information
retrieval system MEDLARS (Medical Literature Analysis and Retrieval System). In 1971, they
developed a fast working system MEDLINE (MEDLARS ON-LINE).
The computerised information retrieval has following advantages over the manual search.
(i) It is time saving and pleasant.
(ii) It is more thorough and timely than manual search.
To operate the information retrieval system, the equipments needed include a micro-
computer, a printer, a telephone line and a modem.
For information retrieval, the choice of a database is also very important. The databases
may be

m
(a) Bibliographic database,

co
(b) Journal information and
d.
rm
(c) Textbook material.
Generally, bibliographic database is adopted, as usually there is a medical library nearby
ha

from where one can get the articles.


np

The databases are medicine oriented like MEDLINE, or pharmacy oriented, like
io

International Pharmaceutical. Abstract may be chosen. Some on- line Databases of the medical
ut

and pharmaceutical literature is shown are Table 15.1.


ol

Table 15.1
ev

Database Produced by Data


R

1. Medline National Library of Medicine Around 3000 Biomedical journals


(NLM). dating back to 1966.

2. Toxicology Data Bank (TDB) <t _ Toxicological data.

3. International Pharmaceutical American Society of Hospital More than 600 publications from
Abstracts Pharmacists 1970 are covered.

4. Biosis Bioscience Information Service Biological Abstracts


Some ot he r data bases a va ilab le on- line are :

(i) MEDLARS (M ed ica l Liter ature Ana lys is a nd Re tr ie va l System)


The earliest e ffort to computerize med ica l infor matio n retrie va l results in the
creation of MEDLARS by NLM.
(ii) MEDLINE (MEDLARS ON-LINE)
(iii) NLM (National Library of Medicine)
(iv) MICROMEDEX : It is a microcomputer based retrieval system that uses a laser
disk at the site for storage of data.
(v) I nte rna tio na l P ha r ma c e utic a l Ab s tr a c t s (I .P . A.) : It is available on- line
print version.
(vi) Pha rmace ut ica l News I nde x (PNI ) Co nt a ins c urr e nt ne ws abo ut
de vices , cos me tics a nd re la ted he a lth ind us t r ies. It pro vides newsletters which are

m
not covered by printing in abstracts and indexes. It is updated weekly.

co
Advantages : The most important advantage is time saving in conducting literature
d.
searches. A pharmacist may require severa l hours to resear ch a part ic ular the rape utic
rm
questio n fro m a litera t ure sea rc h cover ing about 10 years o f artic les. It can be done in
ha

minutes a nd the computer search is more pleasant.


np

Only few seconds are required to broaden a computer search from a spec ific drug
io

to ent ire therapeut ic c lass, b ut ma nua lly it is a ted ious job to search in the 'INDEX
ut

MEDICUS.
ol
ev

A pharmacist should be able to retrieve orders for review prior to administration to patient. Data
R

may be entered by use of codes for drug name, dosing schedule. All the drug orders should
contain following:
Drug name (code)
Drug generic name and strength
Roue of administration
Dosage schedule
Starting data
Stopping date
Physician code
Pharmacist varification code.
Features : The system should be capable of rapidly collecting and displaying entries by
patient name or room number. It should be capable of scheduling, starting and stopping date
automatically and separating the orders of patients on active drug therapy from those with no
drug therapy.
(3) Preparation of Lists

The computer system should be capable of producing the labels and reports in the form
of:
(i) Patient medication profile.
(ii) "Fill- lists'' for preparation of individual doses.
(iii) List of medications charged
(iv) Drug order renewal lists for the prescriber and (v) Medication
administration record (MAR).
Orders entered for IV solutions, admixtures and total parenteral nutrition (TPN) should be
separately prepared. These orders should contain the following information.
"Patient" and "Medication Order" identifying information
Start date and Stop date.

m
co
Administration rate.and
Order status (conditional or active). d.
rm
The system should be capable of calculating the flow rates and checking the
ha

incompatibilities. It should allow the conditional entry and its checking by pharmacist. The
np

system should be able to prepare the lists of solutions soon to expire and to be prepared at
io

pharmacy.
ut
ol
ev
R
Application of Computers in Pharmaceutical and Clinical Studies 347. •

critical conditions for immediate nursing attention and enables medical staff to
make accurate judgments of patients progress It further provides data for
CHAPTER 23 research purpose to monitor patients under intensive care.
Hence computers play an important role in communication by acquiring the data
about patient's metabolism and then communicating the same to medical staff by
displaying graphs Detecting critical conditions and generating alarms involve both
Applications of Computers in numerical and logical data processing. This processing helps in giving warning of
cnbcal conditions, enabling the medical staff for proper judgment of patients
Pharmaceutical and Clinical Studies progress, and in the long-term provides data for medical research.

Actually computers controls number of equipments simultaneously to obtain


samples of body fluids and then get them analyzed through auto-analyser for their
physical and chemical parameters.

m
After analysis of parameters "AND or OR" statements indicate logical relationship
where as IF….THEN mark a conditional computation. This combination of logical and

co
23.1 Introduction
conditional data processing enables the patient monitoring system as a decision
Now-a-days computers are used in pharmaceutical industries, hospitals and in making instrument for interpretation of results.

d.
various departments for drug information, education, evaluation, analysis and

rm
medication history and for maintenance of financial records etc. They have become
indispensable in the development of clinical pharmacy, hospital pharmacy and in 23.3 Medication Monitoring
pharmaceutical research. They co-ordinate effective communication and support

ha
To meet the goal of optimum drug therapy, medication is very essential In this case,
clinical and financial management functions prescription of the patient received over a period of time is entered into the computer

np
Effective functioning of any organization largely depends upon continuous flow or data which serves as a chronological drug file of the patient. It helps in suggesting
information, i.e., receiving the Information, storing it, processing it and disseminating it. number of drugs along with their dosage schedule. Computers provide two types of

io
An effective management in f ormation system always provides the needed information.
information in the right form, at the right time and at the right place. Actually each
ut (a) Pharmacokinetic (b) Non-Pharmacokinetic
ol
information of organization is connected with other informations , through
communication channels, thus making organizational entity a decision—making point.
Pharmacokinetic Information :
ev

Computers play an effective role for retrieval of information. In hospitals, data


"NONLIN" is a computer program which can predict pharmacokinetic parameters
R

management involves creating, modifying, adding and deleting data in patient flies to
very easily. These parameters include volume of distribution, bioavailability, rate of
generate reports. Now many doctors for further investigation as they are connected
clearance etc It helps in maintaining dosage schedule of various drugs like
through various personal computers share these reports. Some popular Data Base
antibiotics, aminoglysosides etc.
Management System (DBMS) packages for personal Computer are: Dbase III+ and
Fox Base + Hence, computer help in maintaining overall health care system and this
Non-Pharmacokinetic Information
can be best illustrated by enlisting its applications
It includes various allergic reactions, drug interactions, adverse drug reactions etc.
For such information two computer programmes are available.
23.2 Patient Monitoring
1 . MEDIPHOR (Monitoring and evaluation of drug interactions by a pharmacy
Patients monitoring incudes monitoring of physiological processes in patients such
oriented reporting)
as blood pressure, pulse rate, temperature, etc. This information plays special role in
2 .PAD (Pharmacy Automated drug interaction screening)
detection and prevention of critical condioons in patients, It helps in giving warning of
BiostatIstics and Computer Applications
Application of Computers in Pharmaceutical and Clinical Studies 351

The developments of powerful computers offer opportunity for improved viewing


manages the hospital systems and allows the p harmacists to check the work
and interpretation in radiology department. In this many of the latest imaging
Software is available for the pharmacist to provide professional services and to
techniques such as Computerized Tomography (CT) and Magnetic Resonance
automate the technical staff This will ultimately result in efficient and cost effective
Imaging (MR1) are inherently digital. In this, computer creates a "functional image" by
operation and will further maximize clinical and patient oriented functions of
performing complex calculations on measured data pharmacists

23.8 Pharmaceutical Education 23.11 Patient Counselling


Computer-aided instructions help in improving the shortcomings of traditional Computers play an important role in in-patient counseling. Sophisticated software is
teaching methods. They provide a medium for interactive learning offer immediate available to educate patients by giving patient education leaflets. These leaflets
student-specific feedback. Support individuals tailored instructions finally form a basis provide information about name of medication, its uses, side-effects, precautions,
for objective testing. drug interactions, missed dose, storage, how to take the medication etc.
It should always be kept in mind that computer program should be an intelligent
23.9 Hospital Pharmacy and Retail Pharmacy

m
one so that it should not affect adversely by giving too much or too less of the
information to patent

co
Computers are used in pharmacies to maintain accessible, legible and up-to-date
medication records. They help in keeping overall patient care by maintaining their

d.
records, consumption of drugs, registration numbers and detailed records of 23.12 Drug Interactions
accounts and purchase section. Even for retail pharmacist, computers have been of

rm
Pharmacists cannot remember each and every medication, its therapeutic usage, its
valuable assistance in the prescription processing, It includes display of computer
effects and drug interactions_ Therefore computers offer knowledge base systems to
information about patient and drug, its adverse drug reaction, causation, duplication

ha
extend our professional services. Computerized pharmacy can alert physician(
of orders, labeling conditions etc.
pharmacist for serious drug-drug, drug-food and drug-disease interactions, which are

np
Following are the other applications in hospital and retail pharmacy likely to occur in prescription. Examples of such database and online services are
 MEDLINE, IDIS and pharmline.

io
Calculation of monthly gross income
 Generating pay slips
 Updating the employee information ut 23.13 Community Pharmacy


ol
Placement of supply order - Computers help in streamlining refilling of prescriptions. It has terminated the long-

ev

standing problem and waiting in a queue for refilling of prescription It has been
Keeping track of total payment and amount due to supplier

becoming popular because it reminds the patient for refilling and compliance of
R

Checking the quality and quantity of hospital supplies recorded and medication These systems not only help in filling of individual prescriptions but in
identifying any discrepancies processing the prescription in a right manner. It also enables to manage inventory,
 Recording purchases for accounting purposes. sales, accounts, etc., in community pharmacy,
A number of computer programs have been developed to assist physicians in
dosing and scheduling drug. But there are certain drugs, which are extremely 23.14 Drug Information Services
sensitive to certain patients. For such patient's physicians use computer programs tti
Various software, Internet. Intranet and online services are available for the
forecast drug levels and to choose the amounts and schedule of drug doses that Will pharmacist to provide drug information service to medical, paramedical professional
achieve target level. Similarly 'HELP' is a system, which identifies abnormal and patients Computer—aided drug design helps the chemist to formulate a new drug
chemistry levels, concurrent diseases and other related patient conditions. molecule possessing desired therapeutic action. These new drug entities can be
generated through graphics and by changing molecular configuration'. CD-ROM
23.10 Hospital Setting technology has helped a lot in the evolution of compact electronic libraries. Various

Duties of the pharmacist have been changing tremendously and hence it has software programs of different companies are listed here.
become impossible to remember and to recheck everything. Therefore, computer
348 Blostatistics and Computer Apptications Application of Computers in Pharmaceutical and Clinical Studies 349

 Over/under stocking

23.4 Maintenance of Records
Venous records like patient's medication history. current treatment and fin3ncial Slow moving/Fast moving items
records etc , are maintained in computers by feeding accurate data as 'DATA' is a  Expired drugs
collection of facts and computer works as a DATA BASE' manager MEDLINIE is a
data base package used for such purpose It gives the current information of pa- bents 23.6 Data Storage and Retrieval
regarding their name, age. sex, room number, weight, allergic reaction etc "Itese
records are stored in a 'FILE like "Physician name" file, "Direction" file, 'Drug Hospital administration computer helps in rapid data storage and retrieval,
-
Interaction" file etc Now these files contain specific information like physman's particularly when the data stored is subjected Infrequent changes and when group of
name, registration number, phone number, address, etc. and provide such items based on the stored data need to be retneved. Admission of M-patients and
informaton whenever required their discharge from a hospital require data, which gets changed every minute, e g.,
admission of in-patient ties up resources like clinical and nursing staff, a bed,

m
23.5 Materials Management
operation theatre, Intensive care unit, pharmacy department, radiological services etc.
Computers play vdal role in material planning, purchasing, inventory control and Hence decision to admit a new patient is not a simple one. Even the availability of

co
forecasting pnces. Inventory control is very essential because it maintains the a suitable bed is difficult to determine in male and female ward, isolation ward etc A
balance between stock-in-hand and excessive capital investment. Techniques such as prediction must be made that a suitable bed will be available at future date because d
ABC analysis and EDO can be easily programmed It will eliminate the tedious and

d.
the estimation is over optimistic, then patients who are called in, may be tumed away
time-consuming task of calculations. Computers are used to detect the items, which at the last minute. If the prediction is over pessimist c expensive resources lie idle
had attained minimum order level It then prepares a list and purchase orders for

rm
and the waiting period for treatment is extended.
further supplies Generally there are Iwo systems for inventory control. Once the patient gets admitted, computer records and stores information like
For Penodic inventory control clinical information, catering information, diagnosis, sex, medication etc. It helps in

ha
(b) Perpetual system providing detailed information about medical and paramedical staff including their
(a) Periodic Inventory Control System : In this system stock levels are clocked duty chart It helps the senior personnel to keep a check on ward-by-ward loading of
nursing staff and to allocate additional help whenever required
manually and the amount of inventory in hand is compared with minimur6 and

np
maximum stock maintained in the computers. Computers help in placemcnt of
order to different suppliers after checking their terms and conditions becaule all 23.7 Diagnostic Laboratories
the entries of stocks are present in iL
io
(b) Perpetual System : In this system computer tells about the present positim of
all the drugs because when they are received, they are entered in the nitial
Computers meet the growing demand for testing laboratories as manual procedures
were lengthy and time consuming whereas automated computerized instruments
ut
perform a number of tasks with accuracy in diagnostic laboratories. Generally LIS
stocks to get the current stocks When the drugs are delivered to values (Laboratory Information System) is used to manage large amount of data_ In this,
departments the quantities are subtracted accordingly Such type of add.;ionS
ol
instruments Contain preprocessors, which convert raw data into digital format and
and deletions from inventory balance is done with the help of "data t ase" help in transmitting numerical values for report generation. LIS also performs
package
ev

administrative and managerial function, including specimen tracking, product analysis


The information as output from the computer may be obtained in venous brms like. and quality control. Similarly many instruments have microprocessors that facilitate all
Planning of material phases of testing processes, including calibration of instrument till reporting of

R

Drugs formulary reSults.


 Vendor detail for procurement
 Tender rate and analysis
 Determination of 500Pending supply orders


Inventory analysis


Records points
Safety stocks
 Ledger for narcotics
Computerized recordkeeping.

Regulations relating to the storage and retrieval of prescriptions records, including


computerized recordkeeping. This administrative regulation provides standards for those desiring
to use computerized recordkeeping.

Section 1. The following information shall be entered into the system:


(1) All information pertinent to a prescription shall be entered into the system, including, but
not limited to, each of the following:
(a) The prescription number;
(b) The patient’s name and address;
(c) The prescriber’s name and address;
(d) The prescriber’s Federal Drug Enforcement Administration number, if appropriate;
(e) Refill authorization;
(f) Any prescriber’s instructions or patient’s preference permitted by law or administrative
regulation;
(g) The name, strength, dosage form, and quantity of the drug dispensed originally and upon
each refill; and

m
(h) The date of dispensing of the prescription and the identifying designation of the

co
dispensing pharmacist for the original filling and each refill.
d.
(2) The entries shall be made into the system at the time the prescription is first filled and at
rm
the time of each refill, except that the format of the record may be organized so that the data
already entered may appear for the prescription or refill without reentering that data. Records
ha

that are received or sent electronically may be kept electronically. The dispensing pharmacist
np

shall be responsible for the completeness and accuracy of the entries.


(3) The original prescription and a record of each refill, if received written or oral, shall be
io

preserved as a hard copy for a period of three (3) years and thereafter be preserved as a hard
ut

copy or electronically for no less than an additional two (2) years. The original prescription and a
ol

record of each refill, if received by facsimile, shall be preserved as a hard copy, the original
ev

electronic image, or electronically for a period of three (3) years and thereafter be preserved as a
R

hard copy, the original electronic image, or electronically for no less than an additional two (2)
years. The original and electronic prescription shall be subject to inspection by authorized
agents. An original prescription shall not be obstructed in any manner.
(4) The original prescription and a record of each refill, if received as an e-prescription, shall
be preserved electronically for a period of no less than five (5) years. The electronic prescription
shall be subject to inspection by authorized agents. An original prescription shall not may be
obstructed in any manner.
(5) The required information shall be entered into the system for all prescriptions filled at the
pharmacy.
(6) The system shall provide adequate safeguards against improper manipulation or alteration
of the data.
(7) The system shall have the capability of producing a hard-copy printout of all original and
refilled prescription data as required in Section 1 of this administrative regulation. A hard-copy
printout of the required data shall be made available to an authorized agent within forty-eight
(48) hours of the receipt of a written request.
(8) The system shall maintain a record of each day’s prescription data as follows:
(a) This record shall be verified, dated, and signed by the pharmacist(s) who filled those
prescription orders either:
1. Electronically;
2. Manually; or
3. In a log.
(b) This record shall be maintained for no less than five (5) years; and
(c) This record shall be readily retrievable and shall be subject to inspection by authorized
agents.
(9) An auxiliary recordkeeping system shall be established for the documentation of refills if
the automated data processing system is inoperative for any reason. The auxiliary system shall
insure that all refills are authorized by the original prescription order and that the maximum
number of refills is not exceeded. If the automated data processing system is restored to
operation, the information regarding prescriptions filled and refilled during the inoperative
period shall be entered into the automated data processing system within seventy-two (72) hours.
(10) Controlled substance data shall be identifiable apart from other items appearing in the
record.
(11) The pharmacist shall be responsible to assure continuity in the maintenance of records
throughout any transition in record systems utilized.

m
co
Section 2. A computer malfunction or data processing services provider’s negligence shall
d.
not be not a defense against charges of improper recordkeeping.
rm
Section 3. This administrative regulation is not applicable to the recordkeeping for drugs
ha

prescribed for and administered to patients confined as inpatients in an acute care facility.
np
io
ut
ol
ev
R

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy