404 Research Methodology
404 Research Methodology
com
The question of the size of the sample, the number of observations, to be used in scientific
experiments is of extreme importance. Most experiments beg the question of sample size.
Particularly when time and cost are critical factors, one wishes to use the minimum sample size
to achieve the experimental objectives. Even when time and cost are less crucial, the scientist
wishes to have some idea of the number of observations needed to yield sufficient data to answer
the objectives. An elegant experiment will make the most of the resources available, resulting
in a sufficient amount of information from a minimum sample size. For simple comparative
experiments, where one or two groups are involved, the calculation of sample size is relatively
simple. A knowledge of the ␣ level (level of significance),  level (1 − power), the standard
deviation, and a meaningful “practically significant” difference is necessary in order to calculate
the sample size.
Power is defined as 1 −  (i.e.,  = 1 − power). Power is the ability of a statistical test to
show significance if a specified difference truly exists. The magnitude of power depends on the
level of significance, the standard deviation, and the sample size. Thus power and sample size
are related.
In this chapter, we present methods for computing the sample size for relatively simple
situations for normally distributed and binomial data. The concept and calculation of power
are also introduced.
6.1 INTRODUCTION
The question of sample size is a major consideration in the planning of experiments, but may not
be answered easily from a scientific point of view. In some situations, the choice of sample size
is limited. Sample size may be dictated by official specifications, regulations, cost constraints,
and/or the availability of sampling units such as patients, manufactured items, animals, and
so on. The USP content uniformity test is an example of a test in which the sample size is fixed
and specified [1].
The sample size is also specified in certain quality control sampling plans such as those
described in MIL-STD-105E [2]. These sampling plans are used when sampling products for
inspection for attributes such as product defects, missing labels, specks in tablets, or ampul leak-
age. The properties of these plans have been thoroughly investigated and defined as described
in the document cited above. The properties of the plans include the chances (probability) of
rejecting or accepting batches with a known proportion of rejects in the batch (sect. 12.3).
Sample-size determination in comparative clinical trials is a factor of major importance.
Since very large experiments will detect very small, perhaps clinically insignificant, differences
as being statistically significant, and small experiments will often find large, clinically significant
differences as statistically insignificant, the choice of an appropriate sample size is critical in the
design of a clinical program to demonstrate safety and efficacy. When cost is a major factor in
implementing a clinical program, the number of patients to be included in the studies may be
limited by lack of funds. With fewer patients, a study will be less sensitive. Decreased sensitivity
means that the comparative treatments will be relatively more difficult to distinguish statistically
if they are, in fact, different.
The problem of choosing a “correct” sample size is related to experimental objectives and
the risk (or probability) of coming to an incorrect decision when the experiment and analysis
are completed. For simple comparative experiments, certain prior information is required in
RevolutionPharmD.com
SAMPLE SIZE AND POWER 129
order to compute a sample size that will satisfy the experimental objectives. The following
considerations are essential when estimating sample size.
1. The ␣ level must be specified that, in part, determines the difference needed to represent a
statistically significant result. To review, the ␣ level is defined as the risk of concluding that
treatments differ when, in fact, they are the same. The level of significance is usually (but
not always) set at the traditional value of 5%.
2. The  error must be specified for some specified treatment difference, . Beta, , is the risk
(probability) of erroneously concluding that the treatments are not significantly different
when, in fact, a difference of size or greater exists. The assessment of  and , the
“practically significant” difference, prior to the initiation of the experiment, is not easy.
Nevertheless, an educated guess is required.  is often chosen to be between 5% and 20%.
Hence, one may be willing to accept a 20% (1 in 5) chance of not arriving at a statistically
significant difference when the treatments are truly different by an amount equal to (or
greater than) . The consequences of committing a  error should be considered carefully.
If a true difference of practical significance is missed and the consequence is costly,  should
be made very small, perhaps as small as 1%. Costly consequences of missing an effective
treatment should be evaluated not only in monetary terms, but should also include public
health issues, such as the possible loss of an effective treatment in a serious disease.
3. The difference to be detected, (that difference considered to have practical significance),
should be specified as described in (2) above. This difference should not be arbitrarily or
capriciously determined, but should be considered carefully with respect to meaningfulness
from both a scientific and commercial marketing standpoint. For example, when comparing
two formulas for time to 90% dissolution, a difference of one or two minutes might be
considered meaningless. A difference of 10 or 20 minutes, however, may have practical
consequences in terms of in vivo absorption characteristics.
4. A knowledge of the standard deviation (or an estimate) for the significance test is necessary.
If no information on variability is available, an educated guess, or results of studies reported
in the literature using related compounds, may be sufficient to give an estimate of the
relevant variability. The assistance of a statistician is recommended when estimating the
standard deviation for purposes of determining sample size.
To compute the sample size in a comparative experiment, (a) ␣, (b) , (c) , and (d)
must be specified. The computations to determine sample size are described below (Fig. 6.1).
Figure 6.1 Scheme to demonstrate calculation of sample size based on ␣, , , and : ␣ = 0.05,  = 0.10,
= 5, = 7; H 0 : = 0, H a : = 5.
RevolutionPharmD.com
130 CHAPTER 6
␦− ␦−0
Z= √ = √ . (6.1)
/ N 7/ N
For a two-tailed test, if the absolute value of Z is 1.96 or greater, the difference is significant.
According to Eq. (6.1), to obtain the significance
Z 7(1.96) 13.7
␦ ≥ √ = √ = √ . (6.2)
N N N
√ √
Therefore, values of ␦ equal to or greater than 13.7/ N (or equal to or less than −13.7/ N)
will lead to a declaration of significance. These points are designated as ␦L and ␦U in Figure 6.1,
and represent the cutoff points for statistical significance at the 5% level; that is, observed
differences equal to or more remote from the mean than these values result in “statistically
significant differences.”
√If curve B is the true distribution
√ (i.e., = 5), an observed mean difference greater than
13.7/ N (or less than −13.7/ N) will result in the correct decision; H0 will be rejected and√we
conclude that√ a difference exists. If = 5, observations of a mean difference between 13.7/ N
and −13.7/ N will lead to an incorrect decision, the acceptance of H0 (no difference) (Fig. 6.1).
By definition, the probability of making this incorrect decision is equal to .
In the present √example,  will be set at 10%. In Figure 6.1,  is represented by the area in
curve B below 13.7/ N(␦U ), equal to 0.10. (This area, , represents the probability of accepting
H0 if = 5.)
We will now compute the value of ␦ that cuts off 10% of the area in the lower tail of the nor-
mal curve with a mean of 5 and a standard deviation of 7 (curve B in Figure 6.1). Table IV.2 shows
that 10% of the area in the standard normal curve is below −1.28. The value of ␦ (mean difference
in blood pressure between the two groups) that corresponds to a given value of Z (−1.28, in this
example) is obtained from the formula for the Z transformation [Eq. (3.14)] as follows:
␦ = + Z √
N
␦−
Z = √ . (6.3)
/ N
√
Applying Eq. (6.3) to our present example, ␦ = 5 − 1.28(7/ N). The value of ␦ in Eqs. (6.2)
and (6.3) is identically the same, equal to ␦U . This is illustrated in Figure 6.1.
∗ is considered to be the true mean difference, similar to . ␦ will be used to denote the observed mean difference.
RevolutionPharmD.com
SAMPLE SIZE AND POWER 131
√
From
√ Eq. (6.2), ␦U = 13.7/ N, satisfying the definition of ␣. From Eq. (6.3), ␦U = 5 −
1.28(7)/ N, satisfying the definition of . We have two equations in two unknowns (␦U and N),
and N is evaluated as follows:
13.7 1.28(7)
√ = 5− √
N N
(13.7 + 8.96)2
N= = 20.5 ∼
= 21.
52
In general, Eqs. (6.2) and (6.3) can be solved for N to yield the following equation:
2
N= (Z␣ + Z )2 , (6.4)
where Z␣ and Z † are the appropriate normal deviates obtained from Table IV.2. In our example,
N= (7/5)2 (1.96 + 1.28)2 ∼
= 21. A sample size of 21 will result in a statistical test with 90% power
( = 10%) against an alternative of 5, at the 5% level of significance. Table 6.1 shows how the
choice of  can affect the sample size for a test at the 5% level with = 5 and = 7.
The formula for computing the sample size if the standard deviation is known [Eq. (6.4)]
is appropriate for a paired-sample test or for the test of a mean from a single population. For
example, consider a test to compare the mean drug content of a sample of tablets to the labeled
amount, 100 mg. The two-sided test is to be performed at the 5% level. Beta is designated as
10% for a difference of −5 mg (95 mg potency or less). That is, we wish to have a power of 90%
to detect a difference from 100 mg if the true potency is 95 mg or less. If is equal to 3, how
many tablets should be assayed? Applying Eq. (6.4), we have
2
3
N= (1.96 + 1.28)2 = 3.8.
5
Assaying four tablets will satisfy the ␣ and  probabilities. Note that Z = 1.28 cuts off 90%
of the area under curve B (the “alternative” curve) in Figure 6.2, leaving 10% () of the area in
the upper tail of the curve. Table 6.2 shows values of Z␣ and Z for various levels of ␣ and 
to be used in Eq. (6.4). In this example, and most examples in practice,  is based on one tail of
the normal curve. The other tail contains an insignificant area relating to  (the right side of the
normal curve, B, in Fig. 6.1)
Equation (6.4) is correct for computing the sample size for a paired- or one-sample test if
the standard deviation is known.
In most situations, the standard deviation is unknown and a prior estimate of the standard
deviation is necessary in order to calculate sample size requirements. In this case, the estimate
of the standard deviation replaces in Eq. (6.4), but the calculation results in an answer that is
slightly too small. The underestimation occurs because the values of Z␣ and Z are smaller than
132 CHAPTER 6
Z␣
√ √
Figure 6.2 Illustration of the calculation of N for tablet assays. X = 95 + Z / N = 100 − Z␣ / N.
the corresponding t values that should be used in the formula when the standard deviation is
unknown. The situation is somewhat complicated by the fact that the value of t depends on the
sample size (d.f.), which is yet unknown. The problem can be solved by an iterative method,
but for practical purposes, one can use the appropriate values of Z to compute the sample size
[as in Eq. (6.4)] and add on a few extra samples (patients, tablets, etc.) to compensate for the
use of Z rather than t. Guenther has shown that the simple addition of 0.5Z␣2 , which is equal
to approximately 2 for a two-sided test at the 5% level, results in a very close approximation to
the correct answer [3]. In the problem illustrated above (tablet assays), if the standard deviation
were unknown but estimated as being equal to 3 based on previous experience, a better estimate
of the sample size would be N + 0.5Z␣2 = 3.8 + 0.5(1.96)2 ∼ = 6 tablets.
If the standard deviation is unknown and a prior estimate is available (s.d.), substitute
s.d. for in Eq. (6.5) and compute the sample size; but add on 0.25Z␣2 to the sample size for each
group.
Example 1: This example illustrates the determination of the sample size for a two indepen-
dent groups (two-sided test) design. Two variations of a tablet formulation are to be compared
with regard to dissolution time. All ingredients except for the lubricating agent were the same
in these two formulations. In this case, a decision was made that if the formulations differed by
10 minutes or more to 80% dissolution, it would be extremely important that the experiment
shows a statistically significant difference between the formulations. Therefore, the pharmaceu-
tical scientist decided to fix the  error at 1% in a statistical test at the traditional 5% level. Data
were available from dissolution tests run during the development of formulations of the drug
RevolutionPharmD.com
and the standard deviation was estimated as 5 minutes. With the information presented above,
the sample size can be determined from Eq. (6.5). We will add on 0.25Z␣2 samples to the answer
because the standard deviation is unknown.
2
5
N=2 (1.96 + 2.32)2 + 0.25(1.96)2 = 10.1.
10
The study was performed using 12 tablets from each formulation rather than the 10 or
11 suggested by the answer in the calculation above. Twelve tablets were used because the
dissolution apparatus could accommodate six tablets per run.
Example 2: A bioequivalence study was being planned to compare the bioavailability of a
final production batch to a previously manufactured pilot-sized batch of tablets that were made
for clinical studies. Two parameters resulting from the blood-level data would be compared:
area under the plasma level versus time curves (AUC) and peak plasma concentration (Cmax ).
The study was to have 80% power ( = 0.20) to detect a difference of 20% or more between the
formulations. The test is done at the usual 5% level of significance. Estimates of the standard
deviations of the ratios of the values of each of the parameters [(final product)/(pilot batch)]
were determined from a small pilot study. The standard deviations were different for the
parameters. Since the researchers could not agree that one of the parameters was clearly critical
in the comparison, they decided to use a “maximum” number of patients based on the variable
with the largest relative variability. In this example, Cmax was most variable, the ratio having a
standard deviation of approximately 0.30. Since the design and analysis of the bioequivalence
study is a variation of the paired t test, Eq. (6.4) was used to calculate the sample size, adding
on 0.5Z␣2 , as recommended previously.
2
N= (Z␣ + Z )2 + 0.5(Z␣2 )
2
0.3
= (1.96 + 0.84)2 + 0.5(1.96)2 = 19.6. (6.6)
0.2
Twenty subjects were used for the comparison of the bioavailabilities of the two formula-
tions.
For sample-size determination for bioequivalence studies using FDA recommended
designs, see Table 6.5 and section 11.4.4.
Sometimes the sample sizes computed to satisfy the desired ␣ and  errors can be inordi-
nately large when time and cost factors are taken into consideration. Under these circumstances,
a compromise must be made—most easily accomplished by relaxing the ␣ and  requirements‡
(Table 6.1). The consequence of this compromise is that probabilities of making an incorrect
decision based on the statistical test will be increased. Other ways of reducing the required
sample size are (a) to increase the precision of the test by improving the assay methodology
or carefully controlling extraneous conditions during the experiment, for example, or (b) to
compromise by increasing , that is, accepting a larger difference that one considers to be of
practical importance.
Table 6.3 gives the sample size for some representative values of the ratio /, ␣, and ,
where the s.d. (s) is estimated.
Table 6.3 Sample Size Needed for Two-Sided t Test with Standard Deviation Estimated
Estimated S /Δ 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20
4.0 296 211 170 128 388 289 242 191 588 417 337 252 770 572 478 376
2.0 76 54 44 34 100 75 63 51 148 106 86 64 194 145 121 96
1.5 44 32 26 20 58 54 37 30 84 60 49 37 110 82 69 55
1.0 21 16 13 10 28 22 19 16 38 27 23 17 50 38 32 26
0.8 14 11 9 8 19 15 13 11 25 18 15 12 33 25 21 17
0.67 11 8 7 6 15 12 11 9 18 13 11 9 24 18 15 13
0.5 7 6 5 4 10 8 8 7 11 8 7 6 14 11 10 8
0.4 6 5 4 4 8 7 6 6 8 6 5 4 10 8 7 6
RevolutionPharmD.com
0.33 5 4 4 3 7 6 6 5 6 5 4 4 8 6 6 5
CHAPTER 6
SAMPLE SIZE AND POWER 135
where = p1 − p2 ; p1 and p2 are prior estimates of the proportions in the experimental groups.
The values of Z␣ and Z are the same as those used in the formulas for the normal curve or
t tests. N is the sample size for each group. If it is not possible to estimate p1 and p2 prior to
the experiment, one can make an educated guess of a meaningful value of and set p1 and p2
both equal to 0.5 in the numerator of Eq. (6.8). This will maximize the sample size, resulting in a
conservative estimate of sample size.
Fleiss [4] gives a fine discussion of an approach to estimating , the practically significant
difference, when computing the sample size. For example, one approach is first to estimate
the proportion for the more well-studied treatment group. In the case of a comparative clinical
study, this could very well be a standard treatment. Suppose this treatment has shown a success
rate of 50%. One might argue that if the comparative treatment is additionally successful for 30%
of the patients who do not respond to the standard treatment, then the experimental treatment
would be valuable. Therefore, the success rate for the experimental treatment should be 50% +
0.3 (50%) = 65% to show a practically significant difference. Thus, p1 would be equal to 0.5 and
p2 would be equal to 0.65.
Example 3: A reconciliation of quality control data over several years showed that the
proportion of unacceptable capsules for a stable encapsulation process was 0.8% ( p0 ). A sample
size for inspection is to be determined so that if the true proportion of unacceptable capsules
is equal to or greater than 1.2% ( = 0.4%), the probability of detecting this change is 80%
( = 0.2). The comparison is to be made at the 5% level using a one-sided test. According to
Eq. (6.7),
1 0.008 · 0.992 + 0.012 · 0.988
N= (1.65 + 0.84)2
2 (0.008 − 0.012)2
7670
=
2
= 3835.
The large sample size resulting from this calculation is typical of that resulting from
binomial data. If 3835 capsules are too many to inspect, ␣, , and/or must be increased. In
the example above, management decided to increase ␣. This is a conservative decision in that
more good batches would be “rejected” if ␣ is increased; that is, the increase in ␣ results in an
increased probability of rejecting good batches, those with 0.8% unacceptable or less.
Example 4: Two antibiotics, a new product and a standard product, are to be compared
with respect to the two-week cure rate of a urinary tract infection, where a cure is bacteriological
evidence that the organism no longer appears in urine. From previous experience, the cure rate
for the standard product is estimated at 80%. From a practical point of view, if the new product
shows an 85% or better cure rate, the new product can be considered superior. The marketing
RevolutionPharmD.com
136 CHAPTER 6
division of the pharmaceutical company felt that this difference would support claims of better
efficacy for the new product. This is an important claim. Therefore,  is chosen to be 1% (power
= 99%). A two-sided test will be performed at the 5% level to satisfy FDA guidelines. The test
is two-sided because, a priori, the new product is not known to be better or worse than the
standard. The calculation of sample size to satisfy the conditions above makes use of Eq. (6.8);
here p1 = 0.8 and p2 = 0.85.
0.08 · 0.2 + 0.85 · 0.15
N= (1.96 + 2.32)2 = 2107.
(0.80 − 0.85)2
The trial would have to include 4214 patients, 2107 on each drug, to satisfy the ␣ and
 risks of 0.05 and 0.01, respectively. If this number of patients is greater than that can be
accommodated, the  error can be increased to 5% or 10%, for example. A sample size of 1499
per group is obtained for a  of 5%, and 1207 patients per group for  equal to 10%.
Although Eq. (6.8) is adequate for computing the sample size for most situations, the
calculation of N can be improved by considering the continuity correction [4]. This would be
particularly important for small sample sizes
2
N 8
N = 1+ 1+ ,
4 (N | p2 − p1 |)
where N is the sample size computed from Eq. (6.8) and N is the corrected sample size. In the
example, for ␣ = 0.05 and  = 0.01, the corrected sample size is
2
2107 8
N = 1+ 1+ = 2186.
4 (2107 |0.80 − 0.85|)
p̂q̂
p̂ ± Z . (6.3)
N
To obtain a 99% confidence interval with a width of 0.01 (i.e., construct an interval that is
within ±0.005 of the observed proportion, p̂ ± 0.005),
p̂q̂
Zp = 0.005
N
SAMPLE SIZE AND POWER 137
or
Z2p ( p̂q̂ )
N= (6.9)
(W/2)2
(2.58)2 ( p̂q̂ )
N= .
(0.005)2
A more exact formula for the sample size for small values of N is given in Ref. [5].
Example 5: A quality control supervisor wishes to have an estimate of the proportion of
tablets in a batch that weigh between 195 and 205 mg, where the proportion of tablets in this
interval is to be estimated within ±0.05 (W = 0.10). How many tablets should be weighed? Use
a 95% confidence interval.
To compute N, we must have an estimate of p̂ [see Eq. (6.9)]. If p̂ and q̂ are chosen to
be equal to 0.5, N will be at a maximum. Thus, if one has no inkling as to the magnitude of
the outcome, using p̂ = 0.5 in Eq. (6.9) will result in a sufficiently large sample size (probably,
too large). Otherwise, estimate p̂ and q̂ based on previous experience and knowledge. In the
present example from previous experience, approximately 80% of the tablets are expected to
weigh between 195 and 205 mg ( p̂ = 0.8). Applying Eq. (6.9),
(1.96)2 (0.8)(0.2)
N= = 245.9.
(0.10/2)2
A total of 246 tablets should be weighed. In the actual experiment, 250 tablets were
weighed, and 195 of the tablets (78%) weighed between 195 and 205 mg. The 95% confidence
interval for the true proportion, according to Eq. (5.3), is
p̂q̂ (0.78)(0.22)
p ± 1.96 = 0.78 ± 1.96 = 0.78 ± 0.051.
N 250
The interval is slightly greater than ±5% because p is somewhat less than 0.8 (pq is larger
for p = 0.78 than for p = 0.8). Although 5.1% is acceptable, to ensure a sufficient sample size, in
general, one should estimate p closer to 0.5 in order to cover possible poor estimates of p.
If p̂ had been chosen equal to 0.5, we would have calculated
(1.96)2 (0.5)(0.5)
N= = 384.2.
(0.10/2)2
(2.58)2 (0.9988)(0.0012)
N= = 797,809.
(0.0002/2)2
The trial will have to include approximately 800,000 subjects in order to yield the desired
precision.
138 CHAPTER 6
6.5 POWER
Power is the probability that the statistical test results in rejection of H0 when a specified
alternative is true. The “stronger” the power, the better the chance that the null hypothesis will
be rejected (i.e., the test results in a declaration of “significance”) when, in fact, H0 is false. The
larger the power, the more sensitive is the test. Power is defined as 1 − . The larger the  error,
the weaker is the power. Remember that  is an error resulting from accepting H0 when H0 is
false. Therefore, 1 −  is the probability of rejecting H0 when H0 is false.
From an idealistic point of view, the power of a test should be calculated before an exper-
iment is conducted. In addition to defining the properties of the test, power is used to help
compute the sample size, as discussed above. Unfortunately, many experiments proceed with-
out consideration of power (or ). This results from the difficulty of choosing an appropriate
value of . There is no traditional value of  to use, as is the case for ␣, where 5% is usually
used. Thus, the power of the test is often computed after the experiment has been completed.
Power is best described by diagrams such as those shown previously in this chapter
(Figs. 6.1 and 6.2). In these figures,  is the area of the curves represented by the alternative
hypothesis that is included in the region of acceptance defined by the null hypothesis.
The concept of power is also illustrated in Figure 6.3. To illustrate the calculation of power,
we will use data presented for the test of a new antihypertensive agent (sect. 6.2), a paired sample
test, with = 7 and H0 : = 0. The test is performed at the 5% level of significance. Let us
suppose that the sample size is limited by cost. The sponsor of the test had sufficient funds
to pay for a study that included only 12 subjects. The design described earlier in this chapter
(sect. 6.2) used 26 patients with  specified equal to 0.05 (power = 0.95). With 12 subjects,
the power will be considerably less than 0.95. The following discussion shows how power is
calculated.
The cutoff points for statistical significance (which specify the critical region) are defined
by ␣, N, and . Thus, the values of ␦ that will lead to a significant result for a two-sided test are
as follows:
␦
Z= √
/ N
±Z
␦= √ .
N
±(1.96)(7)
␦= √ = ±3.96.
12
Values of ␦ greater than 3.96 or less than −3.96 will lead to the decision that the products
differ at the 5% level. Having defined the values of ␦ that will lead to rejection of H0 , we obtain
the power for the alternative, Ha : = 5, by computing the probability that an average result, ␦,
will be greater than 3.96, if Ha is true (i.e., = 5).
This concept is illustrated in Figure 6.3. Curve B is the distribution with mean equal to 5
and = 7. If curve B is the true distribution, the probability of observing a value of ␦ below
3.96 is the probability of accepting H0 if the alternative hypothesis is true ( = 5). This is the
definition of . This probability can be calculated using the Z transformation.
3.96 − 5
Z= √ = −0.51.
7/ 12
Referring to Table IV.2, the area below +3.96 (Z = −0.51) for curve B is approximately
0.31. The power is 1 −  = 1 − 0.31 = 0.69. The use of 12 subjects results in a power of 0.69 to
“detect” a difference of +5 compared to the 0.95 power to detect such a difference when 26
subjects were used. A power of 0.69 means that if the true difference were 5 mm Hg, the statistical
test will result in significance with a probability of 69%; 31% of the time, such a test will result in
acceptance of H0 .
A power curve is a plot of the power, 1 − , versus alternative values of . Power curves can
be constructed by computing  for several alternatives and drawing a smooth curve through
these points. For a two-sided test, the power curve is symmetrical around the hypothetical
mean, = 0, in our example. The power is equal to ␣ when the alternative is equal to the
hypothetical mean under H0 . Thus, the power is 0.05 when = H0 (Fig. 6.4) in the power curve.
The power curve for the present example is shown in Figure 6.4.
The following conclusions may be drawn concerning the power of a test if ␣ is kept
constant:
A simple way to compute the approximate power of a test is to use the formula for sample
size [Eqs. (6.4) and (6.5). for example] and solve for Z . In the previous example, a single sample
or a paired test, Eq. (6.4) is appropriate:
2
N= (Z␣ + Z )2 (6.4)
√
Z = N − Z␣ . (6.10)
Once having calculated Z , the probability determined directly from Table IV.2 is equal to
the power, 1 − . See the discussion and examples below.
In the problem discussed above, applying Eq. (6.10) with = 5, = 7, N = 12, and
Z␣ = 1.96,
5√
Z = 12 − 1.96 = 0.51.
7
According to the notation used for Z (Table 6.2),  is the area above Z . Power is the area
below Z (power = 1 − ). In Table IV.2, the area above Z = 0.51 is approximately 31%. The
power is 1 − . Therefore, the power is 69%.§
If N is small and the variance is unknown, appropriate values of t should be used in place
of Z␣ and Z . Alternatively, we can adjust N by subtracting 0.5Z␣2 or 0.25Z␣2 from the actual
sample size for a one- or two-sample test, respectively. The following examples should make
the calculations clearer.
Example 7: A bioavailability study has been completed in which the ratio of the AUCs for
two comparative drugs was submitted as evidence of bioequivalence. The FDA asked for the
power of the test as part of their review of the submission. (Note that this analysis is different
from that presently required by FDA.) The null hypothesis for the comparison is H0 : R = 1,
where R is the true average ratio. The test was two-sided with ␣ equal to 5%. Eighteen subjects
took each of the two comparative drugs in a paired-sample design. The standard deviation was
calculated from the final results of the study, and was equal to 0.3. The power is to be determined
for a difference of 20% for the comparison. This means that if the test product is truly more than
20% greater or smaller than the reference product, we wish to calculate the probability that the
ratio will be judged to be significantly different from 1.0. The value of to be used in Eq. (6.10)
is 0.2.
√
0.2 16
Z = − 1.96 = 0.707.
0.3
Note that the value of N is taken as 16. This is the inverse of the procedure for determining
sample size, where 0.5Z␣2 was added to N. Here we subtract 0.5Z␣2 (approximately 2) from N;
18 − 2 = 16. According to Table IV.2, the area corresponding to Z = 0.707 is approximately 0.76.
Therefore, the power of this test is 76%. That is, if the true difference between the formulations
is 20%, a significant difference will be found between the formulations 76% of the time. This
is very close to the 80% power that was recommended before current FDA guidelines were
implemented for bioavailability tests (where = 0.2).
Example 8: A drug product is prepared by two different methods. The average tablet
weights of the two batches are to be compared, weighing 20 tablets from each batch. The average
weights of the two 20-tablet samples were 507 and 511 mg. The pooled standard deviation was
calculated to be 12 mg. The director of quality control wishes to be “sure” that if the average
weights truly differ by 10 mg or more, the statistical test will show a significant difference, when
§ The value corresponding to Z in Table IV.2 gives the power directly. In this example, the area in the table
corresponding to a Z of 0.51 is approximately 0.69.
SAMPLE SIZE AND POWER 141
he was asked, “How sure?”, he said 95% sure. This can be translated into a  of 5% or a power
of 95%. This is a two independent groups test. Solving for Z from Eq. (6.5), we have
N
Z = − Z␣
2
10 19
= − 1.96 = 0.609. (6.11)
12 2
As discussed above, the value of N is taken as 19 rather than 20, by subtracting 0.25Z␣2
from N for the two-sample case. Referring to Table IV.2, we note that the power is approximately
73%. The experiment does not have sufficient power according to the director’s standards. To
obtain the desired power, we can increase the sample size (i.e., weigh more tablets). (See Exercise
Problem 10.)
6.6 SAMPLE SIZE AND POWER FOR MORE THAN TWO TREATMENTS
(ALSO SEE CHAP. 8)
The problem of computing power or sample size for an experiment with more than two treat-
ments is somewhat more complicated than the relatively simple case of designs with two
treatments. The power will depend on the number of treatments and the form of the null
and alternative hypotheses. Dixon and Massey [5] present a simple approach to determining
power and sample size. The following notation will be used in presenting the solution to this
problem.
Let M1 , M2 , M3 . . . Mk be the hypothetical population means of the k treatments. The null
hypothesis is M1 = M2 = M3 = Mk . As for the two sample cases, we must specify the alternative
of Mi . The alternative means are expressed as a grand mean, Mt ± some deviation, Di ,
values
where (Di ) = 0. For example, if three treatments are compared for pain, Active A, Active B,
and Placebo (P), the values for the alternative hypothesized means, based on a VAS scale for
pain relief, could be 75 + 10 (85), 75 + 10 (85), and 75 − 20 (55) for the two actives and placebo,
respectively. The sum of the deviations from the grand mean, 75, is 10 + 10 − 20 = 0. The power
is computed based on the following equation:
(Mi − Mt )2 /k
=
2
, (6.12)
S2 /n
where n is the number of observations in each treatment group (n is the same for each treatment)
and S2 is the common variance. The value of 2 is referred to Table 6.4 to estimate the required
sample size.
Consider the following example of three treatments in a study measuring the analgesic
properties of two actives and a placebo as described above. Fifteen subjects are in each treatment
group and the variance is 1000. According to Eq. (6.12),
(85 − 75)2 + (85 − 75)2 + (55 − 75)2 /3
=
2
= 3.0.
1000/15
Table 6.4 gives the approximate power for various values of , at the 5% level, as a function
of the number of treatment groups and the d.f. for error for 3 and 4 treatments. (More detailed
tables, in addition to graphs,
√ are given in Dixon and Massey [5].) Here, we have 42 d.f. and three
treatments with = 3 = 1.73. The power is approximately 0.72 by simple linear interpolation
(42 d.f. for = 1.7). The correct answer with more extensive tables is closer to 0.73.
142 CHAPTER 6
alpha = 0.05, k = 4
10 1.4 0.48
2.0 0.80
2.6 0.96
20 1.4 0.56
2.0 0.88
2.6 986
30 1.4 0.59
2.0 0.90
2.6 >0.99
60 1.4 0.61
2.0 0.92
2.6 >0.99
inf 1.4 0.65
2.0 0.94
2.6 >0.99
Table 6.4 can also be used to determine sample size. For example, how many patients
per treatment group are needed to obtain a power of 0.80 in the above example? Applying
Eq. (6.12),
Solve for 2
2 = 0.2n.
SAMPLE SIZE AND POWER 143
0.2N = 4 = 2 and = 2.
For = 2 and N = 20 (d.f.√ = 57), the power is approximately 0.86 (for d.f. = 60, power
0.86). For N = 15 (d.f. = 42, = 3), we have calculated (above) that the power is approximately
0.72. A sample size of between 15 and 20 patients per treatment group would give a power of
0.80. In this example, we might guess that 17 patients per group would result in approximately
80% power. Indeed, more exact tables show that a sample size of 17( = (0.2 × 17) = 1.85)
corresponds to a power of 0.79.
The same approach can be used for two-way designs, using the appropriate error term
from the analysis of variance.
6.7 SAMPLE SIZE FOR BIOEQUIVALENCE STUDIES (ALSO SEE CHAP. 11)
In its early evolution, bioequivalence was based on the acceptance or rejection of a hypothesis
test. Sample sizes could then be determined by conventional techniques as described in section
6.2. Because of inconsistencies in the decision process based on this approach, the criteria for
acceptance was changed to a two-sided 90% confidence interval, or equivalently, two one-sided
t test, where the hypotheses are (1 /2 ) < 0.8 and (1 /2 ) > 1.25 versus the alternative of
0.8 < (1 /2 ) < 1.25. This test is based on the antilog of the difference between the averages of
the log-transformed parameters (the geometric mean). This test is equivalent to a two-sided 90%
confidence interval for the ratio of means falling in the interval 0.80 to 1.25 in order to accept
the hypothesis of equivalence. Again, for the currently accepted log-transformed data, the 90%
confidence interval for the antilog of the difference between means must lie between 0.80 and
1.25, that is, 0.8 < antilog (1 /2 ) < 1.25. The sample-size determination in this case is not as
simple as the conventional determination of sample size described earlier in this chapter. The
method for sample-size determination for nontransformed data has been published by Phillips
[6] along with plots of power as a function of sample size, relative standard deviation (computed
from the ANOVA), and treatment differences. Although the theory behind this computation is
beyond the scope of this book, Chow and Liu [7] give a simple way of approximating the power
and sample size. The sample size for each sequence group is approximately
2
CV
N = (t␣, 2N−2 + t, 2N−2 )2 , (6.13)
(V − ␦)
where N is the number of subjects per sequence, t the appropriate value from the t distribution, ␣
the significance level (usually 0.10), 1 −  the power (usually 0.8), CV the coefficient of variation,
V the bioequivalence limit, and ␦ the difference between products.
One would have to have an approximation of the magnitude of the required sample size
in order to approximate the t values. For example, suppose that RSD = 0.20, ␦ = 0.10, power is
0.8, and an initial approximation of the sample size is 20 per sequence (a total of 40 subjects).
Applying Eq. (6.13)
Use a total of 52 subjects. This agrees closely with Phillip’s more exact computations.
Dilletti et al. [8] have published a method for determining sample size based on the log-
transformed variables, which is the currently preferred method. Table 6.5 showing sample sizes
for various values of CV, power, and product differences is taken from their publication.
Based on these tables, using log-transformed estimates of the parameters would result in
a sample size estimate of 38 for a power of 0.8, ratio of 0.9, and CV = 0.20. If the assumed ratio
is 1.1, the sample size is estimated as 32.
Equation (6.13) can also be used to approximate these sample sizes using log values for V
and ␦: n = (1.69 + 0.85)2 [0.20/(0.223 − 0.105)]2 = 19 per sequence or 38 subjects in total, where
0.223 is the log of 1.25 and 0.105 is the absolute value of the log of 0.9.
144 CHAPTER 6
Table 6.5 Sample Sizes for Given CV Power and Ratio (T /R ) for Log-Transformed Parametersa
CV Power r , x
(%) (%) 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20
5.0 70 10 6 4 4 4 4 6 16
7.5 16 6 6 4 6 6 10 34
10.0 28 10 6 6 6 8 16 58
12.5 42 14 8 8 8 12 24 90
15.0 60 18 10 10 10 16 32 128
17.5 80 22 12 12 12 20 44 172
20.0 102 30 16 14 16 26 56 224
22.5 128 36 20 16 20 30 70 282
25.0 158 44 24 20 22 38 84 344
27.5 190 52 28 24 26 44 102 414
30.0 224 60 32 28 32 52 120 490
3.0 80 12 6 4 4 4 6 8 22
7.5 22 8 6 6 6 8 12 44
10.0 36 12 8 6 8 10 20 76
12.5 54 16 10 8 10 14 30 118
15.0 78 22 12 10 12 20 42 168
17.5 104 30 16 14 16 26 56 226
20.0 134 38 20 16 18 32 72 294
22.5 168 46 24 20 24 40 90 368
25.0 206 56 28 24 28 48 110 452
27.5 248 68 34 28 34 58 132 544
30.0 292 80 40 32 38 68 156 642
5.0 90 14 6 4 4 4 6 8 28
7.5 28 10 6 6 6 8 16 60
10.0 48 14 8 8 8 14 26 104
12.5 74 22 12 10 12 18 40 162
15.0 106 30 16 12 16 26 58 232
17.5 142 40 20 16 20 34 76 312
20.0 186 50 26 20 24 44 100 406
22.5 232 64 32 24 30 54 124 510
25.0 284 78 38 28 36 66 152 626
27.5 342 92 44 34 44 78 182 752
30.0 404 108 52 40 52 92 214 888
For ␦ = 1.10 (log = 0.0953), the sample size is: n = (1.69 + 0.85)2 [0.20/ (0.223 − 0.0953)]2 =
16 per sequence or 32 subjects in total.
If the difference between products is specified as zero (ratio = 1.0), the value for t, 2n−2
in Eq. (6.3) should be two sided (Table 6.2). For example, for 80% power (and a large sample
size) use 1.28 rather than 0.84. In the example above with a ratio of 1.0 (0 difference between
products), a power of 0.8, and a CV = 0.2, use a value of (approximately) 1.34 for t, 2n−2 .
An Excel program to calculate the number of subjects required for a crossover study under
various conditions of power and product differences, for both parametric and binary (binomial)
data, is available on the disk accompanying this volume.
This approach to sample-size determination can also be used for studies where the out-
come is dichotomous, often used as the criterion in clinical studies of bioequivalence (cured or
not cured) for topically unabsorbed products or unabsorbed oral products such as sucralfate.
This topic is presented in section 11.4.8.
RevolutionPharmD.com
RevolutionPharmD.com
2 DATA GRAPHICS
“The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove
nothing, but bring outstanding features readily to the eye; they are therefore no substitute for
such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in
explaining the conclusions founded upon them.” This quote is from Ronald A. Fisher, the father
of modern statistical methodology [1]. Tabulation of raw data can be thought of as the initial
and least refined way of presenting experimental results. Summary tables, such as frequency
distribution tables, are much easier to digest and can be considered a second stage of refine-
ment of data presentation. Summary statistics such as the mean, median, variance, standard
deviation, and the range are concise descriptions of the properties of data, but much informa-
tion is lost in this processing of experimental results. Graphical methods of displaying data
are to be encouraged and are important adjuncts to data analysis and presentation. Graphical
presentations clarify and also reinforce conclusions based on formal statistical analyses. Finally,
the researcher has the opportunity to design aesthetic graphical presentations that command
attention. The popular cliché “A picture is worth a thousand words” is especially apropos to
statistical presentations. We will discuss some key concepts of the various ways in which data
are depicted graphically.
2.1 INTRODUCTION
The diagrams and plots that we will be concerned with in our discussion of statistical methods
can be placed broadly into two categories:
1. Descriptive plots are those whose purpose is to transmit information. These include dia-
grams describing data distributions such as histograms and cumulative distribution plots
(see sect. 1.2.3). Bar charts and pie charts are examples of popular modes of communicating
survey data or product comparisons.
2. Plots that describe relationships between variables usually show an underlying, but unknown
analytic relationship between the variables that we wish to describe and understand. These
relationships can range from relatively simple to very complex, and may involve only two
variables or many variables. One of the simplest relationships, but probably the one with
greatest practical application, is the straight-line relationship between two variables, as
shown in the Beer’s law plot in Figure 2.1. Chapter 7 is devoted to the analysis of data
involving variables that have a linear relationship.
When analyzing and depicting data that involve relationships, we are often presented
with data in pairs (X, Y pairs). In Figure 2.1, the optical density Y and the concentration X are
the data pairs. When considering the relationship of two variables, X and Y, one variable can
often be considered the response variable, which is dependent on the selection of the second
or causal variable. The response variable Y (optical density in our example) is known as the
dependent variable. The value of Y depends on the value of the independent variable, X (drug
concentration). Thus, in the example in Figure 2.1, we think of the value of optical density as
being dependent on the concentration of drug.
Figure 2.1 Beer’s law plot illustrating a linear relationship between two variables.
The histogram can be considered as a visual presentation of a frequency table. The frequency, or
proportion, of observations in each class interval is plotted as a bar, or rectangle, where the area
of the bar is proportional to the frequency (or proportion) of observations in a given interval.
An example of a histogram is shown in figure 2.2, where the data from the frequency table in
Table 1.2 have been used as the data source. As is the case with frequency tables, class intervals
for histograms should be of equal width. When the intervals are of equal width, the height of
the bar is proportional to the frequency of observations in the interval. If the intervals are not of
equal width, the histogram is not easily or obviously interpreted, as shown in Figure 2.2(B).
The choice of intervals for a histogram depends on the nature of the data, the distribution
of the data, and the purpose of the presentation. In general, rules of thumb similar to that used
for frequency distribution tables (sect. 1.2) can be used. Eight to twenty equally spaced intervals
usually are sufficient to give a good picture of the data distribution.
1. A title should be given. The title should be brief and to the point, enabling the reader to
understand the purpose of the graph without having to resort to reading the text. The title
can be placed below or above the graph as in Figure 2.3.
2. The axes should be clearly delineated and labeled. In general, the zero (0) points of both axes
should be clearly indicated. The ordinate (the Y axis) is usually labeled with the description
parallel to the Y axis. Both the ordinate and abscissa (X axis) should be each appropriately
115
Diastolic blood pressure (mm Hg)
110
105
100
95
90
85
80
0 2 4 6 8
Time (weeks) after initialion of study
Figure 2.3 Blood pressure as a function of time in a clinical study comparing drug and placebo with a regimen
of one tablet per day. , placebo (average of 45 patients); +, drug (average of 50 patients).
RevolutionPharmD.com
DATA GRAPHICS 29
(A)
450
350
300
DRUG I
250
0 1 2 3 4 5
(B) (C)
500 450
400
Exercise time (sec)
DRUG II
Exercise time (sec) 400
300 DRUG II
DRUG I
200 350
100
300
0 DRUG I
0 1 2 3 4 5
250
(D) 0 1 2 3 4 5
500
(E)
140
400
Difference in exercise time
DRUG II 120
Exercise time (sec)
100
300
DRUG I 80
(sec)
200 60
40
100 20
0
0 0 1 2 3 4 5
0 1 2 3 4 5
Time after dosing (hr)
Figure 2.4 Various graphs of the same data presented in different ways. Exercise time at various time intervals
after administration of single doses of two nitrate products. = Drug I, = Drug II.
labeled and subdivided in units of equal width (of course, the X and Y axes almost always
have different subdivisions). In the example in Figure 2.3, note the units of mm Hg and
weeks for the ordinate and abscissa, respectively. Grid lines may be added [Fig. 2.4(E)] but,
if used, should be kept to a minimum, not be prominent and should not interfere with the
interpretation of the figure.
3. The numerical values assigned to the axes should be appropriately spaced so as to nicely
cover the extent of the graph. This can easily be accomplished by trial and error and a little
manipulation. The scales and proportions should be constructed to present a fair picture of
the results and should not be exaggerated so to prejudice the interpretation. Sometimes, it
may be necessary to skip or omit some of the data to achieve this objective. In these cases,
the use of a “broken line” is recommended to clearly indicate the range of data not included
in the graph (Fig. 2.4).
30 CHAPTER 2
4. If appropriate, a key explaining the symbols used in the graph should be used. For example,
at the bottom of Figure 2.3, the key defines as the symbol for placebo and + for drug. In
many cases, labeling the curves directly on the graph (Fig. 2.4) results in more clarity.
5. In situations where the graph is derived from laboratory data, inclusion of the source of the
data (name, laboratory notebook number, and page number, for example) is recommended.
Usually graphs should stand on their own, independent of the main body of the text.
Examples of various ways of plotting data, derived from a study of exercise time at various
time intervals after administration of a single dose of two long-acting nitrate products to anginal
patients, are shown in Figures 2.4(A) to 2.4(E). All of these plots are accurate representations of
the experimental results, but each gives the reader a different impression. It would be wrong to
expand or contract the axes of the graph, or otherwise distort the graph, in order to convey an
incorrect impression to the reader. Most scientists are well aware of how data can be manipulated
to give different impressions. If obvious deception is intended, the experimental results will not
be taken seriously.
When examining the various plots in Figure 2.4, one could not say which plot best repre-
sents the meaning of the experimental results without knowledge of the experimental details,
in particular the objective of the experiment, the implications of the experimental outcome, and
the message that is meant to be conveyed. For example, if an improvement of exercise time of
120 seconds for one drug compared to the other is considered to be significant from a medical
point of view, the graphs labeled A, C, and E in Figure 2.4 would all seem appropriate in con-
veying this message. The graphs labeled B and D show this difference less clearly. On the other
hand, if 120 seconds is considered to be of little medical significance, B and D might be a better
representation of the data.
Note that in plot A of Figure 2.4, the ordinate (exercise time) is broken, indicating that
some values have been skipped. This is not meant to be deceptive, but is intentionally done
to better show the differences between the two drugs. As long as the zero point and the break
in the axis are clearly indicated, and the message is not distorted, such a procedure is entirely
acceptable.
Figures 2.4(B) and 2.5 are exaggerated examples of plots that may be considered not to
reflect accurately the significance of the experimental results. In Figure 2.4(B), the clinically
significant difference of approximately 120 seconds is made to look very small, tending to
diminish drug differences in the viewer’s mind. Also, fluctuations in the hourly results appear
to be less than the data truly suggest. In Figure 2.5, a difference of 5 seconds in exercise time
between the two drugs appears very large. Care should be taken when constructing (as well as
reading) graphs so that experimental conclusions come through clear and true.
6. If more than one curve appears on the same graph, a convenient way to differentiate the
curves is to use different symbols for the experimental points (e.g., ◦, ×, , , +) and, if
necessary, connecting the points in different ways (e.g., —.—.—., . . . . . ., –.–.–.–). A key or
label is used, which is helpful in distinguishing the various curves, as shown in Figures 2.3
to 2.6. Other ways of differentiating curves include different kinds of crosshatching and use
of different colors.
7. One should take care not to place too many curves on the same graph, as this can result in
confusion. There are no specific rules in this regard. The decision depends on the nature of
the data, and how the data look when they are plotted. The curves graphed in Figure 2.7
are cluttered and confusing. The curves should be presented differently or separated into
two or more graphs. Figure 2.8 is a clearer depiction of the dissolution results of the five
formulations shown in Figure 2.7.
8. The standard deviation may be indicated on graphs as shown in Figure 2.9. However, when
the standard deviation is indicated on a graph (or in a table, for that matter), it should be
made clear whether the variation described in the graph is an indication of the standard
deviation (S) or the standard deviation of the mean (Sx̄ ). The standard deviation of the
mean, if appropriate, is often preferable to the standard deviation not only because the
values on the graph are mean values, but also because Sx̄ is smaller than the s.d., and
therefore less cluttering. Overlapping standard deviations, as shown in Figure 2.10, should
be avoided, as this representation of the experimental results is usually more confusing than
clarifying.
9. The manner in which the points on a graph should be connected is not always obvious.
Should the individual points be connected by straight lines, or should a smooth curve that
approximates the points be drawn through the data? (See Fig. 2.11.) If the graphs represent
functional relationships, the data should probably be connected by a smooth curve. For
example, the blood level versus time data shown in Figure 2.11 are described most accurately
by a smooth curve. Although, theoretically, the points should not be connected by straight
lines as shown in Figure 2.11(A), such graphs are often depicted this way. Connecting the
individual points with straight lines may be considered acceptable if one recognizes that
this representation is meant to clarify the graphical presentation, or is done for some other
appropriate reason. In the blood-level example, the area under the curve is proportional to
the amount of drug absorbed. The area is often computed by the trapezoidal rule [4], and
depiction of the data as shown in Figure 2.11(A) makes it easier to visualize and perform
such calculations.
Figure 2.12 shows another example in which connecting points by straight lines is con-
venient but may not be a good representation of the experimental outcome. The straight line
connecting the blood pressure at zero time (before drug administration) to the blood pressure
after two weeks of drug administration suggests a gradual decrease (a linear decrease) in blood
Figure 2.8 Individual plots of dissolution of the five formulations shown in Fig. 2.7.
pressure over the two-week period. In fact, no measurements were made during the initial
two-week interval. The 10-mm Hg decrease observed after two weeks of therapy may have
occurred before the two-week reading (e.g., in one week, as indicated by the dashed line in
Fig. 2.12). One should be careful to ensure that graphs constructed in such a manner are not
misinterpreted.
Figure 2.9 Plot of exercise time as a function of time for an antianginal drug showing mean values and standard
error of the mean.
DATA GRAPHICS 33
Figure 2.10 Graph comparing two antianginal drugs that is confusing and cluttered because of the overlapping
standard deviations. •, drug A; o, drug B.
Figure 2.11 Plot of blood level versus time data illustrating two ways of drawing the curves.
45
Time to 80% dissolution (min)
30
15
0
0.0 0.2 0.4 0.6 0.8 1.0
Fraction of dose absorbed in vivo
Figure 2.13 Scatter plot showing the correlation of dissolution time and in vivo absorption of six tablet formula-
tions. , formulation A; ×, formulation B; •, formulation C; , formulation D; , formulation E; +, formulation F.
and the fraction of the dose absorbed when human subjects take the various tablets. The data plotted
in Figure 2.13 show pictorially that as dissolution increases (i.e., the time to 80% dissolution
decreases) in vivo absorption increases. Scatter plots involve data pairs, X and Y, both of which
are variable. In this example, dissolution time and fraction absorbed are both random variables.
kt
log C = log C0 − (2.1)
2.3
where C is the concentration at time t, C0 the concentration at time 0, k the first-order rate
constant, t the time, and log represents logarithm to the base 10.
Table 2.1 shows blood-level data obtained after an intravenous injection of a drug
described by a one-compartment model [3].
Figure 2.14 shows two ways of plotting the data in Table 2.1 to demonstrate the linearity
of the log C versus t relationship.
1. Figure 2.14(A) shows a plot of log C versus time. The resulting straight line is a consequence
of the relationship of log concentration and time as shown in Eq. 2.1. This is an equation of
a straight line with the Y intercept equal to log C0 and a slope equal to −k/2.3. Straight-line
relationships are discussed in more detail in chapter 8.
Time after injection, t (hr) Blood level, C (g/mL) Log blood level
0 20 1.301
1 10 1.000
2 5 0.699
3 2.5 0.398
4 1.25 0.097
DATA GRAPHICS 35
(A) (B)
100
2nd
Log concentration
1.2
Concentration
cycle
0.8 10
0.4 1st
cycle
0 1
0 1 2 3 4 5 0 1 2 3 4 5
Time after injection (hr) Time after injection (hr)
Figure 2.14 Linearizing plots of data from Table 2.1. (Plot A) log C versus time; (plot B) semilog plot.
2. Figure 2.14(B) shows a more convenient way of plotting the data of Table 2.1, making use of
semilog graph paper. This paper has a logarithmic scale on the Y axis and the usual arithmetic,
linear scale on the X axis. The logarithmic scale is constructed so that the spacing corresponds
to the logarithms of the numbers on the Y axis. For example, the distance between 1 and 2 is
the same as that between 2 and 4. (Log 2−log 1) is equal to (log 4−log 2). The semilog graph
paper depicted in Figure 2.14(B) is two-cycle paper. The Y (log) axis has been repeated two
times. The decimal point for the numbers on the Y axis is accommodated to the data. In our
example, the data range from 1.25 to 20 and the Y axis is adjusted accordingly, as shown in
Figure 2.14(B). The data may be plotted directly on this paper without the need to look up
the logarithms of the concentration values.
RevolutionPharmD.com
DATA GRAPHICS 37
450
400
Exercise time (sec)
Drug 1
350 Drug 2
300
250
1 2 3 4 5
Time after dosing (hr)
Figure 2.16 Exercise time for two drugs in the form of a column chart using data of Figure 2.4.
Pie charts are popular ways of presenting categorical data. Although the principles used in
the construction of these charts are relatively simple, thought and care are necessary to convey
the correct message. For example, dividing the circle into too many categories can be confusing
and misleading. As a rule of thumb, no more than six sectors should be used. Another problem
with pie charts is that it is not always easy to differentiate two segments that are reasonably
close in size, whereas in the bar graph, values close in size are easily differentiated, since length
is the critical feature.
The circle (or pie) represents 100%, or all of the results. Each segment (or slice of pie) has an
area proportional to the area of the circle, representative of the contribution due to the particular
segment. In the example shown in Figure 2.17(A), the pie represents the anti-inflammatory
drug market. The slices are proportions of the market accounted for by major drugs in this
therapeutic class. These charts are frequently used for business and economic descriptions, but
can be applied to the presentation of scientific data in appropriate circumstances. Figure 2.17(B)
shows the proportion of patients with good, fair, and poor responses to a drug in a clinical trial
(see also Fig. 2.15).
Of course, we have not exhausted all possible ways of presenting data graphically. We
have introduced the cumulative plot in section 1.2.3. Other kinds of plots are the stick diagram
(analogous to the histogram) and frequency polygon [5]. The number of ways in which data
can be presented is limited only by our own ingenuity. An elegant pictorial presentation of
data can “make” a report or government submission. On the other hand, poor presentation of
data can detract from an otherwise good report. The book Statistical Graphics by Calvin Schmid
is recommended for those who wish detailed information on the presentation of graphs and
charts.
KEY TERMS
Bar charts Independent variables
Bar graphs Key
Column charts Pie charts
Correlation Scatter plots
Data pairs Semilog plots
Dependent variables
Histogram
EXERCISES
1. Plot the following data, preparing and labeling the graph according to the guidelines out-
lined in this chapter. These data are the result of preparing various modifications of a
formulation and observing the effect of the modifications on tablet hardness.
Formulation modification
(Hint: Plot these data on a single graph where the Y axis is tablet hardness and the X axis
is lactose concentration. There will be two curves, one at 10% starch and the other at 5%
starch.)
2. Prepare a histogram from the data of Table 1.3. Compare this histogram to that shown in
Figure 2.2(A). Which do you think is a better representation of the data distribution?
3. Plot the following data and label the graph appropriately.
X : response Y : response
Patient to product A to product B
1 2.5 3.8
2 3.6 2.4
3 8.9 4.7
4 6.4 5.9
5 9.5 2.1
6 7.4 5.0
7 1.0 8.5
8 4.7 7.8
What conclusion(s) can you draw from this plot if the responses are pain relief scores, where
a high score means more relief?
4. A batch of tables was shown to have 70% with no defects, 15% slightly chipped, 10%
discolored, and 5% dirty. Construct a pie chart from these data.
5. The following data from a dose–response experiment, a measure of physical activity, are the
responses of five animals at each of three doses.
RevolutionPharmD.com
ANOVA:
Analysis of Variation
The basic ANOVA situation
Two variables: 1 Categorical, 1 Quantitative
13
12
11
10
days
A B P
treatment
What does ANOVA do?
At its simplest (there are extensions) ANOVA
tests the following hypotheses:
H0: The means of all the groups are equal.
Group i has
• ni = # of individuals in group i
• xij = value for individual j in group i
• xi = mean for group i
• si = standard deviation for group i
How ANOVA works (outline)
ANOVA measures two sources of variation in the data and
compares their relative sizes
(x i − x ) 2
(x ij − xi )
2
The ANOVA F-statistic is a ratio of the
Between Group Variaton divided by the
Within Group Variation:
Between MSG
F= =
Within MSE
A large F is evidence against H0, since it
indicates that there is more difference
between groups than within groups.
Minitab ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
R ANOVA Output
Df Sum Sq Mean Sq F value Pr(>F)
treatment 2 34.7 17.4 6.45 0.0063 **
Residuals 22 59.3 2.7
How are these computations
made?
We want to measure the amount of variation due
to BETWEEN group variation and WITHIN group
variation
( )
to: 2
• BETWEEN group variation: x i − x
SUMMARY
Groups Count Sum Average Variance
Column 1 3 18 6 0.49
Column 2 4 23.8 5.95 0.176667
Column 3 3 22.6 7.533333 0.123333
Excel ANOVA Output
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416
Within Groups 1.756667 7 0.250952
Total 6.884 9
∑ (x − xi ) 2
∑ (x − x) 2
∑ (x
ij
obs ij − x) 2
obs
i
obs
= MSG / MSE
Remember: si
2
=
∑ (x ij − xi )
2
=
SS[ Within Group i ]
ni − 1 dfi
So SS[Within Group i] = (si2) (dfi )
so MSE is the
SSE
sp =
2
= MSE pooled estimate
DFE of variance
In Summary
SST = ∑ (x ij − x ) = s (DFT)
2 2
obs
SSE = ∑ (x ij − x i ) = ∑ si (df i )
2 2
obs groups
SSG = ∑ (x i − x) =2
∑ n (x i i − x) 2
obs groups
SS MSG
SSE +SSG = SST; MS = ; F=
DF MSE
R2 Statistic
R2 gives the percent of variance due to between
group variation
SS[Between ] SSG
R =
2
=
SS[Total ] SST
A B
These give 98.01%
B -3.685
0.435
CI’s for each pairwise
difference.
P -4.863 -3.238
-0.859 0.766 Only P vs A is significant
(both values have same sign)
98% CI for A-P is (-0.86,-4.86)
Tukey’s Method in R
Tukey multiple comparisons of means
95% family-wise confidence level
SSW = ∑ ( X i − X A ) 2
XA XG ( X A − X G )2
G1 79 79 0
Between Control 79 79 0
Sum of M=79 79 79 0
Squares
SD=3.16 79 79 0
79 79 0
G2 84 79 25
M=84 84 79 25
SD=3.16 84 79 25
SSB = ∑NA(XA − XG )2 84 79 25
84 79 25
G3 74 79 25
M=74 74 79 25
SD=3.16 74 79 25
74 79 25
74 79 25
Sum 250
The between sum of squares relates the Cell Means to
the Grand Mean. This is related to the variance of the
means.
SSB = ∑NA(XA − XG )2
ANOVA Source Table (1)
Source SS df MS F
dfW 1 2 3 4 5
• Extends independent‐samples t test
ANOVA ‐ Analysis of Variance
• Extends independent‐samples t test
• Compares the means of groups of
independent observations
ANOVA ‐ Analysis of Variance
• Extends independent‐samples t test
• Compares the means of groups of
independent observations
– Don’t be fooled by the name. ANOVA does not
compare variances.
ANOVA ‐ Analysis of Variance
• Extends independent‐samples t test
• Compares the means of groups of
independent observations
– Don’t be fooled by the name. ANOVA does not
compare variances.
• Can compare more than two groups
ANOVA – Null and Alternative Hypotheses
Say the sample contains K independent groups
ANOVA – Null and Alternative Hypotheses
Say the sample contains K independent groups
• ANOVA tests the null hypothesis
H0: μ1 = μ2 = … = μK
– That is, “the group means are all equal”
ANOVA – Null and Alternative Hypotheses
Say the sample contains K independent groups
• ANOVA tests the null hypothesis
H0: μ1 = μ2 = … = μK
– That is, “the group means are all equal”
• The alternative hypothesis is
H1: μi ≠ μj for some i, j
– or, “the group means are not all equal”
Example:
Accuracy of Implant
Placement
Implants were placed in a
manikin using placement
guides of various widths.
15 implants were placed
using each guide.
Error (discrepancies with a
reference implant) was
measured for each implant.
Example:
Accuracy of Implant
Placement
The overall mean of the
entire sample was 0.248
mm.
This is called the “grand”
mean, and is often
X
denoted by .
If H0 were true then we’d
expect the group means to
be close to the grand
mean.
Example:
Accuracy of Implant
Placement
The ANOVA test is based
on the combined distances
from .
X
If the combined distances
are large, that indicates we
should reject H0.
The Anova Statistic
To combine the differences from the grand mean we
– Square the differences
– Multiply by the numbers of observations in the groups
– Sum over the groups
( )2
( )2
(
SSB = 15 X 4 mm − X + 15 X 6 mm − X + 15 X 8 mm − X )
2
where the are the group means.
X*
“SSB” = Sum of Squares Between groups
The Anova Statistic
To combine the differences from the grand mean we
– Square the differences
– Multiply by the numbers of observations in the groups
– Sum over the groups
( )2
( )2
(
SSB = 15 X 4 mm − X + 15 X 6 mm − X + 15 X 8 mm − X )
2
where the are the group means.
X*
“SSB” = Sum of Squares Between groups
Note: This looks a bit like a variance.
How big is big?
• For the Implant Accuracy Data, SSB = 0.0047
• Is that big enough to reject H0?
• As with the t test, we compare the statistic to the
variability of the individual observations.
• In ANOVA the variability is estimated by the Mean
Square Error, or MSE
MSE
Mean Square Error
The Mean Square Error is a
measure of the variability
after the group effects
have been taken into
account.
MSE =
1
(
∑∑ ij j
x − X )2
N −K j i
where xij is the ith
observation in the jth
group.
MSE
Mean Square Error
The Mean Square Error is a
measure of the variability
after the group effects
have been taken into
account.
MSE =
1
(
∑∑ ij j
x − X )2
N −K j i
where xij is the ith
observation in the jth
group.
MSE
Mean Square Error
The Mean Square Error is a
measure of the variability
after the group effects
have been taken into
account.
MSE =
1
(
∑∑ ij j
x − X )2
N −K j i
Note that the variation of
the means seems quite
small compared to the
variance of observations
within groups
Notes on MSE
• If there are only two groups, the MSE is equal to the
pooled estimate of variance used in the equal‐
variance t test.
• ANOVA assumes that all the group variances are
equal.
• Other options should be considered if group
variances differ by a factor of 2 or more.
ANOVA F Test
• The ANOVA F test is based on the F statistic
SSB (K − 1)
F=
MSE
where K is the number of groups.
• Under H0 the F statistic has an “F” distribution, with
K‐1 and N‐K degrees of freedom (N is the total
number of observations)
Implant Data:
F test p‐value
To get a p‐value we
compare our F statistic to
an F(2, 42) distribution.
Implant Data:
F test p‐value
To get a p‐value we
compare our F statistic to
an F(2, 42) distribution.
In our example
.0047 2
F= = .211
.0467 42
Implant Data:
F test p‐value
To get a p‐value we
compare our F statistic to
an F(2, 42) distribution.
In our example
.0047 2
F= = .211
.0467 42
The p‐value is
Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811
Total .470 44
ANOVA Table
Results are often displayed using an ANOVA Table
Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811
Total .470 44
Pop Quiz!: Where are the following quantities presented in this table?
Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811
Total .470 44
Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811
Total .470 44
Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811
Total .470 44
Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811
Total .470 44
Between
Squares df Square F Sig.
Which means are different?
33383 3 11128 5.1 .002
Groups
Within
Groups
4417119 2007 2201 Can directly compare the
Total 4450502 2010 subgroups using “post hoc”
tests.
Least Significant Difference test
Std.
N Mean Deviation The most simple post hoc
Healthy 802 221.5 46. 2 test is called the Least
Gingivitis 490 223.5 45.3 Significant Difference Test.
Periodontitis 347 227.3 48.9
Edentulous 372 232.4 48. 8 The computation is very
similar to the equal-
variance t test.
Sum of Mean
Squares df Square F Sig. Compute an equal-variance
Between
Groups
33383 3 11128 5.1 .002 t test, but replace the
Within
4417119 2007 2201 pooled variance (s2) with
Groups
Total 4450502 2010
the MSE.
Least Significant Difference Test: Examples
Std.
Compare Healthy group to
N Mean Deviation Periodontitis group:
Healthy 802 221.5 46. 2
221.5 − 227.3
T= = −1.92
2201(1 802 + 1 347)
Gingivitis 490 223.5 45.3
Periodontitis 347 227.3 48.9
Edentulous 372 232.4 48. 8
p = 2 ⋅ P(t1147 > 1.92) = 0.055
Healthy Gingivitis
Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple
comparisons.
• For example, if the data contain 4 groups, then 6
different pairwise comparisons can be made
Healthy Gingivitis
Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple
comparisons.
• For example, if the data contain 4 groups, then 6
different pairwise comparisons can be made
Healthy Gingivitis
Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple
comparisons.
• For example, if the data contain 4 groups, then 6
different pairwise comparisons can be made
Healthy Gingivitis
Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple
comparisons.
• For example, if the data contain 4 groups, then 6
different pairwise comparisons can be made
Healthy Gingivitis
Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple
comparisons.
• For example, if the data contain 4 groups, then 6
different pairwise comparisons can be made
Healthy Gingivitis
Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple
comparisons.
• For example, if the data contain 4 groups, then 6
different pairwise comparisons can be made
Healthy Gingivitis
Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
– If, in truth, H0 were true for all six.
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
– If, in truth, H0 were true for all six.
– The probability that at least one test rejects H0 is 26%
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
– If, in truth, H0 were true for all six.
– The probability that at least one test rejects H0 is 26%
– P(at least one rejection) = 1‐P(no rejections) = 1‐.956 = .26
Bonferroni Correction for Multiple Comparisons
• The Bonferroni correction is a simple way to adjust
for the multiple comparisons.
Bonferroni Correction
• Perform each test at significance level α.
• Multiply each p-value by the number of tests
performed.
• The overall significance level (chance of any of the
tests rejecting in error) will be less than α.
Example: Cholesterol Data post‐hoc comparisons
Mean Least
Difference Significant
Group 1 Group 2 (Group 1 - Difference Bonferroni
Group 2) p-value p-value
Healthy Gingivitis -2.0 .46 1.0
Healthy Periodontitis -5.8 .055 .330
Healthy Edentulous -10.9 .00021 .00126
Gingivitis Periodontitis -3.9 .25 1.0
Gingivitis Edentulous -8.9 .0056 .0336
Periodontitis Edentulous -5.1 .147 .88
Example: Cholesterol Data post‐hoc comparisons
Mean Least
Difference Significant
Group 1 Group 2 (Group 1 - Difference Bonferroni
Group 2) p-value p-value
Healthy Gingivitis -2.0 .46 1.0
Healthy Periodontitis -5.8 .055 .330
Healthy Edentulous -10.9 .00021 .00126
Gingivitis Periodontitis -3.9 .25 1.0
Gingivitis Edentulous -8.9 .0056 .0336
Periodontitis Edentulous -5.1 .147 .88
Conclusion: The Edentulous group is significantly different than
the Healthy group and the Gingivitis group (p < 0.05), after
adjustment for multiple comparisons
Summarizing Scores with
Measures of Central Tendency:
The Mean, Median, and Mode
Outline of the Course
III. Descriptive Statistics
A. Measures of Central Tendency (Chapter 3)
1. Mean
2. Median
3. Mode
B. Measures of Variability (Chapter 4)
1. Range
2. Mean deviation
3. Variance
4. Standard Deviation
C. Skewness (Chapter 2)
1. Positive skew
2. Normal distribution
3. Negative skew
D. Kurtosis
1. Platykurtic
2. Mesokurtic
3. Leptokurtic
Measures of Central Tendency
Strawberry 15 10
5
Neapolitan 8 0
Butter Pecan 12
rry
d
n
lla
e
a
ca
at
ita
pl
ni
be
Ro
ol
ip
Pe
Va
ol
w
c
R
Rocky Road 9
p
ky
ho
ra
r
ea
ge
tte
oc
St
C
d
Bu
Fu
Fudge Ripple 6
Measures of Central Tendency
Mode
ΣX Σ X
z For a population: μ = For a Sample: X =
N n
Measures of Central Tendency
Mean
Other Notes
ΣX 788
z ΣX=788, X = = = 98.5
n 8
z The Mean salary for this sample is $98,500 which is
more than twice almost all of the scores.
z Arrange the scores 38, 40, 40, 41, 42, 42, 45, 500
z The middle two #’s are 41 and 42, thus the average is
$41500, perhaps a more accurate measure of central
tendency.
Measures of Central Tendency
Mean vs. Median
Reaction Time Example
z Data is time to complete task (in s):
z 45, 34, 87, 56, 21, didn’t finish, 49
ΣX
X=
n
X •n = ΣX
z ntotal = n1 + n2 = 23 + 34 = 57
z A common
Exam Score X−X
formula we will 7 (7-9) = -2
be working with 6 (6-9) = -3
extensively is 8 (8-9) = -1
the deviation: 9 (9-9) = 0
X−X 12 (12-9) = 3
10 (10-9) = 1
ΣX = 72 11 (11-9) = 2
n=8 9 (9-9) = 0
ΣX 72
X= = =9 ∑ (X − X ) = 0
n 8
Measures of Central Tendency
Using the Mean to Interpret Data
Predicting Scores
X ≈μ
Measures of Central Tendency
Consider the Measurements and Frequency Table
Generated in the previous lecture
87, 85, 79, 75, 81, 88, 92, 86, 77, 72, 75, 77, 81, 80, 77,
73, 69, 71, 76, 79, 83, 81, 78, 75, 68, 67, 71, 73, 78, 75,
84, 81, 79, 82, 87, 89, 85, 81, 79, 77, 81, 78, 74, 76, 82,
85, 86, 81, 72, 69, 65, 71, 73, 78, 81, 77, 74, 77, 72, 68
Class Class Midpoint Total Frequency
64.5 - 69.5 67 6 0.100
69.5 – 74.5 72 11 0. 183
74.5 – 79.5 77 20 0.333
79.5 – 84.5 82 13 0.217
84.5 – 89.5 87 9 0.150
89.5 – 94.5 92 1 0.0167
Measures of Central Tendency
65, 67, 68, 68, 69, 69, 71, 71, 71, 72, 72, 72, 73, 73, 73,
74, 74, 75, 75, 75, 75, 76, 76, 77, 77, 77, 77, 77, 77, 78,
78, 78, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 81, 81,
81, 82, 82, 83, 84, 85, 85, 85, 86, 86, 87, 87, 88, 89, 92
Since N/2 = 30 and both the 30th and 31st values in the list are
the same, we obtain median = 78
Measures of Central Tendency
One further parameter of a population that may give some
indication of central tendency of the data is the mode
65, 67, 68, 68, 69, 69, 71, 71, 71, 72, 72, 72, 73, 73, 73,
74, 74, 75, 75, 75, 75, 76, 76, 77, 77, 77, 77, 77, 77, 78,
78, 78, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 81, 81,
81, 82, 82, 83, 84, 85, 85, 85, 86, 86, 87, 87, 88, 89, 92
That the value 81 occurs 8 times mode = 81
μ = Σi(fi * xi) / Σi fi
Imagine that you received the following data on the vocabulary test mentioned earlier:
20 22 23 23 23
23 23 23 24 25
28 29 30 30 30
30 30 30 31 32
32 33 33 34 35
35 36 36 37 37
2. Compute the mean, mode, and median of the data and decide which of the three you
believe to be best for the central tendency of the data.
Measure of variability
Variability provides a quantitative
measure of the degree to which scores
in a distribution are spread out or
clustered together.
Measure of variability
Range
range=Xhighest – Xlowest
Quartile:
A statistical term describing a division of observations into four defined intervals
based upon the values of the data and how they compare to the entire set of
observations.
Each quartile contains 25% of the total observations. Generally, the data is
ordered from smallest to largest with those observations falling below 25% of all
the data analyzed allocated within the 1st quartile, observations falling between
25.1% and 50% and allocated in the 2nd quartile, then the observations falling
between 51% and 75% allocated in the 3rd quartile, and finally the remaining
observations allocated in the 4th quartile.
Interquartile: The interquartile range is a measure of spread or dispersion. It
is the difference between the 75th percentile (often called Q3) and the 25th
percentile (Q1). The formula for interquartile range is therefore: Q3-Q1.
Semi-interquartile: The semi-interquartile range is a measure of spread or
dispersion. It is computed as one half the difference between the 75th
percentile [often called (Q3)] and the 25th percentile (Q1). The formula for
semi-interquartile range is therefore: (Q3-Q1)/2.
TOEFL: (560-470)/2=45
Measure of variability
Measure of variability
Variance
Deviation: deviation of one score from the
mean
Variance: taking the distribution of all
scores into account.
Sum of
square (SS)
n=24
Measure of variability
Standard deviation
squar ed
scor e mean devi at i on* devi at i on
8 9. 67 - 1. 67 2. 79
25 9. 67 +15. 33 235. 01
7 9. 67 - 2. 67 7. 13
5 9. 67 - 4. 67 21. 81
8 9. 67 - 1. 67 2. 79
3 9. 67 - 6. 67 44. 49
10 9. 67 + . 33 . 11
12 9. 67 + 2. 33 5. 43
9 9. 67 - . 67 . 45
sum of squar ed dev= 320. 01
Type of instrument
Listening Watching
Mean SD Mean SD
Males 15.72 4.43 6.94 2.26
Females 3.47 1.12 2.61 0.98
Measure of variability
Standard deviation and normal distribution
Homework
1. Calculate the mean, median, mode, range and standard
deviation for the following sample:
Midterm Exam
X X
100 85
88 82
83 96
105 107
78 102
98 113
126 94
85 119
67 91
88 100
88 72
77 88
114 85
Homework
2. Suppose that the following scores were obtained on administering a
language proficiency test to ten aphasics who had undergone a course
of treatment, and ten otherwise similar aphasics who had not
undergone the treatment:
Experimental group Control group
15 31
28 34
62 47
17 41
31 28
58 54
45 36
11 38
76 45
43 32
Calculate the mean score and standard deviation for each group, and
comment on the results.
Locating scores and finding
scales in a distribution
Percentiles, quartiles, deciles
Mind work
Imagine that you conducted an in-service course for ESL teachers. To receive university credit for the
course, the teachers must take examinations--in this case, a midterm and a final. The midterm was a
multiple-choice test of 50 items and the final exam presented teachers with 10 problem situations to
solve. Sue, like most teachers, was a whiz at taking multiple-choice exams, but bombed out on the
problem-solving final exam. She received a 48 on the midterm and a 1 on the final. Becky didn't do so
well on the midterm. She kept thinking of exceptions to answers on the multiple-choice exam. Her score
was 39. However, she really did shine on the final, scoring a 10. Since you expect students to do well on
both exams, you reason that Becky has done a creditable job on each and Sue has not. Becky gets the
higher grade. Yet, if you add the points together, Sue has 49 and Becky has 49. The question is whether
the points are really equal.
Should Sue also do this bit of arithmetic, she might come to your office to complain of the injustice of it
all. How will you show her that the value of each point on the two tests is different?
Locating scores and finding
scales in a distribution
Standard score (z-scores) X −x
z=
s
Locating scores and finding
scales in a distribution
Mind work
Suppose that we have measured the times taken by a very large number of
people to utter a particular sentence, and have shown these times to be
normally distributed with a mean of 3.45 sec and a standard deviation of
0.84 sec. Armed with this information, we can answer various questions.
1. What proportion of the (potentially infinite) population of utterance
times would be expected to fall below 3 sec?
2. What proportion would lie between 3 and 4 sec?
3. What is the time below which only 1 per cent of the times would be
expected to fall?
Mind work
某外语学院在其研究生教学中规定,只要有一门课程的考试
成绩低于75分,即取消其撰写论文的资格。显然,这是不科
学的。因为这实质上也是把不同质的考试硬拉在一起进行比
较。同是75分,在不同考试中的意义是不一样的。在一个非
常容易的考试中,它可能是比较低的分数,而在一个难度较
大的考试中,它却可能是比较高的考分。如果凡是低于该分
数的都不让写论文,这是不科学的,也是不公平的。科学的
做法是把各科的考试分数换算成标准分,然后规定多少标准
分以下的没有资格写论文。同上例一样,有了标准分之后,
也可以把各科的成绩合成一个总分,或求平均分,排出名次,
再制定一个标准,以确定总分或平均分为多少的人才有资格
撰写论文
Locating scores and finding
scales in a distribution
Distributions with nominal data
Implicational scaling (Guttman scaling)
Coefficient of scalability
Homework
I. The following scores are obtained by 50 subjects on a language aptitude test:
42 62 44 32 47 42 52 76 36 43
55 27 46 55 47 28 53 44 15 61
18 59 58 57 49 55 88 49 50 62
61 82 66 80 64 50 40 53 28 63
63 25 58 71 82 52 73 67 58 77
.2−x
z=58 .2−x
58 51.7≤ x≤64.7
3.3 −1.96
≤ ≤1.9
3.3
Sample statistics and population
parameter: estimation
Confidence limits for proportions
Standard error = p(1 − p )
N
Confidence limits=proportion in sample
±(critical value x standard error)
Sample statistics and population
parameter: estimation
Suppose that we have taken a random sample of 500 finite verbs from a text,
and found that 150 of them have present tense form. How can we set
confidence limits for the proportion of present tense finite verbs in the whole
text, the population from which the sample is taken?
Calculate (i) the mean, (ii) the standard deviation, (iii) the standard error
of the mean, (iv) the 99 per cent confidence limits for the mean.
II. A random sample of 300 finite verbs is taken from a text, and it is found
that 63 of these are auxiliaries. Calculate the 95 per cent confidence
limits for the proportion of finite verbs which are auxiliaries in the text
as a whole.
III. Using the data in question II, calculate the size of the sample of finite
verbs which would. be required in order to estimate the proportion of
auxiliaries to within an accuracy of 1 per cent, with 95 per cent
confidence.
Probability and Hypothesis
Testing
Null hypothesis (H0)
The null hypothesis states that in the general
population there is no change, no difference, or no
relationship. In the context of an experiment, H0
predicts that the independent variable (treatment)
will have no effect on the dependent variable for
the population. H0: μA- μB=0 or μA= μB
Alternative hypothesis (H1)
The alternative hypothesis (H1) states that there is
a change, a difference, or a relationship for the
general population. H1: μA≠ μB
Probability and Hypothesis
Testing
Null hypothesis (H0)
When we reject the null hypothesis, we want the probability to be
very low that we are wrong. If, on the other hand, we must accept
the null hypothesis, we still want the probability to be very low that
we are wrong in doing so.
Type I error and Type II error
A type I error is made when the researcher rejected the null
hypothesis when it should not have been rejected.
A type II error is made when the null hypothesis is accepted when
it should have been rejected.
In research, we test our hypothesis by finding the probability of
our results. Probability is the proportion of times that any
particular outcome would happen if the research were repeated
an infinite number of times.
Probability and Hypothesis
Testing
Two-tailed and one-tailed hypothesis
When we specify no direction for the null hypothesis (i.e.,
whether our score will be higher or lower than more typical
scores), we must consider both tails of the distribution. This
is called two-tailed hypothesis.
If we have good reason to believe that we will find a
difference (e.g., previous studies or research findings
suggest this is so), then we will use a one-tailed hypothesis.
One-tailed tests specify the direction of the predicted
difference. We use previous findings to tell us which
direction to select.
.05 .01
1-tailed 1.64 2.33
2-tailed 1.96 2.57
Probability and Hypothesis
Testing
Steps in hypothesis testing
1. State the null hypothesis.
2. Decide whether to test it as a one- or two-tailed hypothesis. If there is
no research evidence on the issue, select a two-tailed hypothesis. This
will allow you to reject the null hypothesis in favor of an alternative
hypothesis. If there is research evidence on the issue, select a
one-tailed hypothesis. This will allow you to reject the null
hypothesis in favor of a directional hypothesis.
3. Set the probability level (α level). Justify your choice.
4. Select the appropriate statistical test(s) for the data.
5. Collect the data and apply the statistical test(s).
6. Report the test results and interpret them correctly.
Probability and Hypothesis
Testing
Parametric vs. nonparametric
Parametric procedures
Make strong assumptions about the distribution of the
data
Assume the data are NOT frequencies or ordinal scales
but interval data
Data are normally distributed
Nonparametric procedures
Do not make strong assumptions about the shape of the
distribution of the data
Work with frequencies and rank-ordered scales
Used when the sample size is small
Homework
Lecture 14 chi-square test, P-
value
• Measurement error (review from lecture 13)
• Null hypothesis; alternative hypothesis
• Evidence against null hypothesis
• Measuring the Strength of evidence by P-
value
• Pre-setting significance level
• Conclusion
• Confidence interval
Some general thoughts about
hypothesis testing
• A claim is any statement made about the truth; it
could be a theory made by a scientist, or a
statement from a prosecutor, a manufacture or a
consumer
• Data cannot prove a claim however, because there
• May be other data that could contradict the theory
• Data can be used to reject the claim if there is a
contradiction to what may be expected
• Put any claim in the null hypothesis H0
• Come up with an alternative hypothesis and put it
as H1
• Study data and find a hypothesis testing statistics
which is an informative summary of data that is
• Testing statistics is obtained by experience or
statistical training; it depends on the
formulation of the problem and how the data
are related to the hypothesis.
• Find the strength of evidence by P-value :
from a future set of data, compute the
probability that the summary testing statistics
will be as large as or even greater than the
one obtained from the current data. If P-
value is very small , then either the null
hypothesis is false or you are extremely
unlucky. So statistician will argue that this is a
strong evidence against null hypothesis.
If P-value is smaller than a pre-specified level
(called significance level, 5% for example),
Back to the microarray
• example
Ho : true SD σ=0.1 (denote 0.1 by σ0)
• H1 : true SD σ > 0.1 (because this is the main
concern; you don’t care if SD is small)
• Summary :
• Sample SD (s) = square root of ( sum of
squares/ (n-1) ) = 0.18
• Where sum of squares = (1.1-1.3)2 + (1.2-
1.3)2 + (1.4-1.3)2 + (1.5-1.3)2 = 0.1, n=4
• The ratio s/ σ =1.8 , is it too big ?
• The P-value consideration:
• Suppose a future data set (n=4) will be
collected.
• Let s be the sample SD from this future
dataset; it is random; so what is the
b bilit th t / ill b
• P(s/ σ0 >1.8)
• But to find the probability we need to use chi-
square distribution :
• Recall that sum of squares/ true variance
follow a chi-square distribution ;
• Therefore, equivalently, we compute
• P ( future sum of squares/ σ02 > sum of
squares from the currently available data/
σ02), (recallσ0 is
• The value claimed under the null hypothesis) ;
Once again, if data were generated again, then Sum of
squares/ true variance is random and follows a chi-squared
distribution
with n-1 degrees of freedom; where sum of squares= sum of
squared distance between each data point and the sample
mean
P-value = P(chi-square random variable> computed
value
Note fromofdata)=P
: Sum (chisquare
squares= random
(n-1) sample variable
variance > 10.0)
= (n-1)(sample
SD)For
2 our case, n=4; so look at the chi-square
11.1
Chi-Square Goodness of Fit
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends upon the degrees of freedom, just like
Student’s t-distribution.
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends upon the degrees of freedom, just like
Student’s t-distribution.
3. As the number of degrees of freedom
increases, the chi-square distribution becomes
more symmetric as is illustrated in Figure 1.
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends upon the degrees of freedom, just like
Student’s t-distribution.
3. As the number of degrees of freedom
increases, the chi-square distribution becomes
more symmetric as is illustrated in Figure 1.
4. The values are non-negative. That is, the
values of are greater than or equal to 0.
The Chi-Square Distribution
A goodness-of-fit test is an inferential
procedure used to determine whether a
frequency distribution follows a claimed
distribution.
Expected Counts
Suppose there are n independent trials an
experiment with k > 3 mutually exclusive possible
outcomes. Let p1 represent the probability of
observing the first outcome and E1 represent the
expected count of the first outcome, p2 represent
the probability of observing the second outcome
and E2 represent the expected count of the second
outcome, and so on. The expected counts for
each possible outcome is given by
Ei = μi = npi for i = 1, 2, …, k
EXAMPLE Finding Expected Counts
A sociologist wishes to determine whether the distribution for
the number of years grandparents who are responsible for
their grandchildren is different today than it was in 2000.
According to the United States Census Bureau, in 2000,
22.8% of grandparents have been responsible for their
grandchildren less than 1 year; 23.9% of grandparents have
been responsible for their grandchildren 1or 2 years; 17.6%
of grandparents have been responsible for their
grandchildren 3 or 4 years; and 35.7% of grandparents have
been responsible for their grandchildren for 5 or more years.
If the sociologist randomly selects 1,000 grandparents that
are responsible for their grandchildren, compute the
expected number within each category assuming the
distribution has not changed from 2000.
Test Statistic for Goodness-of-Fit Tests
Let Oi represent the observed counts of category i,
Ei represent the expected counts of an category i, k
represent the number of categories, and n represent
the number of independent trials of an experiment.
Then,
i = 1, 2, …, k
χ2 = (252-228)^2/228+(255-239)^2/239
+( 162-176)^2/176+(331-357)^2/357
=6.605
• Step 6. Compare the test statistics with the critical values
the test statistic < the critical value
or the test statistic does not lie in th critical region.
• Step 7. Conclusion?
• There is no sufficient evidence at the α=0.05 level of significance to reject
the null hypothesis, i.e., the claim of the distribution for the number of years
grandparents who are responsible for their grandchildren is the same today as
it was in 2000
Or ….
CORRELATION
Correlation
key concepts:
Types of correlation
Methods of studying correlation
a) Scatter diagram
b) Karl pearson’s coefficient of correlation
c) Spearman’s Rank correlation coefficient
d) Method of least squares
Correlation
Correlation: The degree of relationship between the
variables under consideration is measure through the
correlation analysis.
The measure of correlation called the correlation coefficient
The degree of relationship is expressed by coefficient which
range from correlation ( -1 ≤ r ≥ +1)
The direction of change is indicated by a sign.
The correlation analysis enable us to have an idea about the
degree & direction of the relationship between the two
variables under study.
Correlation
Correlation is a statistical tool that helps
to measure and analyze the degree of
relationship between two variables.
Correlation analysis deals with the
association between two or more
variables.
Correlation & Causation
Causation means cause & effect relation.
Correlation denotes the interdependency among the
variables for correlating two phenomenon, it is essential
that the two phenomenon should have cause-effect
relationship,& if such relationship does not exist then the
two phenomenon can not be correlated.
If two variables vary in such a way that movement in one
are accompanied by movement in other, these variables
are called cause and effect relationship.
Causation always implies correlation but correlation does
not necessarily implies causation.
Types of Correlation
Type I
Correlation
Correlation
Simple Multiple
Partial Total
Types of Correlation Type II
Simple correlation: Under simple correlation
problem there are only two variables are studied.
Multiple Correlation: Under Multiple
Correlation three or more than three variables
are studied. Ex. Qd = f ( P,PC, PS, t, y )
Partial correlation: analysis recognizes more
than two variables but considers only two
variables keeping the other constant.
Total correlation: is based on all the relevant
variables, which is normally not feasible.
Types of Correlation
Type III
Correlation
Height
Height Height
of A of B
High Degree of positive correlation
Positive relationship
r = +.80
Weight
Height
Degree of correlation
Moderate Positive Correlation
r = + 0.4
Shoe
Size
Weight
Degree of correlation
Perfect Negative Correlation
r = -1.0
TV
watching
per
week
Exam score
Degree of correlation
Moderate Negative Correlation
r = -.80
TV
watching
per
week
Exam score
Degree of correlation
Weak negative Correlation
Shoe
r = - 0.2
Size
Weight
Degree of correlation
No Correlation (horizontal line)
r = 0.0
IQ
Height
Degree of correlation (r)
r = +.80 r = +.60
r = +.40 r = +.20
2) Direction of the Relationship
Positive relationship – Variables change in the
same direction.
Indicated by
As X is increasing, Y is increasing
As X is decreasing, Y is decreasing
sign; (+) or (-).
E.g., As height increases, so does weight.
Negative relationship – Variables change in
opposite directions.
As X is increasing, Y is decreasing
As X is decreasing, Y is increasing
Regression Equation of x on y:
X – X = bxy (Y –Y)
bxy = ∑xy / ∑y2
bxy = r (σx / σy )
Properties of the Regression Coefficients
The coefficient of correlation is geometric mean of the two
regression coefficients. r = √ byx * bxy
If byx is positive than bxy should also be positive & vice
versa.
If one regression coefficient is greater than one the other
must be less than one.
The coefficient of correlation will have the same sign as
that our regression coefficient.
Arithmetic mean of byx & bxy is equal to or greater than
coefficient of correlation. byx + bxy / 2 ≥ r
Regression coefficient are independent of origin but not of
scale.
Standard Error of Estimate.
Standard Error of Estimate is the measure of
variation around the computed regression line.
Standard error of estimate (SE) of Y measure the
variability of the observed values of Y around the
regression line.
Standard error of estimate gives us a measure about
the line of regression. of the scatter of the
observations about the line of regression.
Standard Error of Estimate.
Standard Error of Estimate of Y on X is:
S.E. of Yon X (SExy) = √∑(Y – Ye )2 / n-2
Y = Observed value of y
Ye = Estimated values from the estimated equation that correspond
to each y value
e = The error term (Y – Ye)
n = Number of observation in sample.
www.vias.org/.../img/gm_regression.jpg
Correlation vs. regression
z Correlation z Regression
z In a correlation, we z In a regression, we
look at the try to predict the
relationship between outcome of one
two variables without variable from one or
knowing the direction more predictor
of causality variables. Thus, the
direction of causality
can be established.
z 1 predictor=simple
regression
z >1 predictor=multiple
regression
Correlation vs. regression
Correlation Regression
For a correlation you
do not need to For a regression you do want to find
know anything out about those relations between
about the possible variables, in particular, whether
relation between one 'causes' the other.
the two variables Therefore, an unambiguous causal
Many variables template has to be established
correlate with each between the causer and the
other for unknown causee before the analysis!
reasons This template is inferential.
Correlation underlies Regression is THE statistical
regression but is method underlying ALL inferential
descriptive only statistics (t-test, ANOVA, etc.). All
that follows is a variation of
regression.
Linear regression
Independent and dependent variables
In a regression, the predictor variables are
labelled 'independent' variables. They predict
the outcome variable labelled 'dependent'
variable.
http://snobear.colorado.edu/Markw/SnowHydro/ERAN/regression.jp
Method of least squares
In order to know which line to choose as the best
model of a given data cloud, the method of least
squares is used. We select the line for which the
sum of all squared deviations (SS) of all data
points is lowest. This line is labelled 'line of best
fit', or 'regression line'.
Regression line
Simple regression In mathematics, a coefficient is a
constant multiplicative factor of a
Yi = (b0 + b1Xi) + εi
Yi = outcome we want to predict
b0 = intercept of the regression line regression
b1 = slope of the regression line coefficients
zSlope/gradient:
steepness of the line;
neg or pos
zIntercept: where the line
Yi = (- 4 + 1.33Xi) + εi
http://algebra-tutoring.com/slope-intercept-form-equation-lines-1-gifs/slope-52.gif
'goodness-of-fit'
http://images.google.de/imgres?imgurl=http://math.uprm.edu/~wrolke/esma3102/graphs/rssfig2.pn
g&imgrefurl=http://math.uprm.edu/~wrolke/esma3102/rss.htm&h=552&w=553&sz=4&hl=de&start=
23&tbnid=eY0TWAtPXf0_ZM:&tbnh=133&tbnw=133&prev=/images%3Fq%3Dsum%2Bof%2Bsqua
res%26start%3D21%26svnum%3D10%26hl%3Dde%26lr%3D%26sa%3DN
Mean of Y as basic model
The summed
squared
Yi -⎯Y differences
between
observed
values and the
mean, SST, are
big, hence the
Mean, ⎯Y mean is not a
good model of
the data
F = MSM
MSR
The F-ratio should be high (since the model should
have improved the prediction considerably, as
expressed in MSM). MSR, the difference between the
model and the observed data (the residual), should be
small.
The coefficient of a predictor
The coefficient of the predictor b1≠0
X is b1. B1 indicates the
gradient/slope of the regression b1=0
line. It says how much Y
changes when X is changed
one unit. In a good model, b1
should always be different from
0, since the slope is either
positive or negative. If b1=0, this means:
z A change in one unit of the
Only a bad model, i.e., the basic predictor X does not change
model of the mean, has a slope the predicted variable Y
zThe gradient of the
of 0. regression line is 0.
T-Test of the coefficient of the predictor
A good predictor variable should have a b1 that is
different from 0 (the regression coefficient of the
basic model, the mean). Whether this difference is
significant, can be tested by a t-test.
The b of the expected values (0-Hypothesis, i.e.,
0) is subtracted from the b of the observed values
and divided by the standard error of b.
SEb
SEb
Simple regression on SPSS
(using the Record1.sav data)
Descriptive glance: Scatterplot of the correlation
between advertisement and record sales
Predictor:
How much money
(in 1000)
you spend on What you want to predict:
advertisement # of records (in 1000) sold
Output of simple regression on SPSS
(using the Record1.sav data)
Analyze --> Regress --> Linear
R is the simple Pearson
correlation between R² is the amount of
'advertisement' and explained variance
'records sold'
MSM
MSR
SST
sum of squares total
b1 gradient
b0 intercept Regression If predictor X
is increased
where regres- coefficients b0, b1 by 1 unit (1000, then
sion line
crosses Y axis 96,12 extra
When no money records will
is spent (X=0), be sold
134,140 records are
sold t= B/SEB
134,14/7,537=
17,799
=.09612
A closer look at the t-values
What’s wrong? Nothing, this is a rounding error. If you double-click on the output table
“Coefficients”, a more exact number will be shown:
9.612E-02 = 0,09612448597388
.010 = 0,00963236621523
If you re-compute the equation with these numbers, the result is correct:
0,09612448597388/ 0,00963236621523 = 9.979
Using the model for Prediction
Imagine the record company wants to spend
100,000 £ for advertisement.
Using Equation 5.2, we can fit in the values of b0
and b1:
Yi = (b0 + b1Xi)
= 134.14 + (.09612 x Advertising Budgeti) Is that a
good deal?
Expl: If 100,000 £ are spent on ads,
Under the
menu 'Fit',
specify the
following
options
3-D-scatterplot
If adjusted appropriately,
you can see the
regression plain and the The regression plains are
confidence plains chosen as to cover most of
almost like lines the data points in the three-
dimensional data cloud
Sum of squares, R, R2
The terms we encountered for simple regression,
SST, SSR, SSM, still mean the same, but are more
complicated to compute now.
What you
want to
predict
Your predictor
Example for using DFBeta as an
indicator of an 'influential case'
using file dfbeta.sav
z All data (including z Case 30 removed
Outlier
DFBetas, DFFit, CVR's
All the following measures measure the difference
between a model including and one excluding
influential cases:
zStandardized DFBeta: Difference between a
parameter estimated using all cases and
estimated when one case is excluded, e.g.
DFBetas of the parameters b0 and b1.
zStandardized DFFit: Difference between the
http://image.informatik.htw-
aalen.de/Thierauf/Knobelaufgaben/Sommer03/zweifel.png
Residuals and influence statistics
(using the file pubs.sav)
The correlation between no.
outlier of pubs in London districts
and deaths with and without
the outlier.
Note: The residual for the
outlier fitted to the regression
line including it is small.
However, its influence
statistics is huge.
Why? The outlier is the 'City of
London' district, where a lot of
pubs are but only few residents
live. The ones who are drinking in
Scatterplot of both variables those pubs are visitors, hence,
Graphs --> Interactive --> the ratio of deaths of citizens
scatterplot given the overall consumation of
alcohol is relatively low.
Case summary: 8
London districts
St. Res. Lever St. DFFIT St. DFB Interc St. DFB Pubs
1 -1,34 0,04 -0,74 -0,74 0,37
2 -0,88 0,03 -0,41 -0,41 0,18
3 -0,42 0,02 -0,18 -0,17 0,07
4 0,04 0,02 0,02 0,02 -0,01
5 0,5 0,01 0,2 0,19 -0,06
6 0,96 0,01 0,4 0,38 -0,1
7 1,42 0 0,68 0,63 -0,12
8 -0,28 0,86 -4,60E+008 92676016 -4,30E+008
Total 8 8 8 8 8
The residual of the
outlier #8 is small
because it actually
sits very close to the The influence statistics are huge!
regression line
Excluding the outlier
(pubs.sav)
If you create a variable “num_dist” (number of the
district) in the variables list of the pubs.sav file and
simply allocate a number to each district (1-8), you can
use this variable to exclude the problematic district #8.
Data Æ Select cases Æ If condition is satisfied Æ
num_dist~=8
Excluding the outlier – continued
(pubs.sav)
Look at the scatterplot again
now that district # 8 has
been excluded:
Graphs Æ Interactive Æ
Scatterplot
variance
Will our sample regression
generalize to the population?
- continued
independent
zPredictors and outcome have a linear relation
e.g.,
large efffect --> n= 80 (for up to 20 predictors)
medium effect --> n=200
small effect --> n=600
(Multi-)Collinearity
If ≥ 2 predictors are inter-correlated, we speak of
collinearity. In the worst case, 2 variables have a
correlation of 1. This is bad for a regression, since
the regression cannot be computed reliably
anymore. This is because the variables become
interchangeable.
High collinearity is rare, but some degree of
collinearity is always around.
Problems with collinearity:
zIt underestimates the variance of a second variable if
this variable is strongly intercorrelated with the first
variable. It adds little unique variance although – taken
for itself – it would explain a lot.
zWe can't decide which variable is important, which
z...
*ZRED
*ZPRED
For heteroscedasticity
Heteroscedasticity occurs
when the residuals at each
level of the predictor
For 'random errors' variables have unequal
variances.
Regression diagnostics
The
regression
diagnostics
are saved in
the data file,
each as a
separate
variable in a
new column
Options
leave them as they are
Interpreting Multiple Regression
The 'Descriptives'
give you a brief
summary of the
variables
Interpreting Multiple Regression
Pearson correlations R
How well
the model
generalizes. F-values
3 predic Explained Similar val- for R2 The model(s)
tors variance ues to R2 change bring about
by the are good. a significant
predic- Only 5% change
tor(s) shrinkage
ANOVA for the model against the
basic model (the mean)
Df equal to
# of cases
Df equal to Df equal minus F-values:
# of
# of cases to # of coefficients MSM/MSR:
minus 1 predic- (b0,b1)
433687.833/4354.87=99.587
200-1=199 tors 200-2=198 287125.806/2217.217=129.498
SSM
Significance
level
SSR
b0
b1 *
b2
b3
Pearson Corr of
predictor x outcome
controlled for each single
other predictor
Pearson Corr of
The 'Coefficients' table tells us the predictor x outcome
individual contribution of variables to the controlled for all
other predictor
regression model. The Standardized Beta's 'unique relationship'
tell us the importance of each predictor
Excluded variables
Interpretation:
If Ad increaes 1 unit-->sales increase .08 units; if airplay + 1
unit-->sales+3.37; if attract + 1 unit --> sales +11 units,
independent of the contributions of the other predictors.
No Multicollinearity
(In this regression, variables are not closely linearly
related)
1
Measuring Epidemiological Outcomes
3
Number of new cases during a time period
Incidence =
Population at risk during that time period
y Incidence is a rate
y Calculated for a given time period (time
interval)
y Reflects risk of disease or condition 4
Number of existing cases
Prevalence =
Total number in the population at risk
y Prevalence is a proportion
y Point Prevalence: at a particular instant in time
y Period Prevalence: during a particular interval of time
(existing cases + new cases)
5
Prevalence = Incidence × Duration
Prevalence depends on the rate of occurrence (incidence)
AND the duration or persistence of the disease
7
Measurement “captures” the phenomenon
8
An example population (N=200)
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb O
O
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
O O
dbcdcdbcdcdbcdcdbcdcdbcdc
9
How can we quantify disease in populations?
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
10
How can we quantify disease in populations?
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
11
How can we quantify disease in populations?
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
12
How can we quantify disease in populations?
dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
13
How can we quantify disease in populations?
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
14
How can we quantify the frequency?
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
15
Rate of occurrence of new cases
per unit time (e.g., 1 per month)
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
16
1 new case in month 1
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
17
1 new case in month 2
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
18
1 new case in month 3, for a total of 3 cases
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
19
2 new cases in month 4
dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
20
1 new case in month 5 (total=6)
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
21
1 case in month 6
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
22
1 new case in month 7
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
23
2 new cases in month 8
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcdOOO
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
24
2 cases in month 9
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
25
Rate of occurrence of new cases during 9 months:
1 case/month to 2 cases/month
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
26
Number of cases depends on length of interval
27
Number of cases depends on population size
28
How to estimate population-time?
O
bcdcdbcdcdbcdcdbcdcdbcdcd
O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb O
O
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
O O
dbcdcdbcdcdbcdcdbcdcdbcdc
36
What proportion of the population
is affected after 1 month? (1/200)
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
37
What proportion of the population
is affected after 2 months? (2/200)
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
38
What proportion of the population
is affected after 3 months? (3/200)
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
39
What proportion of the population is
affected after 4 months? (5/200)
dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
40
6 / 200 = 0.03 = 3% = 30 / 1,000
in 5 months
dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
O
bcdcdbcdcdbcdcdbcdcdbcdcd
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
41
Incidence proportion (“cumulative incidence”)
44
Incidence rate versus incidence
proportion
(rare disease, IR = 0.005 /
month)(see spreadsheet at epidemiolog.net/studymat/)
45
Incidence rate versus incidence
proportion
(common disease, IR = 0.1 /
month)
46
Case fatality rate
47
Mortality rate
Number of deaths
Mortality rate = ––––––––––––––––––––––––––––
Population at risk × Time interval
Number of deaths
Annual mortality rate = ––––––––––––––––––––––
Mid-year population (x 1 yr)
48
Mortality rate (more notes)
Number of deaths
Mortality rate = ––––––––––––––––––––––––––––
Population at risk × Time interval
Number of deaths
Annual mortality rate = ––––––––––––––––––
Mid-year population
6/6/2002 Incidence and prevalence 49
Mortality rates versus incidence rates
• Mortality data are more generally available
• Fatality reflects many factors, so mortality
rates may not be a good surrogate of incidence
rates
• Death certificate cause of death not always
accurate or useful
50
Prevalence – another important proportion
51
1 new case, 1 death
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
52
1 new case, 1 new death
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
53
2 new cases, no deaths
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcdOOO
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
54
2 new cases, 1 new death
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
55
What is the prevalence? (9 / 197)
OO
dbcdcdbcdcdbcdcdbcdcdbcdc
OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
56
Fine points . . .
•Who is “at risk”?
• Endometrial cancer? Prostate cancer?
Breast cancer?
• Only women who have not had a
hysterectomy?
“Could” develop the condition + “would” be
counted.
57
More fine points
• Age?
• Immunity?
• Genetically susceptible?
58
More fine points . . .
59
Fine points . . .
• Importance of stating units and scaling
unless they are clear from the context
– e.g., 120 per 100,000 person-years =
10 per 100,000 person-months
– Hazards from lack of clarity
60
“You can never, never take anything
for granted.”
Noel Hinners, vice president for flight
systems at Lockheed Martin Astronautics in
Denver, concerning the loss of the Martian
Climate Orbiter due to the Lockheed Martin
spacecraft team’s having reported
measurements in English units whiles the
orbiter’s navigation team at the Jet
Propulsion Laboratory (JPL) in Pasadena,
California assumed the measurements
were in metric units.
61
Relation of incidence and prevalence
62
Existing
cases
Population
at risk
Deaths,
cures, etc.
63
Incidence, prevalence, duration of hospitalization
Remote community of 101,000 people
One hospital, patient census = 1,000
Steady state
500 admissions per week
Prevalence = 1,000/101,000 = 9.9/1,000
IR = 500/100,000 = 5/1,000/week
Duration Prevalence / IR = 2 weeks
64
Relation of incidence and prevalence
65
Standardization
• When objective is comparability, need to adjust
for different distributions of other determinants
• Strategy:
• Analyze within each subgroup (stratum)
• Take a weighted average across strata
• Use same weights for all populations
(See the Evolving Text on www.epidemiolog.net)
66
Familiar example of weighted averages
• Liters of petrol per kilometer - differs for Interstate
(0.050 LpK) and non-Interstate (0.100 LpK) driving.
• To compare different cars, can:
• Compare them for each type of driving separately
(stratified analysis)
• Average for each car, using one set of weights
(e.g., 80% Interstate, 20% non-Interstate)
• E.g. = 0.80 x 0.050 LpK + 0.20 x 0.100 LpK = 0.060 LpK
67
Comparing a Suburu and a Mazda
Juan drives a Suburu 800 km on Interstate highways
and 200 km on other roads. His car uses 0.050 LpK
on Interstates and 0.100 LpK on other roads, for a
total of 60 liters of petrol, an average of 0.060 LpK
(60 L / 1000 km). His overall LpK can be expressed
as a weighted average:
(800/1000) x 0.050 LpK + (200/1000) x 0.100 LpK
= 0.80 x 0.050 LpK + 0.20 x 0.100 LpK = 0.060 LpK
68
Comparing a Suburu and a Mazda
Shizu drives her Mazda on a different route, with
only 200 km on Interstate and 800 km on other
roads. She uses 0.045 lpk on Interstate highways
and 0.080 LpK on non-Interstate. She uses a total
of 73 liters, or 0.073 LpK. Her overall LpK can be
expressed as a weighted average:
(200/1,000) x 0.045 LpK + (800/1,000) x 0.080 LpK
= 0.20 x 0.045 LpK + 0.80 x 0.080 LpK =0.073 LpK
69
How can we compare their fuel efficiency?
Juan Shizu
Km LpK Km LpK
Interstate 800 0.050 200 0.045
Other 200 0.100 800 0.080
Total 1,000 0.060 1,000 0.073
70
Total fuel efficiency is not comparable
because weights are different
Juan Shizu
% LpK % LpK
Interstate 80 0.050 20 0.045
Other 20 0.100 80 0.080
Total 100% 0.060 100% 0.073
71
By adopting a “standard” set of weights we
can compare fairly
Juan Shizu
% LpK % LpK
Interstate 60 0.050 60 0.045
Other 40 0.100 40 0.080
Total 100 0.060 100 0.073
Standardized 0.070 0.059
72
Comparing a Suburu and a Mazda
•Juan’s Suburu:
= 0.60 x 0.050 LpK + 0.40 x 0.100 LpK =0.070 LpK
•Shizu’s Mazda:
= 0.60 x 0.045 LpK + 0.40 x 0.080 LpK =0.059 LpK
H0 H1
β-error
H0 1-α
(Type II error)
Sample
H1 α-error
1-β
(Type I error)
Distribution & Probability
If we know s.th. about the distribution of events, we know s.th.
about the probability of these events
n
∑x i
x= i =1
α/2 n
n
∑ i
( x − x ) 2
s= i =1
n
Standardised normal distribution
Population Sample
x−μ xi − x xz = 0
z= zi =
σ s sz = 1
• the z-score represents a value on the x-axis for which we know the
p-value
x1 − x 2 2 2
t= s x1 − x2 =
s1
+
s2
s x1 − x2 n1 n2
p
pc =
n
ANOVA controls this error by testing all means at once - it can compare k
number of means. Drawback = loss of specificity
F-tests / Analysis of Variance (ANOVA)
Different types of ANOVA depending upon experimental
design (independent, repeated, multi-factorial)
Assumptions
• observations within each sample were independent
• samples must be normally distributed
• samples must have equal variances
F-tests / Analysis of Variance (ANOVA)
Total variability
Empty Full
Result:
Normal 22 15 = 37
No main effect for
factor A (normal/obese)
Obese 17 18 = 35
No main effect for
factor B (empty/full)
= 39 = 33
F-tests / Analysis of Variance (ANOVA)
Mean number of crackers eaten
Empty Full
23
22
Normal 22 15 21
20
19
18 obese
17
16
Obese 17 18 15 normal
14
Empty Full
Stomach Stomach
F-tests / Analysis of Variance (ANOVA)
Application to imaging…
F-tests / Analysis of Variance (ANOVA)
Application to imaging…
Early days => subtraction methodology => T-tests corrected for multiple comparisons
Google
The geometric mean is the positive
number x such that:
A X
X B
To find the geometric mean:
Example: Find the geometric mean
between 4 and 16.
4 X X= 80
X2 = 80
X 20
X=4 5
What if you know the geometric mean?
Example: 6 is the geometric mean
between 9 and what other number:
X 6 And finish!
6 9 X = 4 *no square root!!!
Research: Hypothesis
Definition
the word hypothesis is derived form the Greek words
9 “hypo” means under
9 “tithemi” means place
Working hypothesis
The working or trail hypothesis is provisionally adopted to explain the
relationship between some observed facts for guiding a researcher in the
investigation of a problem.
A Statement constitutes a trail or working hypothesis (which) is to be tested and
conformed, modifies or even abandoned as the investigation proceeds.
Typologies
Null hypothesis
A null hypothesis is formulated against the working hypothesis; opposes the
statement of the working hypothesis
....it is contrary to the positive statement made in the working hypothesis;
formulated to disprove the contrary of a working hypothesis
When a researcher rejects a null hypothesis, he/she actually proves a working
hypothesis
Null hypothesis (Ho): Population do not have any influence on the number of
bank branches in a town.
True State
H0 True Correct Type I Error
Decision
H0 False Type II Error Correct
Decision
y New y Std
s New sStd
nNew nStd
Sampling Distribution of Difference in Means
Y1 −Y 2
Z= ~ N (0,1)
σ 2
σ 2
1
+ 2
n1 n2
• σ12 and σ22 are unknown and estimated by s12 and s22
Example - Efficacy Test for New drug
• Type I error - Concluding that the new drug is better than the
standard (HA) when in fact it is no better (H0). Ineffective drug is
deemed better.
– Traditionally α = P(Type I error) = 0.05
y1 − y 2
T .S . : zobs =
s12 s22
+
n1 n2
y1 − y 2
• T . S . : z obs =
s 12 s 22
+
n1 n2
• R . R . : z obs ≥ z α
• P − value : P ( Z ≥ z obs )
y1 = 10.1 s1 = 3.6 n1 = 33
y 2 = 7.7 s2 = 3.4 n2 = 35
Source: Wissel, et al (2001)
Example - Botox for Cervical Dystonia
Test whether Botox A produces lower mean Tsui
scores than placebo (α = 0.05)
• H 0 : μ1 − μ 2 = 0
• H A : μ1 − μ 2 > 0
10.1 − 7.7 2.4
• T .S . : zobs = = = 2.82
2
(3.6) (3.4) 2 0.85
+
33 35
• R.R. : zobs ≥ zα = z.05 = 1.645
• P − val : P ( Z ≥ 2.82) = .0024
· Example:
· H0: μ1- μ2 = 0 HA: μ1- μ2 > 0
• σ12 = σ22 = 25 n1 = n2 = 25
· Decision Rule: Reject H0 (at α=0.05 significance level) if:
y1 − y 2 y1 − y 2
z obs = = ≥ 1 .645 ⇒ y 1 − y 2 ≥ 2 .326
σ 2
σ 2
2
1
+ 2
n1 n2
Power of a Test
• Now suppose in reality that μ1-μ2 = 3.0 (HA is true)
• Power now refers to the probability we (correctly)
reject the null hypothesis. Note that the sampling
distribution of the difference in sample means is
approximately normal, with mean 3.0 and standard
deviation (standard error) 1.414.
• Decision Rule (from last slide): Conclude population
means differ if the sample mean for group 1 is at least
2.326 higher than the sample mean for group 2
• Power for this case can be computed as:
2.326− 3
Power= P(Y 1 − Y 2 ≥ 2.326) = P(Z ≥ = −0.48) = .6844
1.41
μ1 − μ 2
δ=
σ
• Step 2 - Choose the desired power to detect the the clinically
meaningful difference (1-β, typically at least .80). For 2-sided test:
2(zα / 2 + z β )
2
n1 = n2 =
δ2
Example - Rosiglitazone for HIV-1
Lipoatrophy
• Trts - Rosiglitazone vs Placebo
• Response - Change in Limb fat mass
• Clinically Meaningful Difference - 0.5 (std dev’s)
• Desired Power - 1-β = 0.80
• Significance Level - α = 0.05
⎛ 2 ⎞
σ 2
σ
Y 1 − Y 2 ~ N ⎜ μ1 − μ 2 , 1 + 2 ⎟
⎜ n n ⎟
⎝ 1 2 ⎠
• Thus, we can expect (with 95% confidence) that our sample
mean difference lies within 2 standard errors of the true difference
(1-α)100% Confidence Interval for μ1-μ2
(y )
2 2
s s
1 − y 2 ± zα / 2 1
+ 2
n1 n2
• Standard level of confidence is 95% (z.025 = 1.96 ≈ 2)
• (1-α)100% CI’s and 2-sided tests reach the same
conclusions regarding whether μ1-μ2= 0
Example - Viagra for ED
• Comparison of Viagra (Group 1) and Placebo (Group 2)
for ED
• Data pooled from 6 double-blind trials
• Subjects - White males
• Response - Percent of succesful intercourse attempts in
past 4 weeks (Each subject reports his own percentage)
σ 12 σ 22
σ (x −x ) = +
1 2
n1 n2
Test of hypothesis for difference of two population means
(two tailed and large sample)
1) Hypothesis: D0 is some specified difference that you
wish to test. For many tests, you will wish to
hypothesize that there is no difference between two
means, that is D0=0
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0
2) Test statistic: large sample case
( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1−
σ 12 σ 22
x2 )
+
n1 n2
1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 > D 0 or H a : μ1 − μ 2 < D0
2) Test statistic: large sample case
( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1 − x2 )
σ 12 σ 22
+
n1 n2
Test the hypothesis that the salaries are less for faculty in
public school with 5% significance level
In small sample case, the sampling
distribution of the difference between two
means is the t-distribution with mean
μ ( x − x ) = μ1 − μ 2
1 2
1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0
x1 = 57.48 x2 = 66.39
s1 = 9 s 2 = 9 .5
Test the hypothesis that the salaries are the same for faculty in
public and private school with 5% significance level
Test of hypothesis for binomial proportion
1) Hypothesis: H 0 : p = p0
Two-tailed: H a : p ≠ p0
1) Hypothesis :
H A : ( p1 − p 2 ) ≠>< D 0 one/two tail tests
2)Test statistic
( pˆ 1 − pˆ 2 ) − D 0
z obs =
p1 q1 p 2 q 2
+
n1 n2
Test of hypothesis for difference in binomial
proportions
Example:
60 students are polled average of 72 observed with a
standard deviation of 10, what is the p-value of the test
whether the population average is 75?
Power of a statistical test
- P(reject the null hypothesis when it is false)=1-β
-(1-α) is the probability we accept the null when it was in
fact true
-(1-β) is the probability we reject when the null is in fact
false - this is the power of the test.
-You would prefer to have a larger power
-The power changes depending on what the actual
population parameter is.
Hypothesis Testing: Preliminaries
σ 12 σ 22
σ (x −x ) = +
1 2
n1 n2
Test of hypothesis for difference of two population means
(two tailed and large sample)
1) Hypothesis: D0 is some specified difference that you
wish to test. For many tests, you will wish to
hypothesize that there is no difference between two
means, that is D0=0
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0
2) Test statistic: large sample case
( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1−
σ 12 σ 22
x2 )
+
n1 n2
1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 > D 0 or H a : μ1 − μ 2 < D0
2) Test statistic: large sample case
( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1 − x2 )
σ 12 σ 22
+
n1 n2
Test the hypothesis that the salaries are less for faculty in
public school with 5% significance level
In small sample case, the sampling
distribution of the difference between two
means is the t-distribution with mean
μ ( x − x ) = μ1 − μ 2
1 2
1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0
x1 = 57.48 x2 = 66.39
s1 = 9 s 2 = 9 .5
Test the hypothesis that the salaries are the same for faculty in
public and private school with 5% significance level
Test of hypothesis for binomial proportion
1) Hypothesis: H 0 : p = p0
Two-tailed: H a : p ≠ p0
1) Hypothesis :
H A : ( p1 − p 2 ) ≠>< D 0 one/two tail tests
2)Test statistic
( pˆ 1 − pˆ 2 ) − D 0
z obs =
p1 q1 p 2 q 2
+
n1 n2
Test of hypothesis for difference in binomial
proportions
Example:
60 students are polled average of 72 observed with a
standard deviation of 10, what is the p-value of the test
whether the population average is 75?
Power of a statistical test
- P(reject the null hypothesis when it is false)=1-β
-(1-α) is the probability we accept the null when it was in
fact true
-(1-β) is the probability we reject when the null is in fact
false - this is the power of the test.
-You would prefer to have a larger power
-The power changes depending on what the actual
population parameter is.
Correlation
In this you will cover:
• H
• A
W
T
x- x =-
x- x =+
y- y =+
- x +=+ y- y =+
+ x +=+
x- x =-
x- x
x =+
y- y =- y- y = -
- x - =+ + x - =+
(x − x )( y − y ) = S
∑ n
xy = covariance
• Covariance –
• W
• W
• W
(x − x )( y − y ) = S
∑ n
xy = covariance
• Covariance –
– W
– W
– W
• Y
– (you don’t
P
M C C
Is to standardise the covariance so that it
can interpreted easily. It converts the
covariance to a number between -1 to 1,
where:
• -1 is a perfect negative correlation Karl Pearson
• 1 is a perfect positive correlation
1857 - 1936
• 0 is no correlation
∑ ( x − x )( y − y )
r = n
⎛ ∑ (x − x ) ⎞⎛ ∑ (y − y ) ⎞
2 2
⎜ ⎟⎜ ⎟
⎜ n ⎟⎜ n ⎟
⎝ ⎠⎝ ⎠
P M C C
1
∑ (x − x )( y − y )
r= n
SxS y
T
T
T
• E A
•P 140
T
• I
F W
• No –
B
• S
– G
C 7 13
T
I
I
• O
– A
W
• N
– P
The Kruskal-Wallis H Test
12 Ti 2
H= ∑ − 3(n + 1)
n(n + 1) ni
The Kruskal-Wallis H Test
12 Ti 2
Test statistic: H = ∑ − 3(n + 1)
n(n + 1) ni
12 ⎛ 312 + 352 + 152 + 552 ⎞
= ⎜⎜ ⎟⎟ − 3(17) = 8.96
16(17) ⎝ 4 ⎠
Reject H0. There is sufficient
Rejection region: For a right-
evidence to indicate that there
tailed chi-square test with α =
is a difference in test scores for
.05 and df = 4-1 =3, reject H0 if
the four teaching techniques.
H ≥ 7.81.
Key Concepts
I. Nonparametric Methods
These methods can be used when the data cannot be measured on a
quantitative scale, or when
• The numerical scale of measurement is arbitrarily set by the
researcher, or when
• The parametric assumptions such as normality or constant
variance are seriously violated.
Key Concepts
Kruskal-Wallis H Test: Completely Randomized Design
1. Jointly rank all the observations in the k samples (treat as one
large sample of size n say). Calculate the rank sums, Ti = rank
sum of sample i, and the test statistic
12 Ti 2
H= ∑ − 3(n + 1)
n(n + 1) ni
Data
5 3
4 2
4 4
Test statistic Probability of
3 1 U = 15.5 H0 being true
5 2 p = 0.03
4 1
5
Is p above critical level?
Y N
Reject H0
Accept H0
This particular test:
The Mann-Whitney U test is a non-parametric
test which examines whether 2 columns of data
could have come from the same population (ie
“should” be the same)
It generates a test statistic called U (no idea why
it’s U). By hand we look U up in tables; PCs
give you an exact probability.
It requires 2 sets of data - these need not be
paired, nor need they be normally distributed,
nor need there be equal numbers in each set.
How to do it
1: rank all data into 2 Harmonize ranks where the
ascending order, same value occurs more than
then re-code the data once
set replacing raw
data with ranks.
The simple answer is to run the Kruskal-Wallis test. This is run on a PC,
but behaves very much like the M-W U. It will give one significance
value, which simply tells you whether at least one group differs from one
other.
Males Females
Site 1 Site 2 Site 3
Do males differ
Do results differ
from females?
between these
sites?
Your coursework:
I will give each of you a sheet with data collected from 3 sites. (Don’t
try copying – each one is different and I know who gets which dataset!).
Push this to
sort the data in
an ascending
order
Mann-Whitney U test
S. male P. male
thorax thorax
rank width rank width
3 2.6 3 2.6
6 2.65 3 2.6
8.5 2.7 3 2.6
13.5 2.85 3 2.6
13.5 2.85 8.5 2.7
15.5 2.9 8.5 2.7
17 3 8.5 2.7
18 3.2 11.5 2.8
19 3.85 11.5 2.8
20 4 15.5 2.9
• Z = T+ - µT+ / VarT+
• = T+ - [N(N+1)/4]
[N(N+1)(2N+1)/24]
• If Z > 1.96 than P < 0.05
• reject null hypothesis
MEASURES OF
DISPERSION
Measures of Dispersion
• While measures of central tendency indicate what value
of a variable is (in one sense or other) “average” or
“central” or “typical” in a set of data, measures of
dispersion (or variability or spread) indicate (in one
sense or other) the extent to which the observed values
are “spread out” around that center — how “far apart”
observed values typically are from each other and
therefore from some average value (in particular, the
mean). Thus:
– if all cases have identical observed values (and thereby are also
identical to [any] average value), dispersion is zero;
– if most cases have observed values that are quite “close
together” (and thereby are also quite “close” to the average
value), dispersion is low (but greater than zero); and
– if many cases have observed values that are quite “far away”
from many others (or from the average value), dispersion is high.
• A measure of dispersion provides a summary statistic
that indicates the magnitude of such dispersion and, like
a measure of central tendency, is a univariate statistic.
Importance of the Magnitude
Dispersion Around the Average
• Dispersion around the mean test score.
SD 30 pounds 4 inches
13.6 kilograms .33 feet
.015 tons 10.2 centimeters
Which variable [WEIGHT or HEIGHT] has greater dispersion? [No meaningful answer can
be given]
Which variable has greater dispersion relative to its average, e.g., greater Coefficient of
Dispersion (SD relative to mean)?
Note that the Coefficient of Variation is a pure number, not expressed in any units and is
the same whatever units the variable is measured in.
Coefficient of Variation
• Conditions of Applicability:
– One group of subjects
– Comparing to population with known mean and variance.
p( X ) σX
Implausible
X X X μ
Fairly plausible
Highly plausible
p( X )
σX
X μ
p
PSYC 6130, PROF. J. ELDER 6
Plausibility of the null hypothesis
X −μ
z=
σX
p( z )
1
z 0
p
PSYC 6130, PROF. J. ELDER 7
Results for 2005 Toronto Marathon
n = 420
μ = 4hr 16min = 256 min
σ = 33min
n = 38
X = 4hr 9min = 249 min
Actual Situation
X −μ
z=
p( z ) σX
1
z 0 z
Null hypothesis: marathon times for young women are the same
as for the general female contestant population.
H 0 : μ = μ0
H A : μ < μ0
p( z )
1
0
α = .05 zα = −1.65
PSYC 6130, PROF. J. ELDER 20
Step 5. Calculate the Test Statistic
X −μ
z=
σX
Canadian Adult Female Population: Sample: Female students enrolled in PSYC 6130C 2008-09
μ 162.10 cm
σ 6.55 cm
• Random sampling
• Variable is normal
– CLT: Deviations from normality ok as long as sample is large.
H0 is true H0 is false
α 1− β
Nonparametric Statistics
2. Difficult to Compute by
hand for Large Samples
3. Tables Not Widely Available
Decision:
Conclusion:
Decision:
Conclusion:
Decision:
Conclusion:
Conclusion:
Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or
Right
¾ α=
¾ n1 = n2 =
¾ Critical Value(s): Decision:
Conclusion:
Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or
Right
¾ α = .10
¾ n1 = 4 n2 = 5
¾ Critical Value(s): Decision:
Conclusion:
Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum
Table 12 (Rosner) (Portion)
α = .05 two-tailed
Factory 1 Factory 2
Rate Rank Rate Rank
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 85
82 82
77 94
92 97
88 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85
82 82
77 94
92 97
88 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85
82 82
77 2 94
92 97
88 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85
82 3 82 4
77 2 94
92 97
88 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85
82 3 3.5 82 4 3.5
77 2 94
92 97
88 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 97
88 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 97
88 6 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94
92 7 97
88 6 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97
88 6 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97 9
88 6 ... ...
Rank Sum
Factory 1 Factory 2
Rate Rank Rate Rank
71 1 85 5
82 3 3.5 82 4 3.5
77 2 94 8
92 7 97 9
88 6 ... ...
Rank Sum 19.5 25.5
y = a + bx
y = a + bx
• y is the dependent variable
• x is the independent variable
• a is a constant
• b is the slope of the line
• For every increase of 1 in x, y changes by an amount equal
to b
• Some relationships are perfectly linear and fit this equation
exactly. Your cell phone bill, for instance, may be:
Weight
not explain. 160
On average, you might have
140
an equation like:
Weight = -222 + 5.7*Height 120
If you take a sample of
100
actual heights and weights,
60 65 70 75
you might see something
Height
like the graph to the right.
The line in the graph shows the
average relationship described by
220
the equation. Often, none of the
actual observations lie on the line.
The difference between the line
200 and any individual observation is
the error.
180 The new equation is:
Weight
• Multicollinearity
• Omitted Variables
• Endogeneity
• Other
Multicollinearity
Multicollinearity occurs when one or more of your independent variables
are related to one another. The coefficient for each independent variable
shows how much an increase of one in its value will change the dependent
variable, holding all other independent variables constant. But what if you
cannot hold them constant? If you have two houses that are exactly the
same, and you add a bedroom to one of them, the value of the house may
go up by, say, $10,000. But you have also added to its square footage.
How much of that $10,000 is a result of the extra bedroom and how much
is a result of the extra square footage? If the variables are very closely
related, and/or if you have only a small number of observations, it can be
difficult to separate these effects. Your regression gives you the
coefficients that best describe your set of data, but the independent
variables may not have a good p-value if multicollinearity is present.
Sometimes it may be appropriate to remove a variable that is related to
others, but it may not always be appropriate. In the home value example,
both the number of bedrooms and the square footage are important on
their own, in addition to whatever combined effects they may have.
Removing them may be worse than leaving them in. This does not
necessarily mean that the model as a whole is hurt, but it may mean that
the model should not be used to draw conclusions about the relationship of
individual independent variables with the dependent variable.
Omitted Variables
If independent variables that have significant relationships with the
dependent variable are left out of the model, the results will not be as
good as if they are included. In the home value example, any real
estate agent will tell you that location is the most important variable of
all. But location is hard to measure. Locations are more or less
desirable based on a number of factors. Some of them, like population
density or crime rate, may be measurable factors that can be included.
Others, like perceived quality of the local schools, may be more difficult.
You must also decide what level of specificity to use. Do you use the
crime rate for the whole city, a quadrant of the city, the zip code, the
street? Is the data even available at the level of specificity you want to
use? These factors can lead to omitted variable bias… variance in the
error term that is not random and that could be explained by an
independent variable that is not in the model. Such bias can distort the
coefficients on the other independent variables, as well as decreasing
the R2 and increasing the Significance F. Sometimes data just isn’t
available, and some variables aren’t measurable. There are methods
for reducing the bias from omitted variables, but it can’t always be
completely corrected.
Endogeneity
Regression measures the effect of changes in the
independent variable on the dependent variable.
Endogeneity occurs when that relationship is either
backwards or circular, meaning that changes in the
dependent variable cause changes in the independent
variable. In the home value example, we had discussed
earlier that the perceived quality of the local schools might
affect home values. But the perceived quality is likely also
related to the actual quality, and the actual quality is at
least partially a result of funding levels. Funding levels are
often related to the property tax base, or the value of local
homes. So… good schools increase home values, but high
home values also improve schools. This circular
relationship, if it is strong, can bias the results of the
regression. There are strategies for reducing the bias if
removing the endogenous variable is not an option.
Others
There are several other types of biases that can
exist in a model for a variety of reasons. As with
the types already described, there are tests to
measure the levels of bias, and there are
strategies that can be used to reduce it.
Eventually, though, one may have to accept a
certain amount of bias in the final model,
especially when there are data limitations. In that
case, the best that can be done is to describe the
problem and the effects it might have when
presenting the model.
The 136 System Model Regression Equation
Local Revenue
per Pupil = -236 (y-intercept)
+ .0041 x County-area Property per Pupil
+ .0032 x System Unshared Property per Pupil
+ .0202 x County-area Sales per Pupil
+ .0022 x System Unshared Sales per Pupil
+ .0471 x System State-shared Taxes per Pupil
+ 296 x [County-area Commercial, Industrial,
Utility and Business Personal Property
Assessment ÷ Total Assessment]
+ 327 x [System Commercial, Industrial, Utility
and Business Personal Property
Assessment ÷ Total Assessment]
+ .0209 x County-area Median Household Income
+ -795 x System Child Poverty Rate
A Step by Step Guide to
Learning SAS
1
Objective
• Familiarize yourselves with the SAS
programming environment and language.
• Learn how to create and manipulate data
sets in SAS and how to use existing data
sets outside of SAS.
• Learn how to conduct a regression
analysis.
• Learn how to create simple plots to
illustrate relationships.
2
LECTURE OUTLINE
• Getting Started with SAS
• Elements of the SAS program
• Basics of SAS programming
• Data Step
• Proc Reg and Proc Plot
• Example
• Tidbits
• Questions/Comments
3
Getting Started with SAS
1.1 Windows or Batch Mode?
1.1.1 Pros and Cons
1.1.2 Windows
1.1.3 Batch Mode
Reference:
www.cquest.utoronto.ca/stats/sta332s/sas.html
4
1.1.1 Pros and Cons
Windows:
Pros:
• SAS online help available.
• You can avoid learning any Unix commands.
• Many people like to point and click.
Cons:
• SAS online help is incredibly annoying.
• Possibly very difficult to use outside CQUEST
lab.
• Number of windows can be hard to manage.
5
1.1.1 cont’d…
Batch Mode:
Pros:
• Easily usable outside CQUEST labs.
• Simpler to use if you are already familiar with
Unix.
• Established Unix programs perform most tasks
better than SAS's builtin utilities.
Cons:
• Can't access SAS's online help.
• Requires some basic knowledge of Unix.
6
1.1.2 Windows
• You can get started using either of these
two ways:
1. Click on Programs at the top left of the
screen and select
CQUEST_APPLICATIONS and then sas.
2. In a terminal window type: sas
8
1.2 SAS Help
• If you are running SAS in a window environment then
there is a online SAS available.
• How is it helpful?
You may want more information about a command or
some other aspect of SAS then what you remember from
today or that is in this guide.
• How to access SAS Help?
1. Click on the Help button in task bar.
2. Use the menu command – Online documentation
• There are three tabs: Contents, Index and Find
9
1.3 SAS Run
• If you are running SAS in a window
environment then simply click on the Run
Icon. It’s the icon with a picture of a
person running!
• For Batch mode, simply type the
command: filename.sas
10
Elements of the SAS Software
2.1 SAS Program Editor: Enhanced Editor
2.2 Important SAS Windows: Log and
Output Windows
2.3 Other SAS Windows: Explorer and
Results Windows
11
2.1 SAS Program Editor
• What is the Enhanced Editor Window?
This is where you write your SAS programs. It will contain
all the commands to run your program correctly.
• What should be in it?
All the essentials to SAS programming such as the
information on your data and the required steps to
conduct your analysis as well as any comments or titles
should be written in this window (for a single problem).
See Section 3-6.
• Where should I store the files?
In your home directory. SAS will read and save files
directly from there.
12
2.2 Log and Output Windows
• How do you know whether your program is
syntactically correct?
Check the Log window every time you run a
program to check that your program ran
correctly – at least syntactically. It will indicate
errors and also provide you with the run time.
• You ran your program but where’s your output?
There is an output window which uses the
extension .lst to save the file.
If something went seriously wrong – evidence will
appear in either or both of these windows.
13
2.3 Other SAS Windows
• There are two other windows that SAS executes
when you start it up: Results and Explorer
Windows
• Both of these can be used as data/file
management tools.
• The Results Window helps to manage the
contents of the output window.
• The SAS Explorer is a kind of directory
navigation tool. (Useful for heavy SAS users).
14
Basics of SAS Programming
3.1 Essentials
3.1.1 A program!
3.1.2 End of a command line/statement
3.1.3 Run Statement
3.2 Extra Essentials
3.2.1 Comments
3.2.2 Title
3.2.3 Options
3.2.4 Case (in)sensitivity
15
3.1 Essentials
of SAS Programming
3.1.1 Program
• You need a program containing some
SAS statements.
• It should contain one or more of the
following:
1) data step: consists of statements that
create a data set
2) proc step: used to analyze the data
16
3.1 cont’d…
3.1.2 End of a command line or statement
• Every statement requires a semi-colon (;) and hit enter
afterwards. Each statement should be on a new line.
• This is a very common mistake in SAS programming –
so check very carefully to see that you have placed a ; at
the end of each statement.
3.1.3 Run command or keyword
• In order to run the SAS program, type the command:
run; at the end of the last data or proc step.
• You still need to click on the running man in order to
process the whole program.
17
3.2 Extra Essentials
of SAS Programming
3.2.1 Comments
• In order to put comments in your SAS
program (which are words used to explain
what the program is doing but not which
SAS is to execute as commands), use /*
to start a comment and */ to end a
comment. For example,
/* My SAS commands go here. */
18
3.2 cont’d…
3.2.2 Title
• To create a SAS title in your output, simply type the
command:
Title ‘Regression Analysis of Crime Data’;
• If you have several lines of titles or titles for different
steps in your program, you can number the title
command. For example,
Title1 ‘This is the first title’;
Title2 ‘This is the second title’;
• You can use either single quotes or double quotes. Do
not use contractions in your title such as don’t or else it
will get confused with the last quotation mark.
19
3.2 cont’d…
3.2.3 Options
• There is a statement which allows you to control
the line size and page size. You can also
control whether you want the page numbers or
date to appear. For example,
options nodate nonumber ls=78 ps=60
3.2.4 Case (in)sensitivity
• SAS is not case sensitive. So please don’t use
the same name - once with capitals and once
without, because SAS reads the word as the
same variable name or data set name.
20
4. Data Step
• 4.1 What is it?
• 4.2 What are the ingredients?
• 4.3 What can you do within it?
• 4.4 Some Basic Examples
• 4.5 What can you do with it?
• 4.6 Some More Examples
21
4.1 What is a Data Step?
• A data step begins by setting up the data set. It
is usually the first big step in a SAS program that
tells SAS about the data.
• A data statement names the data set. It can
have any name you like as long as it starts with
a letter and has no more than eight characters of
numbers, letters or underscores.
• A data step has countless options and
variations. Fortunately, almost all your DATA
sets will come prepared so there will be little or
no manipulation required.
22
4.2 Ingredients of a Data Step
4.2.1 Input statement
• INPUT is the keyword that defines the names of the
variables. You can use any name for the variables as
long as it is 8 characters.
• Variables can be either numeric or character (also called
alphanumeric). SAS will assume that variables are
numeric unless specified. To assign a variable name to
have a character value use the dollar sign $.
4.2.2 Datalines statement (internal raw data)
• This statement signals the beginning of the lines of data.
• A ; is placed both at the end of the datalines staement
and on the line following the last line of data.
• Spacing in data lines does matter.
23
4.2 cont’d…
4.2.3 Raw Data Files
• The datalines statement is used when referring to
internal raw data files.
• The infile statement is used when your data comes
from an external file. The keyword is placed directly
before the input statement. The path and name are
enclosed within single quotes. You will also need a
filename statement before the data step.
• Here are some examples of infile statements under 1)
windows and 2) UNIX operating environments:
1) infile ‘c:\MyDir\President.dat’;
2) infile ‘/home/mydir/president.dat’;
24
4.3 What can you do within it?
• A data step not only allows you to create a data
set, but it also allows you to manipulate the data
set.
• For example, you may wish to add two variables
together to get the cumulative effect or you may
wish to create a variable that is the log of
another variable (Meat example) or you may
simply want a subset of the data. This can be
done very easily within a data step.
• More information on this will be provided in a
supplementary documentation to follow.
25
4.4.1 Basic Example of a Data
Step
options ls=79;
data meat;
input steer time pH;
datalines;
1 1 7.02
2 1 6.93
3 2 6.42
4 2 6.51
5 4 6.07
6 4 5.99
7 6 5.59
8 6 5.80
9 8 5.51
10 8 5.36
;
26
4.4.2 Manipulating the Existing
Data
options ls=79;
data meat;
input steer time pH;
logtime=log(time);
datalines;
1 1 7.02
2 1 6.93
3 2 6.42
4 2 6.51
5 4 6.07
6 4 5.99
7 6 5.59
8 6 5.80
9 8 5.51
10 8 5.36
; 27
4.4.3 Designating a Character
Variable
options ls=79;
/*
Data on Violent and Property Crimes in 23 US Metropolitan Areas
violcrim = number of violent crimes
propcrim = number of property crimes
popn = population in 1000's
*/
data crime;
/* city is a character valued-variable so it is followed by
a dollar sign in the input statement */
input city $ violcrim propcrim popn;
datalines;
AllentownPA 161.1 3162.5 636.7
BakersfieldCA 776.6 7701.3 403.1
; 28
4.4.4 Data from an External File
options nodate nonumber ls=79 ps=60;
data cars;
infile datain;
input mpg;
datalines;
/* some data goes here */
;
29
4.5 What can you do with it?
4.5.1 View the data set
• Suppose that you have done some
manipulation to the original data set. If
you want to see what has been done, use
a proc print statement to view it.
31
4.6 Some Comments
• If you don’t want to view all the variables, you
can use the keyword var to specify which
variables the proc print procedure should
display.
• The command by is very useful in the previous
examples and of the procedures to follow. We
will take a look at its use through some
examples.
• Let’s look at the Meat Example again using SAS
to demonstrate the steps explained in 4.5.
32
5. Regression Analysis
5.1 What is proc reg?
5.2 What are the important ingredients?
5.3 What does it do?
5.4 What else can you do with it?
5.5 The cigarette example
5.6 The Output – regression analysis
33
5.1 Proc Reg
• What is a proc procedure?
It is a procedure used to do something to the
data – sort it, analyze it, print it, or plot it.
• What is proc reg?
It is a procedure used to conduct regression
analyses. It uses a model statement to
define the theoretical model for the
relationship between the independent and
dependent variables.
34
5.2 Ingredients of Proc Reg
5.2.1 General Form
proc reg data=somedata <options>;
by variables;
model dependent=independent
<options>;
plot yvar*xvar <options>;
run;
35
5.2 cont’d…
5.2.2 What you need and don’t need?
• You need to assign 1) the data to be
analyzed, and 2) the theoretical model to
be fit to the data.
• You don’t need the other statements
shown in 5.2.1 such as the by and plot
keywords nor do you need any of the
possible <options>; however, they can
prove useful, depending on the analysis.
36
5.2 cont’d… options
• There are more options for each keyword and the proc
reg statement itself.
• Besides defining the data set to be used in the proc
reg statement, you can also use the option simple to
provide descriptive statistics for each variable.
• For the model option, here are some options:
p prints observed, predicted and residual values
r prints everything above plus standard errors of the
predicted and residuals, studentized residuals and
Cook’s D-statistic.
clm prints 95% confidence intervals for mean of each obs
cli prints 95% prediction intervals
37
5.2 cont’d… more options
• And yes there are more options….
• Within proc reg you can also plot!
• The plot statement allows you to create a plot
that shows the predicted regression line.
• Use the variables in the model statement and
some special variables created by SAS such as
p. (predicted), r. (residuals), student.
(studentized residuals), L95. and U95. (cli
model option limits), and L95M. and U95M.
(clm. Model option limits). *Note the (.) at the
end of each variable name.
38
5.3 What does it do?
• Most simply, it analyzes the theoretical
model proposed.
• However, it (SAS) may have done all the
computational work, but it is up to you to
interpret it.
• Let’s look at an example to illustrate these
various options in SAS.
39
5.4 What else can you do with it?
• Plot it (of course!) using another procedure.
• There are two procedures that can be used: proc plot
and proc gplot.
• These procedures are very similar (in form) but the latter
allows you to do a lot more.
• Here is the general form:
proc gplot data=somedata;
plot yvar*xvar;
run;
• Again, you need to identify a data set and the plot
statement. The plot keyword works similarly to the way
it works in proc reg.
40
5.4 cont’d… plot options
• Some plot options:
yvar*xvar=‘char’ obs. plotted using character
specified
yvar*(xvar1 xavr2) two plots appear on
separate pages
yvar*(xvar1 xavr2)=‘char1’ two plots
appear on separate pages
yvar*(xvar1 xavr2)=‘char2’ two plots
appear on the sample plot distinguished by the
character specification
41
5.5 An Example
• Let’s take a look at a complete example.
Consider the cigarette example.
• Suppose you want to (1)find the estimated
regression line, (2) plot the estimated regression
line, and (3) generate confidence intervals and
prediction intervals.
• We’ll look at all the key elements needed to
create the SAS program in order to perform the
analysis as well as interpreting the output.
42
5.6 Output
• Identify all the different components
displayed in the SAS output and determine
what they mean.
• Begin by identifying what the sources of
variation are and their respective degrees
of freedom.
• The last page contains your predicted,
observed and residual values as well as
confidence and prediction intervals.
43
Analysis – some questions
Now let’s answer the following questions in order
to understand all the output displayed.
• What do the sums of squares tell us? Or What
do they account for?
• How do you determine the mean square(s)?
• How do you determine the F-statistics? What is
it used for? What does the p-value indicate?
• What are the root mean square error, the
dependent mean and the coeff var? What do
they measure?
44
More questions….
• What is the R-square? What does it measure?
• What are the parameter estimates? What is the
fitted model expression? What does this mean?
• What do the estimated standard errors tells us?
• How do you determine t-statistics? What are
they used for? What does the p-value indicate?
45
Now you can….
You should be able to do the:
• Create a data set using a data step in order to:
- manipulate a data set (in various ways)
- use external raw data files
• Use various procedures in order to:
- find the estimated regression line
- plot the estimated regression line with data
- generate confidence intervals and prediction
intervals
46
6. Hints and Tidbits
• For assignments, summarize the output, and write the
answer to the questions being asked as well as clearly
interpreting and indicating where in the output the
numbers came from.
• You will need to be able to do this for your tests too – so
you might as well practice…..
• Practice with the examples provided in class and the
practice problems suggested by Professor Gibbs.
• Before going into the lab to use SAS, read over the
questions carefully and determine what needs to be
done. Look over examples that have already been
presented to you to give you an idea. It will save you lots
of time!
• Always check the log file for any errors!
47
Last Comments & Contact
• I will provide you with a short
supplementary document to help with the
SAS language and simple programming
steps (closer to the assignment time).
Anjali Mazumder
E-mail: mazumder@utstat.toronto.edu
www.utstat.toronto.edu/mazumder
48
References
1. Delwiche, Lora D. (1996). The Little SAS
Book: a primer. (2nd ed.)
2. Elliott, Rebecca J. (2000). Learning SAS
in the Computer Lab. (2nd ed.)
3. Freund, Rudolf J. and Littell, Ramon C.
(2000). SAS System for Regression. (3rd
ed.)
49
Introduction to SAS
1
Why use statistical packages
• Built‐in functions
• Data manipulation
• Updated often to include new applications
• Different packages complete certain tasks
more easily than others
• Packages we will introduce
– SAS
– R (S‐plus)
2
SAS
• Easy to input and output data sets
• Preferred for data manipulation
• “proc” used to complete analyses with built‐in
functions
• Macros used to build your own functions
3
Outline
• SAS Structure
• Efficient SAS Code for Large Files
• SAS Macro Facility
4
Common errors
• Missing semicolon
• Misspelling
• Unmatched quotes/comments
• Mixed proc and data statement
• Using wrong options
5
SAS Structure
• Data Step: input, create, manipulate or output
data
– Always start with a data line
– Ex. data one;
• Procedure Step: complete an operation on
data
– Always start with a proc line
– Ex. proc contents;
6
Statements for Reading Data
• data statement names the data set you are
making
• Can use any of the following commands to
input data
– infile Identifies an external raw data file to read
with an INPUT statement
– input Lists variable names in the input file
– cards Indicates internal data
– set Reads a SAS data set
7
Example
data temp;
infile ‘g:\shared\BIO271summer\baby.csv' delimiter=','
dsd;
input id headcir length bwt gestwks mage mnocig
mheight mppwt fage fedyrs fnocig fheig;
run;
proc print data = temp (obs=10);
run;
8
Delimiter Option
• blank space (default)
• DELIMITER= option specifies that the INPUT
statement use a character other than a blank
as a delimiter for data values that are read
with list input
9
Delimiter Example
Sometimes you want to input the data yourself
Try the following data step:
data nums;
infile datalines dsd delimiter=‘&';
input X Y Z;
datalines;
1&2&3
4&5&6
7&8&9 ;
Notice that there are no semicolons until the end of the
datalines
10
DSD option
• Change how SAS treats delimiters when list input is used and
sets the default delimiter to a comma. When you specify DSD,
SAS treats two consecutive delimiters as a missing value and
removes quotation marks from character values.
• Use the DSD option and list input to read a character value
that contains a delimiter within a quoted string. The INPUT
statement treats the delimiter as a valid character and
removes the quotation marks from the character string before
the value is stored. Use the tilde (~) format modifier to retain
the quotation marks.
11
Example: Reading Delimited Data
SAS data step:
data scores;
infile datalines delimiter=',';
input test1 test2 test3;
datalines;
91,87,95
97,,92
,1,1
;
Output:
Obs test1 test2 test3
1 91 87 95
2 97 92 1
12
Example: Correction
SAS data step
data scores;
infile datalines delimiter=',‘ dsd;
input test1 test2 test3;
datalines;
91,87,95
97,,92
,1,1
;
Output:
Obs test1 test2 test3
1 91 87 95
2 97 . 92
3 . 1 1
13
Modified List Input
Read data that are separated by commas and that may
contain commas as part of a character value:
data scores;
infile datalines dsd;
input Name : $9. Score Team : $25. Div $;
datalines;
Joseph,76,"Red Racers, Washington",AAA
Mitchel,82,"Blue Bunnies, Richmond",AAA
Sue Ellen,74,"Green Gazelles, Atlanta",AA
;
14
Modified List Input
Output:
15
Dynamic Data Exchange (DDE)
• Dynamic Data Exchange (DDE) is a method of dynamically
exchanging information between Windows applications. DDE
uses a client/server relationship to enable a client application
to request information from a server application. In Version 8,
the SAS System is always the client. In this role, the SAS
System requests data from server applications, sends data to
server applications, or sends commands to server
applications.
• You can use DDE with the DATA step, the SAS macro facility,
SAS/AF applications, or any other portion of the SAS System
that requests and generates data. DDE has many potential
uses, one of which is to acquire data from a Windows
spreadsheet or database application.
16
Dynamic Data Exchange (DDE)
• NOTAB is used only in the context of Dynamic
Data Exchange (DDE). This option enables you
to use nontab character delimiters between
variables.
17
DDE Example
FILENAME biostat DDE 'Excel|book1!r1c1:r27c2';
DATA NEW;
INFILE biostat dlm='09'x notab dsd missover;
INFORMAT seqno 10. no 2.;
INPUT seqno no; RUN;
Note:
SAS reads in the first 27 rows and 2 columns of the
spreadsheet named book1 in a open Excel file
through the Dynamic Data Exchange (DDE).
18
Statements for Outputting Data
• file: Specifies the current output file for PUT
statements
• put: Writes lines to the SAS log, to the SAS procedure
output file, or to an external file that is specified in
the most recent FILE statement.
Example:
data _null_;
set new;
file 'c:\out.csv' delimiter=',' dsd;
put seqno no ;
run;
19
Comparisons
• The INFILE statement specifies the input file for any INPUT
statements in the DATA step. The FILE statement specifies the
output file for any PUT statements in the DATA step.
• Both the FILE and INFILE statements allow you to use options
that provide SAS with additional information about the
external file being used.
• An INFILE statement usually identifies data from an external
file. A DATALINES statement indicates that data follow in the
job stream. You can use the INFILE statement with the file
specification DATALINES to take advantage of certain data‐
reading options that effect how the INPUT statement reads in‐
stream data.
20
Read Dates with Formatted Input
DATA Dates;
INPUT @1 A date11.
@13 B ddmmyy6.
@20 C mmddyy10.
@31 D yymmdd8.;
duration=A-mdy(1,1,1970);
FORMAT A B C D mmddyy10.; cards;
13/APR/1999 130499 04-13-1999 99 04 13
01/JAN/1960 010160 01-01-1960 60 01 01;
RUN;
Obs A B C D duration
1 04/13/1999 04/13/1999 04/13/1999 04/13/1999 10694
2 01/01/1960 01/01/1960 01/01/1960 01/01/1960 -3653
21
Procedures To Import/Outport
Data
• IMPORT: reads data from an external data source
and writes it to a SAS data set.
• CPORT: writes SAS data sets, SAS catalogs, or SAS
data libraries to sequential file formats (transport
files).
• CIMPORT: imports a transport file that was created
(exported) by the CPORT procedure. It restores the
transport file to its original form as a SAS catalog, SAS
data set, or SAS data library.
22
PROC IMPORT
• Syntax:
PROC IMPORT
DATAFILE="filename" | TABLE="tablename"
OUT=SAS‐data‐set
<DBMS=identifier><REPLACE>;
23
PORC IMPORT
Space.txt:
MAKE MPG WEIGHT PRICE
AMC 22 2930 4099
AMC 17 3350 4749
AMC 22 2640 3799
Buick 20 3250 4816
Buick 15 4080 7827
proc import datafile="space.txt" out=mydata
dbms=dlm replace;
getnames=yes;
datarow=4;
run;
24
Common DBMS Specifications
Identifier Input Data Source Extension
ACCESS Microsoft Access .MDB
Database
DBF dBASE file .DBF
EXCEL EXCEL file .XLS
DLM delimited file (default .*
delimiter is a blank)
CSV comma-separated file .CSV
TAB tab-delimited file .TXT
25
SAS Programming Efficiency
• CPU time
• I/O time
• Memory
• Data storage
• Programming time
26
Use ELSE statement to reduce CPU
time
IF agegrp=3 THEN DO;...END;
IF agegrp=2 THEN DO;...END;
IF agegrp=1 THEN DO;...END;
IF agegrp=3 THEN DO;...END;
ELSE IF agegrp=2 THEN DO;...END;
ELSE IF agegrp=1 THEN DO;...END;
27
Subset a SAS Dataset
DATA div1; SET adults;
IF division=1; RUN;
DATA div2; SET adults;
IF division=2; RUN;
DATA div1 div2;
SET adults;
IF division=1 THEN OUTPUT div1;
ELSE IF division=2 THEN OUTPUT div2;
28
MODIFY is Better Than SET
DATA salary;
SET salary;
wages=wagesy*0.1;
DATA salary;
MODIFY salary;
wages=wages*0.1;
29
Save Space by DROP or KEEP
DATA new;
SET old (KEEP=a b c);
RUN;
DATA new;
SET old (DROP=a);
RUN;
30
Save Space by Deleting Data Sets
DATA three;
MERGE one two;
BY type;
RUN;
PROC DATASETS;
DELETE one two;
RUN;
31
Save Space by Compress
DATA new (COMPRESS=YES);
SET old;
PROC SORT DATA=a OUT=b (COMPRESS=YES);
PROC SUMMARY;
VAR score;
OUTPUT OUT=SUM1 (COMPRESS=YES) SUM=;
32
Read Only What You Need
DATA large:
INFILE myDATA;
INPUT @15 type $2. @ ;
INPUT @1 X $1. @2 Y $5. ;
DATA large:
INFILE myDATA;
INPUT @15 type $2. @ ;
IF type in ('10','11','12') THEN
INPUT @1 X $1. @2 Y $5.;
33
PROC FORMAT Is Better Than
IF‐THEN
DATA new;
SET old;
IF 0 LE age LE 10 THEN agegroup=0;
ELSE IF 10 LE age LE 20 THEN agegroup=10;
ELSE IF 20 LE age LE 30 THEN agegroup=20;
ELSE IF 30 LE age LE 40 THEN agegroup=30;
RUN;
PROC FORMAT;
VALUE age 0‐09=0 10‐19=10 20‐29=20 30‐39=30;
RUN;
DATA new;
SET old;
agegroup=PUT(age,age.);
RUN;
34
Shorten Expressions with Functions
array c{10} cost1-cost10;
tot=0;
do I=1 to 10;
if c{i} ne . then do;
tot+c{i};
end;
end;
tot=sum(of cost1-cost10);
35
IF‐THEN Better Than AND
IF status1=1 THEN
IF status2=9 THEN OUTPUT;
36
Use SAS Functions Whenever
Possible
DATA new; SET old;
meanxyz = (x+y+z)/3;
RUN;
37
Use RETAIN to Initialize Constants
DATA new; SET old;
a = 5; b = 13;
(programming statements); RUN;
38
Efficient Sort
PROC SORT;
BY vara varb varc vard vare;
RUN;
40
When w has many missing values.
DATA new;
SET old;
wyzsum = 26 + y + z + w;
RUN;
DATA new;
SET old;
IF x > . THEN wyzsum = 26 + y + z + w;
RUN;
41
Put Loops With the Fewest
Iterations Outermost
DATA new; DATA new;
SET old; SET old;
DO i = 1 TO 100; DO i = 1 TO 10;
DO j = 1 TO 10; DO j = 1 TO 100;
(programming
(programming statements);
statements);
END;
END; END;
END; RUN;
RUN;
42
IN Better Than OR
IF status=1 OR status=5 THEN
newstat="single";
ELSE newstat="not single";
43
SAS Macro
What can we do with Macro?
44
SAS Macro Facility
• SAS macro variable
• SAS Macro
• Autocall Macro Facility
• Stored Compiled Macro Facility
45
SAS Macro Delimiters
Two delimiters will trigger the macro
processor in a SAS program.
• ¯o-name
This refers to a macro variable. The current
value of the variable will replace ¯o-name;
• %macro-name
This refers to a macro, which consists of one or
more complete SAS statements, or even whole data
or proc steps.
46
SAS Macro Variables
• SAS Macro variables can be defined and used
anywhere in a SAS program, except in data
lines. They are independent of a SAS dataset.
• Macro variables contain a single character
value that remains constant until it is
explicitly changed.
47
SAS Macro Variables
%LET: assign text to a macro variable;
%LET macrovar = value
1. Macrovar is the name of a global macro variable;
2. Value is macro variable value, which is a character string
without quotation or macro expression.
49
SAS Macro Variables
Combine Macro Variables with Text
%LET first = John;
%LET last = Smith;
%put &first.&last; (combine)
%put &first. &last; (blank separate)
%put Mr. &first. &last; (prefix)
%put &first. &last. HSPH; (suffix)
output:
JohnSmith
John Smith
Mr. John Smith
John Smith HSPH
50
Create SAS Macro
• Definition:
%MACRO macro‐name (parm1, parm2,…parmk);
Macro definition (&parm1,&parm2,…&parmk)
%MEND macro‐name;
• Application:
%macro‐name(values of parm1, parm2,…,parmk);
51
SAS Macro Example
Import Excel to SAS Datasets by a Macro
%macro excelsas(in=, out=);
proc import out=work.&out
datafile="c:\&in"
dbms=excel2000 replace;
getnames=yes; run;
%mend excelsas;
% excelsas(class1, score1)
% excelsas(class2, score2)
52
SAS System Options
• System options are global instructions that affect the
entire SAS session and control the way SAS performs
operations. SAS system options differ from SAS data
set options and statement options in that once you
invoke a system option, it remains in effect for all
subsequent data and proc steps in a SAS job, unless
you specify them.
• In order to view which options are available and in
effect for your SAS session, use proc options.
PROC OPTIONS; RUN;
53
SAS system options
• NOCAPS Translate quoted strings and titles to upper case?
• CENTER Center SAS output?
• DATE Date printed in title?
• ERRORS=20 Maximum number of observations with error messages
• FIRSTOBS=1 First observation of each data set to be processed
• FMTERR Treat missing format or informat as an error?
• LABEL Allow procedures to use variable labels?
• LINESIZE=96 Line size for printed output
• MISSING=. Character printed to represent numeric missing values
• NUMBER Print page number on each page of SAS output?
• OBS=MAX Number of last observation to be processed
• PAGENO=1 Resets the current page number on the print file
• PAGESIZE=54 Number of lines printed per page of output
• YEARCUTOFF=1900 Cutoff year for DATE7. informat
54
Log, output and procedure options
• center controls whether SAS procedure output is centered. By default, output is centered. To
specify not centered, use nocenter.
• date prints the date and time to the log and output window. By default, the date and time is
printed. To suppress the printing of the date, use nodate.
• label allows SAS procedures to use labels with variables. By default, labels are permitted. To
suppress the printing of labels, use nolabel.
• notes controls whether notes are printed to the SAS log. By default, notes are printed. To
suppress the printing of notes, use nonotes.
• number controls whether page numbers are printed. By default, page numbers are printed.
To suppress the printing of page numbers, use nonumber.
• linesize= specifies the line size (printer line width) for the SAS log and the SAS procedure
output file used by the data step and procedures.
• pagesize= specifies # of lines that can be printed per page of SAS output.
• missing= specifies the character to be printed for missing numeric values.
• formchar= specifies the the list of graphics characters that define table boundaries.
Example:
OPTIONS NOCENTER NODATE NONOTES LINESIZE=80 MISSING=. ;
55
SAS data set control options
SAS data set control options specify how SAS data sets
are input, processed, and output.
• firstobs= causes SAS to begin reading at a specified observation in a data
set. The default is firstobs=1.
• obs= specifies the last observation from a data set or the last record from
a raw data file that SAS is to read. To return to using all observations in a
data set use obs=all
• replace specifies whether permanently stored SAS data sets are to be
replaced. By default, the SAS system will over‐write existing SAS data sets
if the SAS data set is re‐specified in a data step. To suppress this option,
use noreplace.
Example:
• OPTIONS OBS=100 NOREPLACE;
56
Error handling options
Error handling options specify how the SAS System
reports on and recovers from error conditions.
• errors= controls the maximum number of observations for which
complete error messages are printed. The default maximum number of
complete error messages is errors=20
• fmterr controls whether the SAS System generates an error message
when the system cannot find a format to associate with a variable. SAS
will generate an ERROR message for every unknown format it
encounters and will terminate the SAS job without running any following
data and proc steps. To read a SAS system data set without requiring a
SAS format library, use nofmterr.
Example:
OPTIONS ERRORS=100 NOFMTERR;
57
Using where statement
where statement allows us to run procedures on a
subset records.
Examples:
PROC PRINT DATA=auto;
WHERE (rep78 >= 3);
VAR make rep78;
RUN;
PROC PRINT DATA=auto;
WHERE (rep78 <= 2) and (rep78 ^= .) ;
VAR make price rep78 ;
RUN;
58
Missing Values
As a general rule, SAS procedures that perform
computations handle missing data by omitting
the missing values.
59
Summary of how missing values
are handled in SAS procedures
• proc means
For each variable, the number of non‐missing values
are used
• proc freq
By default, missing values are excluded and
percentages are based on the number of non‐missing
values. If you use the missing option on the tables
statement, the percentages are based on the total
number of observations (non‐missing and missing)
and the percentage of missing values are reported in
the table.
60
Summary of how missing values
are handled in SAS procedures
• proc corr
By default, correlations are computed based on the
number of pairs with non‐missing data (pairwise
deletion of missing data). The nomiss option can be
used to request that correlations be computed only
for observations that have non‐missing data for all
variables on the var statement (listwise deletion of
missing data).
• proc reg
If any of the variables on the model or var statement
are missing, they are excluded from the analysis (i.e.,
listwise deletion of missing data)
61
Summary of how missing values
are handled in SAS procedures
• proc glm
If you have an analysis with just one variable on the
left side of the model statement (just one outcome
or dependent variable), observations are eliminated
if any of the variables on the model statement are
missing. Likewise, if you are performing a repeated
measures ANOVA or a MANOVA, then observations
are eliminated if any of the variables in the model
statement are missing. For other situations, see the
SAS/STAT manual about proc glm.
62
Missing values in assignment
statements
• As a general rule, computations involving missing
values yield missing values.
2 + 2 yields 4
2 + . yields .
• mean(of var1‐varn):average the data for the non‐
missing values in a list of variables.
avg = mean(of var1‐var10)
N(of var1‐varn): determine the number of non‐
missing values in a list of variables
n = N(var1, var2, var3)
63
Missing values in logical
statements
• SAS treats a missing value as the smallest possible
value (e.g., negative infinity) in logical statements.
DATA times6;
SET times ;
if (var1 <= 1.5) then varc1 = 0; else varc1 = 1 ;
RUN ;
Output:
Obs id var1 varc1
1 1 1.5 0
2 2 . 0
3 3 2.1 1
64
Subsetting Data
Subsetting variables using keep or drop statements
Example:
DATA auto2;
SET auto;
KEEP make mpg price;
RUN;
DATA auto3;
SET auto;
DROP rep78 hdroom trunk weight length turn displ gratio foreign;
RUN;
65
Subsetting Data
Subsetting observations using if statements
Example:
DATA auto4;
SET auto;
IF rep78 ^= . ;
RUN;
DATA auto5;
SET auto;
IF rep78 > 3 THEN DELETE ;
RUN;
66
Labeling variables
Variable label: Use the label statement in the data step to assign
labels to the variables. You could also assign labels to
variables in proc steps, but then the labels only exist for that
step. When labels are assigned in the data step they are
available for all procedures that use that data set.
Example:
DATA auto2;
SET auto;
LABEL rep78 ="1978 Repair Record" mpg ="Miles Per Gallon" foreign="Where Car
Was Made";
RUN;
PROC CONTENTS DATA=auto2;
RUN;
67
Labeling variable values
Labeling values is a two step process. First, you must create the
label formats with proc format using a value statement. Next,
you attach the label format to the variable with a format
statement. This format statement can be used in either proc
or data steps.
Example:
*first create the label formats forgnf and makef;
PROC FORMAT;
VALUE forgnf 0="domestic" 1="foreign" ;
VALUE $makef "AMC" ="American Motors" "Buick" ="Buick (GM)" "Cad." ="Cadallac (GM)"
"Chev." ="Cheverolet (GM)" "Datsun" ="Datsun (Nissan)";
RUN;
*now we link them to the variables foreign and make;
PROC FREQ DATA=auto2;
FORMAT foreign forgnf. make $makef.;
TABLES foreign make; RUN;
68
Sort data
Use proc sort to sort this data file.
Examples:
PROC SORT DATA=auto ; BY foreign ; RUN ;
PROC SORT DATA=auto OUT=auto2 ;
BY foreign ; RUN ;
PROC SORT DATA=auto OUT=auto3;
BY descending foreign ; RUN ;
PROC SORT DATA=auto OUT=auto2 noduplicates;
BY foreign ; RUN ;
69
Making and using permanent SAS
data files
• Use a libname statement.
libname diss 'c:\dissertation\';
data diss.salary;
input sal1996‐sal2000 ;
cards;
14000 16500 18000 22000 29000
;
run;
• specify the name of the data file by directly specifying the
path name of the file
data 'c:\dissertation\salarylong';
input Salary1996‐Salary2000 ;
cards;
14000 16500 18000 22000 29000
;
run;
70
Merge data files
One‐to‐one merge: there are three steps to match
merge two data files dads and faminc on the same
variable famid.
1. Use proc sort to sort dads on famid and save that
file (we will call it dads2)
PROC SORT DATA=dads OUT=dads2; BY famid; RUN;
71
Merge data files
One‐to‐many merge: there are three steps to match
merge two data files dads and kids on the same
variable famid.
1. Use proc sort to sort dads on famid and save that
file (we will call it dads2)
PROC SORT DATA=dads OUT=dads2; BY famid; RUN;
72
Merge data files: mismatch
• Mismatching records in one‐to‐one merge: use the in
option to create a 0/1 variable
DATA merge121;
MERGE dads(IN=fromdadx) faminc(IN=fromfamx);
BY famid;
fromdad = fromdadx;
fromfam = fromfamx;
RUN;
• Variables with the same name, but different
information: rename variables
DATA merge121;
MERGE faminc(RENAME=(inc96=faminc96 inc97=faminc97 inc98=faminc98))
dads(RENAME=(inc98=dadinc98));
BY famid;
RUN;
73
Concatenating data files in SAS
• Use set to stack data files
DATA dadmom; SET dads moms; RUN;
• Use rename to stack two data files with different variable
names for the same thing
DATA momdad;
SET dads(RENAME=(dadinc=inc)) moms(RENAME=(mominc=inc));
RUN;
• Two data files with different lengths for variables of the same
name
DATA momdad;
LENGTH name $ 4;
SET dads moms;
RUN;
74
Concatenating data files in SAS
• The two data files have variables with the same
name but different codes
dads moms
famid name inc fulltime famid name inc fulltime;
1 Bill 30000 1 1 Bess 15000 N
2 Art 22000 0 2 Amy 18000 N
3 Paul 25000 3 3 Pat 50000 Y;
DATA dads; SET dads; full=fulltime; DROP fulltime;RUN;
DATA moms; SET moms;
IF fulltime="Y" THEN full=1; IF fulltime="N" THEN full=0;
DROP fulltime;RUN;
DATA momdad; SET dads moms;RUN;
75
SAS Macro
What can we do with Macro?
76
SAS Macro Facility
• SAS macro variable
• SAS Macro
• Autocall Macro Facility
• Stored Compiled Macro Facility
77
SAS Macro Delimiters
Two delimiters will trigger the macro
processor in a SAS program.
• ¯o-name
This refers to a macro variable. The current
value of the variable will replace ¯o-name;
• %macro-name
This refers to a macro, which consists of one or
more complete SAS statements, or even whole data
or proc steps.
78
SAS Macro Variables
• SAS Macro variables can be defined and used
anywhere in a SAS program, except in data
lines. They are independent of a SAS dataset.
• Macro variables contain a single character
value that remains constant until it is
explicitly changed.
• To record the SAS macro use
• options macro;
79
SAS Macro Variables
%LET: assign text to a macro variable;
%LET macrovar = value
1. Macrovar is the name of a global macro variable;
2. Value is macro variable value, which is a character string
without quotation or macro expression.
81
SAS Macro Variables
Combine Macro Variables with Text
%LET first = John;
%LET last = Smith;
%put &first.&last; (combine)
%put &first. &last; (blank separate)
%put Mr. &first. &last; (prefix)
%put &first. &last. HSPH; (suffix)
output:
JohnSmith
John Smith
Mr. John Smith
John Smith HSPH
82
Create SAS Macro
• Definition:
%MACRO macro‐name (parm1, parm2,…parmk);
Macro definition (&parm1,&parm2,…&parmk)
%MEND macro‐name;
• Application:
%macro‐name(values of parm1, parm2,…,parmk);
83
SAS Macro Example
Import Excel to SAS Datasets by a Macro
%macro excelsas(in,out);
proc import out=work.&out
datafile="c:\&in"
dbms=excel2000 replace;
getnames=yes; run;
%mend excelsas;
% excelsas(class1, score1)
% excelsas(class2, score2)
84
SAS Macro Example
Use proc means by a Macro
%macro auto(var1, var2);
proc sort data=auto;
by &var2;
run;
proc means data=auto;
var &var1;
by &var2;
run;
%mend auto;
%auto(price, rep78) ;
%auto(price, foreign);
85
Inclass practice
Use the auto data to do the following
• check missing values for each variable
• create a new variable model (first part of
make)
• get means/frequencies for each variable by
model
• create 5 data files with 1‐5 repairs using macro
86
SCATTER DIAGRAM
A scatter diagram is a tool for analyzing relationships between two variables. One
variable is plotted on the horizontal axis and the other is plotted on the vertical axis. The
pattern of theirintersecting points can graphically show relationship patterns. Most often
a scatter diagram is usedto prove or disprove cause-and-effect relationships. While the
diagram shows relationships, itdoes not by itself prove that one variable causes the
other. In addition to showing possible causeand-effect relationships, a scatter diagram
can show that two variables are from a common causethat is unknown or that one
variable can be used as a surrogate for the other.
Plot the paired data. Plot the data on the chart, using concentric circles to indicate
repeated data points.
Complex Correlation
The value of Y seems to be related to the value of X, but the relationship is not easily
determined.
No Correlation
There is no demonstrated connection between the two variables.
No Correlation
used with ordinal data . . .
Inferential statistics
40 40 40
30 30 30
y 20 y 20 y 20
10 10 10
0 0 0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
x x x
GNP and adult literacy
GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1
Sudan 290 2 55.5 3 -1
Gambia 340 3 34.5 1 2
Peru 2460 4 89 7 -3
Turkey 3160 5 81.4 5 0
Brazil 4570 6 84 6 0
Argentina 8970 7 97 9 -2
Israel 15940 8 96 8 0
U.A.E. 18220 9 74.3 4 5
Netherlands 24760 10 100 10 0
n = 10
Summing ‘d^2’ . . .
GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1 1
Sudan 290 2 55.5 3 -1 1
Gambia 340 3 34.5 1 2 4
Peru 2460 4 89 7 -3 9
Turkey 3160 5 81.4 5 0 0
Brazil 4570 6 84 6 0 0
Argentina 8970 7 97 9 -2 4
Israel 15940 8 96 8 0 0
U.A.E. 18220 9 74.3 4 5 25
Netherlands 24760 10 100 10 0 0
n = 10 44
Getting ‘n’ . . .
GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1 1
Sudan 290 2 55.5 3 -1 1
Gambia 340 3 34.5 1 2 4
Peru 2460 4 89 7 -3 9
Turkey 3160 5 81.4 5 0 0
Brazil 4570 6 84 6 0 0
Argentina 8970 7 97 9 -2 4
Israel 15940 8 96 8 0 0
U.A.E. 18220 9 74.3 4 5 25
Netherlands 24760 10 100 10 0 0
n = 10 44
Calculating Spearman’s Rank . . .
GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1 1
Sudan 290 2 55.5 3 -1 1
Gambia 340 3 34.5 1 2 4
Peru 2460 4 89 7 -3 9
Turkey 3160 5 81.4 5 0 0
Brazil 4570 6 84 6 0 0
Argentina 8970 7 97 9 -2 4
Israel 15940 8 96 8 0 0
U.A.E. 18220 9 74.3 4 5 25
Netherlands 24760 10 100 10 0 0
n = 10 44
Rs = 1 - ( 6 x 44 )
(1000 – 100)
The answer to Spearman’s . . .
GNP % adult
per capita literacy
X Rank X Y Rank Y d d^2
Nepal 210 1 39.7 2 -1 1
Sudan 290 2 55.5 3 -1 1
Gambia 340 3 34.5 1 2 4
Peru 2460 4 89 7 -3 9
Turkey 3160 5 81.4 5 0 0
Brazil 4570 6 84 6 0 0
Argentina 8970 7 97 9 -2 4
Israel 15940 8 96 8 0 0
U.A.E. 18220 9 74.3 4 5 25
Netherlands 24760 10 100 10 0 0
n = 10 44
1
What is in this workshop
• SPSS interface: data view and variable view
• How to enter data in SPSS
• How to import external data into SPSS
• How to clean and edit data
• How to transform variables
• How to sort and select cases
• How to get descriptive statistics
2
Data used in the workshop
• We use 2009 Youth Risk Behavior Surveillance
System (YRBSS, CDC) as an example.
– YRBSS monitors priority health‐risk behaviors and
the prevalence of obesity and asthma among
youth and young adults.
– The target population is high school students
– Multiple health behaviors include drinking,
smoking, exercise, eating habits, etc.
3
SPSS interface
• Data view
– The place to enter data
– Columns: variables
– Rows: records
• Variable view
– The place to enter variables
– List of all variables
– Characteristics of all variables
4
Before the data entry
• You need a code book/scoring guide
• You give ID number for each case (NOT real
identification numbers of your subjects) if you
use paper survey.
• If you use online survey, you need something
to identify your cases.
• You also can use Excel to do data entry.
5
Example of a code book
A code book is about how you code your
variables. What are in code book?
1. Variable names
2. Values for each response option
3. How to recode variables
6
Enter data in SPSS 19.0
Columns:
variables
Rows: cases
Under Data
View
7
Enter variables
1. Click Variable View
2. Type variable name under
4. Description
2. Type of variable
Name column (e.g. Q01).
variable name NOTE: Variable name can be 64
3. Type: bytes long, and the first
numeric or character must be a letter or
string one of the characters @, #, or
$.
3. Type: Numeric, string, etc.
4. Label: description of variables.
1. Click this
Window
8
Enter variables
Based on your
code book!
9
Enter cases
1. Two variables in the data set.
2. They are: Code and Q01.
3. Code is an ID variable, used to identify individual case
(NOT people’s real IDs).
4. Q01 is about participants’ ages: 1 = 12 years or younger,
2 = 13 years, 3 = 14 years…
Under Data
View
10
Import data from Excel
• Select File Open Data
• Choose Excel as file type
• Select the file you want to import
• Then click Open
11
Open Excel files in SPSS
12
Import data from CVS file
• CVS is a comma‐separated values file.
• If you use Qualtrics to collect data (online
survey), you will get a CVS data file.
• Select File Open Data
• Choose All files as file type
• Select the file you want to import
• Then click Open
13
Continue
14
Continue
15
Continue
16
Continue
17
Continue
18
Continue
19
Continue
Save this file
as SPSS data
20
Clean data after import data files
• Key in values and labels for each variable
• Run frequency for each variable
• Check outputs to see if you have variables
with wrong values.
• Check missing values and physical surveys if
you use paper surveys, and make sure they
are real missing.
• Sometimes, you need to recode string
variables into numeric variables
21
Continue
Wrong
entries
22
Variable transformation
• Recode variables
1. Select Transform Recode
into Different Variables
2. Select variable that you want to
transform (e.g. Q20): we want
1= Yes and 0 = No
3. Click Arrow button to put your
variable into the right window
4. Under Output Variable: type
name for new variable and
label, then click Change
5. Click Old and New Values
23
Continue
6. Type 1 under Old Value
and 1 under New Value,
click Add. Then type 2
under Old Value, and 0
under New Value, click
Add.
7. Click Continue after
finish all the changes.
8. Click Ok 24
Variable transformation
y Compute variable (use YRBSS 2009 data)
y Example 1. Create a new variable: drug_use (During the past 30
days, any use of cigarettes, alcohol, and marijuana is defined as
use, else as non‐use). There are two categories for the new
variable (use vs. non‐use). Coding: 1= Use and 0 = Non‐use
1. Use Q30, Q41, and Q47 from 2009 YRBSS survey
2. Non‐users means those who answered 0 days/times to all three
questions.
3. Go to Transform Compute Variable
25
Continue
4. Type “drug_use” under
Target Variable
5. Type “0” under Numeric
Expression. 0 means
Non‐use
6. Click If button.
26
Continue
7. With help of that
Arrow button, type
Q30= 1 & Q41 = 1 & Q47= 1
then click Continue
8. Do the same thing for
AND
OR
Use, but the numeric
expression is different:
Q30> 1 | Q41> 1 | Q47>1
27
Continue
9. Click OK
10. After click OK,
a small window asks
if you want to
change existing
variable because
drug_use was already
created when you
first define non‐use.
11. Click ok.
28
Continue
y Compute variables
y Example 2. Create a new variable drug_N that
assesses total number of drugs that adolescents
used during the last 30 days.
1. Use Q30 (cigarettes), 41 (alcohol), 47
(marijuana), and 50 (cocaine). The number of
drugs used should be between 0 and 4.
2. First, recode all four variables into two
categories: 0 = non‐use (0 days), 1 = use (at least
1 day/time)
3. Four variables have 6 or 7 categories 29
Continue
4. Recode four variables: 1 (old) = 0 (new), 2‐6/7 (old) = 1 (New).
5. Then select Transform Compute Variable
30
Continue
6. Type drug_N under Target Variable
7. Numeric Expression: SUM (Q30r,Q41r,Q47r,Q50r)
8. Click OK
31
Continue
• Compute variables
– Example 3: Convert string variable into numeric
variable
1. Enter 1 at Numeric
Expression.
2. Click If button and type
Q2 = ‘Female’
3. Then click Ok.
4. Enter 2 at Numeric
Expression.
5. Click If button and type
Q2 = ‘Male’
6. Then click Ok
32
Sort and select cases
• Sort cases by variables: Data Sort Cases
• You can use Sort Cases to find missing.
33
Sort and select cases
• Select cases
– Example 1. Select Females for analysis.
1. Go to Data Select Cases
2. Under Select: check the second one
3. Click If button
34
Continue
4. Q2 (gender) = 1,
1 means Female
5. Click Continue
6. Click Ok
Unselected
cases :
Q2 = 2
35
Sort and select cases
7. You will see a new variable: filter_$ (Variable
view)
36
Sort and select cases
• Select cases
– Example 2. Select cases who used any of cigarettes, alcohol, and marijuana
during the last 30 days.
1. Data Select Cases
2. Click If button
3. Type Q30 > 1 | Q41 > 1 | Q47 > 1, click Continue
37
Basic statistical analysis
• Descriptive statistics
– Purposes:
1. Find wrong entries
2. Have basic knowledge about the sample and
targeted variables in a study
3. Summarize data
Analyze Descriptive statistics Frequency
38
Continue
39
Frequency table
40
1. Skewness: a measure of the
asymmetry of a distribution.
The normal distribution is
symmetric and has a skewness
value of zero.
Positive skewness: a long right tail.
Negative skewness: a long left tail.
Departure from symmetry : a
skewness value more than twice
its standard error.
2. Kurtosis: A measure of the extent
to which observations cluster around
a central point. For a normal
Normal distribution, the value of the kurtosis
Curve statistic is zero. Leptokurtic data
values are more peaked, whereas
platykurtic data values are flatter and
more dispersed along the X axis.
41
42
` About the four-windows in SPSS
Click
` This sheet contains information about the data set that is stored
with the dataset
` Name
◦ The first character of the variable name must be alphabetic
◦ Variable names must be unique, and have to be less than 64
characters.
◦ Spaces are NOT allowed.
` Type
◦ Click on the ‘type’ box. The two basic types of variables
that you will use are numeric and string. This column
enables you to specify the type of variable.
` Width
◦ Width allows you to determine the number of
characters SPSS will allow to be entered for the
variable
` Decimals
◦ Number of decimals
◦ It has to be less than or equal to 16
3.14159265L
` Label
◦ You can specify the details of the variable
◦ You can write characters with spaces up to 256
characters
` Values
◦ This is used and to suggest which numbers
represent which categories when the
variable represents a category
` Click the cell in the values column as shown below
` For the value, and the label, you can put up to 60
characters.
` After defining the values click add and then click OK.
Click
` How would you put the following information into SPSS?
Click
` Click ‘Data’ and then click Sort Cases
` Double Click ‘Name of the students.’ Then click
ok.
Click
Click
` How would you sort the data by the
‘Height’ of students in descending order?
` Answer
◦ Click data, sort cases, double click ‘height of
students,’ click ‘descending,’ and finally click ok.
` Click ‘Transform’ and then click ‘Compute Variable…’
` Example: Adding a new variable named ‘lnheight’ which is
the natural log of height
◦ Type in lnheight in the ‘Target Variable’ box. Then type in
‘ln(height)’ in the ‘Numeric Expression’ box. Click OK
Click
` A new variable ‘lnheight’ is added to the table
` Create a new variable named “sqrtheight”
which is the square root of height.
` Answer
` Frequencies
◦ This analysis produces frequency tables showing
frequency counts and percentages of the values of
individual variables.
` Descriptives
◦ This analysis shows the maximum, minimum,
mean, and standard deviation of the variables
Click Click
` Finally Click OK in the Frequencies box.
Click
` Click ‘Analyze,’ ‘Descriptive statistics,’ then
click ‘Frequencies.’
` Put ‘Gender’ in the Variable(s) box.
` Then click ‘Charts,’ ‘Bar charts,’ and click
‘Continue.’
` Click ‘Paste.’
Click
` Highlight the commands in the Syntax editor
and then click the run icon.
` You can do the same thing by right clicking the
highlighted area and then by clicking ‘Run
Current’
Right
Click Click!
` Do a frequency analysis on the
variable “minority”
Click
` The options allows you to analyze other
descriptive statistics besides the mean and Std.
` Click ‘variance’ and ‘kurtosis’
` Finally click ‘Continue’
Click
Click
` Finally Click OK in the Descriptives box. You will
be able to see the result of the analysis.
` Click ‘Analyze,’ ‘Regression,’ then click
‘Linear’ from the main menu.
` For example let’s analyze the model salbegin = β 0 + β1edu + ε
` Put ‘Beginning Salary’ as Dependent and ‘Educational Level’ as
Independent.
Click
Click
` Clicking OK gives the result
` Click ‘Graphs,’ ‘Legacy Dialogs,’
‘Interactive,’ and ‘Scatterplot’ from the
main menu.
` Drag ‘Current Salary’ into the vertical axis box
and ‘Beginning Salary’ in the horizontal axis box.
` Click ‘Fit’ bar. Make sure the Method is
regression in the Fit box. Then click ‘OK’.
Set this to
Click
Regression!
` Find out whether or not the previous
experience of workers has any affect on their
beginning salary?
◦ Take the variable “salbegin,” and “prevexp” as
dependent and independent variables respectively.
- 9.5
Or, start with the lowest score, 89. How far away is 89
from the mean of 81.5?
89 - 81.5 = 7.5
- 9.5 7.5
Distance
from
So, the Mean
72 - 9.5 90.25
76 - 5.5 30.25 Sum:
80 - 1.5 2.25 214.5
Add up all 80 - 1.5 2.25
of the 81 - 0.5 0.25
distances 83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance
from Distances
Mean Squared
72 - 9.5 90.25
76 - 5.5 30.25 Sum:
Divide by (n
80 - 1.5 2.25 214.5
- 1) where n
represents 80 - 1.5 2.25 (10 - 1)
72 - 9.5 90.25
Finally, 76 - 5.5 30.25 Sum:
take the 80 - 1.5 2.25 214.5
72 - 9.5 90.25
76 - 5.5 30.25 Sum:
80 - 1.5 2.25 214.5
This is the 80 - 1.5 2.25 (10 - 1)
Standard 81 - 0.5 0.25
= 23.8
Deviation 83 1.5 2.25
= 4.88
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance
from Distances
Mean Squared
57 - 24.5 600.25
Now find 65 - 16.5 272.25 Sum:
the 83 1.5 2.25 2280.5
Standard 94 12.5 156.25 (10 - 1)
Deviation 95 13.5 182.25
= 253.4
for the 96 14.5 210.25
= 15.91
other class 98 16.5 272.25
grades 93 11.5 132.25
71 - 10.5 110.25
63 -18.5 342.25
Now, lets compare the two
classes again
Team A Team B
Average on
the Quiz 81.5 81.5
Standard
Deviation 4.88 15.91
Variance and Standard Deviation
Variance: a measure of how data
points differ from the mean
y Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7
But we know that the two data sets are not identical! The variance
shows how they are different.
∑( x − X )
N
y Although this might seem reasonable, this expression always equals 0,
because the negative deviations about the mean always cancel out the
positive deviations about the mean.
y We could just drop the negative signs, which is the same mathematically as
taking the absolute value, which is known as the mean deviations.
y The concept of absolute value does not lend itself to the kind of advanced
mathematical manipulation necessary for the development of inferential
statistical formulas.
y The average of the squared deviations about the mean is called the variance.
∑(x − X )
2
∑(x − X )
2
For sample variance
s =
2
n −1
Score (
X − X)
2
X−X
X
1
3
2
5
3
7
4
10
5
10
Totals
35
1
3 3-7=-4
2
5 5-7=-2
3
7 7-7=0
4
10 10-7=3
5
10 10-7=3
Totals
35
Score (
X − X)
2
X−X
X
1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals
35 38
Score (
X − X)
2
X−X
X
1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals
35 38
∑(x − X )
2
38
s = 2
= = 7.6
n 5
Example 2
mean 23 23
median 22 27
range 10 22
1 28 5 25
2 22 -1 1
3 21 -2 4
4 26 3 9
5 18 -5 25
Totals 115 0 64
s=
n −1
y population standard deviation:
∑(x − μ)
2
σ=
N
Another formula
y Definitional formula for variance for data in a frequency
distribution
S 2
=
∑ (X − X ) 2
f
∑f
y Definitional formula for standard deviation for data in a
frequency distribution
S=
∑ ( X − X ) 2
f
∑f
The mean is 23
28 1
27 3
6 1
115 5
Myrna’s Score X f ( X − X)2 ( X − X )2 x f
X−X
28 1 5
27 3 4
6 1 -17
115 5
Myrna’s Score X f ( X − X)2 ( X − X )2 x f
X−X
28 1 5 25
27 3 4 16
6 1 -17 289
115 5
round-off rule – carry
one more decimal
Myrna’s Score X f ( X − X)2 ( X − X )2 x f
place than was
X−X
present in the
original data
28 1 5 25 25
27 3 4 16 48
115 5 362
2
T-Test
The alternative hypothesis suggests the direction of the actual
value of the parameter relative to the stated value. The statement
of Ha in the form of an inequality that indicates that the
investigator has no opinion as to whether the actual value of μ is
more than or less than the stated value but the feeling is that the
stated value is incorrect. In this case the test is two-tail test.
Statements in the form of strictly greater than or strictly less than
relationship indicate that the investigator has an opinion as to the
direction of the value of the parameter relative to the stated
value. In this case it is called one-tail test.
3
T-Test
z 2- State the level of significance of the test and the
corresponding Z values (for large sample tests), or the
corresponding T values ( for small sample tests). The
hypothesis test is frequently conducted at the 5%, 1% and 10%
levels of significance. Some can use the Z values. For a test
conducted at any other level of significance, we simply use the
normal distribution table to determine a corresponding Z value.
z 3- Calculate the test statistic for the sample that has taken.
z There are three cases:
4
T-Test
z Case 1: The variable has a normal distribution and σ2is known.
In this case the test statistic is
x − μ0
Z =
σ
n
which has a standard normal distribution if μ = μ0 in H 0.
z Case 2: The variable has a normal distribution and σ is
2
6
T-Test
4- Determine the boundary (or boundaries) for the area of
rejection regions using either X c or Zc values. A critical value is
the boundary or limit value that requires as to reject the
statement of the null hypothesis.
7
T-Test
Lower X C μ upper XC
8
T-Test
Rejection region
μ upper XC
In directional test there is one critical value (upper boundary ) when:
H a : μ > μo
9
T-Test
Rejection region
Lower XC μ
In directional test there is one critical value (lower boundary ) when:
H a : μ < μo
10
z The critical value X is simply the maximum or minimum value that we are
willing to accept as being consistent with the stated parameter . The mean of the
distribution is given by:
μx = μ
z The standard deviation of the distribution is given by:
σ
σx =
n
z 5- Formulate a decision rule on the basis of the boundary values obtained in step
4. When we conduct an hypothesis test, we are required to make one of two
decisions:
z a- Reject Ho or
z B- Accept Ho
11
It is possible to make two errors in decision . One error is
called a type I error or α − error .We make a type I
error whenever we reject the statement of H 0 ,when is in
fact true. The probability of making a type I error is the
level of significance of the test. The second error we can
make in an hypothesis test is called a type II error, or B-
error. We commit a type II error if we fail to reject the
statement of ,when
H is in fact false. The four combinations
0
of truth values of and the resulting
H
0
decisions are
summarizing below:
12
H H
0 0
True False
Reject Type I Correct
H
0 error Decision
Accept Correct Type II
H
0 Decision error
13
When we lower the level of significance of an
hypothesis test we always increase the possibility of
committing a B-error.
6- State a conclusion for the hypothesis test based on the
sample data obtained and the decision rule stated in steps.
14
z P-value of a test:
z The p- value is the probability of getting a value more
extreme than one observed value of the test statistic, it is
denoted by Z When H is as follows: H a ≠
obs a
z P-value= 2p (Z >| Z obs |)
z When H a is :>
z p-value= p (Z > Z obs )
z When H a is :<
≠
15
z If we have a T statistic with a t n − 1 distribution and
observe value t , these p-values becomes:
obs
z ≠ alternative :p-value = 2p (t n − 1 >| t obs |)
¾ > alternative :p-value = p ( t n − 1 > t obs )
¾ < alternative :p-value = p( t < t )
n −1 obs
16
¾ Thus H is rejected if p-value < α . When data is
o
collected from a normally distributed population and the
sample size is small, the t values of the student t
distribution must be used in the hypothesis test not the Z
values of the normal distribution. This is due to the fact
that her central limit theorem does not apply when n < 30.
17
¾ Ex:
¾ Suppose we measure the sulfur content (as a percent) of
15 samples of crude oil from a particular Middle Eastern
area obtaining:
¾ 1.9,2.3,2.9,2.5,2.1,2.7,2.8,2.6,2.6,2.5,2.7,2.2,2.8,2.7,3.
¾ Assume that sulfur content are normally distributed . Can
we conclude that the average sulfur content in this area is
less than 2.6? Use a level of significance of .05.
18
n = 15 X = 2.533 S = .3091 α = .05
H 0 : μ = 2.6
H a : μ < 2.6
19
Rejection region
.95
.05
-1.6
20
21
22
¾ Testing for the Difference in Two Population means:
¾ Often we have two populations for which we would
like to compare the means. Independent random
samples of sizes n 1 and n 2 are selected from the two
populations with no relationship between the elements
we drawn from the two populations. The statistical
hypothesis are given by:
23
H 0 : μ1= μ 2 vs H a : μ1 ≠ μ 2
or H a : μ1 > μ 2
or H a : μ1 < μ 2
24
z There are three cases which depend on what is known
about the the population variances.
σ 1
2
an d σ 2
2
z Case1:
z Population variances are known for normal populations
(or non normal populations with both n 1 and n 2 large).
In this case the test statistic is to be :
X 1 − X 2
Z =
σ 12 σ 22
+
n1 n 2
25
z Case2:
z Populations are unknown but are to be equalσ 12 = σ 22 = σ 2
z in normal populations. In this case, we pool our estimates
to get the pooled two- sample variance
z
2 ( n1 − 1) S12 + ( n2 − 1) S 22
S =
p n1 + n2 − 2
26
z And the test statistic is to be
X1 − X 2
T =
2 1 1
Sp( + )
n1 n2
z Which has a t n +n −2
1 2
distribution if H is true.
0
27
z Case 3:
1 2
σ σ
z 2 and 2 are unknown and unequal normal
populations . In this case the test statistic is given by:
X1 − X 2
T ′ =
S 12 S 22
+
n1 n2
which does not have a known distribution.
28
Ex:
The amount of solar ultraviolet light of wavelength from 290 to 320
nm which reached the earths surface in the Riyadh area was
measured for independent samples of days in cooler months
(October to March) and in warmer months (April to September):
z Cooler:5.31,4.36,3.71,3.74,4.51,4.58,4.64,3.83,3.16,3.67,4.34,2.95,
3.62,3.29,2.45.
z Warmer:4.07,3.83,4.75,4.84,5.03,5.48,4.11,4.15,3.9,4.39,4.55,4.91,
4.11,3.16,2.99,3.01,3.5,3.77.
29
z Assuming normal distributions with equal variances ,
test whether there is a difference in the average ultraviolet
light reaching Riyadh in the cooler and warmer months .
Use a level of significance of .05.
30
n = 15 n = 18
1 2
X
1 = 3 . 877 X 2 = 4 . 142
S 1 = . 751 S 2 = . 709
H 0 : μ 1 = μ 2
H a : μ 1 ≠ μ 2
31
z The pooled two sample variance is
2 ( n1 − 1 ) S 12 + ( n 2 − 1 ) S 22
S = = . 531
p n1 + n 2 − 1
X1 − X 2
T = = −1.033
1 1
S 2p ( + )
n1 n2
32
.95
.025 .025
−t = 2.0423 t
31 , 025 = 2 . 0423
31.025
33
34
35
z Since the value of the test statistic is in the
acceptance region , then H 0 is accepted at α = 05 .
z It means that there is no difference in the average
ultraviolet light reaching Riyadh in the cooler and
warmer months .
36
z Dependent Samples:
37
¾ For each X (the elements of the sample before the
i
experiment) and Y i (the elements of the sample after the
experiment) we obtain in the two samples, we compute a
value d of a random variable D which represents the
i
difference between the two populations and n is the
number of items of data obtained in each of the two
samples .
38
The samples drawn from the two populations are
therefore converted to single sample –a sample of d i ' s
The mean , d , and the standard deviation, S , of the
distribution of d i ' s are obtained as follows: d
z
∑ di ∑ ( xi − yi )
d = =
n n
∑ (di − d )2
Sd =
n − 1
39
¾ We are interested in testing one of the tests of hypothesis:
H 0 : μd = 0 vs H a : μd ≠ 0
or H a : μd > 0
or H a : μd < 0
z Thus the quantity
d − μ d
T =
S d
n
z has a t distribution.
n −1
40
Ex:
In an experiment comparing two feeding methods for
calves, eight pairs of twins were used-one twin receiving
Method A and the other twin receiving Method B. At the
end of a given time, the calves were slaughtered and
cooked, and the meat was rated for its taste (with a higher
number indicating a better taste
41
Twin pair Method A Method B
1 27 23
2 37 28
3 31 30
4 38 32
5 29 27
6 35 29
7 41 36
8 37 31
42
Assuming approximate normality, test if the average taste
score for calves fed by Method B is less than the average
taste for calves fed by Method A. Use α = .05 .
43
2
d d
i i
4 16
9 81
1 1
6 36
2 4
6 36
5 25
6 36
39 235 44
H 0 : μd = 0 vs H a : μd > 0
d =
∑di = 4.875
n
∑
1
Sd = ( d i2 − n d 2 ) = 2.542
n
45
The test statistic is
d − μd
T = = 5.447
Sd
n
46
.95 rejection region
.05
t = 1 . 8946
n − 1, α
47
48
49
50
Quality Control
z A “defect” is an instance of a failure to meet a requirement
imposed on a unit with respect to single quality characteristic . In
inspection or testing , each unit is checked to see if it does or dose
not contain any defects. For example
±
, if every dosage unit could
be tested , the expense would probably be prohibitive both to
manufacturer and consumer. Also it is may cause misclassification
of items and other errors . Quality can be accurately and precisely
estimated by testing only part of the total material (a sample) .It
requires small samples for inspection or analysis .
51
z Data obtained from this sampling can then be treated
statistically to estimate population parameters. After
inspection (n) units we will have found say (d) of them to
be defectives and (n - d) of them to be good ones. On the
other hand we may count and record the number of
defects, c, we find on single unit. This count may be
0,1,2,…. Such an approach of counting of defects on a
unit becomes especially useful if most of the units contain
one or more defects.
52
z Control charts can be applied during in - process
manufacturing operations, for finished product
characteristics and in research and development for
repetitive procedures.We may always convert a
measurable characteristics of a unit to an attribute by
setting limits, say L (lower bound) and U (upper bound)
for x. Then if x lies between, the unit is a good one, or if
outside, it is a defective one. As an example for the
control chart the tablet weight.
53
z We are interested in ensuring that tablet weight remain
close to a target value under “statistical control”. To
achieve this object , we will periodically sample a group
of tablets, measuring the mean weight and variability.
Variability can be calculated on the basis of the standard
deviation or the range. The range is the difference between
the lowest and highest value.
54
z If the sample size is not large (<10) the range is an
efficient estimator of the standard deviation. The mean
weight and variability of each sample (subgroup) are
plotted sequentially as a function of time. The control
chart is a graph that has time or order of submission of
sequential lots on the x axis and the average test result on
the Y axis. The subgroups should be as homogeneous as
possible relative to overall process. They are usually ( but
not always) taken as units manufactured close in time.
55
z Four to five items per subgroup is usually as adequate
sample size. In our example (10) tablets are individually
weighted at approximately (1) hour intervals. The mean
and range are calculated for each of the subgroups
samples. As long as the mean and range of the 10 tablet
samples do not vary “ too much” from subgroup to
subgroup, the product is considered to be in control (it
means that the observed variation is due only to the
random, uncontrolled variation inherent in the process).
56
z We will define upper and lower limits for the mean and
range of the subgroups. The construct of these limits is
based on normal distribution. In particular, a value more
than (3) standard deviations from the mean is highly
unlikely and can be considered to be probably due to some
systematic, assignable cause. The average line (the target
value) may be determined from the history of the product
regular updating or may be determined from the product
specifications .
57
z The action lines (the limits) are constructed to represent
z ± 3 standard deviations ( ± 3 σ limits) from the
target value. The upper and lower limits for the mean X
z chart are given by:
X ± AR ,
R =
∑ R
K
is the average range , K is the number of samples
(subgroups).A is a factor which is obtained from a table
according to the sample size .
58
z The central line, the upper and lower limits for the range chart are
given by:
z Central line =
R =
∑R
z
K
z Lower limit = D R
L
z Upper limit = D R
U
59
z Where D and D are factors which are
L U
z obtained from a table according to the sample size. It is
noticed that the sample size is constant.
z Ex:
z Tablet weights and ranges from a tablet Manufacturing
Process (Data are the average and range of 10 tablets):
60
Date Time Mean Range
X R
3/1 11 a.m. 302.4 16
12 p.m. 298.4 13
1 p.m. 300.2 10
2 p.m. 299 9
3/5 11 a.m. 300.4 13
12 p.m. 302.4 5
1 p.m. 300.3 12
2 p.m. 299 17
61
Date Time Mean Range R
X
A = . 31 at n = 10 , R = 10 . 833
64
R chart
D = . 22 , D = 1 . 78 at n = 10
L U
65
X
304 U c L=303.358
302
C L=300
300
298 L c L=296.642
296
294
292
290
66
U c L=19.283
R 18
16
14
12
C L=10.833
10
8
6
4 L c L=2.383
3\1 3\5 3\9 3\11 3\16 3\22
67
Non‐parametric tests
• Note: When valid use parametric
• Commonly used
Wilcoxon
Chi square etc.
• Performance comparable to parametric
• Useful for non‐normal data
• If normalization not possible
• Note: CI derivation‐difficult/impossible
Wilcoxon signed rank test
To test difference between paired
data
STEP 1
• Exclude any differences which are zero
• Put the rest of differences in ascending order
• Ignore their signs
• Assign them ranks
• If any differences are equal, average their ranks
STEP 2
• Count up the ranks of +ives as T+
• Count up the ranks of –ives as T‐
STEP 3
• If there is no difference between drug (T+) and
placebo (T‐), then T+ & T‐ would be similar
• If there were a difference
one sum would be much smaller and
the other much larger than expected
• The smaller sum is denoted as T
• T = smaller of T+ and T‐
STEP 4
• Compare the value obtained with the critical
values (5%, 2% and 1% ) in table
• N is the number of differences that were
ranked (not the total number of differences)
• So the zero differences are excluded
Hours of sleep Rank
Patient Drug Placebo Difference Ignoring sign
1 6.1 5.2 0.9 3.5*
2 7.0 7.9 -0.9 3.5*
3 8.2 3.9 4.3 10
4 7.6 4.7 2.9 7
5 6.5 5.3 1.2 5
6 8.4 5.4 3.0 8
7 6.9 4.2 2.7 6
8 6.7 6.1 0.6 2
9 7.4 3.8 3.6 9
10 5.8 6.3 -0.5 1
3rd & 4th ranks are tied hence averaged
T= smaller of T+ (50.5) and T- (4.5)
Here T=4.5 significant at 2% level indicating the drug (hypnotic) is
more effective than placebo
Wilcoxon rank sum test
• To compare two groups
• Consists of 3 basic steps
Non‐parametric equivalent of
t test
Step 1
• Rank the data of both the groups in
ascending order
• If any values are equal average their ranks
Step 2
• Add up the ranks in group with smaller
sample size
• If the two groups are of the same size either
one may be picked
• T= sum of ranks in group with smaller sample
size
Step 3
• Compare this sum with the critical ranges
given in table
• Look up the rows corresponding to the sample
sizes of the two groups
• A range will be shown for the 5% significance
level
Non-smokers (n=15) Heavy smokers (n=14)
Birth wt (Kg) Rank Birth wt (Kg) Rank
3.99 27 3.18 7
3.79 24 2.84 5
3.60* 18 2.90 6
3.73 22 3.27 11
3.21 8 3.85 26
3.60* 18 3.52 14
4.08 28 3.23 9
3.61 20 2.76 4
3.83 25 3.60* 18
3.31 12 3.75 23
4.13 29 3.59 16
3.26 10 3.63 21
3.54 15 2.38 2
3.51 13 2.34 1
2.71 3
Sum=272 Sum=163
1 6.1 5.2
2 7.0 7.9
3 8.2 3.9
4 7.6 4.7
5 6.5 5.3
6 8.4 5.4
7 6.9 4.2
8 6.7 6.1
9 7.4 3.8
10 5.8 6.3
Null Hypothesis: Hours of sleep are the same using placebo & the drug
STEP 1
• Exclude any differences which are zero
• If there is a difference
one sum would be much smaller and
the other much larger than expected
• T = larger of T+ and T-
STEP 4
• Compare the value obtained with the
critical values (5%, 2% and 1% ) in table
• Look up the rows corresponding to the sample
sizes of the two groups
• A range will be shown for the 5% significance
level
Non-smokers (n=15) Heavy smokers (n=14)
Birth wt (Kg) Rank Birth wt (Kg) Rank
3.99 27 3.18 7
3.79 24 2.84 5
3.60* 18 2.90 6
3.73 22 3.27 11
3.21 8 3.85 26
3.60* 18 3.52 14
4.08 28 3.23 9
3.61 20 2.76 4
3.83 25 3.60* 18
3.31 12 3.75 23
4.13 29 3.59 16
3.26 10 3.63 21
3.54 15 2.38 2
3.51 13 2.34 1
2.71 3
Sum=272 Sum=163
* 17, 18 & 19are tied hence the ranks are averaged
Hence caculated value of T = 163; tabulated value of T (14,15) = 151
Mean birth weights are not same for non-smokers & smokers
they are significantly different
Spearman’s Rank Correlation Coefficient
Examples
• to know the correlation between honesty
and wisdom of the boys of a class.
Revolutionpharmd.com
Study Designs
1. Case Reports
2. Case Series
5. Cohort Studies
Revolutionpharmd.com
1. Case Reports
Revolutionpharmd.com
2. Case Series
Revolutionpharmd.com
3. Analyses of Secular Trends
• Called as ecological studies.
• Examines trends in an exposure that is a presumed cause and
trends in a disease that is a presumed effect and test whether the
trends coincide.
• Vital statistics and record linkage are often used in these studies.
• Useful for rapidly providing evidence for or against a hypothesis.
• Unable to control confounding variables. E.g. lung cancer might
be the cause of cigarettes but chance of occupational hazards
can still not be ruled out
Revolutionpharmd.com
4. Case – Control Studies
• Compare cases with the disease to controls without the
disease, looking for differences in exposure.
• Multiple possible causes of a single disease can be
studied.
• Helps in studying relatively rare disease requires smaller
sample size.
• Informations are generally obtained retrospectively from
the medical records, by interviews or questionnaires.
• Limitations are validity of retrospective information and
selection of control is challenging task. Inappropriate
Revolutionpharmd.com
control selection can lead to incorrect conclusion.
5. Cohort Studies
• Identify subsets of a defined population and followed them
over time, looking for differences in their outcome.
• Used to compare exposed patients to unexposed patients, can
also be used to compare one exposure to another or when
multiple outcomes from single exposure is to be studied.
• Either done prospectively or retrospectively.
• More reliable causal association.
• But requires large sample size (even for an uncommon
outcome) and can require prolonged time period to study
delayed outcomes.
Revolutionpharmd.com
Differences between Cohort study and Case – Control study
Disease
Present Absent
Cohort (Cases) (Controls)
studies
Present
(Exposed)
Factor
Absent
(Unexposed)
Revolutionpharmd.com
6. Randomized Clinical Trials
Revolutionpharmd.com
Computer System in Hospital
Pharmacy
1. Pattern of Computer Uses in
Hospital Pharmacy
• * Hospital Pharmacy is very slow to adopt
computers
* Only about 60% of Pharmacies are
computerized to any extent
* Institutional Pharmacy Manager may
be wary about the computerization
www.revolutionpharmd.com
Pattern of Computer Uses in
Hospital Pharmacy
• Because Hospital Pharmacy is a more
difficult and complex operation than the
retail pharmacy
• Retail Pharmacies dispense prescription
in more or less the same way
• Hospital pharmacy dept. distributes
different types of drug products and giving
different types of services
www.revolutionpharmd.com
Pattern of Computer Uses in
•
Hospital Pharmacy
Institutional Pharmacist are also aware
that many computer systems have
performed in less than satisfactory
manner
• One survey revealed that only about 69%
of hospital pharmacies were fully satisfied
with their computer system
www.revolutionpharmd.com
Pattern of Computer Uses in
Hospital Pharmacy
• Nearly 2/3rd of hospital pharmacies
believed that their computer system.
Because it has improved some
pharmacy operations such as billing,
quality of drug therapy in the hospital
• Each having its own information
requirements
• Therefore considerable room for
improvement is required
www.revolutionpharmd.com
Pattern of Computer Uses in
Hospital
• The most common system feature is ability to generate
• 1 Drug order labels
• 2. Maintain Patient Profile
• 3. Generating drug use review data
• 4. Maintaining a drug formulary
• 5. Updating drug Price
• 6. Transferring Patients drug charges to
the billing department
7. To have some inventory control function
8. Food and drug interaction
•
www.revolutionpharmd.com
Capabilities Of A Hospital
Computer System
• System is capable performing
• 1. Patient record database management
• 2.Medication order entry
• 3.Drug label & Lists
• 4.Intravenous solutions and admixtures
• 5.Patient medication profiles
• 6.Drug utilization review
www.revolutionpharmd.com
Capabilities Of A Hospital
Computer System
• 7. Drug therapy problem detection
• 8. Drug therapy monitoring
• 9. Formulary search and update
• 10. Purchasing and Inventory control
• 11. Billing Procedure
• 12. Management information systems and
decision support
• 13. Integration with other hospital
departments
www.revolutionpharmd.com
1. Patient record database
management
• A Hospital Pharmacy computer assure
that the pharmacy’s Patient record
database is continually updated to reflect
the current status of patient
• Updating has to done by accessing
information from admitting department
database to determine recent admission
discharge and patient transfer (ADT)
www.revolutionpharmd.com
1.Patient record database
management
• Computer system should be capable of
producing a current roster of patients, by
identifying Name, Age, sex, Room number
and Hospital service Unit
• Computer system must be capable of
displaying
• 1. Present diagnoses
• 2. Other diseases present
www.revolutionpharmd.com
1. Patient record database
management
• 3. Allergies
• 4. Weight
• 5. Height
• 6. Physician
• 7. Special Note about patient
www.revolutionpharmd.com
Medication order Entry
• Rapid processing of drug orders is an
essential function of computer system
• Typically orders are entered at a terminal
by technical person
• Formatted data entry screen should allow
easy entry and Retrieval of orders
•
www.revolutionpharmd.com
Medication order Entry
• A Pharmacist should be able to retrieve
order for review or verification prior to
administration to patient
• All drug order should contain at least the
following data elements
• 1. Physician
• 2. Drug code
• 3. Drug generic name and strength
www.revolutionpharmd.com
Medication order Entry
• 4. Route of administration
• 5. Dosage administration schedule
• 6. Start date
• 7. Stop date
• 8. Order status: Conditional, Active,
Discontinued
9. Pharmacist verification code
www.revolutionpharmd.com
Medication order Entry
• System – Capable of Easily and Rapidly
aggregating and displaying entries by Name,
Chart or room number
• It should be possible to separate
• Scheduled & Prn orders and Patients with
• 1. Active therapy
• 2. Patient with no therapy
• System – Should be capable of automatically
scheduling start and stop dated for
administration of each drug
www.revolutionpharmd.com
Drug Labels and List
• System – Should be capable of generating
medication container labels and report in the
form
• 1. Patient medication Profile
• 2. “Fill list” for the preparation of individual
doses in medication charts
• 3. List of medication changes since the last “Fill
list “ was prepared.
• 4. Drug order renewal list for the prescriber
• 5. Medication administration record (MAR)
www.revolutionpharmd.com
Medication Administration record
www.revolutionpharmd.com
Medication Administration record
• 6. Weight
• 7. Height
• 8. Allergies
• Medication Currently Scheduled for
administration are listed
• Contains - Drug, Strength and Dosage
Scheduled and Start and Stop Date
www.revolutionpharmd.com
Intravenous Solutions and
Admixture
• IV solutions are prepared separately from
other medication order
• It requires separate computer reports
• Orders for these solution should contain
• 1. Patient identifying information
• 2. Medication order information
• 3. Start and stop date
www.revolutionpharmd.com
Intravenous Solutions and Admixture
• 4. Administration rate
• 5. Order status (Conditional , Active)
www.revolutionpharmd.com
Intravenous Solutions and Admixture
www.revolutionpharmd.com
Patient Medication Profile
www.revolutionpharmd.com
Intravenous Solutions and Admixture
www.revolutionpharmd.com
Intravenous Solutions and
Admixture
• Medication order
• Drug name
• Dosage
• Administration schedule
• Original start date
• Stop date
www.revolutionpharmd.com
Purchasing and Inventory Control
• Drug inventory is a delicate balance
between not ordering too much and
avoiding out-of-stock
• Here we can use – Perpetual and Periodic
inventory control system
• Perpetual system is most sophisticated
system
www.revolutionpharmd.com
Purchasing and Inventory
Control
• Perpetual system – It is impossible to
main in a pharmacy except by computer
• It involves maintaining running balance of
all drug in stocks
• But hospital pharmacy starts with Periodic
control system
• All drugs are entered into the dada base
as they are received in the Pharmacy
www.revolutionpharmd.com
Purchasing and Inventory
Control
• Added to beginning inventory level
• It will reflect the current stock level of
each drug
• Quantity of all drugs leaving the pharmacy
are similarly subtracted from the inventory
balance
www.revolutionpharmd.com
Purchasing and Inventory
Control
• This is done automatically done whenever
drug order is processed
• Residual inventory balance can be
checked by inventory system manager to
verify the current balance with minimum
order balance
• List of the drug that need to be ordered is
produced on demand
www.revolutionpharmd.com
Purchasing and Inventory
Control
• Perpetual Inventory system is a remarkable labor
saving device
• Helpful in avoiding costly stock outs
• Here physical inventory should be accurate in
computer maintained file
• Pharmacy personnel must be careful to assure
that all drugs leaving or entering the pharmacy
are entered into the computer
www.revolutionpharmd.com
Purchasing and Inventory
Control
• Finally the Periodically checked against the self
inventory to assure accuracy
• Usually checked once in a week.
• The amount of inventory on hand is compared to
minimum and maximum stock levels
• The existing stock level should be entered into
computer manually , then it will generate
• Placement order copy.
www.revolutionpharmd.com
Purchasing and Inventory
Control
• Then Purchase order may be then
generated for each supplier determining
minimum order quantity for each supplier
• Computer Provides reports of drug
purchase history information that are
invaluable in hospital inventory control
management
www.revolutionpharmd.com
Purchasing and Inventory
Control
• Sophisticated computer system can
perform Purchasing order processing and
Inventory management functions
• 1. Detects items that have reached pre-
determined minimum order levels and list
of those items and sorted by suppliers, in
the form of a purchase order
www.revolutionpharmd.com
Purchasing and Inventory
Control
• 2. Track purchase order through the
hospital and vendor purchasing system to
avoid duplicate order
• 3. Display requested drug to know current
stock and gives necessary information for
processing orders
• 4.Provides aggregate drug usage
statistics
www.revolutionpharmd.com
Purchasing and Inventory
Control
• 5.Provide periodic drug order history
report - ( 6 months) – Contains
information of price and stock number
• 6.Automatically recalculate and update
optimum reorder point based on order
history and lead time from suppliers
• 7.Detects and reports infrequently
purchased items
www.revolutionpharmd.com
Purchasing and Inventory
Control
• 8.Automatically print Physical inventory
reconciliation and inventory shrinkage reports for
perpetual inventory control system
• 9. Gives reports of inventory in hand, stock
turnover and value of drug purchased from
individual supplier
• 10. Update the cost information in database
www.revolutionpharmd.com
Management reports and
Statistics
• Need to develop and maintain an
information system to assist managerial
decision making
• The computer system should generate
management reports relating to the
pharmacy work load for specific period of
time ( ex Monthly)
www.revolutionpharmd.com
Management reports and
Statistics
• These reports help pharmacy managers to plan
and monitor work schedule and budget
• Improve operational efficiency
• 1. Hospital census data :
• Number of admission
• Discharges
• Patients days
•
www.revolutionpharmd.com
Management reports and
Statistics
• 2. Aggregate drug usage ---
• Total drug orders and doses
• produced
• Types of drug order (Oral and
• topical, Injectables)
• Pharmacy preparation hours
www.revolutionpharmd.com
Management reports and
Statistics
• 3. Drug use per patients by drug,
diagnosis or hospital service unit
• Aggregate drug usage – Number of
patients receiving drugs
• Average number of doses and total
cost
• Usage by drug category -Average
number of doses and total cost
www.revolutionpharmd.com
Use of computer in community pharmacy
-SAI KUMAR
Computers have invaded i n every walk of life and almost all comm ercial organizati on and business
firms have undergone significant computerization with no exception of community pharmacy
establishment. At present c ommunity pharmacy use computer for s electi ve pharmaceutical purposes.
While there are several possible purposes. Following is a list of majority of community pharmacy functions
that could be computerized.
(1) Cleric al: Preparation of prescription levels. Providing a receipt for patient, Generation of hard
copy record of transaction. Calculation of total prescription cost. Maintenance of perpetual record of
inventory record. Accumulation of suggested orders based on suggested order quantity. Automatically
order required inventory via electronic transmission. Calculation and storing of annual withholding
statements.
(2) Managerial: Preparation of daily sales report. Generation of complete sales analysis as required
for a day, week, month, year and to date for number of prescriptions handled and amount in cash.
Estimation of profit and financial rati o analysis. Production of drug usage reports. Calculation of gross
margin, reported in all manner of details. Calculate number of prescriptions handled per unit time, to help
in staff scheduling. Printing of billing a payment summary.
(3) Professional: Building a patient profile. Storing of information on drug and ot her allergies to warn
about possible problems. Retrieval of current drug regimen for review. Updating of patient information in
m
file. P rinting of drug–drug and drug–food interactions. Maintaining of physicians file including specialty,
co
designation, address, hone office hours, etc.
d.
rm
(4) Clinical support:
– Patient medication profile – Patient education profile
ha
m
(a) Bibliographic database,
co
(b) Journal information and
d.
rm
(c) Textbook material.
Generally, bibliographic database is adopted, as usually there is a medical library nearby
ha
The databases are medicine oriented like MEDLINE, or pharmacy oriented, like
io
International Pharmaceutical. Abstract may be chosen. Some on- line Databases of the medical
ut
Table 15.1
ev
3. International Pharmaceutical American Society of Hospital More than 600 publications from
Abstracts Pharmacists 1970 are covered.
m
not covered by printing in abstracts and indexes. It is updated weekly.
co
Advantages : The most important advantage is time saving in conducting literature
d.
searches. A pharmacist may require severa l hours to resear ch a part ic ular the rape utic
rm
questio n fro m a litera t ure sea rc h cover ing about 10 years o f artic les. It can be done in
ha
Only few seconds are required to broaden a computer search from a spec ific drug
io
to ent ire therapeut ic c lass, b ut ma nua lly it is a ted ious job to search in the 'INDEX
ut
MEDICUS.
ol
ev
A pharmacist should be able to retrieve orders for review prior to administration to patient. Data
R
may be entered by use of codes for drug name, dosing schedule. All the drug orders should
contain following:
Drug name (code)
Drug generic name and strength
Roue of administration
Dosage schedule
Starting data
Stopping date
Physician code
Pharmacist varification code.
Features : The system should be capable of rapidly collecting and displaying entries by
patient name or room number. It should be capable of scheduling, starting and stopping date
automatically and separating the orders of patients on active drug therapy from those with no
drug therapy.
(3) Preparation of Lists
The computer system should be capable of producing the labels and reports in the form
of:
(i) Patient medication profile.
(ii) "Fill- lists'' for preparation of individual doses.
(iii) List of medications charged
(iv) Drug order renewal lists for the prescriber and (v) Medication
administration record (MAR).
Orders entered for IV solutions, admixtures and total parenteral nutrition (TPN) should be
separately prepared. These orders should contain the following information.
"Patient" and "Medication Order" identifying information
Start date and Stop date.
m
co
Administration rate.and
Order status (conditional or active). d.
rm
The system should be capable of calculating the flow rates and checking the
ha
incompatibilities. It should allow the conditional entry and its checking by pharmacist. The
np
system should be able to prepare the lists of solutions soon to expire and to be prepared at
io
pharmacy.
ut
ol
ev
R
Application of Computers in Pharmaceutical and Clinical Studies 347. •
critical conditions for immediate nursing attention and enables medical staff to
make accurate judgments of patients progress It further provides data for
CHAPTER 23 research purpose to monitor patients under intensive care.
Hence computers play an important role in communication by acquiring the data
about patient's metabolism and then communicating the same to medical staff by
displaying graphs Detecting critical conditions and generating alarms involve both
Applications of Computers in numerical and logical data processing. This processing helps in giving warning of
cnbcal conditions, enabling the medical staff for proper judgment of patients
Pharmaceutical and Clinical Studies progress, and in the long-term provides data for medical research.
m
After analysis of parameters "AND or OR" statements indicate logical relationship
where as IF….THEN mark a conditional computation. This combination of logical and
co
23.1 Introduction
conditional data processing enables the patient monitoring system as a decision
Now-a-days computers are used in pharmaceutical industries, hospitals and in making instrument for interpretation of results.
d.
various departments for drug information, education, evaluation, analysis and
rm
medication history and for maintenance of financial records etc. They have become
indispensable in the development of clinical pharmacy, hospital pharmacy and in 23.3 Medication Monitoring
pharmaceutical research. They co-ordinate effective communication and support
ha
To meet the goal of optimum drug therapy, medication is very essential In this case,
clinical and financial management functions prescription of the patient received over a period of time is entered into the computer
np
Effective functioning of any organization largely depends upon continuous flow or data which serves as a chronological drug file of the patient. It helps in suggesting
information, i.e., receiving the Information, storing it, processing it and disseminating it. number of drugs along with their dosage schedule. Computers provide two types of
io
An effective management in f ormation system always provides the needed information.
information in the right form, at the right time and at the right place. Actually each
ut (a) Pharmacokinetic (b) Non-Pharmacokinetic
ol
information of organization is connected with other informations , through
communication channels, thus making organizational entity a decision—making point.
Pharmacokinetic Information :
ev
management involves creating, modifying, adding and deleting data in patient flies to
very easily. These parameters include volume of distribution, bioavailability, rate of
generate reports. Now many doctors for further investigation as they are connected
clearance etc It helps in maintaining dosage schedule of various drugs like
through various personal computers share these reports. Some popular Data Base
antibiotics, aminoglysosides etc.
Management System (DBMS) packages for personal Computer are: Dbase III+ and
Fox Base + Hence, computer help in maintaining overall health care system and this
Non-Pharmacokinetic Information
can be best illustrated by enlisting its applications
It includes various allergic reactions, drug interactions, adverse drug reactions etc.
For such information two computer programmes are available.
23.2 Patient Monitoring
1 . MEDIPHOR (Monitoring and evaluation of drug interactions by a pharmacy
Patients monitoring incudes monitoring of physiological processes in patients such
oriented reporting)
as blood pressure, pulse rate, temperature, etc. This information plays special role in
2 .PAD (Pharmacy Automated drug interaction screening)
detection and prevention of critical condioons in patients, It helps in giving warning of
BiostatIstics and Computer Applications
Application of Computers in Pharmaceutical and Clinical Studies 351
m
one so that it should not affect adversely by giving too much or too less of the
information to patent
co
Computers are used in pharmacies to maintain accessible, legible and up-to-date
medication records. They help in keeping overall patient care by maintaining their
d.
records, consumption of drugs, registration numbers and detailed records of 23.12 Drug Interactions
accounts and purchase section. Even for retail pharmacist, computers have been of
rm
Pharmacists cannot remember each and every medication, its therapeutic usage, its
valuable assistance in the prescription processing, It includes display of computer
effects and drug interactions_ Therefore computers offer knowledge base systems to
information about patient and drug, its adverse drug reaction, causation, duplication
ha
extend our professional services. Computerized pharmacy can alert physician(
of orders, labeling conditions etc.
pharmacist for serious drug-drug, drug-food and drug-disease interactions, which are
np
Following are the other applications in hospital and retail pharmacy likely to occur in prescription. Examples of such database and online services are
MEDLINE, IDIS and pharmline.
io
Calculation of monthly gross income
Generating pay slips
Updating the employee information ut 23.13 Community Pharmacy
ol
Placement of supply order - Computers help in streamlining refilling of prescriptions. It has terminated the long-
ev
standing problem and waiting in a queue for refilling of prescription It has been
Keeping track of total payment and amount due to supplier
becoming popular because it reminds the patient for refilling and compliance of
R
Checking the quality and quantity of hospital supplies recorded and medication These systems not only help in filling of individual prescriptions but in
identifying any discrepancies processing the prescription in a right manner. It also enables to manage inventory,
Recording purchases for accounting purposes. sales, accounts, etc., in community pharmacy,
A number of computer programs have been developed to assist physicians in
dosing and scheduling drug. But there are certain drugs, which are extremely 23.14 Drug Information Services
sensitive to certain patients. For such patient's physicians use computer programs tti
Various software, Internet. Intranet and online services are available for the
forecast drug levels and to choose the amounts and schedule of drug doses that Will pharmacist to provide drug information service to medical, paramedical professional
achieve target level. Similarly 'HELP' is a system, which identifies abnormal and patients Computer—aided drug design helps the chemist to formulate a new drug
chemistry levels, concurrent diseases and other related patient conditions. molecule possessing desired therapeutic action. These new drug entities can be
generated through graphics and by changing molecular configuration'. CD-ROM
23.10 Hospital Setting technology has helped a lot in the evolution of compact electronic libraries. Various
Duties of the pharmacist have been changing tremendously and hence it has software programs of different companies are listed here.
become impossible to remember and to recheck everything. Therefore, computer
348 Blostatistics and Computer Apptications Application of Computers in Pharmaceutical and Clinical Studies 349
Over/under stocking
23.4 Maintenance of Records
Venous records like patient's medication history. current treatment and fin3ncial Slow moving/Fast moving items
records etc , are maintained in computers by feeding accurate data as 'DATA' is a Expired drugs
collection of facts and computer works as a DATA BASE' manager MEDLINIE is a
data base package used for such purpose It gives the current information of pa- bents 23.6 Data Storage and Retrieval
regarding their name, age. sex, room number, weight, allergic reaction etc "Itese
records are stored in a 'FILE like "Physician name" file, "Direction" file, 'Drug Hospital administration computer helps in rapid data storage and retrieval,
-
Interaction" file etc Now these files contain specific information like physman's particularly when the data stored is subjected Infrequent changes and when group of
name, registration number, phone number, address, etc. and provide such items based on the stored data need to be retneved. Admission of M-patients and
informaton whenever required their discharge from a hospital require data, which gets changed every minute, e g.,
admission of in-patient ties up resources like clinical and nursing staff, a bed,
m
23.5 Materials Management
operation theatre, Intensive care unit, pharmacy department, radiological services etc.
Computers play vdal role in material planning, purchasing, inventory control and Hence decision to admit a new patient is not a simple one. Even the availability of
co
forecasting pnces. Inventory control is very essential because it maintains the a suitable bed is difficult to determine in male and female ward, isolation ward etc A
balance between stock-in-hand and excessive capital investment. Techniques such as prediction must be made that a suitable bed will be available at future date because d
ABC analysis and EDO can be easily programmed It will eliminate the tedious and
d.
the estimation is over optimistic, then patients who are called in, may be tumed away
time-consuming task of calculations. Computers are used to detect the items, which at the last minute. If the prediction is over pessimist c expensive resources lie idle
had attained minimum order level It then prepares a list and purchase orders for
rm
and the waiting period for treatment is extended.
further supplies Generally there are Iwo systems for inventory control. Once the patient gets admitted, computer records and stores information like
For Penodic inventory control clinical information, catering information, diagnosis, sex, medication etc. It helps in
ha
(b) Perpetual system providing detailed information about medical and paramedical staff including their
(a) Periodic Inventory Control System : In this system stock levels are clocked duty chart It helps the senior personnel to keep a check on ward-by-ward loading of
nursing staff and to allocate additional help whenever required
manually and the amount of inventory in hand is compared with minimur6 and
np
maximum stock maintained in the computers. Computers help in placemcnt of
order to different suppliers after checking their terms and conditions becaule all 23.7 Diagnostic Laboratories
the entries of stocks are present in iL
io
(b) Perpetual System : In this system computer tells about the present positim of
all the drugs because when they are received, they are entered in the nitial
Computers meet the growing demand for testing laboratories as manual procedures
were lengthy and time consuming whereas automated computerized instruments
ut
perform a number of tasks with accuracy in diagnostic laboratories. Generally LIS
stocks to get the current stocks When the drugs are delivered to values (Laboratory Information System) is used to manage large amount of data_ In this,
departments the quantities are subtracted accordingly Such type of add.;ionS
ol
instruments Contain preprocessors, which convert raw data into digital format and
and deletions from inventory balance is done with the help of "data t ase" help in transmitting numerical values for report generation. LIS also performs
package
ev
The information as output from the computer may be obtained in venous brms like. and quality control. Similarly many instruments have microprocessors that facilitate all
Planning of material phases of testing processes, including calibration of instrument till reporting of
R
Records points
Safety stocks
Ledger for narcotics
Computerized recordkeeping.
m
(h) The date of dispensing of the prescription and the identifying designation of the
co
dispensing pharmacist for the original filling and each refill.
d.
(2) The entries shall be made into the system at the time the prescription is first filled and at
rm
the time of each refill, except that the format of the record may be organized so that the data
already entered may appear for the prescription or refill without reentering that data. Records
ha
that are received or sent electronically may be kept electronically. The dispensing pharmacist
np
preserved as a hard copy for a period of three (3) years and thereafter be preserved as a hard
ut
copy or electronically for no less than an additional two (2) years. The original prescription and a
ol
record of each refill, if received by facsimile, shall be preserved as a hard copy, the original
ev
electronic image, or electronically for a period of three (3) years and thereafter be preserved as a
R
hard copy, the original electronic image, or electronically for no less than an additional two (2)
years. The original and electronic prescription shall be subject to inspection by authorized
agents. An original prescription shall not be obstructed in any manner.
(4) The original prescription and a record of each refill, if received as an e-prescription, shall
be preserved electronically for a period of no less than five (5) years. The electronic prescription
shall be subject to inspection by authorized agents. An original prescription shall not may be
obstructed in any manner.
(5) The required information shall be entered into the system for all prescriptions filled at the
pharmacy.
(6) The system shall provide adequate safeguards against improper manipulation or alteration
of the data.
(7) The system shall have the capability of producing a hard-copy printout of all original and
refilled prescription data as required in Section 1 of this administrative regulation. A hard-copy
printout of the required data shall be made available to an authorized agent within forty-eight
(48) hours of the receipt of a written request.
(8) The system shall maintain a record of each day’s prescription data as follows:
(a) This record shall be verified, dated, and signed by the pharmacist(s) who filled those
prescription orders either:
1. Electronically;
2. Manually; or
3. In a log.
(b) This record shall be maintained for no less than five (5) years; and
(c) This record shall be readily retrievable and shall be subject to inspection by authorized
agents.
(9) An auxiliary recordkeeping system shall be established for the documentation of refills if
the automated data processing system is inoperative for any reason. The auxiliary system shall
insure that all refills are authorized by the original prescription order and that the maximum
number of refills is not exceeded. If the automated data processing system is restored to
operation, the information regarding prescriptions filled and refilled during the inoperative
period shall be entered into the automated data processing system within seventy-two (72) hours.
(10) Controlled substance data shall be identifiable apart from other items appearing in the
record.
(11) The pharmacist shall be responsible to assure continuity in the maintenance of records
throughout any transition in record systems utilized.
m
co
Section 2. A computer malfunction or data processing services provider’s negligence shall
d.
not be not a defense against charges of improper recordkeeping.
rm
Section 3. This administrative regulation is not applicable to the recordkeeping for drugs
ha
prescribed for and administered to patients confined as inpatients in an acute care facility.
np
io
ut
ol
ev
R