Part 3
Part 3
Part III
STATISTICS IN RESEARCH
Much of a researcher’s time is spent accumulating and analysing data: data about
temperatures, occurrences, concentrations, durations, etc. The field of statistics deals
with the analysis of data.
Almost always one has imperfect data about a small subset (the sample) of the
intended population. The data is imperfect because of experimental, recording and
other errors. One has data only about a subset of the population because it is impossible
to examine the entire population. A botanist who proposes a law of genetics can only
test this law on a tiny fraction of the plants on the planet (to say nothing about those
plants not yet growing); sooner or later one has to stop and conclude that the patterns
found in the sample are also found in the whole.
These chapters proceed as follows. We start with data and how to organize it. We
then look at probability and probability distributions especially the normal distribution.
This brings us to the main purpose of this part: explaining estimation and statistical
tests. These are used to answer questions such as: how confidently can one extrapolate
from the sample data, and does the sample data agree with a preconceived theory.
In this book we examine only the situations a researcher is most likely to encounter.
If you need more information then you should consider a statistics book (e.g. Net82,
Hog93, Fre90, Moo85, Mul87, Wad90) or a friendly statistician. A computer statistics
package can also be useful, both for performing the necessary calculations and for
representing the data in various ways. An example of a full statistics package is the SAS
program; more modest packages include SPSS and Minitab. A spreadsheet program
(such as QuattroPro, Excel or Lotus 1-2-3) can also be useful.
8 Basic Statistics: Data Organization
In this chapter we introduce common statistical measures and discuss the represen-
tation and organisation of data. The chapter concludes with a discussion of regression,
which is a technique for fitting curves to data.
Data can be qualitative or quantitative. Quantitative data has numerical values, e.g.
in the range 0 to 100. Qualitative data has values that fall into categories, e.g. animal,
vegetable or mineral. For example, whether it rained or not yesterday is a qualitative
question (the categories are rained and not rained); how much rain was measured is a
quantitative question. Quantitative data can be discrete or continuous: discrete if it
takes on only whole values and continuous if it takes on any real value in some interval.
The amount of rain yesterday is continuous data; on how many days it rained last year
is discrete data.
A statistical measure, or simply a statistic, is a summary of the data. This data can
be for the entire population or for the chosen sample. We now discuss some population
statistics.
Most statistics of quantitative data are either measures of central tendency (where the
data is centred) or measures of dispersion (how spread out the data is). In the former
category are the mean, the median and the mode.
The population mean is the average and is denoted by the Greek letter µ. If the
population data values are x1 , x2 , . . . , xn , then:
n
1X
µ= xi
n i=1
The symbol Σ (pronouned ‘sigma’) means ‘the sum of’, so the above formula says that
x1 + x2 + x3 + . . . + xn
µ=
n
47
The median is the middle value were the data to be sorted from smallest to biggest. If
there is an even number of values, then the median is usually taken to be the average
of the middle two values. The mode is defined as the most common value(s) in the
data set.
Aside: In this book we use the decimal point in real numbers instead of the comma
(e.g. 5.42 rather than 5,42).
While the mean is more commonly used, the median can be a better summary of
the data if there are extreme values. For example, in statistics on the price of houses
in an area, the median is often given, since the mean is very much influenced by a large
mansion being sold. Say selling prices in thousands of rands were 100, 100, 110, 120,
130, 140, and 980. The mean here is 240 and the median is 120. The median is a better
summary as most people paid around 120 thousand rand for their house.
For qualitative variables (e.g. colour) the mode is the only possible measure. For
ranked data (e.g. 1st, 2nd, 3rd) both the median and the mode are possible, but the
median is preferred as it considers the rankings rather than just frequencies.
One simple statistic is the range: the gap between the largest and smallest values.
This is seldom useful, as there can be eccentric data. For example, the annual incomes of
South Africans range from zero to many million rands—however, the spread of incomes
has very few people at the top extreme and very many at the bottom, and most incomes
lie in a much narrower range.
One is more likely to use percentiles. For example, the 90th percentile is the value
below which lies 90% of the data. In this terminology, the median is the 50th percentile.
n
1X
2
σ = (xi − µ)2
n i=1
The (positive) square-root of the data is called the standard deviation of the
data; the population standard deviation is denoted by σ. The standard deviation is the
most used measure of dispersion.
One property of the mean and the standard deviation is that if the data comes from
a ‘reasonable’ population (to be defined later), then about 68% of the data lies between
µ − σ and µ + σ (that is, within one standard deviation of the mean), while 95% of the
data lies within two standard deviations of the mean.
Sometimes a picture explains the situation much more clearly than a jumble of numbers.
However, do not include a picture just to impress people; pictures should be relevant
and useful. There are several common types of pictures. Spreadsheets and many other
computer packages contain inbuilt graphing functions.
A common device for displaying qualitative data is a pie chart. For example, if 60% of
a country’s music exports is kwaito, 25% rock and 15% rap, then this could be depicted
by the pie chart in Figure 1.
Kwaito
Rap
Rock
In frequency data, one divides the range of values into intervals and counts the
number of data items that lie in each interval. For example, examination scores are
given as a mark out of 100 and a schoolteacher records the number of A-plus scores (90
– 99), As (80 –89), Bs (70 – 79), Cs (60 – 69), etc. The result is known as a frequency
distribution. An example of a frequency distribution is given in Table 1.
Frequency
15
10
Grade
34.5 44.5 54.5 64.5 74.5 84.5 94.5
One can also depict the data with a frequency polygon. In this, one plots for
each interval a point above the midpoint of each interval giving the number of data
items recorded, and then one joins up consecutive plots (see Figure 2b). Note that the
frequency polygon starts and ends at zero: one assumes that the intervals immediately
before and after those given both have 0 data items.
Frequency
15
10
Grade
20 40 60 80 100
In an ogive one looks at the data cumulatively. One calculates for each interval
the cumulative count (or percentage)—the number of data items up to and including
that interval. Then one plots for each interval a point at the maximum of the interval
giving the cumulative count (or percentage). These points are then joined up to form
the ogive. (If the data is discrete, then the end of an interval is halfway between the
maximum of that interval and the minimum of the next interval. So in our example
the ends are 39.5, 49.5, etc.) The ogive increases from 0 to total count (or 100 percent)
and is often S-shaped. An example is presented in Figure 2c.
51
75%
50%
25%
Grade
20 40 60 80 100
An ogive is also useful in calculating percentiles. For example, if one wishes to know
the 70th percentile, one draws a horizontal line from 70% on the Y-axis and determines
where it intersects the ogive. The corresponding value on the X-axis is the desired
percentile.
In paired data there is a series of observations, each with two values: say (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
For example, a chemical engineer measures the arsenic levels in a stream at various dis-
tances from a factory. He obtains a set of paired (distance, level) data. This data can
be represented by a scattergram on X-Y axes in which there is one plot for each data
point (see Figure 4). Such paired data often results from experimental research—e.g.
52
a physicist suspends a spring from the roof, attaches different weights to it and records
the stretch for each weight.
Figure 4. A scattergram.
Based on the data, one might propose a curve which ‘fits’ the data. The data
depicted in the example scattergram suggests a straight line fit. We discuss this type
of problem next.
8.3 REGRESSION
Typically one starts with a scattergram. Then one looks at it to see whether there
seems to be any pattern. This simplest case is whether there is a straight-line rela-
tionship and what the straight-line involved is. The process one uses to find the best
estimate of a straight-line fit for data is known as linear regression. We deal with
the question of how good the fit is in Section 11.1 (linear correlation).
with parameters A and B. (Recall that B is the slope of the line and A the y-intercept.)
The task is to find the best values of the parameters.
In linear regression the best values of the parameters are given by the following
rather intimidating formulas:
xi yi ) − ( xi ) ( yi ) yi ) − B (
P P P P P
n( ( xi )
B= and A=
xi − ( xi )2
P 2
n
P
n
The key is to calculate, for each pair (xi , yi ) of data, the value of x2i and xi yi and
then calculate the sums. Fortunately, many calculators have a statistical mode for
performing such a calculation.
20
15
10
2 4 6 8 10 12 14
xi yi x2i xi yi
2 24.5 4 49.0
4 18.7 16 74.8
6 16.3 36 97.8
8 12.0 64 96.0
10 9.3 100 93.0
12 6.2 144 74.4
14 2.5 196 35.0
56 89.5 560 520.0
This means that the best straight-line fit is y = 26.79 − 1.75x. Figure 6
shows this fit.
25
20
15
10
2 4 6 8 10 12 14
While linear relationships are the most common, there are many other possible
relationships. For example, the data in the scattergram in Figure 7 suggests a parabolic
fit. To find the best fit, one can uses regression but the actual formulas are beyond the
scope of the book.
The process of regression is guaranteed to give a fit. For example, one can fit a line
to the data in Figure 7, but this is clearly inappropriate. The best fit is not necessarily
a good fit. Testing whether a fit does in fact reveal a pattern in the data is discussed
in Section 11.1.
2. For each of the measures defined in the text, find a situation (a) where the measure
is a good summary of the data and (b) one where it is bad.
4. A geographer measures the relationship between the population size of a city and
the number of cinemas. Data is
population (ten thousands) 1 5 22 23 40 67
cinemas 1 1 10 10 17 28
(a) Plot the data on a scattergram.
(b) Find the best straight-line fit.
(c) Use the line to estimate the number of cinemas for a city of 500 000 inhabitants.
9.1 PROBABILITY
The probability of an event is the proportion of time the event can be expected to
occur over the long run. For a simple example, consider tossing a coin. If one repeats
this 1 000 times, and heads comes up 300 times, then it appears that the probability
of heads is 3/10. (Could a coin act this way? Think about it.)
1
Pr(D = 5) = 6
We need some definitions: The rolling of the die is an experiment. (In statistics,
an experiment is any action where the answer is not predetermined.) The experiment
has a number of outcomes, and associated with each outcome is a probability. For a
repeatable experiment, the probability of a specific outcome amounts to the proportion
of time the outcome is likely to occur. Proportions always lie between 0 and 1, so that
if something occurs 40% of the time, then its probability is 0.4. A probability of 0
means that the event never occurs, a probability of 1 means that it always occurs.
If one rolls a die 600 times, it is very very unlikely that one will get each number
exactly 100 times. However, if some numbers seem to come up a lot more than others,
58
then one would suspect that the die was crooked (not fair). A simulation the authors
ran produced the following frequencies of the numbers from 1 to 6: 86, 112, 122, 95,
98, 87. What do you think about the die?
The question of when to conclude that a die is crooked is precisely the type of
question that statistics is designed for. And the short answer is: if the probability of a
result as extreme is sufficiently small, then one may conclude that the die is crooked.
Often one has an experiment with just two outcomes, such as tossing a coin. These
outcomes are often referred to as success and failure, and the experiment as a trial.
To obtain useful data one repeats the experiment several times and counts the number
of successes.
n!
Pr(S = k) = pk (1 − p)n−k
k! (n − k)!
where:
probability
0.20
0.15
0.10
0.05
outcome
2 4 6 8 10 12 14
We have made one assumption which one must be explicit about: the success or
failure of one trial does not affect the (probability of ) success or failure of another
trial. Statisticians would say the trials must be independent events. If the trials are
dependent, i.e. not independent, then the above formula is not valid. If I toss one
coin and then another coin, the results are independent. If I measure the height and
mass of one child, the results are dependent.
There are other useful probability distributions, including the normal, exponential,
geometric, and Poisson distributions. Of these, the normal distribution is the most
important.
The most common probability distribution for a continuous random variable is the
normal distribution. Figure 9 shows the smooth bell-shaped curve of the normal
distribution.
68%
µ−σ µ µ+σ
A normal distribution has two parameters: its mean and standard deviation. The
distribution with µ = 0 and σ = 1 is called the standard normal distribution, and the
associated random variable is often denoted by Z. Many natural occurring things have
a normal distribution, including heights of people, speeds of ostriches, and birth-mass
of babies.
A table of the standard normal distribution is given in Appendix B. This gives the
proportion of the population that will be less than or equal to z for various values of z.
For example, suppose that the heights of South African men had a mean of 1.77
m and a standard deviation of 4 cm, and one wanted to know what proportion is less
than or equal to 1.87 m tall. The method comprises two steps:
187−177
For this example, z = 4 = 2.5. Now Appendix B is consulted. The proportion is
0.994.
If you need to find what proportion is greater than z, simply use 1 minus the
proportion less than z. For example, the proportion of men that are more than 1.87m
tall is 1 − 0.994 = 0.006 or 0.6%.
61
The value α (the Greek letter alpha) denotes the probability of error (that is, the
measure lying outside the interval). So the probability of being correct is 1 − α. For
example, for a 95% interval α is 0.05 (5%) and for a 99% interval α = 0.01 (1%).
Confidence intervals are (usually) constructed such that the probability of error is
shared equally between the two sides: the probability that the actual measure is above
the interval is α2 and the probability that it is below the interval is α2 .
In this book we discuss the problem of estimating the population mean. The sample
mean is denoted by x̄ and is defined as the average of the n sample values:
n
1X
x̄ = xi
n i=1
This is similar to the formula for the population standard deviation σ except that there
is the value n − 1 in the denominator rather than n. (There is a deep statistical reason
for this difference which we do not explore here.) There is an alternative formula which
is easier for calculations:
sP
n 2
i=1 xi − n × x̄2
s=
n−1
r r
782 − 6 × 112 56
= = 3.35
5 5
In this section we discuss two common situations where one can provide confidence
intervals for the mean.
The large sample method can be reliably applied if there are at least 30 data values.
Given some choice of α, a 1 − α confidence interval for µ is given by:
s s
x̄ − z × √ < µ < x̄ + z × √
n n
In this expression, the value z is obtained from the table in Appendix C by looking up
the entry in the two-tailed column corresponding to the chosen value of the error α.
(The two-tailed values are always used for confidence intervals.)
For example, suppose one wanted to be 95% sure that the value of µ lay within the
interval. Then choose α to be 5% (i.e. 0.05), and the value 1.96 is obtained for z in the
column corresponding to 95% and two tailed. (Recall that one divides the chance of
error equally on both sides or tails, so that there is a 2.5% chance of the actual value
of µ being smaller than the calculated lower bound, and a 2.5% chance of the actual
value of µ being larger than the calculated upper bound.)
One can estimate the population mean from a small sample if one knows that the
underlying population has a normal distribution, or one has good reason to believe
63
this. In this case the sample mean is related to Student’s t distribution (so-called
because the discoverer Gosset published the work under the pseudonym ‘Student’).
The process of estimation is the same as the large sample method, except that one
looks up values in the t-table given in Appendix D rather than the z-table.
The t distribution has a bell-shaped curve. It has one parameter df which is known
as its degrees of freedom. For df small the curve is a bit more peaked than the
normal distribution, but for df large the two curves are indistinguishable.
s s
x̄ − t × √ < µ < x̄ + t × √
n n
where the t-value is that for df =n − 1 degrees of freedom. The value t is found in
the t-table on the row for n − 1 degrees of freedom in the appropriate column for the
α-value required (again, the two-tailed values are used).
Example revisited : In the previous example we used the large sample method
to determine a 99% confidence interval for mangrove salinity. Say our biolo-
gist could only obtain 20 leaves, and again obtained a sample mean of 5.39,
but a sample standard deviation of 1.84. Since he believes that the salinity
has a normal distribution, he uses the small sample method. Degrees of
freedom df = 20 − 1 = 19. α = 0.01 (1% chance of error). Value in table is
2.861. So the confidence interval is
1.84 1.84
[5.39 − 2.861 × √ , 5.39 + 2.861 × √ ] = [4.21, 6.57]
20 20
The above example shows that a confidence interval can be narrowed by taking n
larger (think about why). This means: the larger the sample, the better the estimate.
1. If you roll two honest dice, what is the probability of getting double-sixes? Any
double? An odd number for the total?
2. From the population of the digits 0 through 9 take a random sample of 3 digits.
Calculate µ, σ, and s.
3. Say the size of king protea flowers is normally distributed with mean 127 mm and
standard deviation 13 mm. What proportion is more than 140 mm?
64
4. Find a 95% confidence interval for the population mean from the sample data
7.2, 9.3, 10.2, 11.4, 14.8, 16, 10.3, 11.4, 12.3, 9.9, 8
5. Find a 99% confidence interval for the population mean if the sample size is 50,
the sample mean is 43.41, and the sample standard deviation is 2.61.
10 Testing Hypotheses
10.1.1 Hypotheses
Any statistical test revolves around the choice between two hypotheses. These are
labelled H0 and H1 . H0 is often called the null hypothesis.
We asked in a previous chapter: how would one decide if a coin was fair? In this
case the null hypothesis H0 would be that the coin is fair. The alternative H1 is that
it is biased.
10.1.2 Errors
Consider the following example: A biotechnology firm develops a new drug against
tuberculosis. To test the drug, they administer the drug to some patients and a placebo
to others. They then compare the results to determine whether the drug is effective or
not. Suppose H0 is that the drug has no effect, H1 that the drug is effective.
There are two possible ways the answer could be wrong. One possibility is that one
concludes that the drug is effective when in fact it has no effect. The other possibility
is that one concludes that the drug has no effect when in fact it does. These are known
as Type I and Type II errors respectively.
Ideally, one would like neither error to occur, but this is impossible. The smaller the
risk of one type of error, the greater the risk of the other type. If one is not prepared to
accept any risk of a Type I error, then one must accept a high risk of a Type II error,
and vice versa. (Think about why this should be so.)
66
In a hypothesis test, one takes as H0 that situation which one doesn’t mind as
much being accepted as true when it is in fact false, and takes an alternative as H1 .
The benefit of the doubt goes to the hypothesis H0 . In other words, one focuses on the
probability of a Type I error. This probability is denoted by α. The probability of a
Type II error is denoted by β (beta).
H0 true H1 true
H0 accepted okay Type II error
H1 accepted Type I error okay
An agricultural researcher wishes to test a new fertiliser, and see if it increases maize
yield. There are four possible hypotheses:
When the test is one-tailed, the null hypothesis must be adjusted accordingly. In
our example above, is H1b is used then H0 is that the fertiliser does not increase maize
yield. Similarly if H1c is used then H0 is be that the fertiliser does not decrease maize
yield. In all cases H0 is the opposite of H1 —that is, any situation not covered by H1 is
covered by H0 . Figures 10a, 10b and 10c show this diagrammatically.
H0
H1a H1a
H0
H1b
H0
H1c
A hypothesis test is based on the probability of sample data being as extreme as the
data encountered, assuming that H0 is true. This is α, the probability of Type I error.
The researcher must choose a level of significance—an upper bound for α. Common
levels are 5% and 1%; for a test at the 5% level of significance there is at most a 5%
chance of Type I error. In reporting the results of a hypothesis test one must state the
level of significance.
3. Formulate H0 and H1 .
4. Determine whether the test is one-tailed or two-tailed. If it is one-tailed,
determine whether it is an upper- or a lower-tail test.
5. Calculate the test statistic.
6. Based on the chosen level for α, compare the test statistic with a value
from a table (the critical value).
7. Conclude: Either accept or reject H1 .
One takes a sample and has a value in mind for the population mean µ. Then the
question is, does the sample x̄ contradict this value significantly? This is similar to the
estimation of the mean from the sample mean described in the previous chapter.
The large sample test can be used if the sample size is greater than 30, the small
sample test can be used if the sample comes from a normal distribution (regardless of
sample size).
In a test, the H1 hypothesis is that the population mean is either different from
(two-tailed), greater than (one-tailed upper tail) or less than (one-tailed lower tail) a
particular value. H0 is the complement of this. That is,
69
H0 : µ = c, H1 : µ 6= c;
H0 : µ ≤ c, H1 : µ > c; or
H0 : µ ≥ c, H1 : µ < c.
The sample statistics are denoted by n, x̄ and s. The large sample test uses the
following test statistic:
x̄ − µ √
z= n
s
The test statistic calculated is compared to a value from the z-table. The small sample
test uses the following test statistic:
x̄ − µ √
t= n
s
The test statistic calculated is compared to a value from the t-table with df =n − 1.
forming a cluster) and tested for I.Q. The sample mean is 102.7 and the
sample standard deviation 14.8. Are rugby players’ I.Q.s higher?
Procedure: The test is the small sample t-test. H0 is that rugby players
have normal or below normal I.Q. H1 is that rugby players have higher
than normal I.Q. This is a one-tailed upper tail test. The calculated test
statistic is t = 0.816. The degrees of freedom df = 19. Value in the table
is 1.73 for 5% level of significance. As 0.816 < 1.73, at 5% significance level
one cannot conclude that rugby players have higher I.Q.s.
These hypothesis tests of the population mean are linked to the estimation of population
mean discussed in Section 9.4. For example, in a two-sided test, H0 is rejected if and
only if is outside the confidence interval.
Related sample pairs can be formed in two ways. One is where two sets of measure-
ments are compared (for example, the results of ‘before and after’ tests on the weights of
mice). The other is where the data is gathered from measurements of matched samples
(see Section 5.4).
To perform the test on pairs (ai , bi ), one calculates the difference data di = ai − bi .
This difference data is itself a sample. For a two-tailed test, the null hypothesis H0
is that the two population means are equal. This is the same as testing whether the
population mean µD of the difference data is zero, i.e. H0 is that µD = 0. Hypothesis
H1 is that µD is non-zero. We use the difference data for a t-test. The sample statistics
for the difference data are denoted n, x̄d and sd . Note that for matched data n denotes
the number of data pairs.
The t-test for equivalence of related samples uses the following test statistic:
x̄d √
t= n
sd
Degrees of freedom: df =n − 1.
√
10
Calculations: x̄D = 1.9 and sD = 3.07 so that test statistic is t = 1.9 3.07 =
1.96. The number of degrees of freedom is df =9, so the critical values are
1.83 and 2.82. So one can conclude an improvement at the 5% level of
significance but not at the 1% level of significance. (The educator can be
95% certain that H1 is true, but not 99% certain.)
Notes:
The necessary assumptions for hypothesis tests must be checked. If not known
beforehand to be true, this may involve further statistics tests!
1. Explain the meaning of: hypothesis, null hypothesis, tail, level of significance,
hypothesis test, test statistic, critical value.
2. In drug testing the null hypothesis is normally taken that the drug has no effect
and the alternative is that the drug is effective.
(a) What are Type I and Type II errors?
(b) Which is preferable?
(c) What about a disease such as HIV/Aids which is rampant and for which no
cure has been found yet? Should one deny people the drug just because one is
not totally convinced that it is effective? Discuss.
3. The sample data of Question 4 from Chapter 9 is believed to come from a popu-
lation with mean 12.8. Test the hypothesis that:
(a) this is not true; and
(b) the actual mean is smaller than 12.8.
4. If we always use a 0.05 level of significance, does this mean that on average 1 out
of 20 conclusions will be wrong?
In Section 8.3 we described how to find the best line fit for paired data. Determining
whether there is in fact a linear relationship requires another hypothesis test. Pearson’s
product-moment coefficient of linear correlation is calculated by the formula:
x i yi − (
P P P
n x i ) ( yi )
r=q P q P
2
n x2i − ( xi ) 2
yi − ( yi )2
P P
n
(Wow!) This parameter lies between −1 and 1. A value of 1 indicates a perfect linear
dependence with positive slope. (An increase in the value of variable X is associated
with a proportionate increase in the value of variable Y .) A value of −1 indicates a
perfect linear dependence with negative slope. (An increase in the value of variable X
is associated with a proportionate decrease in the value of variable Y .) A value of 0 or
thereabouts says very little.
The hypothesis H0 is, as usual, the complement of H1 . The test statistic is the Pearson
coefficient of linear correlation r. The test statistic is compared with the critical values
obtained from the table in Appendix E, with n − 2 degrees of freedom (where n is the
number of data pairs), and value desired of α.
x (distance in km) 2 4 6 8 10 12 14
y (arsenic in mg/kl) 24.5 18.7 16.3 12.0 9.3 6.2 2.5
Procedure: This is a one-tailed lower tail test with H0 : arsenic levels do
not decrease linearly with distance from factory, and H1 : arsenic levels do
decrease linearly with distance from the factory.
Calculations give n = 7, Σ xi = 56, Σ yi = 89.5, Σ xi yi = 520.0, Σ x2i = 560
and Σ yi2 = 1490.81. Hence
7 × 520 − 56 × 89.5
r=√ √ = −0.995
7 × 560 − 562 × 7 × 1490.81 − 89.52
The df =7 − 2 = 5. Table values are 0.669 (5% level) and 0.833 (1% level).
So the engineer can accept H1 at the 1% level (he can be 99% certain that
H1 is true).
It is important to note that the existence of high linear correlation between two variables
does not necessarily imply a cause-and-effect relationship between the two variables. It
is likely that ownership of motor vehicles and ownership of television sets have a high
correlation. This does not mean that owning a car causes someone to own a television
set. Some other factor could be at work, e.g. wealth.
Partial correlation tries to remove the effect of the third variable and thereby find
the ‘true’ relationship between the first and second variables. The formula for partial
correlation is:
where:
r12.3 is the partial correlation between variables 1 and 2 with the effect of
variable 3 removed;
r12 is the correlation between variables 1 and 2;
r13 is the correlation between variables 1 and 3;
r23 is the correlation between variables 2 and 3;
and correlation is measured by the Pearson coefficient of linear correlation (see previous
section).
75
To continue our example, say a researcher found that owning TVs (variable
1) had a correlation of 0.7 with owning cars (variable 2). It was also found
that the correlation of TV ownership with wealth (variable 3, which the
researcher defined as assets over R50 000) was 0.8, and the correlation of
car ownership with wealth was 0.9. The formula gives:
0.7 − 0.8 × 0.9
r12.3 = √ √ = −0.08
1 − 0.82 1 − 0.92
In other words, the researcher finds almost no correlation at all (−0.08 is
a tiny negative correlation) between TV ownership and car ownership once
the effect of wealth has been removed.
In this section we describe the chi-square test for dependence of two qualitative vari-
ables. For example, the colour of a particular chemical solution can be red, orange,
purple or blue and its acidity can be categorised as high, medium or low. We wish to
know if colour and acidity are related.
Say the first qualitative variable divides the entire population up into R classes and
the second divides the entire population up into C classes. In our example above R = 4
and C = 3. To apply the chi-square test, one takes a sample of the population, and
categorises each item according to both variables. The sample items are divided into
R classes according to the first variable and into C classes according to the second.
Alongside that table, one constructs the expected contingency table which
would result if the two variables were independent: To calculate the entry eij in row i
and column j of the expected contingency table, one multiplies the total in row i by the
total in column j and divides the result by the overall total. The expected contingency
table for our example is given in Table 6.
In the test, the hypothesis H0 is that the two variables are independent and H1 is
that they are dependent. The test statistic is:
X (oij − eij )2
χ2 =
eij
where oij denotes the observed frequency in row i, column j, and eij the expected
frequency in row i, column j, and the summation is over all entries in the contingency
table.
For the above data, the contribution to the test statistic χ2 is as follows:
(5−10.22)2
For example: the entry 2.670 is calculated by 10.22 .
One needs to compare this with the chi-square tables. The degrees of freedom for a
χ2 test are given by (r − 1)(c − 1). In our example above, which is a 4-by-3 table, this
is df = (4 − 1)(3 − 1) = 6. The concept of a lower tail does not exist for the χ2 test
statistic as the statistic is always positive. As all the error is thus in the upper tail, a
two-tail test becomes an upper-tailed test.
77
The null hypothesis H0 is that colour and acidity are independent, and H1 is that
colour and acidity are dependent. The test statistic χ2 = 13.546 with df = 2 × 3 = 6.
So critical values are 12.59 (5%) and 16.81 (1%). Accept at the 5% level of significance
but not at the 1% level of significance.
There are limitations on the use of this test. An important rule of thumb is that
every entry in the expected contingency table must be at least 5.
A version of the chi-square test can be used where one has frequency data and wishes
to check whether this conforms with a theorized distribution. For example, one might
wish to test whether data comes from a random distribution or not.
There is one qualitative variable. Say this divides the population into M classes.
The data gives the observed frequencies oi and the theorized distribution gives the
expected frequencies ei . Then the test statistic is given by a similar formula to the one
above:
X (oi − ei )2
χ2 =
ei
Example: The crooked die story completed. Recall that we rolled a (simu-
lated) die 600 times and got 86, 112, 122, 95, 98, 87. Is this die crooked?
Procedure: Hypothesis H0 is that the die is fair, hypothesis H1 that it is
crooked. Expected values are 100 each. The test has 5 degrees of freedom.
2 (−5)2
122 222
The test statistic χ2 = 10.22 (calculation is (−14)
100 + 100 + 100 + 100 +
(−2)2 2
100+ (−3)
100 ). Testing at the 5% level, we find that the table value is 11.07.
So we cannot reject the hypothesis that the die is fair.
78
In general, the chi-square test is a test of dependence. However, in the special case where
df = 1, one can test for a one-tailed hypothesis. One case is a 2-by-2 contingency table.
For example, one may be testing a new drug with two categories: drug and placebo, and
with two results recover or not recover. One can use a chi-square test to test whether
the drug is effective or not. Another case is illustrated in the next example.
Another situation one may encounter is where one has two independent samples from
different populations but wonders whether the means are the same.
We describe two tests. These are the large sample (z) test and the small sample
(t) test. The t-test can only be applied if both samples come from normal populations,
and the standard deviations of the normal populations are similar. (The equality of
standard deviations can be checked by using what is known as the F -test.)
In both tests there are two samples which we denote by A and B. The null hypoth-
esis H0 in a two-tailed test is that the two population means (which we denote by µA
and µB ) are equal. The sample statistics for A are denoted by nA , x̄A and sA , and for
B are denoted by nB , x̄B and sB .
The z-test for equality of means (large samples) uses the test statistic:
79
x̄A − x̄B
z=r
s2A s2B
nA + nB
The t-test for equality of means (normal populations) uses the test statistic:
s
x̄A − x̄B nA nB (nA + nB − 2)
t= q
(nA − 1)s2A + (nB − 1)s2B nA + nB
Large sample example: A large software house uses the languages C and
Natural for programming. They wish to compare the development time
as they believe that Natural is faster to program in. They know that re-
cently they have completed 42 projects in Natural and 30 in C. The sample
statistics in person-months are x̄N = 3.3, sN = 1.1, x̄C = 4.0 and sC = 0.8.
Procedure: The test is the two-sample z-test. H0 is that Natural is the same
or worse than C, and H1 that Natural is better. This is a one-tailed lower
tail test. Test statistic: z = −3.13. Critical values −1.64 (5%) and −2.33
(1%). As the test statistic is less than both critical values, one can conclude
with 99% certainty that Natural is quicker to develop.
3. Pick 10 number pairs at random. Each number in each pair should be a whole
number in the range 1 to 10. Is there a linear correlation among the number pairs
chosen?
4. The drug company finally does the trial on a new treatment for tuberculosis.
There are 32 patients who complete the trial. Of the 18 patients on placebo, 8
recover and 10 do not. Of the 15 patients on the drug, 12 recover and 3 do not.
(a) Is there any difference between the effectiveness of the drug and that of the
placebo?
(b) What does your result in (a) tell you about the drug?
8. Two samples from different normal populations yield the following sample statis-
tics: nA = 6, x̄A = 100, sA = 10, nB = 8, x̄B = 80, and sB = 9. Test for
inequality of population means.