Statistical Reasoning
Statistical Reasoning
Newborn Services, The Royal Women’s Hospital, Melbourne, and the Departments of 3Obstetrics and Gynaecology, and
4Paediatrics, The University of Melbourne, Parkville, Victoria, Australia
In this article we begin to discuss the techniques of formal will not be discussed further, except to say that statistical
statistical analysis or statistical inference, to be distinguished inference is of little value or can even be misleading unless a
from the descriptive statistical analysis that is involved in study has been designed to avoid major biases.
obtaining tables of frequencies, scatterplots of data and so on, In clinical research, the population of interest may be rather
as described in the previous article of this series.1 This discus- hard to define explicitly, but it will usually be some fairly
sion requires an understanding of a number of basic statistical general notion of the ‘universe’ of all patients of a given type.
concepts and terms, key among them being the idea of For example, in the study of long-term outcomes in children of
sampling variability. To explain the fundamental role of this very low birthweight (VLBW) introduced in our previous
concept we need to introduce the notions of population and article,1 the researchers’ underlying interest is not just in the
sample and probability. We will then explain how the concept particular group of patients that was followed but in the popu-
is used in the form of standard errors and confidence intervals. lation of all VLBW children of similar ‘sociobiological’
characteristics to those who were studied. When seeking publi-
cation in an international journal, the unspoken assumption is
POPULATIONS AND SAMPLES often that the population of interest extends beyond national
boundaries, although the precise definition of the population to
The essential role of formal statistical analysis is to account for which researchers seek to generalize is often left unstated.
the fact that research studies are performed on finite groups of As with the word population, we all have some familiarity
subjects. The group of patients (or other individuals) in a study with the concept of a sample, meaning a small amount of a
is regarded as a sample from a larger population. Statistical larger amount. In statistical analysis, a sample means a smaller
inference addresses the question of what can be said about the number of individuals taken from a population of interest. For
population based just on the sample, allowing for the crucial the valid application of most statistical methods, we strictly
fact that another sample or samples would not produce require that such samples be randomly selected from the popu-
identical results. lations of interest. In practice, this assumption can be difficult
Sociologists, epidemiologists and others are familiar with or impossible to sustain, but it is widely agreed that a reason-
the concept of a population, meaning a group of individuals able substitute is to be able to argue convincingly that one’s
with distinctive characteristics. Individuals can be humans, sample is representative of the population. In other words, we
animals, or other objects; in this series, unless stated otherwise, need to be able to think of our study group or sample as if it
we will use the term ‘individuals’ to refer to humans. What were a random sample from the population of interest.
makes the individuals distinctive to sociologists or epidemi- (Warning: further discussion of this delicate point may descend
ologists may be that they live in the same country (e.g. rapidly into a philosophical quagmire!)
Australia), or the same region (e.g. the state of Victoria), or The mechanics of much statistical analysis concern the use
perhaps they are of the same gender or some other subgroup of of summary statistics obtained from a sample to provide
interest. Since populations in the geographical and sociological estimates of population values, which are formally called para-
world are continually changing, extrapolation from one region meters. (This can be confusing to the clinician who may
or time to another can be hazardous. This is a major problem sometimes describe a measured attribute of a patient, such as
for the researcher, since it can cause systematic differences blood pressure, as a parameter.) Examples of parameters in the
between groups (bias – see previous article in the series1), but VLBW study are the proportion of children requiring mechan-
ical ventilation, mean verbal intelligence quotient (IQ) score
at age 5 years and the difference in mean verbal IQ score at age
Correspondence: Associate Professor LW Doyle, Department of
5 years between children of birthweight < 1000 g and those of
Obstetrics & Gynaecology, The University of Melbourne, Parkville
3010, Victoria, Australia. Fax: (03) 9347 1761; birthweight 1000–1499 g. In statistical texts, parameters are
Email: l.doyle@obgyn-rwh.unimelb.edu.au often symbolized by Greek letters, for example µ (the Greek
JB Carlin, PhD, Statistician. LW Doyle, MD, MSC, FRACP, ‘m’) and σ (Greek ‘s’) for a mean and standard deviation,
Paediatrician. respectively, and π (Greek ‘p’) for a proportion. It is important
Accepted for publication 13 July 2000. to have distinctive notation because the crux of statistical
Statistics for clinicians 503
analysis is the fact that parameters, the unattainable ‘true distributions, such as the binomial, mentioned above, and
values’ in the population, are distinct from the corresponding Poisson. The appropriate model to apply in any given analysis
sample values that we use as their estimates. For example, we is a technical matter beyond the scope of this series. For the
showed that the sample proportion of VLBW children purposes of statistical inference, however, a remarkable mathe-
requiring mechanical ventilation was 0.75 or 75%;1 this is matical fact (the ‘Central Limit Theorem’) says that many
obviously a good estimate of the true population value (under statistical inferences can be created using tools based on the
our crucial assumption of random/representative sampling) but normal probability distribution. For this reason we give a brief
it is not equal to it, unless we have been lucky. review of the important features of the normal distribution.
If ever a sample were to comprise all of a population of
interest there would be no need for statistical inference. This
is the case with a census, where sample values and popula- The normal distribution
tion parameters are identical. It is the nature of research,
however, that it seeks to generalize from the ‘local result’ to Many readers will have a general familiarity with the bell-
a broader target, and this is why statistical methods play such shaped curve that represents the normal distribution (Figs 1,2).
a central role. (It is sometimes called the Gaussian distribution to avoid
confusion with other meanings of the word normal.) Certain
variables are by their nature normally distributed, meaning that
STATISTICAL REASONING if we create a histogram based on a very large sample its shape
will approach that of the bell curve. To have this property, a
Statistical reasoning seems convoluted to the non-statistician, variable needs to have a continuous range of possible values.
especially when it comes to the use of hypothesis tests and P An example of a variable in our data set that is approximately
values. In order to defer some of these complications, we focus normally distributed is height at age 5 years (Fig. 2).
in this present article entirely on the reasoning involved in the An obvious feature of the normal distribution is that it is
estimation of population parameters using sample values. From symmetric, with a mean value in the centre of the distribution
a statistical point of view, such estimation must involve quan- and an even spread on both sides of the mean. The normal
tification of the precision of estimation, which is captured in
the calculation of a standard error and the closely related
confidence interval. These calculations simply quantify the Table 1 Possible outcomes of tossing two coins, with their prob-
extent to which variability from sample to sample (of the same abilities. It is easy to see from this table that the probabilities of
obtaining two heads, one head and no heads are 0.25, 0.5 and 0.25,
size as the one in your study) could be expected to lead to respectively
different estimates of the same parameter. Such quantification
of uncertainty requires the language of probability, and some Coin 1 Coin 2 Probability
familiarity with the famous normal distribution.
Head Head 0.25
Head Tail 0.25
PROBABILITY Tail Head 0.25
Tail Tail 0.25
Probability forms the basis for statistical inference. Most of us Total 1.0
have some understanding of probability, even if we think we
don’t. Perhaps we learnt something about probability at school.
Alternatively, most of us have had the occasional wager, perhaps
on events such as a horse race, and governments may be forcing
us to a better understanding of probability by their increasing
reliance on the gambling dollar as a source of revenue.
If we toss a coin fairly, we know there is an equal chance of
a head or a tail. The probability of a head is 0.5, and of a tail is
0.5. The total probability is 1, as it must always be for the sum
of probabilities of all possible (mutually exclusive) events.
With two coins, things become a little more complex. The
different combinations and their probabilities are shown in
Table 1. With many coins, the number of possible outcomes
increases rapidly. Calculating the probability of particular
combinations (e.g. the probability of two heads in 10 tosses) is
simplified by using the binomial probability distribution,
which will be discussed further in a later article.
All statistical inference is based on probability models, which
propose that we can think of the observed value of a variable
(e.g. did this child require ventilation? what was the child’s
Fig. 1 Graph of the normal curve. Values on the y-axis are probability
verbal IQ score?) as if it were the result of a random experiment density, meaning that the probability of a z-value falling within any
(like a coin toss). Usually this is a fiction based on the under- specified range is the area under the curve between the upper and lower
lying assumption, already discussed, of random sampling from a values of the range. For example, the probability of a value less than
population. To understand statistical inference fully requires an – 1 (left-hand shaded region) may be calculated as 0.159; the proba-
understanding of a number of different probability models or bility of a value greater than 2 (right-hand shaded region) is 0.023.
504 JB Carlin and LW Doyle
SAMPLING DISTRIBUTIONS
Fig. 3 Mean and 95% confidence interval for mean of verbal intelli-
CONFIDENCE INTERVALS FOR MEANS gence quotient (IQ) at 5 years of age in birthweight subgroups above
and below 1000 g, displayed with a dotplot of individual observations
The confidence interval (CI) is the accepted statistical tech- in each group.
nique for expressing precision of an estimate. The simplest CI
are constructed very directly from the standard error of the means are narrow relative to the spread of the data points. The
estimate. We illustrate this by describing how a CI is created CI is wider in the subgroup < 1000 g birthweight for two
for estimating a population mean. As an example, suppose we reasons: the greater variation within this group (higher SD) and
wish to present the sample mean IQ score at age 5 years in our its smaller sample size. Figure 3 begs the question ‘Can we
study as an estimate of a true population value. (If we were not conclude that verbal IQ in the two birthweight subgroups come
willing to do this, why would we claim our study, based on from the same population distribution?’ We will discuss how
these 202 Melbourne children, is of any interest to a national or this question might be answered in the next article in the series.
international readership?). In the 138 children tested at 5 years
of age, the observed mean IQ score was 98.7 and the SD was
15.3, giving SEM = 1.3. CONFIDENCE INTERVALS FOR OTHER
A CI for the mean is created by taking the observed mean PARAMETERS
(the estimate) and in turn adding and subtracting a multiple of The underlying theory behind the CI for a mean is that the
the SEM, where the multiple is taken from the standard normal sampling distribution of the estimate (in this case, the sample
distribution and depends on the level of ‘confidence’ required mean) is normal, and this fact holds true for a large number of
in the interval. The conventional level chosen is 95%, which other sample statistics that are used for estimating population
corresponds to using a multiple (normal ‘z’ value) of 1.96, so parameters of interest. A proportion is in fact a particular type
that a 95% CI for the mean is given by the range: of mean (an average of ‘0’s and ‘1’s), and the same method
– – (1.96 × SEM) to ×
× – + (1.96 × SEM). works for constructing CI, with large samples, except that a
different formula is required for the standard error. The
In our example, for mean IQ, this works out as 96.1–101.3. The principle of using standard errors to construct CI also extends
interpretation is, rather loosely, that we can be 95% confident to making comparisons, for example examining the difference
that the true population mean lies between 96.1 and 101.2. between the mean IQ in the two subgroups considered above.
More precise interpretations depend on whether one takes the Finally, it is important to remember that the methods discussed
‘frequentist’ or ‘Bayesian’ view of statistics; for an introduction in this article are based on ‘large-sample theory’ and generally
to these issues see the text by Motulsky.2 Although the 95% need to be modified for smaller samples. We will return to
level is conventional in most scientific reports, it should be many of these issues in later articles in the series.
emphasized that this choice is essentially arbitrary. Some Our next task, however, is to relate the ideas of sampling
authors have argued for lower confidence levels, which variability and CI to the well-known statistical method of
produce narrower intervals (e.g. a 90% CI for the mean would hypothesis or significance testing. We will tackle this in the
use the ‘z’ multiplier 1.645 instead of 1.96), since these focus next article, which considers in more detail the comparison of
greater attention on parameter values that are supported more continuous distributions between two groups, using the t-test.
strongly by the data.3
Figure 3 gives an example of the graphical representation of
means and 95% CI for verbal IQ at age 5 years in each of the REFERENCES
two birthweight subgroups, displayed with the dotplot of indi-
1 Carlin JB, Doyle LW. Statistics for clinicians. 2: Describing and
vidual data points for each subgroup that we showed in the displaying data. J. Paediatr. Child Health 2000; 36: 270–4.
previous article in this series.1 The sample sizes, means, SD, 2 Motulsky H. Intuitive Biostatistics. Oxford University Press, New
and SEM for the two birthweight subgroups were as follows; York, 1995.
< 1000 g: n = 51, mean 94.7, SD 17.0, SEM 2.4; 1000–1499 g: 3 Turkey JW. Tightening the clinical trial. Control Clin. Trials 1993;
n = 89, mean 100.2, SD 14.6, SEM 1.5. Note that the CI for the 14: 266–85.