Lecture 1 NOTES Variables and Distributions 2020-21
Lecture 1 NOTES Variables and Distributions 2020-21
Objectives
After completing this session, students should be able to:
1/ Identify the variables in a study, their types (binary, (ordered) categorical,
quantitative), and whether they are explanatory or response variables.
2/ Display and interpret distributions of variables in frequency and cumulative
frequency tables, histograms and cumulative frequency curves.
3/ Construct and interpret two-way frequency tables comparing distributions.
4/ Construct and interpret a summary of a distribution of a quantitative variable using
the mean, standard deviation, median, quartiles, and other centiles.
5/ Distinguish a skewed from a symmetric distribution; use logarithms to make more
symmetric a skewed distribution; calculate a geometric mean of such a distribution.
INTRODUCTION
We will look at a simple example first, then refer back to it to introduce some key concepts
Example 1:-
A group of 2266 Canadian newborns are classified by birthweight and whether the
mother had to lift as part of her job. (yes or no). Interest is in whether babies born to
mothers who had to lift at work are likely to be smaller. The data from the study
might be held in a computer file looking something like this:
Table 1
Subject ID Birthweight lifting at work
Number (0-no, 1= yes)
1 3300 0
2 4100 0
3 3970 1
4 3840 0
…………
…………
2265 4000 0
2266 3300 0
The purpose of this lecture is to introduce concepts and methods to help you produce and
interpret summaries of these and similar data to answer questions like that posed above.
L1.1
SUBJECTS AND VARIABLES
In example 1, and in most health studies, information (data) was collected on the
characteristics of persons included in the study (study subjects). In example 1, these
characteristics are birthweight and mother lifting at work. Other examples of characteristics
are sex, environmental exposures, treatments, blood pressure, and experience of disease. We
call the characteristics variables, because the value a variable takes (for example birthweight
in grams, whether lifted at work) varies from person to person.
In Table 1, each row represents a subject, and each column a variable. This is the way data
are usually kept by computers.
In this example, and often, study subjects are persons. However sometimes rather than
measuring characteristics (variables) for persons we measure them for other “units”, for
example households, towns, hospitals, or mosquitoes. In statistical terminology, we still call
these units subjects. A value of each variable is needed for each subject.
Types of variable
Variables can also be classified by the types of values they can take. The main ones are:
Qualitative
o Binary (dichotomous) variables, where the values are two different
categories; for example, sex, or vaccinated status for a particular vaccine, or
being of low birthweight or not.
o Categorical variables, taking as values several different categories that are
distinct from each other; for example marital status (never married, married,
widowed, divorced), or blood group.
o Ordered categorical variables, for which the different categories are
ordered on some scale; for example, severity of disease (mild, moderate,
severe) or a disability score.
L1.2
Quantitative (numerical) variables, where some quantity is measured on a well defined
scale with units; for example, weight, blood pressure, number of episodes of asthma in a
fixed period.
Another example: A randomised controlled trial for a new drug for the treatment of
hypertension. The response (outcome) is blood pressure or change in blood pressure. The
principal explanatory variable is whether a subject is assigned to the new drug or control.
Question 1
What would be the response and explanatory variables in the following studies?
1a A questionnaire survey of whooping cough vaccination status in a sample of boys and
girls, aiming to identify if socio-economic or ethnic status determined whether children were
vaccinated.
1b A study of the occurrence of whooping cough in children, aiming to identify how effective
vaccination was at preventing whooping cough.
Question 2 Write down the types of the variables you identified in question 1.
L1.3
each value of the variable (here NO and YES), . For quantitative variables such tables are
seldom helpful, unless the number of observations is quite small. It is more useful to group
the values taken by the variable and to report the numbers and the frequencies (or percentage
frequencies) of subjects in each group. We show a grouped frequency table for the
distribution of birthweight in babies born to women who did not lift during pregnancy:
Table 3. The distribution of birth weight in 1310 women who did not lift at work
--
Birthweight N. of women Percentage
(g) (frequency) (relative frequency)
--
500- 999 3 0.2
1000-1499 9 0.7
1500-1999 16 1.2
2000-2499 68 5.2
2500-2999 301 23.0
3000-3499 484 37.0
3500-3999 327 25.0
4000-4499 94 7.2
4500-4999 8 0.6
-
Total 1310 100.0
-
L1.4
rectangles: 500-999; 1000-1999; etc. These are what the recorded values were. A
(pedantic) case could also be made for touching rectangles with boundaries at 8.95,
9.95 etc. Both these would be a mistake. The point is to give a simple visual picture
of the distribution, not to display a lot of detail, which would distract your readers.
The cumulative relative frequency (also called cumulative percentage) of babies whose birth
weight is below 1500g is 0.2+0.7=0.9%, the cumulative relative frequency below 2000g is
0.2+0.7+1.2=2.1%, and so on. In a cumulative frequency curve the cumulative percentage
frequencies are plotted against the right hand ends of their intervals:
100
90
80
cumlative relative frequency (%)
70
60
50
40
30
20
10
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
bi rth_weight
L1.5
It is not as easy to study the shape of a distribution from a cumulative curve as from a
histogram, but cumulative curves are easier to use when there are different grouping intervals
(they avoid the problems of calculating the heights of the bars to equalise the areas) and also
have other rather more specialised applications. They are useful for finding medians (see
below; briefly, the median birth weight is obtained by drawing a horizontal line through the
50% point on the vertical axis and noting the birthweight at the point where it cuts the curve).
Table 5
Lifting at work
Birthweight no yes
(gm) N. % N. %
<2000 28 2.1 25 2.6
2000-2499 68 5.2 49 5.1
2500-2999 301 23.0 244 25.5
3000-3499 484 37.0 345 36.1
3500- 429 32.7 293 30.6
Total 1310 100.00 956 100.00
L1.6
There are several summary measures of central value and of spread, but the most common are
the mean and standard deviation.
The mean
The most commonly used measure of the central value of a distribution is the arithmetic
mean, or the average. It is the sum of the observations divided by the number of
observations. The mean birthweight in Example 1 (non-lifting mothers) is 3240g. As is
usual, it lies quite centrally in the histogram. We shall now review the calculation of the mean
in a simpler example:
Example 2
The plasma volumes of 8 healthy men are
2.75 2.86 3.37 2.76 2.62 3.49 3.05 3.12 litres, respectively
The arithmetic mean plasma volume is
2.75 2.86 3.37 2.76 2.62 3.49 3.05 3.12 24.02
3.0 litres
8 8
A formula that corresponds to this calculation is
mean x
x i
n
Where n denotes the number of observations or values of the variable and the value of the i’th
variable is denoted xi, so that x1=2.75, x2=2.86, etc. The Σ (greek sigma) symbol indicates
that the values of xi must be added.
SD var iance
(x i x)2
(n 1)
We do not actually calculate SDs using this procedure, which is described here to give a feel
of what the SD is – a kind of average of deviations about the mean. (The divisor n-1 is used
rather than n for technical reasons beyond the scope of this Unit.)
Both the SD and the mean can be obtained on your calculator by using the statistics mode.
The Appendix shows how this is done for most Casio and other makes of calculator. SD and
mean are also given by many computer programs. (SD is denoted n1 by most calculators.)
L1.7
The mean and SD are usually the best summary measures of symmetrical distributions. For
these, the interval one standard deviation either side of the mean includes roughly 70% of the
distribution and two standard deviations includes roughly 95%.
Table of means and standard deviations are an alternative way of comparing the distribution
of a quantitative response variable in two or more groups. Example 1 again:
Table 6
─────────────────────────────────────────────────────
Lifting Number of Birthweight in g
at work deliveries(%) -----------------------
Mean SD
─────────────────────────────────────────────────────
NO 1,310 (58.8%) 3239.5 559.0
YES 956 (41.2%) 3190.6 553.4
Example 3
The number of days spent in hospital by 17 subjects after an operation, arranged in
increasing size, were:
3 4 4 6 8 8 8 10 10 12 14 14 17 25 27 37 42
The distribution is not symmetric (asymmetric) because low values are closer together
and often repeated, compared with the string of high values. The mean is 14.6 days.
This is not in the centre of the distribution.
L1.8
For a distribution with a large number of observations the quartiles are most easily found
from the cumulative relative frequency distribution (such as in the example above), by
reading off the values that correspond to 25%, 50%, and 75%.
For a smaller number of observations the median can be found directly by arranging the
observations in order from the lowest to the highest value and striking off values at both ends
until only one or two remain. If one, this value is the median; if two the median is half way
between them. The median is then used to divide the data into two halves and the medians of
each of the halves found in the same way - these are the upper and lower quartiles. (If the
median is the single central value, include it in each half).
For example 3, the median stay in hospital is 10 days. The 1st and 3rd quartiles are 8 and 17
days. (Formally, there are a number of definitions of the quartiles; one computer package
even gives four different versions! The details of these are not important to us).
Centiles
The quartiles are the values which correspond to the cumulative percentages 25, 50 and 75,
but there is no need to stick to these percentages. When using a distribution as a standard, for
example the distribution of weight standardised for age in young children, it is common
practice to report the values corresponding to the percentages, say, 5% 10%, 25%. These are
known as the 5th, 10th, 25th centiles (or percentiles) of the distribution.
L1.9
For every positive number x there corresponds its logarithm ln(x). The two main arithmetical
properties of natural logarithms (and any other kind of logarithm) are
ln(xy) = ln(x) + ln(y)
ln(x/y) = ln(x) - ln(y).
That is, they convert multiplication to addition and division to subtraction.
To convert a number to its natural log with your calculator use the ln key. To convert the log of
x back to x the anti-logarithm function is used; this is the key SHIFT ln (or ex) on Casio and
many other calculators, sometimes written as exp(x) in text.
We now consider the use of logarithms with skewed data.
Example 3 - continued
We return to the 17 observations of duration of stay in hospital. As has been said, the
distribution is skewed to the right with a few rather large observations. We recall the
mean duration (14.6 days) is not a usually a satisfactory measure of the centre and the
median (10 days) is better for this purpose. The figure below repeats the point plot of
the observations, then shows the equivalent plot of their logarithms.
L1.10
Some General Rules for Reporting Summaries
Always report the number of observations on which the summary (percentage, mean, etc) is
based.
If the central value of a quantitative distribution is measured using the mean give the
standard deviation as well.
If the central value of a quantitative distribution is measured using the median, give the
lower and upper quartiles as well.
For binary responses (two possible values, A or B) report the percentage of A's or B's but
not usually of both.
Question 1a
Response Whether or not vaccinated binary (yes, no)
Explanatory socio-economic status ordered categorical (usually)
ethnic status categorical (unordered)
sex (not mentioned as of interest, binary
but might be required to control
confounding – see epidemiology)
Other variables (not mentioned) -
might also be required to control
confounding, eg age (quantitative)
Question 1b
Response Whooping cough binary (yes, no)
Explanatory Whether or not vaccinated binary (yes, no)
Other variables (not mentioned) -
might also be required to control
confounding, eg age (quantitative)
Note: These two questions illustrate that whether a variable is a response or explanatory
variable depends on the context – vaccination status was the response variable in 1a, but an
explanatory variable in 1b.
L1.11