0% found this document useful (0 votes)
50 views11 pages

Lecture 1 NOTES Variables and Distributions 2020-21

This document introduces key concepts for analyzing variables and distributions in studies. It defines explanatory and response variables, and different types of variables including binary, categorical, and quantitative. It discusses displaying and interpreting frequency tables and distributions for variables, including histograms. The purpose is to help students produce and interpret summaries of data to answer research questions.

Uploaded by

Hannah Matthews
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views11 pages

Lecture 1 NOTES Variables and Distributions 2020-21

This document introduces key concepts for analyzing variables and distributions in studies. It defines explanatory and response variables, and different types of variables including binary, categorical, and quantitative. It discusses displaying and interpreting frequency tables and distributions for variables, including histograms. The purpose is to help students produce and interpret summaries of data to answer research questions.

Uploaded by

Hannah Matthews
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

LECTURE 1

VARIABLES AND DISTRIBUTIONS

Objectives
After completing this session, students should be able to:
1/ Identify the variables in a study, their types (binary, (ordered) categorical,
quantitative), and whether they are explanatory or response variables.
2/ Display and interpret distributions of variables in frequency and cumulative
frequency tables, histograms and cumulative frequency curves.
3/ Construct and interpret two-way frequency tables comparing distributions.
4/ Construct and interpret a summary of a distribution of a quantitative variable using
the mean, standard deviation, median, quartiles, and other centiles.
5/ Distinguish a skewed from a symmetric distribution; use logarithms to make more
symmetric a skewed distribution; calculate a geometric mean of such a distribution.

INTRODUCTION
We will look at a simple example first, then refer back to it to introduce some key concepts

Example 1:-
A group of 2266 Canadian newborns are classified by birthweight and whether the
mother had to lift as part of her job. (yes or no). Interest is in whether babies born to
mothers who had to lift at work are likely to be smaller. The data from the study
might be held in a computer file looking something like this:
Table 1
Subject ID Birthweight lifting at work
Number (0-no, 1= yes)

1 3300 0
2 4100 0
3 3970 1
4 3840 0
…………
…………
2265 4000 0
2266 3300 0

The purpose of this lecture is to introduce concepts and methods to help you produce and
interpret summaries of these and similar data to answer questions like that posed above.

L1.1
SUBJECTS AND VARIABLES

In example 1, and in most health studies, information (data) was collected on the
characteristics of persons included in the study (study subjects). In example 1, these
characteristics are birthweight and mother lifting at work. Other examples of characteristics
are sex, environmental exposures, treatments, blood pressure, and experience of disease. We
call the characteristics variables, because the value a variable takes (for example birthweight
in grams, whether lifted at work) varies from person to person.
In Table 1, each row represents a subject, and each column a variable. This is the way data
are usually kept by computers.
In this example, and often, study subjects are persons. However sometimes rather than
measuring characteristics (variables) for persons we measure them for other “units”, for
example households, towns, hospitals, or mosquitoes. In statistical terminology, we still call
these units subjects. A value of each variable is needed for each subject.

Explanatory and response variables


A variable will usually be measured for one of two purposes:-
 It is an outcome of interest. In example 1 above, birthweight is the outcome.
Outcome variables are also called response variables.
 It is a factor that influences (or might influence) the outcome. In the example, lifting
is such a variable. These are often called explanatory variables.
More advanced point: In our example, we have suggested looking at lifting as the explanatory
variable and birthweight as the response variable. Note that in other studies, for example a
sociological one looking at the association between ethnicity and lifting, lifting could be a
response variable. In a study looking at the association of birthweight with diseases later on in
life, birthweight would be an explanatory variable. The distinction between explanatory and
response variable is context-specific, not an intrinsic attribute of the variable.

Types of variable
Variables can also be classified by the types of values they can take. The main ones are:
 Qualitative
o Binary (dichotomous) variables, where the values are two different
categories; for example, sex, or vaccinated status for a particular vaccine, or
being of low birthweight or not.
o Categorical variables, taking as values several different categories that are
distinct from each other; for example marital status (never married, married,
widowed, divorced), or blood group.
o Ordered categorical variables, for which the different categories are
ordered on some scale; for example, severity of disease (mild, moderate,
severe) or a disability score.

L1.2
 Quantitative (numerical) variables, where some quantity is measured on a well defined
scale with units; for example, weight, blood pressure, number of episodes of asthma in a
fixed period.

Coding variable values


When the values of a binary or categorical variable are recorded, they are usually given
numerical codes for computer use. For example the outcome "Yes" and "No" were coded 1
and 0 in Example 1. However, this does not make these variables quantitative.
Information is sometimes missing for some subjects in which case there should also be a
special code for "missing value", to allow us to omit them from analyses. This applies to all
types of variables.

Another example: A randomised controlled trial for a new drug for the treatment of
hypertension. The response (outcome) is blood pressure or change in blood pressure. The
principal explanatory variable is whether a subject is assigned to the new drug or control.
Question 1
What would be the response and explanatory variables in the following studies?
1a A questionnaire survey of whooping cough vaccination status in a sample of boys and
girls, aiming to identify if socio-economic or ethnic status determined whether children were
vaccinated.
1b A study of the occurrence of whooping cough in children, aiming to identify how effective
vaccination was at preventing whooping cough.
Question 2 Write down the types of the variables you identified in question 1.

FREQUENCY TABLES AND DISTRIBUTIONS


We can see how the values change from subject to subject just by eye-balling a table like
Table 1. But some way of summarising what we see is usually necessary.
The frequencies with which the different possible Table 2
values of a variable occur in a group of subjects is
called the frequency distribution of the variable --------------------------------
in the group. For a very simple example, consider Lifting Number of deliveries
at work --------------------
the frequency distribution of lifting at work in freq- relative
example 1, which is shown in the first two uency frequency(%)
--------------------------------
columns of Table 2, reproduced on the right with NO 1,310 (58.8%)
YES 956 (41.2%)
additional column titles. Notice that it was useful
to include in the table not just the counts of All 2,266 (100%)
subjects lifting and not lifting (frequencies), but ----------------------------------------------------
also the percentages (relative frequencies). A
table like Table 2 is called a frequency table.
For qualitative variables (such as lifting at work) frequency tables usually include one row for

L1.3
each value of the variable (here NO and YES), . For quantitative variables such tables are
seldom helpful, unless the number of observations is quite small. It is more useful to group
the values taken by the variable and to report the numbers and the frequencies (or percentage
frequencies) of subjects in each group. We show a grouped frequency table for the
distribution of birthweight in babies born to women who did not lift during pregnancy:

Table 3. The distribution of birth weight in 1310 women who did not lift at work
--
Birthweight N. of women Percentage
(g) (frequency) (relative frequency)
--
500- 999 3 0.2
1000-1499 9 0.7
1500-1999 16 1.2
2000-2499 68 5.2
2500-2999 301 23.0
3000-3499 484 37.0
3500-3999 327 25.0
4000-4499 94 7.2
4500-4999 8 0.6
-
Total 1310 100.0
-

Showing distributions graphically; histograms


For qualitative variables frequency distributions can be displayed as bar charts (see notes on
The display of results in this manual). For quantitative variables a grouped frequency
distribution (Table 3) can be displayed in a histogram – see the figure below. Notice:
1. A histogram is made up of a rectangle for
40

each group (row of the grouped frequency


table).
30
Relative Frequency (%)

2. In histograms, the rectangles touch – there


is no gap between them. This distinguishes
20

them from bar-charts showing


distributions of categorical variables,
10

which generally do not touch.


3. We can tell the shape of a distribution
0

0 1000 2000 3000 4000 5000


from a histogram, in particular whether it birth weight

is symmetrical. In this example the distribution is not quite symmetrical. It is


“skewed to the left”.
4. In this example, and others where the widths of the groups are equal, the height of
each rectangle represents the frequency. By changing the scale of the y-axis it can
also represent the relative frequency.
5. [Optional advanced point.] If histograms have groups of unequal width, the area of
the rectangle rather than the height represents frequency of relative frequency.
6. [Optional advanced point.] Novices might be tempted to draw non-touching

L1.4
rectangles: 500-999; 1000-1999; etc. These are what the recorded values were. A
(pedantic) case could also be made for touching rectangles with boundaries at 8.95,
9.95 etc. Both these would be a mistake. The point is to give a simple visual picture
of the distribution, not to display a lot of detail, which would distract your readers.

Cumulative relative frequency tables and curves


An alternative to the histogram for quantitative variables is to display the cumulative
frequencies. These are calculated below for the birthweight data.
Table 4
---
Birthweight N. of women Percentage Cumulative Percentage
(g)
(frequency) (relative frequency) (cumulative relative
frequency)
---
500- 999 3 0.2 0.2
1000-1499 9 0.7 0.9
1500-1999 16 1.2 2.1
2000-2499 68 5.2 7.3
2500-2999 301 23.0 30.3
3000-3499 484 37.0 67.3
3500-3999 327 25.0 92.2
4000-4499 94 7.2 99.4
4500-4999 8 0.6 100.0
--
Total 1310 100.0 100.0
--

The cumulative relative frequency (also called cumulative percentage) of babies whose birth
weight is below 1500g is 0.2+0.7=0.9%, the cumulative relative frequency below 2000g is
0.2+0.7+1.2=2.1%, and so on. In a cumulative frequency curve the cumulative percentage
frequencies are plotted against the right hand ends of their intervals:

100

90

80
cumlative relative frequency (%)

70

60

50

40

30

20

10

0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
bi rth_weight

L1.5
It is not as easy to study the shape of a distribution from a cumulative curve as from a
histogram, but cumulative curves are easier to use when there are different grouping intervals
(they avoid the problems of calculating the heights of the bars to equalise the areas) and also
have other rather more specialised applications. They are useful for finding medians (see
below; briefly, the median birth weight is obtained by drawing a horizontal line through the
50% point on the vertical axis and noting the birthweight at the point where it cuts the curve).

Comparing distributions in two groups of subjects


Up to here we have shown a frequency table, histogram, and cumulative frequency curve,
which showed the distribution of a variable in one group of subjects. These “one-way”
presentations are purely descriptive; they give useful information about the values the
variable takes, but cannot show how it is related to anything else.
Two-way frequency tables allow two (or more) groups to be compared. Usually, the
interesting two-way table shows the distribution of the response variable (eg birthweight) in
groups defined by the explanatory variable (lifting at work):

Table 5

Lifting at work

Birthweight no yes
 
(gm) N. % N. %

<2000 28 2.1 25 2.6
2000-2499 68 5.2 49 5.1
2500-2999 301 23.0 244 25.5
3000-3499 484 37.0 345 36.1
3500- 429 32.7 293 30.6

Total 1310 100.00 956 100.00


SUMMARY MEASURES FOR QUANTITATIVE VARIABLES


Distributions of categorical variables (including binary and ordered categorical variables) can
be expressed as percentages; these percentages can be presented as such or contrasted
between groups of subjects. There is little more that can be done to summarise these
distributions. The distribution of a binary variable is however expressible in only one
percentage (with its accompanying total, for completeness). For example “41.2% of mothers
lifted at work (n= 2,266)”.
For a quantitative variable, there are other measures that aim to summarise the information in
the values taken by the variable. Two summaries are usually given:
 one for central value or "location" of the distribution
 one for spread – a measure that indicates how widely values are spread above and below
the central value.

L1.6
There are several summary measures of central value and of spread, but the most common are
the mean and standard deviation.
The mean
The most commonly used measure of the central value of a distribution is the arithmetic
mean, or the average. It is the sum of the observations divided by the number of
observations. The mean birthweight in Example 1 (non-lifting mothers) is 3240g. As is
usual, it lies quite centrally in the histogram. We shall now review the calculation of the mean
in a simpler example:

Example 2
The plasma volumes of 8 healthy men are
2.75 2.86 3.37 2.76 2.62 3.49 3.05 3.12 litres, respectively
The arithmetic mean plasma volume is
2.75  2.86  3.37  2.76  2.62  3.49  3.05  3.12 24.02
  3.0 litres
8 8
A formula that corresponds to this calculation is

mean  x 
x i

n
Where n denotes the number of observations or values of the variable and the value of the i’th
variable is denoted xi, so that x1=2.75, x2=2.86, etc. The Σ (greek sigma) symbol indicates
that the values of xi must be added.

The standard deviation


This is the measure of spread used in conjunction with the mean. It is based on the deviations
of the observations from the mean, that is on the difference between each observation and the
mean.
These deviations are squared and added. The result is divided by (n-1). The result of this is a
a kind of mean of squared deviations, and is called the variance. The standard deviation is
the square root of the variance. The abbreviation SD is often used for the standard deviation.
For those of you who find formulae help your understanding:

SD  var iance 
(x i  x)2
(n  1)
We do not actually calculate SDs using this procedure, which is described here to give a feel
of what the SD is – a kind of average of deviations about the mean. (The divisor n-1 is used
rather than n for technical reasons beyond the scope of this Unit.)
Both the SD and the mean can be obtained on your calculator by using the statistics mode.
The Appendix shows how this is done for most Casio and other makes of calculator. SD and
mean are also given by many computer programs. (SD is denoted  n1 by most calculators.)

L1.7
The mean and SD are usually the best summary measures of symmetrical distributions. For
these, the interval one standard deviation either side of the mean includes roughly 70% of the
distribution and two standard deviations includes roughly 95%.
Table of means and standard deviations are an alternative way of comparing the distribution
of a quantitative response variable in two or more groups. Example 1 again:
Table 6
─────────────────────────────────────────────────────
Lifting Number of Birthweight in g
at work deliveries(%) -----------------------
Mean SD
─────────────────────────────────────────────────────
NO 1,310 (58.8%) 3239.5 559.0
YES 956 (41.2%) 3190.6 553.4

All 2,266 (100%) 3219.4 557.0


─────────────────────────────────────────────
Would you choose this or table 5 to present a comparison of birthweight in babies born to
mothers lifting during pregnancy and those not lifting?

Non Symmetric Distributions

Example 3
The number of days spent in hospital by 17 subjects after an operation, arranged in
increasing size, were:
3 4 4 6 8 8 8 10 10 12 14 14 17 25 27 37 42
The distribution is not symmetric (asymmetric) because low values are closer together
and often repeated, compared with the string of high values. The mean is 14.6 days.
This is not in the centre of the distribution.

The median and quartiles of a distribution


The median is an alternative measure
of central value that works better for
such a skewed distribution. It is the
value which halves the distribution,
with 50% of the observations below it
and 50% above.
The three values which divide the
distribution into quarters are called the
quartiles. The middle quartile is the
median, and the distance between the
lower quartile and the upper quartile,
called the inter-quartile range, is used
as a measure of spread.

L1.8
For a distribution with a large number of observations the quartiles are most easily found
from the cumulative relative frequency distribution (such as in the example above), by
reading off the values that correspond to 25%, 50%, and 75%.
For a smaller number of observations the median can be found directly by arranging the
observations in order from the lowest to the highest value and striking off values at both ends
until only one or two remain. If one, this value is the median; if two the median is half way
between them. The median is then used to divide the data into two halves and the medians of
each of the halves found in the same way - these are the upper and lower quartiles. (If the
median is the single central value, include it in each half).
For example 3, the median stay in hospital is 10 days. The 1st and 3rd quartiles are 8 and 17
days. (Formally, there are a number of definitions of the quartiles; one computer package
even gives four different versions! The details of these are not important to us).

Centiles
The quartiles are the values which correspond to the cumulative percentages 25, 50 and 75,
but there is no need to stick to these percentages. When using a distribution as a standard, for
example the distribution of weight standardised for age in young children, it is common
practice to report the values corresponding to the percentages, say, 5% 10%, 25%. These are
known as the 5th, 10th, 25th centiles (or percentiles) of the distribution.

Overall summary of a distribution


The following five numbers give a general purpose summary Box plot
of a distribution, and `work’ for both non-symmetric and 40 for
symmetric distributions:- example
The smallest value 3
30
The lower quartile, Q25
The median (Q50)
20
The upper quartile, Q75
The largest value
10
These numbers are sometimes shown in a figure (right) called
a box plot. In this, a bar represents the median, the box goes
between the quartiles, “whiskers” include 95% of a well-
0
behaved distribution, and outlying values are *s or “o”s..

Logarithms and Distributions


Another way to cope with skewed distributions, where (as with the days of stay in hospital) the
skew is "to the right", is by using the logarithms, or logs, of the data values for statistical
analysis, in place of the values themselves.
Before considering this procedure, recall a few points concerning logs.
There are different kinds of logarithm. Logarithms "to the base 10" were invented to do
multiplication and division by way of addition and subtraction, but they no longer play this role
in arithmetic because multiplication and division can be done with a calculator. We shall use
another kind of logarithm called the natural logarithm.

L1.9
For every positive number x there corresponds its logarithm ln(x). The two main arithmetical
properties of natural logarithms (and any other kind of logarithm) are
ln(xy) = ln(x) + ln(y)
ln(x/y) = ln(x) - ln(y).
That is, they convert multiplication to addition and division to subtraction.
To convert a number to its natural log with your calculator use the ln key. To convert the log of
x back to x the anti-logarithm function is used; this is the key SHIFT ln (or ex) on Casio and
many other calculators, sometimes written as exp(x) in text.
We now consider the use of logarithms with skewed data.

Example 3 - continued
We return to the 17 observations of duration of stay in hospital. As has been said, the
distribution is skewed to the right with a few rather large observations. We recall the
mean duration (14.6 days) is not a usually a satisfactory measure of the centre and the
median (10 days) is better for this purpose. The figure below repeats the point plot of
the observations, then shows the equivalent plot of their logarithms.

The distribution of the logs is more symmetric.


The mean log duration is 2.41 and is a satisfactory measure of the central value of the
distribution of log duration. The anti-logarithm of this mean is 11.13 and is known as
the geometric mean. It is usually a more useful measure of the central value of the
distribution of duration than the original mean and is close to the median for positively
skewed distributions.

L1.10
Some General Rules for Reporting Summaries
 Always report the number of observations on which the summary (percentage, mean, etc) is
based.
 If the central value of a quantitative distribution is measured using the mean give the
standard deviation as well.
 If the central value of a quantitative distribution is measured using the median, give the
lower and upper quartiles as well.
 For binary responses (two possible values, A or B) report the percentage of A's or B's but
not usually of both.

Answers to reader exercise questions


Answer to Questions 1 and 2:
Response or explanatory Variable Type (Question 2)

Question 1a
Response Whether or not vaccinated binary (yes, no)
Explanatory socio-economic status ordered categorical (usually)
ethnic status categorical (unordered)
sex (not mentioned as of interest, binary
but might be required to control
confounding – see epidemiology)
Other variables (not mentioned) -
might also be required to control
confounding, eg age (quantitative)

Question 1b
Response Whooping cough binary (yes, no)
Explanatory Whether or not vaccinated binary (yes, no)
Other variables (not mentioned) -
might also be required to control
confounding, eg age (quantitative)

Note: These two questions illustrate that whether a variable is a response or explanatory
variable depends on the context – vaccination status was the response variable in 1a, but an
explanatory variable in 1b.

L1.11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy