St104a Vle
St104a Vle
Statistics 1
J.S. Abdey
ST104a
2024
Statistics 1
J.S. Abdey
ST104a
2024
Undergraduate study in
Economics, Management,
Finance and the Social Sciences
This subject guide is for a 100 course offered as part of the University of London’s
undergraduate study in Economics, Management, Finance and the Social
Sciences. This is equivalent to Level 4 within the Framework for Higher Education
Qualifications in England, Wales and Northern Ireland (FHEQ).
For more information see: london.ac.uk
This guide was prepared for the University of London by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London
School of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that
due to pressure of work the author is unable to enter into any correspondence
relating to, or arising from, the guide. If you have any comments on this subject
guide, please communicate these through the discussion forum on the virtual
learning environment.
University of London
Publications Office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk
Contents
0 Preface 1
0.1 Route map to the subject guide . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1
0.3 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.4 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.5 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.6 Employability outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7 Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7.2 Mathematical background . . . . . . . . . . . . . . . . . . . . . . 4
0.7.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.7.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.7.5 Online study resources . . . . . . . . . . . . . . . . . . . . . . . . 5
0.7.6 The VLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7.7 Making use of the Online Library . . . . . . . . . . . . . . . . . . 7
0.8 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
i
Contents
3 Probability theory 51
3.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
ii
Contents
iii
Contents
5 Interval estimation 91
5.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.1 Principle of confidence intervals . . . . . . . . . . . . . . . . . . . 92
5.5 Interval estimation for a population mean . . . . . . . . . . . . . . . . . 94
5.5.1 Variance known (σ 2 known) . . . . . . . . . . . . . . . . . . . . . 94
5.5.2 Variance unknown (σ 2 unknown) . . . . . . . . . . . . . . . . . . 95
5.5.3 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.4 Confidence interval for a single mean (σ 2 known) . . . . . . . . . 98
5.5.5 Confidence interval for a single mean (σ 2 unknown) . . . . . . . . 99
5.6 Confidence interval for a single proportion . . . . . . . . . . . . . . . . . 100
5.7 Sample size determination . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.8 Estimation of differences between parameters of two populations . . . . . 103
5.9 Difference between two population means . . . . . . . . . . . . . . . . . . 104
5.9.1 Unpaired samples: variances known . . . . . . . . . . . . . . . . . 104
5.9.2 Unpaired samples: variances unknown and unequal . . . . . . . . 106
5.9.3 Unpaired samples: variances unknown and equal . . . . . . . . . . 107
5.9.4 Paired (dependent) samples . . . . . . . . . . . . . . . . . . . . . 109
5.10 Difference between two population proportions . . . . . . . . . . . . . . . 111
5.11 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.13 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 114
5.14 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 114
iv
Contents
v
Contents
vi
Contents
A Mathematics primer and the role of statistics in the research process 211
A.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
A.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
A.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 216
vii
Contents
viii
Contents
ix
Contents
x
Chapter 0
Preface
You may choose to take ST104B Statistics 2 so that you can study the concepts
introduced here in greater depth. A natural continuation of this half course and
ST104B Statistics 2 are the advanced half courses ST2133 Advanced
statistics: distribution theory and ST2134 Advanced statistics: statistical
inference.
Two applied statistics courses for which this half course is a prerequisite are
ST2187 Business analytics, applied modelling and prediction and ST3188
Statistical methods for market research.
You may wish to develop your economic statistics by taking EC2020 Elements of
econometrics, which requires ST104B Statistics 2 as well.
The chapters are not a series of self-contained topics, rather they build on each other
sequentially. As such, you are strongly advised to follow the subject guide in chapter
order. There is little point in rushing past material which you have only partially
understood in order to reach the final chapter. Once you have completed your work on
all of the chapters, you will be ready for examination revision. A good place to start is
the sample examination paper which you will find at the end of the subject guide.
Colour has been included in places to emphasise important items. Formulae in the main
body of chapters are in blue – these exclude formulae used in examples. Key terms and
concepts when introduced are shown in magenta. References to other courses and half
courses are shown in purple (such as above). Terms in italics are shown in purple for
emphasis. References to chapters, sections, figures and tables are shown in teal.
1
0. Preface
Statistics forms a core component of our programmes. All of the courses mentioned
above require an understanding of the concepts and techniques introduced in this
course. You will develop analytical skills on this course that will help you with your
future studies and in the world of work.
0.3 Syllabus
The up-to-date course syllabus for ST104A Statistics 1 can be found in the course
information sheet, which is available on the course VLE page.
be familiar with the key ideas of statistics that are accessible to a student with a
moderate mathematical competence
be able to summarise the ideas of randomness and variability, and the way in which
these link to probability theory to allow the systematic and logical collection of
statistical techniques of great practical importance in many applied areas
have a grounding in probability theory and some grasp of the most common
statistical methods
be able to use correlation analysis and simple linear regression and know when it is
appropriate to do so.
2
0.6. Employability outcomes
1. complex problem-solving
2. decision making
3. communication.
Week Chapter
1 Chapter 1: Mathematics primer and the role of statistics in the research process
2 Chapter 2: Data visualisation and descriptive statistics
3 Chapter 3: Probability theory
4 Chapter 4: Random variables, the normal and sampling distributions
5 Chapter 5: Interval estimation of means and proportions
6 Chapter 6: Hypothesis testing principles
7 Chapter 7: Hypothesis testing of means and proportions
8 Chapter 8: Contingency tables and the chi-squared test
9 Chapter 9: Sampling and experimental design
10 Chapter 10: Correlation and linear regression
3
0. Preface
The last step is the most important. It is easy to think that you have understood the
material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
To prepare for the examination, you will only need to read the material in the subject
guide, but it may be helpful from time to time to look at the suggested ‘Further
reading’ below.
Calculators
A calculator may be used when answering questions on the examination paper for
ST104A Statistics 1. It must comply in all respects with the specification given in
the Regulations. You should also refer to the admission notice you will receive when
entering the examination and the ‘Notice on permitted materials’.
4
0.7. Overview of learning resources
Statistical tables
Lindley, D.V. and W.F. Scott New Cambridge Statistical Tables. (Cambridge:
Cambridge University Press, 1995) 2nd edition [ISBN 9780521484855].
The relevant extracts can be found at the end of this subject guide, and are the same as
those distributed for use in the examination. It is advisable that you become familiar
with them, rather than those at the end of a textbook which may differ in presentation.
This textbook shows many real-world business applications of all statistical methods
covered in ST104A Statistics 1. The textbook is also useful for ST2187 Business
analytics, applied modelling and prediction and ST3188 Statistical methods
for market research so if you study either of these courses you can benefit from a
single textbook!
5
0. Preface
If you have forgotten these login details, please click on the ‘Forgot Password’ link on
the login page.
Course materials: Subject guides and other course materials available for
download. In some courses, the content of the subject guide is transferred into the
VLE and additional resources and activities are integrated with the text.
Discussion forums: A space where you can share your thoughts and questions
with fellow students. Many forums will be supported by a ‘course moderator’, a
subject expert employed by LSE to facilitate the discussion and clarify difficult
topics.
Study skills: Expert advice on getting started with your studies, preparing for
examinations and developing your digital literacy skills.
Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.
6
0.8. Examination advice
where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.
7
0. Preface
8
1
Chapter 1
Mathematics primer and the role of
statistics in the research process
recall and use common signs: square, square root, ‘greater than’, ‘less than’ and
absolute value
demonstrate use of the summation operator and work with the ‘i’, or index, of x
1.4 Introduction
This opening chapter introduces some basic concepts and mathematical tools upon
which the rest of the half course is built. Before proceeding to the rest of the subject
9
1. Mathematics primer and the role of statistics in the research process
1
guide, it is essential that you have a solid understanding of these fundamental concepts
and tools.
You should be a confident user of the basic mathematical operations (addition,
subtraction, multiplication and division) and be able to use these operations on a
calculator. The content of this chapter is expected to be a ‘refresher’ of the elementary
algebraic and arithmetic rules from schooldays. Some material featured in this chapter
may be new to you, such as the summation operator and graphs of linear functions. If
so, you should master these new ideas before progressing.
Finally, remember that although it is unlikely that an examination question would test
you on the topics in this chapter alone, the material covered here may well be an
important part of the answer!
Brackets
Division
Multiplication
Addition
Subtraction.
10
1.6. Squares and square roots
1
Example 1.1 What is (35 ÷ 7 + 2) − (42 − 8 × 3)?
BODMAS tells us to work out the brackets first. Here there are two sets of brackets,
so let us deal with them one at a time.
First bracket: 35 ÷ 7 + 2
• do division first: 35 ÷ 7 + 2 = 5 + 2
• then perform the addition: 5 + 2 = 7.
Second bracket: 42 − 8 × 3
• do order first: 42 − 8 × 3 = 16 − 8 × 3
• next do multiplication: 16 − 8 × 3 = 16 − 24
• then perform the subtraction: 16 − 24 = −8.
Now the problem has been simplified we complete the calculation with the final
subtraction: 7 − (−8) = 7 + 8 = 15. Note that the two negatives become positive!
In the common fraction, the top number is the numerator and the bottom number is
the denominator. In practice, decimal fractions are more commonly used.
11
1. Mathematics primer and the role of statistics in the research process
1
When multiplying fractions together, just multiply all the numerators together to
obtain the new numerator, and do the same with the denominators. For example:
4 1 2 4×1×2 8
× × = = .
9 3 5 9×3×5 135
One useful sign in statistics is | | which denotes the absolute value. This is the
numerical value of a real number regardless of its sign (positive or negative). The
absolute value of x, sometimes referred to as the modulus of x, or ‘mod x’, is |x|. So
|7.1| = |−7.1| = 7.1.
Statisticians sometimes want to indicate that they only want to use the positive value of
a number. For example, let the distance between town X and town Y be 5 miles.
Suppose someone walks from X to Y – a distance of 5 miles. A mathematician would
write this as +5 miles. Later, after shopping, the person returns to X and the
mathematician would record him as walking −5 miles (taking into account the direction
of travel). Hence this way the mathematician can show the person ended up where they
started. We, however, may be more interested in the fact that the person has had some
exercise that day! So, we need notation to indicate this. The absolute value enables us
to take only the positive values of our variables. The distance, d, from Y to X may well
be expressed mathematically as −5 miles, but you will probably be interested in the
absolute amount, so |−d| = d.
1.8.2 Inequalities
12
P
1.9. Summation operator,
P 1
1.9 Summation operator,
P
The summation operator, , is likely to be new to many of you. It is widely used in
statistics and you will come across it frequently in ST104A Statistics 1, so make sure
you are comfortable using it before proceeding further!
Statistics involves data analysis, so to use statistical methods we need data! Individual
observations are typically represented using a subscript notation. For example, the
heights of n people1 would be represented by x1 , x2 , . . . , xn , where the subscript denotes
the order in which the heights are observed (x1 represents the height of the first observed
person, x2 the height of the second observed person etc.). Hence xi represents the height
of the ith individual and, in order to list them all, the subscript i must take all integer
values from 1 to n, inclusive. So, the whole set of observations is {xi : i = 1, 2, . . . , n}
which can be read as ‘a set of observations xi such that i goes from 1 to n’.
P
Summation operator,
We see that the summation is said to be over i, where i is the index of summation
and the range of i, in P
(1.1), is from 1 to n. The lower bound of the range is the value of
i written underneath , and the upper bound is written above it. Note that the lower
bound can be any integer (positive, negative or zero), such that the summation is over
all values of the index of summation in step increments of size one from the lower
bound to the upper bound, inclusive.
P
As stated above, appears frequently in statistics. For example, in Chapter 2 you will
meet descriptive statistics including the arithmetic mean of observations which is
defined as:
n
1X
x̄ = xi .
n i=1
n
P
Rather than write out xi in full, when all the xi s are summed we sometimes write
i=1
n
P P
short-cuts, such as xi , or (when the range of summation is obvious) just xi .
1
Note that the resulting sum does not involve i in any form. Hence the sum is unaffected
by (or invariant to) the choice of letter used for the index of summation. Hence, for
1
Throughout this half course, n will denote a sample size.
13
1. Mathematics primer and the role of statistics in the research process
1
example, the following summations are all equal:
n
X n
X n
X
xi = xj = xk
i=1 j=1 k=1
and:
5
X 5
X
xi (xi − 2) = (x2i − 2xi ) = ((−2)2 − 2 × −2) + (92 − 2 × 9) = 71
i=4 i=4
1.10 Graphs
In Chapter 2 you will spend some time learning how to present data in graphical form,
and also in the representation of the normal distribution in Chapter 4. You should make
sure you have understood the following material. If you are taking MT105A
Mathematics 1, you will need to use these ideas there as well.
When a variable y depends on another variable x, we can represent the relationship
mathematically using functions. In general we write this as y = f (x), where f is the
rule which allows us to determine the value of y when we input the value of x. Graphs
are diagrammatic representations of such relationships, using coordinates and axes. The
graph of a function y = f (x) is the set of all points in the plane of the form (x, f (x)).
Sketches of graphs can be very useful. To sketch a graph, we begin with the x-axis and
y-axis as shown in Figure 1.1.
We then plot all points of the form (x, f (x)). Therefore, at x units from the origin (the
point where the axes cross), we plot a point whose height above the x-axis (that is,
whose y coordinate) is f (x), as shown in Figure 1.2.
Joining all points together of the form (x, f (x)) results in a curve (or sometimes a
straight line), which is called the graph of f (x). A typical curve might look like that
shown in Figure 1.3.
14
1.11. The graph of a linear function
1
However, you should not imagine that the correct way to sketch a graph is to plot a few
points of the form (x, f (x)) and join them up – this approach rarely works well in
practice and more sophisticated techniques are needed. There are two function types
which you need to know about for this half course:
f (x) = a + bx
and their graphs are straight lines which are characterised by a gradient (or slope), b,
and a y-intercept (where x = 0) at the point (0, a).
A sketch of the function y = 3 + 2x is provided in Figure 1.4, and the function
y = 2 − x is shown in Figure 1.5.
15
1. Mathematics primer and the role of statistics in the research process
1
y
3
-1.5 x
2 x
16
1.12. The role of statistics in the research process
1
1.12 The role of statistics in the research process
Before we get into details, let us begin with the ‘big picture’. First, some definitions.
Research may be about almost any topic: physics, biology, medicine, economics, history,
literature etc. Most of our examples will be from the social sciences: economics,
management, finance, sociology, political science, psychology etc. Research in this sense
is not just what universities do. Governments, businesses, and all of us as individuals do
it too. Statistics is used in essentially the same way for all of these.
Understanding the gender pay gap: what has competition got to do with it?
Heeding the push from below: how do social movements persuade the rich to
listen to the poor?
We can think of the empirical research process as having five key stages.
2. Research design: deciding what kinds of data to collect, how and from where.
17
1. Mathematics primer and the role of statistics in the research process
1
The main job of statistics is the analysis of data, although it also informs other stages
of the research process. Statistics are used when the data are quantitative, i.e. in the
form of numbers.
Statistical analysis of quantitative data has the following features.
It can cope with large volumes of data, in which case the first task is to provide an
understandable summary of the data. This is the job of descriptive statistics.
It can deal with situations where the observed data are regarded as only a part (a
sample) from all the data which could have been obtained (the population). There
is then uncertainty in the conclusions. Measuring this uncertainty is the job of
statistical inference.
We continue with an example of how statistics can be used to help answer a research
question.
Gill, M. and A. Spriggs ‘Assessing the impact of CCTV’, Study 292, Home
Office Research, 2005.
Intervention: CCTV cameras installed in the target area but not in the
control area.
Comparison of measures of crime and the fear of crime in the target and
control areas in the 12 months before and 12 months after the intervention.
Level of crime: the number of crimes recorded by the police, in the 12 months
before and 12 months after the intervention.
18
1.12. The role of statistics in the research process
1
• In each area, one sample before the intervention date and one about 12
months after.
• Sample sizes:
Before After
Target area 172 168
Control area 215 242
• Question considered here: ‘In general, how much, if at all, do you worry
that you or other people in your household will be victims of crime?’ (from
1 = ‘all the time’ to 5 = ‘never’).
Statistical analysis of the data.
It is possible to calculate various statistics, for example the Relative Effect Size
RES = ([d]/[c])/([b]/[a]) = 0.98 is a summary measure which compares the
changes in the two areas.
RES < 1, which means that the observed change in the reported fear of crime
has been a bit less good in the target area.
However, there is uncertainty because of sampling: only 168 and 242 individuals
were actually interviewed at each time in each area, respectively.
The confidence interval for RES includes 1, which means that changes in the
self-reported fear of crime in the two areas are ‘not statistically significantly
different’ from each other.
Now the RES > 1, which means that the observed change in the number of
crimes has been worse in the control area than in the target area.
However, the numbers of crimes in each area are fairly small, which means that
these estimates of the changes in crime rates are fairly uncertain.
The confidence interval for RES again includes 1, which means that the changes
in crime rates in the two areas are not statistically significantly different from
each other.
In summary, this study did not support the claim that the introduction of CCTV
reduces crime or the fear of crime.
19
1. Mathematics primer and the role of statistics in the research process
1
If you want to read more about research of this question, see Welsh, B.C. and
D.P. Farrington ‘Effects of closed circuit television surveillance on crime’,
Campbell Systematic Reviews 17 2008, pp. 1–73.
Many of the statistical terms and concepts mentioned above have not been explained yet
– that is what the rest of the course is for! However, it serves as an interesting example
of how statistics can be employed in the social sciences to investigate research questions.
20
1.16. Solutions to Sample examination questions
1
2. Suppose that y1 = −2, y2 = −5, y3 = 1, y4 = 16, y5 = 10, and z1 = 8, z2 = −5,
z3 = 6, z4 = 4, z5 = 10. Calculate the following quantities:
3
X
(a) zi2
i=1
5
X √
(b) yi zi
i=4
3
X 1
(c) z42 + .
y
i=1 i
(b) We have:
2
X 1 1 1
= + = 25 + 0.05 = 25.05.
i=1
x i y i (−0.2) × (−0.2) 2.5 × 8.0
(c) We have:
5
y2 (−2.0)2 02
X
i
y43 + = (−2.0) + 3
+ = −8 + 5 = −3.
i=4
xi 0.8 7.4
2. (a) We have:
3
X
zi2 = 82 + (−5)2 + 62 = 64 + 25 + 36 = 125.
i=1
(b) We have:
5
X √ √ √
yi zi = 16 × 4 + 10 × 10 = 8 + 10 = 18.
i=4
(c) We have:
3
X 1 1 1
z42 + 2
= 4 + − − + 1 = 16.3.
y
i=1 i
2 5
21
1. Mathematics primer and the role of statistics in the research process
1
22
Chapter 2 2
Data visualisation and descriptive
statistics
incorporate labels and titles correctly in your diagrams and state the units which
you have used
2.4 Introduction
Both themes considered in this chapter (data visualisation and descriptive statistics)
could be applied to population data, but in most cases (namely here) they are applied to
a sample. The notation would change slightly if a population was being represented.
23
2. Data visualisation and descriptive statistics
Most visual representations are very tedious to construct in practice without the aid of
a computer. However, you will understand much more if you try a few by hand (as is
commonly asked in examinations). You should also be aware that spreadsheets do not
2 always use correct terminology when discussing and labelling graphs. It is important,
once again, to go over this material slowly and make sure you have mastered the basic
statistical definitions introduced here before you proceed to more theoretical ideas.
Discrete variables: These have outcomes you can count. Examples include the
number of passengers on a flight and the number of telephone calls received each
day in a call centre. Observed values for these will be 0, 1, 2, . . . (i.e. non-negative
integers).
Continuous variables: These have outcomes you can measure. Examples include
height, weight and time, all of which can be measured to several decimal places,
and typically have units of measurement (such as metres, kilograms and hours).
Many of the problems for which people use statistics to help them understand and make
decisions involve types of variables which can be measured. When we are dealing with a
continuous variable – for which there is a generally recognised method of determining
its value – we can also call it a measurable variable. The numbers which we then
obtain come ready-equipped with an ordered relation, i.e. we can always tell if two
measurements are equal (to the available accuracy) or if one is greater or less than the
other.
Of course, before we do any sort of data analysis, we need to collect data. Chapter 9
will discuss a range of different techniques which can be employed to obtain a sample.
For now, we just consider some simple examples of situations where data might be
collected, such as a:
pre-election opinion poll asking 1,000 people about their voting intentions
market research survey asking adults how many hours of television they watch per
week
census interviewer asking parents how many of their children are receiving full-time
education (note that a census is the total enumeration of a population, hence this
would not be a sample!).
1
Note that the word ‘data’ is plural, but is very often used as if it was singular. You will probably
see both forms used when reading widely.
24
2.5. Types of variable
In cases (a) and (b) we are doing simple counts, within a sample, of a single category
– graduates and Party XYZ supporters, respectively – while in cases (c) and (d) we
are looking at some kind of cross-tabulation between two categorical variables – a
scenario which will be considered in Chapter 8.
There is no obvious and generally recognised way of putting political preferences in
order (in the way that we can certainly say that 1 < 2). It is similarly impossible to
rank (as the technical term has it) many other categories of interest: in combatting
discrimination against people, for instance, organisations might want to look at the
effects of gender, religion, nationality, sexual orientation, disability etc. but the
25
2. Data visualisation and descriptive statistics
Before we see our first graphical representation you should be aware when reading
articles in newspapers, magazines and even within academic journals, that it is easy to
mislead the reader by careless or poorly-defined diagrams. As such, presenting data
effectively with diagrams requires careful planning.
A good diagram:
• provides a clear summary of the data
• is a fair and honest representation
• highlights underlying patterns
• allows the extraction of a lot of information quickly.
26
2.6. Data visualisation
A bad diagram:
• confuses the viewer
• misleads (either accidentally or intentionally). 2
Advertisers and politicians are notorious for ‘spinning’ data to portray a particular
narrative for their own objectives!
1. Obtain the range of the dataset (the values spanned by the data), and draw a
horizontal line to accommodate this range.
2. Place dots (hence the name ‘dot plot’ !) corresponding to the values above the line,
resulting in the empirical distribution.
•
• • •
• • • • • •
• • • • • • • •
11.50 11.60 11.70 11.80 11.90 12.00 12.10 12.20
Instantly, some interesting features emerge from the dot plot which are not
immediately obvious from the raw data. For example, most clerical assistants earn
less than £12 per hour and nobody (in the sample) earns more than £12.20 per hour.
2.6.3 Histogram
Histograms are excellent diagrams to use when we want to visualise the frequency
distribution of discrete or continuous variables. Our focus will be on how to construct a
density histogram.
Data are first organised into a table which arranges the data into class intervals (also
called bins) – disjointed subdivisions of the total range of values which the variable
takes. Let K denote the number of class intervals. These K class intervals should be
mutually exclusive (meaning they do not overlap, such that each observation belongs to
at most one class interval) and collectively exhaustive (meaning that each observation
belongs to at least one class interval).
27
2. Data visualisation and descriptive statistics
Recall that our objective is to represent the distribution of the data. As such, when
choosing K, too many class intervals will dilute the distribution, while too few will
concentrate it (using technical jargon, will tend to degenerate the distribution). Either
2 way, the pattern of the distribution will be lost – defeating the purpose of the
histogram. As a guide, K = 6 or 7 should be sufficient, but remember to always exercise
common sense!
To each class interval, the corresponding frequency is determined, i.e. the number of
observations of the variable which fall within each class interval. Let fk denote the
frequency of class interval k, and let wk denote the width of class interval k, for
k = 1, 2, . . . , K.
PK
The relative frequency of class interval k is rk = fk /n, where n = fk is the sample
k=1
size, i.e. the sum of all the class interval frequencies.
The density of class interval k is dk = rk /wk , and it is this density which is plotted on
the y-axis (the vertical axis). It is preferable to construct density histograms only if
each class interval has the same width.
Example 2.3 Consider the weekly production output of a factory over a 50-week
period (you can choose what the manufactured good is!). Note that this is a discrete
variable since the output will take integer values, i.e. something which we can count.
The data are (in ascending order for convenience):
350 354 354 358 358 359 360 360 362 362
363 364 365 365 365 368 371 372 372 379
381 382 383 385 392 393 395 396 396 398
402 404 406 410 420 437 438 441 444 445
450 451 453 454 456 458 459 460 467 469
We construct the following table, noting that a square bracket ‘[’ includes the class
interval endpoint, while a round bracket ‘)’ excludes the class interval endpoint.
Note that here we have K = 7 class intervals each of width 20, i.e. wk = 20 for
k = 1, 2, . . . , 7. From the raw data, check to see how each of the frequencies, fk , has
been obtained. For example, f1 = 6 represents the first six observations (350, 354,
354, 358, 358 and 359).
28
2.6. Data visualisation
We have n = 50, hence the relative frequencies are rk = fk /50 for k = 1, 2, . . . , 7. For
example, r1 = f1 /n = 6/20 = 0.12. The density values can then be calculated. For
example, d1 = r1 /w1 = 0.12/20 = 0.006. 2
The table above includes an additional column of ‘Cumulative frequency’, which is
obtained by simply determining the running total of the class frequencies (for
example, the cumulative frequency up to the second class interval is 6 + 14 = 20).
Note the final column is not required to construct a density histogram, although the
computation of cumulative frequencies may be useful when determining medians and
quartiles (to be discussed later in this chapter).
To construct the histogram, adjacent bars are drawn over the respective class
intervals such that the histogram has a total area of one. The histogram for the
above example is shown in Figure 2.1.
Figure 2.1: Density histogram of weekly production output for Example 2.3.
29
2. Data visualisation and descriptive statistics
Example 2.4 Continuing with Example 2.3, the stem-and-leaf diagram is:
Note the informative title and labels for the stems and leaves.
For the stem-and-leaf diagram in Example 2.4, note the following points.
These stems are formed of the ‘10s’ part of the observations.
Leaves are vertically aligned, hence rotating the stem-and-leaf diagram 90 degrees
anti-clockwise reproduces the shape of the data’s distribution, similar to what
would be revealed with a density histogram.
The leaves are placed in ascending order within the stems, so it is a good idea to
sort the raw data into ascending order first of all (fortunately the raw data in
Example 2.3 were already arranged in ascending order, but for other datasets this
may not be the case).
Unlike the histogram, the actual data values are preserved. This is advantageous if
we want to calculate various descriptive statistics later on.
Measures of location – a central point about which the data tend (also known
as measures of central tendency).
30
2.7. Measures of location
32, 28, 67, 39, 19, 48, 32, 44, 37 and 24. (2.1)
2.7.1 Mean
The preferred measure of location/central tendency, which is simply the ‘average’ of the
data. It will be frequently applied in various statistical inference techniques in later
chapters.
(Sample) mean
P
Using the summation operator, , which remember is just a form of ‘notational
shorthand’, we define the sample mean, x̄, as:
n
1X x1 + x2 + · · · + xn
x̄ = xi = .
n i=1
n
To note, the notation x̄ will be used to denote an observed sample mean for a sample
dataset, while µ will denote its population counterpart, i.e. the population mean.
Of course, it is possible to encounter datasets in frequency form, that is each data value
is given with the corresponding frequency of observations for that value, fk , for
k = 1, 2, . . . , K, where there are K different variable values. In such a situation, use the
formula:
K
P
fk xk
k=1
x̄ = K . (2.2)
P
fk
k=1
Note that this preserves the idea of ‘adding up all the observations and dividing by the
total number of observations’. This is an example of a weighted mean, where the weights
are the relative frequencies (as seen in the construction of density histograms).
2
These three measures can be the same in special cases, such as the normal distribution (introduced
in Chapter 4) which is symmetric about the mean (and so mean = median) and achieves a maximum at
this point, i.e. mean = median = mode.
31
2. Data visualisation and descriptive statistics
If the data are given in grouped-frequency form, such as that shown in the table in
Example 2.3, then the individual data values are unknown3 – all we know is the class
interval in which each observation lies. The sensible solution is to use the midpoint of
2 the interval as a proxy for each observation recorded as belonging within that class
interval. Hence you still use the grouped-frequency mean formula (2.2), but each xi
value will be substituted with the appropriate class interval midpoint.
Example 2.6 Using the weekly production data in Example 2.3, the interval
midpoints are: 350, 370, 390, 410, 440, 450 and 470, respectively. These will act as
the data values for the respective class intervals. The mean is then calculated as:
K
P 7
P
f k xk f k xk
k=1 k=1 (6 × 350) + (14 × 370) + · · · + (3 × 470)
x̄ = = = = 400.4.
PK P7 6 + 14 + · · · + 3
fk fk
k=1 k=1
Compared to the true mean of the raw data (which is 399.72), we see that using the
midpoints as proxies gives a mean very close to the true sample mean value. Note
the mean is not rounded up or down since it is an arithmetic result.
A drawback with the mean is its sensitivity to outliers, i.e. extreme observations. For
example, suppose we record the net worth of 10 randomly chosen people. If Elon Musk
(one of the world’s richest people at time of writing), say, was included, his substantial
net worth would pull the mean upward considerably! By increasing the sample size n,
the effect of his inclusion, although diluted, would still be non-negligible, assuming we
were not just sampling from the population of billionaires!
2.7.2 Median
The (sample) median, m, is the middle value of the ordered dataset, where observations
are arranged in ascending order. By definition, 50 per cent of the observations are
greater than or equal to the median, and 50 per cent are less than or equal to the
median.
(Sample) median
Arrange the n numbers in ascending order, x(1) , x(2) , . . . , x(n) , (known as the order
statistics, such that x(1) is the first order statistic, i.e. the smallest observed value,
and x(n) is the nth order statistic, i.e. the largest observed value), then the sample
median, m, depends on whether the sample size is odd or even. If:
n is even, then there is no explicit middle value, so take the average of the values
either side of the ‘midpoint’, hence m = (x(n/2) + x(n/2+1) )/2.
3
Of course, we do have the raw data for the weekly production output and so we could work out the
exact sample mean, but here suppose we did not have access to the raw data, instead we were just given
the table of class interval frequencies as shown in Example 2.3.
32
2.7. Measures of location
Example 2.7 For the dataset in (2.1), the ordered observations are:
19, 24, 28, 32, 32, 37, 39, 44, 48 and 67. 2
Here n = 10, i.e. there is an even number of observations, so we compute the average
of the fifth and sixth ordered observations, that is:
x(n/2) + x(n/2+1) x(5) + x(6) 32 + 37
m= = = = 34.5.
2 2 2
If we only had data in grouped-frequency form (as in Example 2.3), then we can make
use of the cumulative frequencies. Since n = 50, the median is the 25.5th ordered
observation which must lie in the [380, 400) class interval because once we exhaust the
ordered data up to the [360, 380) class interval we have only accounted for the smallest
20 observations, while once the [380, 400) class interval is exhausted we have accounted
for the smallest 30 observations, meaning the median must lie in this class interval.
Assuming the raw data are not accessible, we could use the midpoint (i.e. 390) as
denoting the median. Alternatively, we could use an interpolation method which uses
the following ‘general’ formula for grouped data, once you have identified the class
which includes the median (such as [380, 400) above):
bin width × number of remaining observations
endpoint of previous bin + .
bin frequency
Example 2.8 Returning to the weekly production output data from Example 2.3,
the median would be:
20 × (25.5 − 20)
380 + = 391.
10
For comparison, using the raw data, x(25) = 392 and x(26) = 393, gives the ‘true’
sample median of 392.5.
Skewness
33
2. Data visualisation and descriptive statistics
Positively-skewed
distribution
2
Negatively-skewed
distribution
Graphically, skewness can be determined by identifying where the long ‘tail’ of the
distribution lies. If the long tail is heading toward +∞ (positive infinity) on the x-axis
(i.e. on the right-hand side), then this indicates a positively-skewed (right-skewed)
distribution. Similarly, if the long tail is heading toward −∞ (negative infinity) on the
x-axis (i.e. on the left-hand side) then this indicates a negatively-skewed (left-skewed)
distribution, as illustrated in Figure 2.2.
Example 2.9 The hourly wage rates used in Example 2.2 are skewed to the right,
due to the influence of the relatively large values 12.00, 12.10, 12.10 and 12.20. The
effect of these (similar to Elon Musk’s effect mentioned above, albeit far less extreme
here) is to ‘drag’ or ‘pull’ the mean upward, hence mean > median.
Example 2.10 For the weekly production output data in Example 2.3, we have
calculated the mean and median to be 399.72 and 392.50, respectively. Since the
mean is greater than the median, the data form a positively-skewed distribution, as
confirmed by the histogram in Figure 2.1.
2.7.3 Mode
Our final measure of location is the mode.
(Sample) mode
34
2.8. Measures of dispersion
Example 2.11 The modal value of the dataset in (2.1) is 32, since it occurs twice 2
while the other values only occur once each.
Example 2.12 For the weekly production output data in Example 2.3, looking at
the stem-and-leaf diagram in Example 2.4, we can quickly see that 365 is the modal
value (the three consecutive 5s opposite the second stem stand out). If just given
grouped frequency data, then instead of reporting a modal value we can determine
the modal class interval, which is [360, 380) with 14 observations. (The fact that this
includes 365 here is a coincidence – the modal class interval and modal value are not
equivalent.)
2.8.1 Range
Our first measure of spread is the range.
Range
The range is the largest value minus the smallest value, that is:
Clearly, the range is very sensitive to extreme observations since (when they occur) they
are going to be the smallest and/or largest observations (x(1) and/or x(n) , respectively),
and so this measure is of limited appeal. If we were confident that no outliers were
present (or decided to remove any outliers), then the range would better represent the
true spread of the data.
However, the range motivates our consideration of the interquartile range (IQR) instead.
The IQR is the difference between the upper (third) quartile, Q3 , minus the lower (first)
quartile, Q1 . The upper quartile divides ordered data into the bottom 75% and the top
25%, while the lower quartile divides ordered data into the bottom 25% and the top
35
2. Data visualisation and descriptive statistics
75%. Unsurprisingly the median, given our earlier definition, is the middle (second)
quartile, i.e. m = Q2 . By discarding the top 25% and bottom 25% of observations,
respectively, we restrict attention solely to the central 50% of observations.
2
Interquartile range
IQR = Q3 − Q1
where Q3 and Q1 are the third (upper) and first (lower) quartiles, respectively.
Example 2.14 Continuing with the dataset in (2.1), computation of the quartiles
can be problematic since, for example, for the lower quartile we require the value
such that the smallest 2.5 observations are below it and the largest 7.5 observations
are above it. A suggested approach (motivated by the median calculation when n is
even) is to use:
x(2) + x(3) 24 + 28
Q1 = = = 26.
2 2
Similarly:
x(7) + x(8) 39 + 44
Q3 = = = 41.5.
2 2
Hence IQR = Q3 − Q1 = 41.5 − 26 = 15.5. Contrast this with the range of 48
(derived in Example 2.13) which is much larger due to the effects of x(1) and x(n) .
There are many different methodologies for computing quartiles, and conventions vary
from country to country, from textbook to textbook, and even from software package to
software package! Any reasonable approach is perfectly acceptable in the examination.
For example, interpolation methods, as demonstrated previously for the case of the
median, are valid. The approach shown in Example 2.14 is the simplest, and so it is
recommended.
2.8.2 Boxplot
At this point, it is useful to introduce another graphical method, the boxplot, also
known as a box-and-whisker plot, no prizes for guessing why!
In a boxplot, the middle horizontal line is the median and the upper and lower ends of
the box are the upper and lower quartiles, respectively. The whiskers extend from the
box to the most extreme data points within 1.5 times the IQR from the quartiles. Any
data points beyond the whiskers are considered outliers and are plotted individually.
Sometimes we distinguish between outliers and extreme outliers, with the latter plotted
using a different symbol. An example of a (generic) boxplot is shown in Figure 2.3.
If you are presented with a boxplot, then it is easy to obtain all of the following: the
median, quartiles, IQR, range and skewness. Recall that skewness (the departure from
symmetry) is characterised by a long tail, attributable to outliers, which are readily
apparent from a boxplot.
36
2.8. Measures of dispersion
Q3
50% of cases
have values Q2 = Median
within the box
Q1
Example 2.15 From the boxplot shown in Figure 2.4, it can be seen that the
median, Q2 , is around 74, Q1 is about 63, and Q3 is approximately 77. The many
outliers provide a useful indicator that this is a negatively-skewed distribution as the
long tail covers lower values of the variable. Note also that Q3 − Q2 < Q2 − Q1 ,
which tends to indicate negative skewness.
The variance and standard deviation are much better and more useful statistics for
representing the dispersion of a dataset. You need to be familiar with their definitions
and methods of calculation for a sample of data values x1 , x2 , . . . , xn .
Begin by computing the so-called ‘corrected sum of squares’, Sxx , the sum of the
squared deviations of each data value from the (sample) mean, where:
n
X n
X
2
Sxx = (xi − x̄) = x2i − nx̄2 . (2.3)
i=1 i=1
37
2. Data visualisation and descriptive statistics
n
P
Recall from earlier x̄ = xi /n. To see why (2.3) holds:
i=1
n
X
Sxx = (xi − x̄)2
i=1
n
X
= (x2i − 2x̄xi + x̄2 ) (expansion of quadratic)
i=1
n
X n
X n
X
= x2i − 2x̄xi + x̄2 (separating into three summations)
i=1 i=1 i=1
n
X n
X
= x2i − 2x̄ xi +nx̄2 (noting that x̄ is a constant added n times)
i=1 i=1
| {z }
= nx̄
n
X
= x2i − 2nx̄2 + nx̄2 (substituting in nx̄)
i=1
n
X
= x2i − nx̄2 (simplifying)
i=1
n
P n
P
which uses the fact that x̄ = xi /n, and so xi = nx̄.
i=1 i=1
We now define the sample variance.
Sample variance
38
2.8. Measures of dispersion
Note the divisor used to compute s2 is n − 1, not n. Do not worry about why (this is
covered in ST104B Statistics 2) just remember to divide by n − 1 when computing a
sample variance.4 To obtain the sample standard deviation, s, we just take the
(positive) square root of the sample variance, s2 .
2
Sample standard deviation
When data are given in grouped-frequency form, the sample variance is calculated as
follows.
For grouped-frequency data with K classes, to compute the sample variance we use
the formula:
2
K K K
fk (xk − x̄)2 fk x2k
P P P
fk xk
2 k=1 k=1 k=1
s = = K − K .
K
P P P
fk fk fk
k=1 k=1 k=1
Recall that the last bracketed squared term is simply the mean formula for grouped data
shown in (2.2). Note that for grouped-frequency data we can ignore the ‘divide by n − 1’
rule, since we would expect n to be very large in such cases, such that n − 1 ≈ n and so
K
P
dividing by n or n − 1 makes negligible difference in practice, noting that fk = n.
k=1
N
4
In contrast for population data, the population variance is σ 2 = (xi − µ)2 /N , i.e. we use the N
P
i=1
divisor here, where N denotes the population size while n denotes the sample size. Also, note the use of
µ (the population mean) instead of x̄ (the sample mean).
39
2. Data visualisation and descriptive statistics
or, alternatively, [120, 130), [130, 140) etc. We now proceed to determine the density
values to plot (and cumulative frequencies, for later). We construct the following
table:
Interval Relative
width, Frequency, frequency, Density, Midpoint,
Class interval wk fk rk = fk /n dk = rk /wk xk f k xk fk x2k
[120, 130) 10 1 0.0345 0.00345 125 125 15,625
[130, 140) 10 4 0.1379 0.01379 135 540 72,900
[140, 150) 10 5 0.1724 0.01724 145 725 105,125
[150, 160) 10 6 0.2069 0.02069 155 930 144,150
[160, 170) 10 7 0.2414 0.02414 165 1,155 190,575
[170, 180) 10 5 0.1724 0.01724 175 875 153,125
[180, 190)
P 10 1 0.0345 0.0345 185 185 34,225
Total, 29 4,535 715,725
40
2.8. Measures of dispersion
Figure 2.5: Density histogram of trading volume data for Example 2.17.
41
2. Data visualisation and descriptive statistics
2 Let us now consider an extended example bringing together many of the issues
considered in this chapter.
At a time of economic growth but political uncertainty, a random sample of n = 40
economists (from the population of all economists) produces the following forecasts for
the growth rate of an economy in the next year:
1.3 3.8 4.1 2.6 2.4 2.2 3.4 5.1 1.8 2.7
3.1 2.3 3.7 2.5 4.1 4.7 2.2 1.9 3.6 2.8
4.3 3.1 4.2 4.6 3.4 3.9 2.9 1.9 3.3 8.2
5.4 3.3 4.5 5.2 3.1 2.5 3.3 3.4 4.4 5.2
Solution:
(a) It would be sensible to have class interval widths of 1 unit, which conveniently
makes the density values the same as the relative frequencies! We construct the
following table and plot the density histogram.
Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[1.0, 2.0) 1 4 0.100 0.100
[2.0, 3.0) 1 10 0.250 0.250
[3.0, 4.0) 1 13 0.325 0.325
[4.0, 5.0) 1 8 0.200 0.200
[5.0, 6.0) 1 4 0.100 0.100
[6.0, 7.0) 1 0 0.000 0.000
[7.0, 8.0) 1 0 0.000 0.000
[8.0, 9.0) 1 1 0.025 0.025
42
2.9. Test your understanding
Note that we still show the ‘6’ and ‘7’ stems even though they have no
corresponding leaves. If we omitted these stems (so that the ‘8’ stem is immediately
below the ‘5’ stem) then this would distort the true shape of the sample
distribution, which would be misleading.
(c) The density histogram and stem-and-leaf diagram show that the data are
positively-skewed (skewed to the right), due to the outlier forecast of 8.2%.
Note if you are ever asked to comment on the shape of a distribution, consider:
• Is the distribution (roughly) symmetric?
• Is the distribution bimodal?
• Is the distribution skewed (an elongated tail in one direction)? If so, what is
the direction of the skewness?
• Are there any outliers?
43
2. Data visualisation and descriptive statistics
(d) There are n = 40 observations, so the median is the average of the 20th and 21st
ordered observations. Using the stem-and-leaf diagram in part (b), we see that
x(20) = 3.3 and x(21) = 3.4. Therefore, the median is (3.3 + 3.4)/2 = 3.35%.
2
(e) Since Q2 is the median, which is 3.35, we now need the first and third quartiles, Q1
and Q3 , respectively. There are several methods for determining the quartiles, and
any reasonable approach would be acceptable in an examination. For simplicity,
here we will use the following since n is divisible by 4:
Q1 = x(n/4) = x(10) = 2.5% and Q3 = x(3n/4) = x(30) = 4.2%.
Hence the interquartile range (IQR) is Q3 − Q1 = 4.2 − 2.5 = 1.7%. Therefore, the
whisker limits must satisfy:
max(x(1) , Q1 − 1.5 × IQR) and min(x(n) , Q3 + 1.5 × IQR)
which is:
max(1.3, −0.05) = 1.30 and min(8.2, 6.75) = 6.75.
We see that there is just a single observation which lies outside the interval
[1.30, 6.75], which is x(40) = 8.2% and hence this is plotted individually in the
boxplot. Since this is less than Q3 + 3 × IQR = 4.2 + 3 × 1.7 = 9.3%, then this
observation is an outlier, rather than an extreme outlier.
The boxplot is (a horizontal orientation is also fine):
Note that the upper whisker terminates at 5.4, which is the most extreme data
point within 1.5 times the IQR above Q3 , i.e. the maximum value no larger than
6.75% as easily seen from the stem-and-leaf diagram in part (b). The lower whisker
terminates at x(1) = 1.3%, since the minimum value of the dataset is within 1.5
times the IQR below Q1 .
It is important to note that boxplot conventions may vary, and some software or
implementations might use slightly different methods for calculating whiskers.
Additionally, different multipliers (other than 1.5) might be used in practice
depending on the desired sensitivity to outliers.
44
2.9. Test your understanding
(f) We have sample data, not population data, hence the (sample) mean is denoted by
x̄ and the (sample) standard deviation is denoted by s. We have:
1X
n
140.4 2
x̄ = xi = = 3.51%
n i=1 40
and:
n
!
1 X 1
s2 = x2i − nx̄2 557.26 − 40 × (3.51)2 = 1.6527.
=
n−1 i=1
39
√
Therefore, the standard deviation is s = 1.6527 = 1.29%.
(g) In (c) it was concluded that the density histogram and stem-and-leaf diagram of
the data were positively-skewed, and this is consistent with the mean being larger
than the median. It is possible to quantify skewness, although this is beyond the
scope of the syllabus.
(h) We calculate:
also:
Now we use the stem-and-leaf diagram to see that 29 observations are between 2.22
and 4.80 (i.e. the interval [2.22, 4.80]), and 39 observations are between 0.93 and
6.09 (i.e. the interval [0.93, 6.09]). So the proportion (or percentage) of the data in
each interval, respectively, is:
29 39
= 0.725 = 72.5% and = 0.975 = 97.5%.
40 40
• Many ‘bell-shaped’ distributions we meet – that is, distributions which look a bit
like the normal distribution (introduced in Chapter 4) – have the property that
68% of the data lie within approximately one standard deviation of the mean, and
95% of the data lie within approximately two standard deviations of the mean. The
percentages in (h) are fairly similar to these.
• The exercise illustrates the importance of (at least) one more decimal place than in
the original data. If we had 3.5% and 1.3% for the mean and standard deviation,
respectively, the ‘boundaries’ for the interval with one standard deviation would
have been 3.5 ± 1.3 ⇒ [2.2, 4.8]. Since 2.2 is a data value which appears twice, we
would have had to worry about which side of the ‘boundary’ to allocate these.
(This type of issue can still happen with the extra decimal place, but much less
frequently.)
45
2. Data visualisation and descriptive statistics
• too few class intervals (which is the same as too wide class intervals)
• too many class intervals (which is the same as too narrow class intervals).
2 For example, with too many class intervals, you mainly get 0, 1 or 2 items per class
interval, so any (true) peak is hidden by the subdivisions which you have used.
• The best number of (equal-sized) class intervals depends on the sample size. For
large samples, many class intervals will not lose the pattern, while for small
samples they will. However, with the datasets which tend to crop up in ST104A
Statistics 1, somewhere between 6 and 10 class intervals are likely to work well.
46
2.13. Solutions to Sample examination questions
2. The data below contain measurements of the low-density lipoproteins, also known
as the ‘bad’ cholesterol, in the blood of 30 patients. Data are measured in
milligrams per decilitres (mg/dL).
2
95 96 96 98 99
99 101 101 102 102
103 104 104 107 107
111 112 113 113 114
115 117 121 123 124
127 129 131 135 143
3. The average daily intakes of calories, measured in kcals, for a random sample of 12
athletes were:
(a) Construct a boxplot of the data. (The boxplot does not need to be exactly to
scale, but values of box properties and whiskers should be clearly labelled.)
(b) Based on the shape of the boxplot you have drawn, describe the distribution of
the data.
(c) Name two other types of graphical displays which would be suitable to
represent the data. Briefly explain your choices.
47
2. Data visualisation and descriptive statistics
2. (a) We have:
2 Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[90, 100) 10 6 0.200 0.0200
[100, 110) 10 9 0.300 0.0300
[110, 120) 10 7 0.233 0.0233
[120, 130) 10 5 0.167 0.0167
[130, 140) 10 2 0.067 0.0067
[140, 150) 10 1 0.033 0.0033
(b) We have:
3,342
x̄ = = 111.4 mg/dL
30
r
1
s= × (377,076 − 30 × (111.4)2 ) = 12.83 mg/dL.
29
(c) The data exhibit positive skewness, as shown by the mean being greater than
the median.
48
2.13. Solutions to Sample examination questions
Note that no label of the x-axis is necessary and that the plot can be
transposed.
(b) Based on the shape of the boxplot above, we can see that the distribution of
the data is positively skewed, equivalently skewed to the right, due to the
presence of the outlier of 3,061 kcals.
(c) A density histogram, stem-and-leaf diagram or a dot plot are other types of
suitable graphical displays. The reason is that the variable is measurable and
these graphs are suitable for displaying the distribution of such variables.
49
2. Data visualisation and descriptive statistics
50
Chapter 3
Probability theory
3
3.1 Synopsis of chapter
The world is full of unpredictability. Will a country’s economic growth increase or
decrease next year? Will artificial intelligence replace the majority of human jobs in a
particular sector in the next decade? What will be the cost of borrowing next month?
These are some instances of uncertainty. While we can anticipate potential outcomes
which could happen (like economic growth increasing, decreasing, or no change), we do
not know with certainty in advance what will happen. Probability allows us to model
uncertainty and in this chapter we explore probability theory.
In other courses, particularly in ST104B Statistics 2 and ST2187 Business
analytics, applied modelling and prediction, you will make full use of probability
in both theory and in decision trees, and highlight the ways in which such information
can be used. We will look at probability at quite a superficial level in this half course.
Even so, you may find that, although the concepts of probability introduced are simple,
their application in particular circumstances may be very difficult.
apply the ideas and notation used for sets in simple examples
51
3. Probability theory
3.4 Introduction
Chance is what makes life worth living – if everything was known in advance, imagine
the disappointment! If we had perfect information about the future, as well as the
present and the past, there would be no need to consider the concepts of probability.
However, it is usually the case that uncertainty cannot be eliminated and hence its
3 presence should be recognised and attempts made to quantify it.
Probability theory is used to determine how likely various events are to occur, such
as:
the likelihood of selecting a candidate with specific skills during the hiring process
the possibility of a project meeting its deadlines and milestones based on historical
data.
Sample space
We define the sample space, S, as the set of all possible outcomes of an experiment.
1. Coin toss: S = {H, T }, where H and T denote ‘heads’ and ‘tails’, respectively,
and are called the elements or members of the sample space.
52
3.5. The concept of probability
So the coin toss sample space has two elementary outcomes, H and T , while the score
on a die has six elementary outcomes. These individual elementary outcomes are
themselves events, but we may wish to consider slightly more exciting events of interest.
For example, for the die score, we may be interested in the event of obtaining an even
score, or a score greater than 4 etc. Hence we proceed to define an event.
Event
3
An event is a collection of elementary outcomes from the sample space S of an
experiment which is a subset of S.
Typically, we can denote events by letters for brevity. For example, A = ‘an even score’,
and B = ‘a score greater than 4’. Hence A = {2, 4, 6} and B = {5, 6}.
The universal convention is that we define probability to lie on a scale from 0 to 1
inclusive. We could, of course, multiply by 100% to express a probability as a
percentage. This means that the probability of any event A is denoted P (A) and is a
real number somewhere in the unit interval, i.e. we always have that:
0 ≤ P (A) ≤ 1.
Note the following.
So, we have a probability scale from 0 to 1 on which we are able to rank events, as
evident from the P (A) < P (B) result above. However, we need to consider how best to
quantify these probabilities. Let us begin with the experiments where each elementary
outcome is equally likely, hence our (fair) coin toss and (fair) die score fulfil this
criterion (conveniently).
let n be the number of these elementary outcomes which are favourable to our
event of interest, A, then:
n
P (A) = .
N
1. For the coin toss, if A is the event ‘heads’, then N = 2 (H and T ) and n = 1
(H). So, for a fair coin, P (A) = 1/2 = 0.50.
53
3. Probability theory
2. For the die score, if A is the event ‘an even score’, then N = 6 (1, 2, 3, 4, 5 and
6) and n = 3 (2, 4 and 6). So, for a fair die, P (A) = 3/6 = 1/2 = 0.5.
Finally, if B is the event ‘score greater than 4’, then N = 6 (as before) and
n = 2 (5 and 6). Hence P (B) = 2/6 = 1/3.
Suppose the event A associated with some experiment either does or does not occur.
Also, suppose we conduct this experiment independently F times. (‘Independent’ is an
important term, discussed later.) Suppose that, following these repeated experiments,
the event A occurs f times.
The ‘frequentist’ approach to probability would regard:
f
P (A) =
F
as F → ∞.
Example 3.4 For a coin toss with event A = {H}, if the coin is fair we would
expect that repeatedly tossing the coin F times would result in approximately
f = F/2 heads, hence P (A) = (F/2)/F = 1/2. Of course, this approach is not
confined to fair coins!
54
3.7. ‘Randomness’
To estimate the probability using the relative frequency approach, the analyst
divides the number of favourable outcomes (days when the stock reached the target
price) by the total number of outcomes (total days examined). Calculating, we have:
2,023
P (A) = = 0.2023 ≈ 0.20.
10,000
So, the estimated probability of the stock reaching the specified price level is 3
approximately 0.20, or 20%.
3.7 ‘Randomness’
Statistical inference is concerned with the drawing of conclusions from data which are
subject to randomness, perhaps due to the sampling procedure, perhaps due to
observational errors, perhaps for some other reason.
Let us stop and think why, when we repeat an experiment under apparently identical
conditions, we get different results.
The answer is that although the conditions may be as identical as we are able to control
them to be, there will inevitably be a large number of uncontrollable (and frequently
unknown) variables which we do not measure and which have a cumulative effect on the
result of the sample or experiment. For example, weather conditions may affect the
outcomes of biological or other ‘field’ experiments.
Therefore, the cumulative effect is to cause variation in our results. It is this variation
which we term randomness and, although we never fully know the true generating
mechanism for our data, we can take the random component into account via the
concept of probability, which is, of course, why probability plays such an important role
in data analysis.
55
3. Probability theory
Axioms of probability
3
1. For any event A, P (A) ≥ 0.
The first two axioms should not be surprising. The third may appear quite difficult.
Events are called mutually exclusive when they cannot both occur simultaneously.
Example 3.7 When rolling a die once, the event A = ‘obtain an even score’ and
the event B = ‘obtain an odd score’ are mutually exclusive.
Example 3.9 Suppose you are managing a manufacturing process involving two
alternative production methods, Method A and Method B, and are interested in
whether Method A or Method B is more effective for a specific task. If the
production methods are such that a particular task can only be completed using one
method at a time, then the events A = ‘Method A is employed’ and B = ‘Method B
is employed’ are mutually exclusive.
All the above examples highlight the concept of mutually exclusive events, where the
occurrence of one event prevents the occurrence of the other event.
Extending this, a collection of events is pairwise mutually exclusive if no two events can
occur simultaneously. For instance the three events A, B and C are pairwise mutually
exclusive if A and B cannot occur together and B and C cannot occur together and A
and C cannot occur together. Another way of putting this is that a collection of events
is pairwise mutually exclusive if at most one of them can occur.
Related to this is the concept of a collection of events being collectively exhaustive.
This means at least one of them must occur, i.e. all possible experimental outcomes are
included among the collection of events.
56
3.8. Properties of probability
Also, do make sure that you distinguish between a set and the probability of a set. This
distinction is important. A set, remember, is a collection of elementary outcomes from
S, whereas a probability (from axioms 1 and 2) is a number on the unit interval, [0, 1].
For example, A = ‘an even die score’, while P (A) = 0.50, for a fair die.
Having defined these events, it is therefore possible to insert every element in the
sample space S into a Venn diagram, as shown in Figure 3.1.
The box represents S, so every possible outcome of the experiment (the total score
when a die is rolled twice) appears within the box. Three (overlapping) circles are
drawn representing the events A, B and C. Each element of S is then inserted into the
appropriate area. For example, the area where the three circles all intersect represents
the event A ∩ B ∩ C into which we place the element ‘6’, since this is the only member
of S which satisfies all three events A, B and C.
Example 3.10 Using Figure 3.1, we can determine the following sets:
A ∩ B = {2, 4, 6} A ∩ C = {6, 8}
A ∩ B ∩ C = {6} (A ∪ B ∪ C)c = {11}
A ∩ B ∩ C c = {2, 4} Ac ∩ B = {3, 5, 7}
(A ∪ C)c ∩ B = {3} A | C = {6, 8}.
57
3. Probability theory
A B
10, 12 2, 4
3
3 8
6
5, 7
9
11
C
Figure 3.1: Venn diagram for pre-defined sets A, B and C recording the total score when
a die is rolled twice.
Let A and B be any two events. The additive law states that:
Example 3.11 We can think about this using a Venn diagram. The total area of
the Venn diagram in Figure 3.2 is assumed to be 1, so area represents probability.
Event A is composed of all points in the left-hand circle, and event B is composed of
all points in the right-hand circle. Hence:
P (A) = area x + area z P (B) = area y + area z
58
3.8. Properties of probability
x z y
3
Therefore, it follows that the probability that the component is defective is:
If A and B are mutually exclusive events, i.e. they cannot occur simultaneously, then
P (A ∩ B) = 0. Hence:
Such events can be depicted by two non-overlapping sets in a Venn diagram as shown
in Figure 3.3. Now revisit axiom 3, to see this result generalised for n mutually
exclusive events.
59
3. Probability theory
A B
P (Ac ) = 1 − P (A).
The multiplicative law is concerned with the probability of two events occurring at
the same time – specifically when the two events have the special property of
independence. An informal definition of independence is that two events are said to
be independent if one has no influence on the other.
P (A ∩ B) = P (A) P (B).
Example 3.13 Consider rolling two fair dice. The score on one die has no influence
on the score on the other die. Hence the respective scores are independent events,
and so:
1 1 1
P (two sixes) = × = .
6 6 36
Note the multiplicative (or product) law does not hold for dependent events, which is
the subject of conditional probability, discussed shortly. Also, take a moment to
ensure you are comfortable with the terms ‘mutually exclusive’ and ‘independent’.
These are not the same thing, so do not get these terms confused!
60
3.9. Conditional probability and Bayes’ formula
4/144 4 P (A ∩ F )
P (A | F ) = = = .
53/144 53 P (F )
Similarly:
4/144 4 P (A ∩ F )
P (F | A) = = = . (Note P (A ∩ F ) = P (F ∩ A).)
18/144 18 P (A)
61
3. Probability theory
Note also another important relationship involving conditional probability is the ‘total
probability formula’ (discussed in greater depth shortly). This expresses an
unconditional probability in terms of other, conditional probabilities.
3
18 4 53 14 91
P (A) = = × + ×
144 53 144 91 144
= P (A | F ) P (F ) + P (A | M ) P (M ).
Conditional probability
P (A ∩ B) P (A ∩ B)
P (A | B) = and P (B | A) = (3.1)
P (B) P (A)
This is the simplest form of Bayes’ formula, and this can be expressed in other ways.
Rearranging (3.1), we obtain:
P (A ∩ B) = P (A | B) P (B) = P (B | A) P (A)
Bayes’ formula
P (B | A) P (A)
P (A | B) = .
P (B)
62
3.9. Conditional probability and Bayes’ formula
Fulfilment of these criteria (being mutually exclusive and collectively exhaustive) allows
us to view B and B c as a partition of the sample space.
P (A) = P (A | B) P (B) + P (A | B c ) P (B c ). 3
In words: ‘the probability of an event is equal to its conditional probability on a
second event times the probability of the second event, plus its probability conditional
on the second event not occurring times the probability of that non-occurrence’.
B3 S
B1
B2 B4
B3 S
B1 A
B2 B4
63
3. Probability theory
For a general partition of the sample space S into B1 , B2 , . . . , Bn , and for some event
A, then:
P (A | Bk ) P (Bk )
P (Bk | A) = P n .
P (A | Bi ) P (Bi )
i=1
Example 3.17 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity – if a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity
– if a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
P (A | B c ) = 0.01, and hence:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:
Hence:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098
64
3.10. Probability trees
Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.
P (A | B) = P (A).
Recall, from the multiplicative law in Section 3.8.4, that under independence
P (A ∩ B) = P (A) P (B). Substituting this into our conditional probability formula gives
the required result:
P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A)
P (B) P (B)
provided P (B) > 0. Hence if A and B are independent, knowledge of B, i.e. ‘ | B’, is of
no value to us in determining the probability of A occurring.
Suspect site A B C
Probability of it being site of fault 0.5 0.2 0.3
The next test they will do is expected to improve the identification of the correct
site, but (like most tests) it is not entirely accurate.
65
3. Probability theory
Draw a probability tree for this problem, and use it to answer the following:
3
(a) What is the probability that the new test will (rightly or wrongly) identify A as
the site of the fault?
(b) If the new test does identify A as the site of the fault, find:
i. the company’s revised probability that C is the site of the fault
ii. the company’s revised probability that B is not the site of the fault.
Let A, B and C stand for the events: ‘fault is at A’, ‘fault is at B’ and ‘fault is at C’,
respectively. Also, let a, b and c stand for the events: ‘the test says the fault is at A’,
‘the test says the fault is at B’ and ‘the test says the fault is at C’, respectively. The
probability tree is shown in Figure 3.6.
(a) The probability P (a) is the sum of the three values against ‘branches’ which
include the event a, that is:
ii. The conditional probability P (B c | a) is the sum of the values for the A ∩ a
and C ∩ a branches divided by P (a), that is:
0.35 + 0.03
= 0.905.
0.42
Alternatively:
66
3.12. Key terms and concepts
Probability tree
a 0.7
p = 0.5 x 0.7 = 0.35
b 0.15
p = 0.5 x 0.15 = 0.075
3
A c 0.15
p = 0.5 x 0.15 = 0.075
0.5
a 0.2
p = 0.2 x 0.2 = 0.04
B b 0.6
p = 0.2 x 0.6 = 0.12
0.2
c 0.2
p = 0.2 x 0.2 = 0.04
C a 0.1
p = 0.3 x 0.1 = 0.03
0.3
b 0.1
p = 0.3 x 0.1 = 0.03
c 0.8
p = 0.3 x 0.8 = 0.24
67
3. Probability theory
3. Suppose there are two boxes; the first box contains three green balls and one red
ball, whereas the second box contains two green balls and two red balls. Suppose a
3 box is chosen at random and then a ball is drawn randomly from this box.
(a) What is the probability that the ball drawn is green?
(b) If the ball drawn was green, what is the probability that the first box was
chosen?
3. (a) Let B1 and B2 denote boxes 1 and 2, respectively. Let G denote a green ball
and R denote a red ball. Applying the total probability formula, we have:
3 1 1 1 5
P (G) = P (G | B1 ) P (B1 ) + P (G | B2 ) P (B2 ) = × + × = .
4 2 2 2 8
68
Chapter 4
Random variables, the normal and
sampling distributions
define a random variable and distinguish it from the values which it takes
state and apply sampling distributions of the sample mean, including the central
limit theorem.
4.4 Introduction
A random variable is a variable which contains the outcomes of a chance experiment.
An alternative view is that a random variable is a description of all possible outcomes of
an experiment together with the probabilities of each outcome occurring. These, and
other possible definitions, are somewhat abstract, so we illustrate with some examples.
69
4. Random variables, the normal and sampling distributions
Example 4.1 Consider the outcomes when two fair dice are rolled. We can read off
the various possibilities for the pair of scores observed from the sample space as
follows in the form (first score, second score):
Suppose we define the random variable X to be the sum of the two scores. For
example, if we observe (1, 1), then the sum is 1 + 1 = 2. We could write the possible
outcomes along with their respective probabilities, pX (x) (where lower case x is a
specific value of the random variable X), in a table depicting the probability
distribution of the random variable X as follows.
1 2 3 4 5 6 5 4 3 2 1
Probability of outcome
36 36 36 36 36 36 36 36 36 36 36
Note that each of the 36 pairs of scores is equally likely, hence why P (X = 2) = 1/36
(since (1, 1) is the only one of the 36 possible outcomes for the sum to equal 2),
P (X = 3) = 2/36 (resulting from the two possible outcomes (1, 2) and (2, 1)), and so
on.
Example 4.2 We determine the sample space when two fair coins are tossed (each
resulting in heads, H, or tails, T ), and the associated random variable, X, which
counts the number of tails. The sample space is:
S = {HH, HT, T H, T T }.
Number of tails, X 0 1 2
Probability of outcome 0.25 0.50 0.25
So, X is a random variable taking values 0, 1 and 2 with probabilities 0.25, 0.50 and
0.25, respectively.
Example 4.3 Suppose you are assessing the success or failure of a specific project
management approach. The two possible outcomes are project success (Success) and
project failure (Failure). Let X be a random variable representing the project
outcome in a randomly selected project. The probabilities are as follows:
70
4.5. Discrete random variables
Investment preference, X A B AB O
Probability of outcome 0.40 0.25 0.15 0.20
In each of the examples, the discrete random variable is numeric (the number of . . .).
This will be the case for most discrete random variable examples. For some of these,
they are clearly finite (such as the number of product launches). Others are finite in
71
4. Random variables, the normal and sampling distributions
Example 4.5 If we wanted P (1 < X < 3), say, then we would compute the area
under the curve defined by f (x) and above the x-axis interval (1, 3). This is
illustrated in Figure 4.1 for an arbitrary pdf.
In this way the pdf will give us the probabilities associated with any interval of
interest. To calculate this area requires the mathematical technique of integration,
which is very important in the theory of continuous random variables because of its
role in determining areas. However, we will not require integration in this course
(again, study ST104B Statistics 2 if interested).
72
4.7. Expectation of a discrete random variable
4
Figure 4.1: For an arbitrary pdf, P (1 < X < 3) is shown as the area under the pdf and
above the x-axis interval (1, 3).
P (1 < X < 3) = the area under f (x) above the x-axis interval (1, 3)
the total area under the curve is 1, since this represents the probability of X taking
any possible value.
Number of tails, X 0 1 2
Probability of outcome 0.25 0.50 0.25
73
4. Random variables, the normal and sampling distributions
In the penultimate line, we can see that the first term is 0 × pX (0), the second term is
1 × pX (1) and the third term is 2 × pX (2).
Therefore, the mean value of X is:
2
X
x pX (x).
x=0
We can think of E(X) as the long-run average when the experiment is carried out a
large number of times.
Example 4.6 Suppose I buy a lottery ticket for £1. I can win £500 with a
probability of 0.001 or £100 with a probability of 0.003. What is my expected profit?
We begin by defining the random variable X to be my profit (in pounds). Its
distribution is:
Profit, X −1 99 499
Probability of outcome 0.996 0.003 0.001
So I expect to make a loss of £0.20 (which will go to funding the prize money or,
possibly, charity).
Using Excel:
74
4.8. Functions of a random variable
Example 4.7 Consider an economic scenario where you are analysing the
distribution of successful investment outcomes based on two investment decisions of
a group of investors. The random variable X represents the number of investments
(out of two) resulting in profitable returns. Suppose the probabilities are
pX (0) = 0.25, pX (1) = 0.50, and pX (2) = 0.25. Using (4.1), the expected number of
profitable investments is:
2
X
E(X) = x pX (x) = 0 × 0.25 + 1 × 0.50 + 2 × 0.25 = 1.
x=0
Score, X 1 2 3 4 5 6
Probability of outcome 1/6 1/6 1/6 1/6 1/6 1/6
These take the values derived from the function given, and the associated probabilities
are derived from those of X. Therefore, from the distribution of X we can derive, for
example, the distribution of X1 = 1/X.
Square of score, X2 = X 2 1 4 9 16 25 36
Probability of outcome 1/6 1/6 1/6 1/6 1/6 1/6
Random variable, X3 0 1 2
Probability of outcome 1/2 1/3 1/6
75
4. Random variables, the normal and sampling distributions
Just as we defined: X
E(X) = x pX (x)
For a discrete random variable X where g(X) is the function of X being considered,
the expectation of this function of X is given by:
4 E(g(X)) =
X
g(x) pX (x)
where we sum over all the x values which are taken by the random variable X.
Example 4.8 For the random variables X1 , X2 and X3 defined above, we have the
following.
For X1 , its expectation is:
6
1 X 1 1 1 1 1 1 1 1 1 1 1 1 1 49
E(X1 ) = E = pX (x) = × + × + × + × + × + × = = 0.4083.
X x=1
x 1 6 2 6 3 6 4 6 5 6 6 6 120
Using Excel:
76
4.9. Variance of a discrete random variable
Example 4.9 For the two investment decisions in Example 4.7, we saw that
E(X) = µ = 1. Therefore, the variance is (using the first method):
2
X
2
σ = (x − µ)2 pX (x) = (0 − 1)2 × 0.25 + (1 − 1)2 × 0.50 + (2 − 1)2 × 0.25
x=0
= 0.25 + 0 + 0.25
= 0.50
= (0 + 0.50 + 1) − 1
= 0.50
√
giving a standard deviation of 0.50 = 0.7071.
77
4. Random variables, the normal and sampling distributions
Number of quarters, X 0 1 2 3 4
Probability of outcome 0.25 0.30 0.25 0.15 0.05
and:
4
X
σ 2 = Var(X) = (x − µ)2 pX (x)
x=0
(b) What is the probability that the number of quarters with negative GDP growth
exceeds:
i. µ + 2σ?
ii. µ + 3σ?
p √
We have that the standard deviation is σ = Var(X) = 1.3475 = 1.16.
i. Hence P (X > µ + 2σ) is:
78
4.10. The normal distribution
Many variables have distributions which are approximately normal – for example,
weights of humans and animals.
Normal distribution
where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are
parameters, with −∞ < µ < ∞ and σ 2 > 0.
The mean (i.e. expected value) of X is:
E(X) = µ
The normal distribution is the so-called ‘bell curve’. The two parameters affect it as
follows.
The mean, µ, determines the location (or central tendency) of the curve.
For example, in Figure 4.2, N (0, 1) and N (5, 1) have the same dispersion but different
locations – the N (5, 1) curve is identical to the N (0, 1) curve, but shifted 5 units to the
right, while N (0, 1) and N (0, 9) have the same location but different dispersions – the
N (0, 9) curve is centred at the same value as the N (0, 1) curve, but spread out more
widely.
79
4. Random variables, the normal and sampling distributions
0.4
0.3
N(0, 1) N(5, 1)
0.2
0.1
4 N(0, 9)
0.0
−5 0 5 10
The mean can also be inferred from the observation that the normal distribution is
symmetric about µ. This also implies that the median of the normal distribution is also
µ, and we also note that since the distribution reaches a maximum at µ, then the mean
and median are also equal to the mode. Hence for any normal distribution:
The most important normal distribution is the special case when µ = 0 and σ 2 = 1. We
call this the standard normal distribution, denoted by Z, i.e. Z ∼ N (0, 1). A
standardised variable has
√ a zero mean and a variance of one (hence also a standard
deviation of one since 1 = 1). Tabulated probabilities which appear in statistical
tables are for the standard normal distribution.
Table 4 of the New Cambridge Statistical Tables lists ‘lower-tail’ probabilities, which
can be represented as:
P (Z ≤ z) = Φ(z) for z ≥ 0
using the conventional Z notation for a standard normal random variable.1
1
Although Z is the conventional letter used to denote a standard normal random variable, Table 4
uses (somewhat confusingly) ‘x’ to denote ‘z’.
80
4.10. The normal distribution
A cumulative probability is the probability of being less than or equal to some particular
value. Note the cumulative probability for the Z distribution, P (Z ≤ z), is often
denoted Φ(z). We now consider some examples of working out probabilities from
Z ∼ N (0, 1).
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
Figure 4.3: Standard normal distribution with shaded area depicting P (Z > 1.20).
P (0 ≤ Z ≤ 1.86) = P (Z ≤ 1.86) − P (Z ≤ 0)
= Φ(1.86) − Φ(0.00)
= 0.9686 − 0.50
= 0.4686.
81
4. Random variables, the normal and sampling distributions
P (−1.24 ≤ Z ≤ 0) = P (Z ≤ 0) − P (Z ≤ −1.24)
= Φ(0.00) − Φ(−1.24)
= Φ(0.00) − (1 − Φ(1.24))
= 0.50 − (1 − 0.8925)
= 0.3925.
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
Figure 4.4: Standard normal distribution depicting P (−1.24 < Z < 1.86) as the shaded
areas.
82
4.11. Sampling distributions
Example 4.13 Suppose X ∼ N (5, 4). What is P (5.8 < X < 7.0)?
We have:
5.8 − µ X −µ 7.0 − µ
P (5.8 < X < 7.0) = P < <
σ σ σ
5.8 − 5 X −5 7.0 − 5
=P √ < √ < √
4 4 4
= P (0.40 < Z < 1.00)
= P (Z ≤ 1.00) − P (Z ≤ 0.40)
= 0.8413 − 0.6554 (from Table 4)
= 0.1859.
83
4. Random variables, the normal and sampling distributions
parameters? To know these values precisely would require data on the heights of the
entire adult population, which for a large population is just not feasible!
Population sizes, denoted by N , are typically very large and clearly no-one has the
time, money or patience to undertake such a marathon data collection exercise. Instead
we opt to collect a sample (i.e. a subset of the population) of size n.2 Having collected
our sample, we then estimate the unknown population parameters based on the known
(i.e. observed) sample data. Specifically, we estimate population quantities based on
their respective sample counterparts.
A statistic (singular noun) is simply a known function of data. A sample statistic is
calculated from sample data. At this point, be aware of the following distinction
4 between an estimator and an estimate.
n
P
Example 4.14 The sample mean X̄ = Xi /n is an estimator of the population
i=1
mean, µ. If we had drawn from the population the sample data:
x1 = 4, x2 = 8, x3 = 2 and x4 = 6
x1 = 2, x2 = 9, x3 = 1 and x4 = 4
(drawn from the same population), then the estimator of µ would still be
n
P
X̄ = Xi /n, but the estimate would now be x̄ = (2 + 9 + 1 + 4)/4 = 16/4 = 4.
i=1
Example 4.14 demonstrates that the value of sample statistics varies from sample to
sample due to the random nature of the sampling process. Hence estimators are random
variables with corresponding probability distributions, known as sampling distributions.
We proceed to study these sampling distributions.
Sampling distribution
2
If n = N , and we sample without replacement, then we have obtained a complete enumeration of the
population, i.e. a census. Most of the time n N , where ‘’ means ‘much less than’.
84
4.12. Sampling distribution of X̄
Before we proceed, let us take a moment to review some population quantities and their
respective sample counterparts, as shown in Table 4.1.
When the values of parameters are unknown, we must estimate them using sample data.
If the sample is (approximately) representative of the population from which it is
drawn, then the sample characteristics should be (approximately) equal to their
corresponding population characteristics. Hence a density histogram of the sample data
should look very similar to the corresponding population distribution. For example, if
X ∼ N (µ, σ 2 ), then a histogram of sample data drawn from this distribution should be
(approximately) bell-shaped. Similarly, common descriptive statistics should be
(approximately) equal to the corresponding population quantities. For example,
i.e. the sample mean should be ‘close to’ the expected value of X, and the sample
variance should be ‘close to’ the variance of X.
The precision (or quality) of point estimates such as x̄ and s2 will depend on the sample
size n, and in principle on the population size N , if finite. In practice if N is very large
relative to n (i.e. n N ), then we can use approximations which are more than
adequate for practical purposes, but would only be completely accurate if the
population truly was infinite. In what follows, we assume N is large enough to be
treated as infinite.
85
4. Random variables, the normal and sampling distributions
When stating the sampling distribution of the sample mean, X̄, we distinguish between
sampling from populations which have a normal distribution and populations which
have a non-normal distribution.
When sampling from N (µ, σ 2 ), the sampling distribution of the sample mean is:
σ2
X̄ ∼ N µ, .
n
4
The central limit theorem applies to determine the sampling distribution of X̄ when
sampling from non-normal distributions.
The central limit theorem (CLT) says that for a random sample from (nearly)
any non-normal distribution with mean µ and variance σ 2 , then:
σ2
X̄ ∼ N µ,
n
So the difference between sampling from normal and non-normal populations is that X̄
is exactly normally distributed in the former case, but only approximately in the latter
case. The approximation is reasonable when n is at least 30, as a rule-of-thumb.
Although because this is an asymptotic approximation (i.e. as n → ∞), the larger n is,
the better the normal approximation.
Notice that Var(X̄) = σ 2 /n depends on the sample size n. It is easy to see that as n
increases (i.e. when we take a larger sample), the variance of X̄ gets smaller. That is,
sample means are less variable than single values (individual observations) from the
population. Indeed, as n → ∞ (as n tends to infinity), Var(X̄) → 0, and so the sample
mean tends to the true value µ (the value we are trying to estimate). Hence the larger
the sample size, the greater the accuracy in estimation (a good thing), but the greater
the total data collection cost (a bad thing). As happens so often in life we face a
trade-off, here between accuracy and cost.
Up to now, we have referred to the square root of the variance of a random variable as
the standard deviation. In sampling theory, the square root of the variance of an
estimator is called the standard error.
86
4.12. Sampling distribution of X̄
Standard error
is normally distributed
has mean µ
√
has variance σ 2 /n and standard error σ/ n.
has mean µ
√
has variance σ 2 /n and standard error σ/ n.
87
4. Random variables, the normal and sampling distributions
x −1 1 2
pX (x) 0.20 k 4k
(a) Determine the constant k and, hence, write down the probability distribution
of X.
(b) Find E(X) (the expected value of X).
(c) Find Var(X) (the variance of X).
2. The scores on a verbal reasoning test are normally distributed with a population
mean of µ = 100 and a population standard deviation of σ = 10.
(a) What is the probability that a randomly chosen person scores at least 105?
88
4.16. Solutions to Sample examination questions
3. The weights of a large population of animals have a mean of 7.3 kg and a standard
deviation of 1.9 kg.
(a) Assuming that the weights are normally distributed, what is the probability
that a random selection of 40 animals from this population will have a mean
weight between 7.0 kg and 7.4 kg?
(b) A researcher stated that the probability you calculated is approximately
correct even if the distribution of the weights is not normal. Do you agree? 4
Justify your answer.
(c) We have:
X
E(X 2 ) = x2i p(xi ) = (−1)2 × 0.20 + 12 × 0.16 + 22 × 0.64 = 2.92
i
hence:
Var(X) = E(X 2 ) − (E(X))2 = 2.92 − (1.24)2 = 1.3824.
An
P alternative method to find the variance is through the formula
2
i (xi − µ) p(xi ), where µ = E(X) was found in part (b).
2. The first part just requires knowledge of the fact that X is a normal random
variable with mean µ = 100 and variance σ 2 = 100. However, for the second part of
the question it is important to note that X̄, the sample mean, is also a normal
random variable with mean µ and variance σ 2 /n. Direct application of this fact
then yields that:
σ2
X̄ ∼ N µ, = N (100, 5).
n
For both parts, the basic property of normal random variables for this question is
that if X ∼ N (µ, σ 2 ), then:
X −µ
Z= ∼ N (0, 1).
σ
89
4. Random variables, the normal and sampling distributions
90
Chapter 5
Interval estimation
construct and interpret confidence intervals for a population mean and proportion
state the sampling distributions of the difference between two sample means and
two sample proportions
construct and interpret confidence intervals for the difference between two
population means and two population proportions.
5.4 Introduction
Researchers often find themselves facing decisions related to populations, which can
consist of various members, such as a group of consumers in a marketing study or a set
91
5. Interval estimation
of manufactured goods on a production line. The types of information needed for these
decisions can take the form of statistical parameters like a population mean (denoted
µ) – for example, ‘What is the average number of consumers per week?’ – or a
population proportion (denoted π) – for example, ‘What proportion of manufactured
goods is defective?’. The decisions stemming from collected data may range from
adjusting prices, to modifying the production process for optimal quality.
In most cases it is impossible to gather information about the whole population due to
time and resource constraints. Consequently, researchers must rely on collecting data
from a representative sample and then make inferences about the broader
population based on the sample. We consider different types of sampling techniques in
Chapter 9, but for now we note that a random sample is free of selection bias with
the expectation (but no guarantee) of a representative sample. We would expect the
representativeness to improve as the sample size increases.
Now, let’s delve into the world of statistical inference and explore concepts like
5 confidence intervals and sample size determination. How do these elements influence the
accuracy of estimating population means and proportions? And how can researchers
effectively utilise this information in their research?
Note that inferring information about a parent (or theoretical) population
using observations from a (random) sample is the primary concern of
statistical inference.
confidence level
sample size
This method of expressing the accuracy of an estimate is easily understood and requires
no statistical sophistication for interpretation.
More formally, an x% confidence interval covers the unknown parameter with x%
probability over repeated samples. A visual illustration is provided in Figure 5.1. The
red and blue lines each represent a confidence interval (each confidence interval is
92
5.4. Introduction
obtained from a different sample). In total there are 10 lines (8 blue, 2 red) reflecting 10
independent random samples drawn from the same population. In this example, 80% of
the time (8 out of the 10 confidence intervals) happen to cover µ (whose true value is
indicated by the green arrow). If this 80% figure was the long-run percentage (i.e. over
many repeated samples), then such confidence intervals would have an 80% coverage
probability. Hence 80% of the time we would obtain a confidence interval for µ which
covers (or spans, or includes) µ. In practice, though, we may only have one sample,
hence one confidence interval. With respect to Figure 5.1 there is a 20% risk that it is a
‘red’ interval. If it was the left red confidence interval, this would lead us to think µ is
smaller than it actually is; if it was the right red confidence interval, this would lead us
to think µ is larger than it actually is.
Figure 5.1: Coverage (in blue) and non-coverage (in red) of µ for several confidence
intervals (one confidence interval per sample).
93
5. Interval estimation
cumulative is a logical value: for the cumulative distribution function, use 1 (or
TRUE); for the probability density function, use 0 (or FALSE).
√
Since σ/ n > 0 (a standard error is always strictly positive), then:
X̄ − µ
0.95 = P −1.96 < √ < 1.96 (from above)
σ/ n
√
σ σ
= P −1.96 × √ < X̄ − µ < 1.96 × √ (multiply through by σ/ n)
n n
σ σ
= P −1.96 × √ < µ − X̄ < 1.96 × √ (multiply through by −1)
n n
σ σ
= P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √ . (et voilà!)
n n
94
5.5. Interval estimation for a population mean
Note that when we multiply by −1 to go from the second to the third line, the
inequality sign is reversed.
When sampling from a √ normal distribution, a 95% confidence interval for µ has
endpoints X̄ ± 1.96 × σ/ n. Hence the reported confidence interval would be:
σ σ
x̄ − 1.96 × √ , x̄ + 1.96 × √ .
n n
This is a simple, but very important, result. As we shall see, it can be applied to give
confidence intervals in many different situations such as for the estimation of a mean, a
proportion, as well as a difference between means and a difference between proportions
(covered later in this chapter). 5
The above derivation was for a 95% confidence interval, i.e. with a 95% coverage
probability, which is a generally accepted confidence requirement. Of course, it is
possible to have different levels of confidence, say 90% or 99% (the 80% demonstrated
in Figure 5.1 is much less common). Fortunately, we can use the same argument as
above. However, a different multiplier coefficient drawn from the standard normal
distribution is required (i.e. not 1.96). For convenience, key values are given below,
where zα denotes the z-value which cuts off 100α% probability in the upper tail of the
standard normal distribution.
Unfortunately, the method used so far in this section is limited by the assumption that
σ 2 is known. This means, in effect, that we need to know the true population variance,
but we do not know the population mean – since this is what we are trying to estimate.
This seems implausible. Why would you know σ 2 , but not know µ?
In such cases it will be necessary to estimate the standard error from the data. This
requires a modification both of the approach and, in particular, of the general formula
for the endpoints of a confidence interval.
95
5. Interval estimation
i.e. that the distribution remains the standard normal distribution, but because of the
additional sampling variability of the estimated standard error, this new, transformed
function of the data will have a more dispersed distribution than the standard normal –
the Student’s t distribution. Indeed we have that:
X̄ − µ
√ ∼ tn−1
S/ n
For our purposes, we will use this distribution whenever we are performing statistical
inference for population means when population variances are unknown, and hence are
96
5.5. Interval estimation for a population mean
estimated from the data. The correct degrees of freedom will depend on the degrees of
freedom used to estimate the variance.
Assuming a 95% coverage probability, for a given ν we can find t0.025, n−1 such that:
X̄ − µ
P −t0.025, n−1 < √ < t0.025, n−1 = 0.95
S/ n
where t0.025, n−1 cuts off 2.5% probability in the upper tail of the t distribution with
n − 1 degrees of freedom. On rearranging the inequality within the brackets we get:
S S
P X̄ − t0.025, n−1 × √ < µ < X̄ + t0.025, n−1 × √ = 0.95.
n n
where t0.025, n−1 is the t-value which cuts off 2.5% probability in the upper tail of the
t distribution with n − 1 degrees of freedom.
The t-values can be found manually in Table 10 of the New Cambridge Statistical
Tables, and they can also be obtained in Excel using the T.INV function.
deg freedom is the number of degrees of freedom with which to characterise the
distribution.
97
5. Interval estimation
Important note
In the following applications, whenever the t distribution is used and we have large
sample size(s) (hence large degrees of freedom), it is acceptable to use standard
normal values as approximations due to the tendency of the t distribution to the
standard normal distribution as the degrees of freedom approach infinity, i.e. we
have that:
tν → N (0, 1) as ν → ∞.
What constitutes a ‘large’ sample size is rather subjective. However, as a rule of
thumb, treat anything over 30 as ‘large’.
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution to ensure a 100(1 − α)% confidence interval.
For example, for α = 0.05, we have a 100(1 − 0.05)% = 95% confidence interval, and
we require the z-value which cuts off α/2 = 0.025, i.e. 2.5% probability in the upper
tail of the standard normal distribution, which is 1.96, and can be obtained using
=NORM.S.INV(0.975) or from the bottom row of Table 10.
98
5.5. Interval estimation for a population mean
where z0.025 = 1.96 is the z-value which cuts off 100α/2% = 2.5% probability in the
upper tail of the standard normal distribution obtained using =NORM.S.INV(0.975),
or Table 10 (using the bottom row, since tν → N (0, 1) as ν → ∞). In other words,
the interval is (0.818, 0.830) which covers the true mean with a probability of 95%.
To compute a 99% confidence interval (where α = 0.01), since σ is known we require
zα/2 = z0.005 , that is the z-value which cuts off 0.5% probability in the upper tail of
the standard normal distribution.
We have that z0.005 = 2.576, obtained using =NORM.S.INV(0.995) (or Table 10), so a
99% confidence interval has endpoints:
σ 0.042
x̄ ± 2.576 × √ = 0.824 ± 2.576 × √ = 0.824 ± 0.008.
n 200
In other words, the interval is a (0.816, 0.832). Note the higher level of confidence
has resulted in a wider confidence interval. This is as expected since, other things 5
equal, the ‘price’ of the benefit of a higher confidence level is the cost of a wider
confidence interval.
where tα/2, n−1 is the t-value which cuts off 100α/2% probability in the upper tail of
the t distribution with n − 1 degrees of freedom, obtained from T.INV, alternatively
using Table 10.
Example 5.3 A researcher carries out a sampling exercise in order to estimate the
average height of a species of tree. A random sample of 12 trees gives the following
descriptive statistics (in feet):
x̄ = 41.625 and s = 7.840.
We seek a 95% confidence interval for µ, the mean height of all trees of this species.
The estimated standard error of the sample mean is:
s 7.840
√ = √ = 2.2632
n 12
99
5. Interval estimation
(36.64, 46.61).
Make sure you see where 2.201 comes from. It is t0.025, 11 , i.e. the t-value above
which lies 2.5% probability for a Student’s t distribution with 11 degrees of
freedom. In Excel, this is obtained using =T.INV(0.975,11). This can also be
found in Table 10.
5 Make sure you report confidence intervals in the form (36.64, 46.61). That is,
you must compute the actual endpoints and report these as an interval, as that
is what a confidence interval is! Note the lower endpoint should be given first.
This result is a consequence of the central limit theorem applied to the proportion of
successes for a ‘binomial’ distribution, or equivalently the parameter of a ‘Bernoulli’
distribution (both introduced in ST104B Statistics 2).
The variance of the sample proportion is:
π(1 − π)
Var(P ) =
n
and so the standard error of the sample proportion (the square root) is:
r
π(1 − π)
S.E.(P ) = .
n
Unfortunately, this depends on π, precisely what we are trying to estimate, hence the
true standard error is unknown, so must itself be estimated. As π is unknown, the best
100
5.7. Sample size determination
we can do is replace it with our point estimate of it (the sample proportion p = r/n,
where there are r ‘successes’ in the sample of size n), hence the estimated standard error
is: r r
p(1 − p) r/n(1 − r/n)
E.S.E.(p) = = .
n n
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV. Such z-values can also be
5
obtained from the bottom row of Table 10.
Note that although we are estimating a variance, for proportions we do not use the t
distribution for the following two reasons.
The sample size n has to be large for the central limit theorem normal
approximation to hold, and so the standard normal distribution is appropriate in
this case.
101
5. Interval estimation
researcher requires that there should be a 95% chance that the estimation error should
be no larger than e units (we refer to e as the tolerance on the estimation error), then
this is equivalent to having a 95% confidence interval of width 2e. Note here e
represents the half-width of the confidence interval since the point estimate is, by
construction, at the centre of the confidence interval.
(2.576)2 × 42
n≥ = 106.17.
12
Remembering that n must be an integer, the smallest n satisfying this is 107. (Note
that we round up, otherwise had we rounded down it would lead to less precision.)
So, 57 more observations are required.
Example 5.6 The reaction time of a patient to a certain stimulus is known to have
a standard deviation of 0.05 seconds. How large a sample of measurements must a
psychologist take in order to be 95% confident and 99% confident, respectively, that
the error in the estimate of the mean reaction time will not exceed 0.01 seconds?
For 95% confidence, we use z0.025 = 1.96. So, using (5.2), n is to be chosen such that:
(1.96)2 × (0.05)2
n≥ .
(0.01)2
Hence we find that n ≥ 96.04. Since n must be an integer, 97 observations are
required to achieve an error of 0.01 or less with 95% confidence.
For 99% confidence, we use z0.005 = 2.576. So, using (5.2), n is to be chosen such that:
(2.576)2 × (0.05)2
n≥ .
(0.01)2
Hence we find that n ≥ 165.89. Since n must be an integer, 166 observations are
required to achieve an error of 0.01 or less with 99% confidence.
102
5.8. Estimation of differences between parameters of two populations
Note that a higher level of confidence requires a larger sample size as more
information (sample data) is required to achieve a higher level of confidence for a
given tolerance, e.
5
In (5.3), p should be an approximate value of π, perhaps obtained from a pilot study, or
alternatively we make an assumption of this value based on judgement and/or
experience. If a pilot study is not feasible and a value cannot be assumed, then set
p = 0.50 in (5.3) as a ‘conservative’ choice, as this value gives the maximum possible
standard error (this can be shown using calculus, but the proof is beyond the scope of
this course).
103
5. Interval estimation
intervals provide this essential framework, offering a range of plausible values for the
true difference and thereby aiding researchers in drawing robust conclusions.
Furthermore, confidence intervals offer a bridge between sample data and population
parameters, allowing researchers to communicate the precision of their findings to a
wider audience. This transparency enhances the credibility of research results and
fosters a deeper understanding of the underlying uncertainty in scientific measurements.
We proceed to look at how to construct confidence intervals in two contexts:
104
5.9. Difference between two population means
which arises due to X̄1 − X̄2 being a linear combination of two independent normal
random variables, such that:
E(X̄1 − X̄2 ) = E(X̄1 ) − E(X̄2 ) = µ1 − µ2
and, due to independence of X̄1 and X̄2 , we have that (note we add the variances):
σ12 σ22
Var(X̄1 − X̄2 ) = Var(X̄1 ) + Var(X̄2 ) = + .
n1 n2
The result in (5.4) follows since a linear combination of independent normal random
variables also has a normal distribution, and we have just derived its mean and variance.
Recall that the standard error is the (positive) square root of the variance of a statistic,
hence the standard error of X̄1 − X̄2 for the case of known variances σ12 and σ22 is:
s
σ12 σ22
q
S.E.(X̄1 − X̄2 ) = Var(X̄1 − X̄2 ) = + .
n1 n2
5
Noting the ‘template’ for confidence intervals seen so far in this chapter of:
point estimate ± confidence coefficient × standard error
we can now state the confidence interval endpoints for the difference between two
population means with known variances.
If the population variances σ12 and σ22 are known, a 100(1 − α)% confidence interval
for µ1 − µ2 has endpoints:
s
σ12 σ2
x̄1 − x̄2 ± zα/2 × + 2 (5.5)
n1 n2
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV, alternatively using the
bottom row of Table 10.
Example 5.8 Two companies supplying a similar service are compared for their
reaction times (in days) to complaints. Company 1 does not offer an online reporting
portal for complaints. In a sample of n1 = 12 complaints, x̄1 = 8.5 days and it is
known that the population standard deviation is σ1 = 3.6 days (hence a known
variance of σ12 = (3.6)2 = 12.96 days2 ).
Company 2 does offer an online reporting portal for complaints. In a sample of
n2 = 10 complaints, x̄2 = 4.8 days and it is known that the population standard
deviation is σ2 = 2.1 days (hence a known variance of σ22 = (2.1)2 = 4.41 days2 ).
Using (5.5), we compute a 95% confidence interval for µ1 − µ2 , given by:
r
(3.6)2 (2.1)2
8.5 − 4.8 ± 1.96 × + ⇒ (1.28, 6.12).
12 10
105
5. Interval estimation
Hence we are 95% confident that µ1 − µ2 lies between 1.28 days and 6.12 days.
Since this interval excludes zero (both endpoints are positive), this suggests that
µ1 > µ2 , i.e. this indicates how Company 1 has a slower reaction time to complaints
on average, relative to Company 2 (slower since x̄1 > x̄2 , which suggests that
µ1 > µ2 ). The presence of the online reporting portal seems to speed up the average
reaction time to complaints.
If the population variances σ12 and σ22 are unknown, provided sample sizes n1 and n2
are large (greater than 30), an approximate 100(1 − α)% confidence interval for
µ1 − µ2 has endpoints:
s
s21 s2
x̄1 − x̄2 ± zα/2 × + 2 (5.6)
n1 n2
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV, or alternatively using the
bottom row of Table 10.
Example 5.9 Continuing Example 5.8, we now assume that the population
variances are unknown. Suppose random samples of complaint reaction times for
these two companies produced the following (in days):
Since we have large sample sizes, n1 = 45 > 30 and n2 = 35 > 30, we can use (5.6) to
derive an approximate 95% confidence interval for µ1 − µ2 . We have:
r
(3.6)2 (2.1)2
8.5 − 4.8 ± 1.96 × + ⇒ (2.44, 4.96).
45 35
Hence we are 95% confident that µ1 − µ2 lies between 2.44 days and 4.96 days.
Again, this excludes zero indicating that µ1 > µ2 .
106
5.9. Difference between two population means
Note that the only difference in values with respect to Example 5.8 are the larger
sample sizes (the means and standard deviations are the same values). If we compare
the widths of these two intervals we have:
σ2 σ2
X̄1 ∼ N µ1 , and X̄2 ∼ N µ2 , .
n1 n2
The pooled variance estimator, where S12 and S22 are sample variances from
samples of size n1 and n2 , respectively, is:
Hence Sp2 is the weighted average of the sample variances S12 and S22 , where the
107
5. Interval estimation
weights are:
n1 − 1 n2 − 1
and
n1 + n2 − 2 n1 + n2 − 2
respectively. So if n1 = n2 , then we give the sample variances equal weight. Intuitively,
this should make sense. As the sample size increases, a sample variance provides a more
accurate estimate of σ 2 . Hence if n1 6= n2 , the sample variance calculated from the
larger sample is more reliable, and so it is given greater weight in the pooled variance
estimator. Of course, if n1 = n2 , then the variances are equally reliable, hence they are
given equal weight.
If the population variances σ12 and σ22 are unknown but assumed equal, a 100(1−α)%
confidence interval for µ1 − µ2 has endpoints:
5 s
1 1
x̄1 − x̄2 ± tα/2, n1 +n2 −2 × s2p + (5.8)
n1 n2
where s2p is the estimate from the pooled variance estimator (12.4), and where
tα/2, n1 +n2 −2 is the t-value which cuts off 100α/2% probability in the upper tail of
the Student’s t distribution with n1 + n2 − 2 degrees of freedom, obtained using
T.INV, alternatively using Table 10.
An obvious problem is how to decide whether to assume the unknown variances are
equal or unequal. Consider the following points.
If σ12 = σ22 , then we would expect approximately equal sample variances, i.e.
s21 ≈ s22 , since both sample variances would be estimating the same (common)
variance. If the sample variances are very different, then this would suggest σ12 6= σ22 .
If we are sampling from two ‘similar’ populations (for example, similar species of
animals) then an assumption of equal variability in these ‘similar’ populations
would be reasonable.
Example 5.10 Extending Examples 5.8 and 5.9, suppose random samples of
complaint reaction times for these two companies produced the following (in days):
Since we do not have large sample sizes, n1 = 12 < 30 and n2 = 10 < 30, we cannot
use (5.6), so instead we assume that σ12 = σ22 (a reasonable assumption for these two
populations) and so use (5.7) to estimate the common variance σ 2 , and then (5.8) to
calculate the confidence interval.
108
5.9. Difference between two population means
where t0.025, 20 = 2.086 (we have estimated the common variance, so we use the t
distribution, here with 20 degrees of freedom, obtained in Excel using
=T.INV(0.975,20)), or using Table 10.
Hence we are 95% confident that µ1 − µ2 lies between 1.01 days and 6.39 days. 5
Again, this excludes zero indicating that µ1 > µ2 .
Note that the confidence interval width here is 6.39 − 1.01 = 5.38 days, which is
wider than the confidence intervals computed in Examples 5.8 and 5.9. This is due
to two factors:
the use of a t-value of 2.086, which is larger than the z-value of 1.96, since the t
distribution has fatter tails than the standard normal distribution, and hence a
t-value confidence coefficient is always larger than any z-value confidence
coefficient (for any column in Table 10, the t-values are greater than the
z-values in the bottom row, recalling that t∞ = N (0, 1))
with respect to Example 5.9 the sample sizes are smaller (but they are the same
as in Example 5.8).
109
5. Interval estimation
This scenario is easy to analyse as the paired data can simply be reduced to a ‘one
sample’ analysis by working with differenced data. That is, suppose two samples
generated sample values x1 , x2 , . . . , xn and y1 , y2 , . . . , yn , respectively (note the same
number of observations, n, in each sample). We compute the differences, di for
i = 1, 2, . . . , n, using:
d1 = x1 − y1 , d2 = x2 − y2 , ..., dn = xn − yn .
Of interest is the population mean difference, µd , where:
µd = µX − µY
which is estimated using the sample mean of the differenced data, x̄d , equivalently the
difference in the sample means, such that:
x̄d = x̄ − ȳ.
5 Using the differenced data we then compute a confidence interval for µd using the
technique in Section 5.5.5.
Example 5.11 The table below shows the before and after weights (in pounds) of 8
adults after trying an experimental diet. We determine a 95% confidence interval for
the mean weight loss due to the experimental diet. Based on this, we can then judge
whether we are convinced that the experimental diet reduces weight, on average.
Before After
127 122
130 120
114 116
139 132
150 144
147 138
167 155
153 152
For example:
We have n = 8 pairs of observations, and the sample mean and sample standard
deviation of the differenced data are, respectively:
110
5.10. Difference between two population proportions
So a 95% confidence interval for the mean √ difference in weight before and after the
experimental diet is x̄d ± t0.025, n−1 × s/ n, that is:
4.66
6 ± 2.365 × √ ⇒ (2.1, 9.9).
8
Hence we are 95% confident that the average weight loss due to the experimental
diet lies between 2.1 pounds and 9.9 pounds.
Since zero is not included in this confidence interval, we conclude that the
experimental diet does appear to reduce weight, i.e. the average weight loss appears
to be positive.
We conclude the chapter by considering confidence intervals for the difference between
two population proportions.
5
5.10 Difference between two population proportions
As with the comparison of two population means, estimating differences between
population proportions in research allows for meaningful comparisons. For example, we
may wish to estimate the difference between population proportions to assess the
effectiveness of a marketing campaign aimed at increasing brand awareness.
The correct approach to the comparison of two population proportions, π1 and π2 , is via
the difference between the population proportions, i.e. π1 − π2 . As seen in Section 5.6,
the sample proportions P1 and P2 are, by the central limit theorem for large sample
sizes n1 and n2 , respectively, (approximately) normally distributed as:
π1 (1 − π1 ) π2 (1 − π2 )
P1 ∼ N π1 , and P2 ∼ N π2 , .
n1 n2
When independent random samples are drawn from two separate populations, then
these distributions are statistically independent. Therefore, the difference between P1
and P2 is also (approximately) normally distributed such that:
π1 (1 − π1 ) π2 (1 − π2 )
P1 − P 2 ∼ N π1 − π2 , +
n1 n2
and, due to independence of P1 and P2 , we have that (note we add the variances):
π1 (1 − π1 ) π2 (1 − π2 )
Var(P1 − P2 ) = Var(P1 ) + Var(P2 ) = + .
n1 n2
Recall that a linear combination of independent normal random variables also has a
normal distribution, and we have just derived its mean and variance.
111
5. Interval estimation
The standard error is the (positive) square root of the variance of a statistic, hence the
standard error of P1 − P2 is:
s
p π1 (1 − π1 ) π2 (1 − π2 )
S.E.(P1 − P2 ) = Var(P1 − P2 ) = + .
n1 n2
We see that both Var(P1 − P2 ) and S.E.(P1 − P2 ), depend on the unknown parameters
π1 and π2 . So we must resort to the estimated standard error:
s
P1 (1 − P1 ) P2 (1 − P2 )
E.S.E.(P1 − P2 ) = + .
n1 n2
We can now state the confidence interval endpoints for the difference between two
population proportions.
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV, alternatively using the
bottom row of Table 10.
Example 5.12 We use (5.9) to calculate 95% and 90% confidence intervals for the
difference between the population proportions of the general public who are aware of
a particular commercial product before and after an advertising campaign. Two
surveys were conducted and the results of the two random samples were:
112
5.11. Overview of chapter
5
5.11 Overview of chapter
This chapter has introduced the concept of parameter estimation, focusing on means
and proportions for one and two populations. As the values of parameters are often
unknown, we draw a random sample from the population and estimate the parameter
using an appropriate statistic (the sample mean, x̄, for µ; the sample proportion, p, for
π). While we hope a random sample is representative of the population, there is no
guarantee. This is why we convert a point estimate into an interval estimate, known as
a confidence interval. The width of a confidence interval is affected by the confidence
level (i.e. the coverage probability, often set at 95%), the sample size (larger samples
produce more accurate estimates), and the amount of variation in the
population/sample (the more heterogeneous, i.e. diverse, the population, the more
difficult it is to capture this variation in a random sample). Matters of sample size
determination were considered, and we saw how to construct and interpret confidence
intervals for means and proportions.
113
5. Interval estimation
0.3676 + 0.5324
p= = 0.45.
2
(b) The (estimated) standard error when estimating a single proportion is:
r √
p(1 − p) 0.45 × 0.55 0.4975
= √ = √ .
n n n
114
5.14. Solutions to Sample examination questions
Since this is a 100(1 − α)% = 99% confidence interval, then α = 0.01, so the
confidence coefficient is zα/2 = z0.005 = 2.576. Therefore, to determine n we
need to solve:
0.4975
2.576 × √ = 0.5324 − 0.45 = 0.0824 ⇒ n = 241.89.
n
The correct sample size is n = 242.
Note that in questions regarding sample size determination remember to round
up when the solution is not an integer.
2. (a) Let p1 , n1 refer to the proportion of large retailers using regression and to the
total number of large retailers, respectively. Similarly, denote by p2 and n2 the
corresponding quantities for small retailers. We have p1 = 85/120 = 0.7083,
n1 = 120, p2 = 78/163 = 0.4785 and n2 = 163.
The estimated standard error is: 5
s
p1 (1 − p1 ) p2 (1 − p2 )
E.S.E.(p1 − p2 ) = + = 0.0570.
n1 n2
The correct z-value is z0.01 = 2.326, leading to the lower and upper bounds of
0.0971 and 0.3625, respectively. Presented as an interval this is (0.0971, 0.3625).
(b) We are 98% confident that the difference between the two population
proportions is between 0.0971 and 0.3625. The interval excludes zero,
suggesting there is a difference between the true proportions at the 2%
significance level.
(c) For a 94% confidence interval, the correct confidence coefficient is z0.03 = 1.88.
The sample proportions and standard error are unaffected, hence the new
interval is (0.1226, 0.3370).
3. (a) The confidence interval formula is:
s
x̄ ± tα/2, n−1 × √ .
n
The degrees of freedom are 21 − 1 = 20, and the correct t value from tables is
t0.01, 20 = 2.528. The computed confidence interval is:
9
165 ± 2.528 × √ ⇒ (160.04, 169.96).
21
√
(b) We seek n such that 2.528 × 9/ n ≤ 1.5 (for a confidence interval of width 3).
Solve for n = 231 (remembering to round up to the nearest integer).
(c) The confidence interval formula remains as:
s
x̄ ± tα/2, n−1 × √ .
n
The correct t value is t0.005,14 = 2.977. The computed interval is:
9
160 ± 2.977 × √ ⇒ (153.08, 166.92).
15
115
5. Interval estimation
Note the confidence intervals are not directly comparable since the one in part
(a) is for 98% confidence, while the one in part (c) is for 99% confidence.
Other things equal, a 99% confidence interval is wider than a 98% one. Also,
the sample sizes are different and, other things equal, a smaller sample size
leads to a wider confidence interval. Although there is some overlap of the
computed confidence intervals (suggesting that the mean heights are
potentially the same), a formal hypothesis test should be performed.
116
Chapter 6
Hypothesis testing principles
define and explain the types of errors which can be made in hypothesis testing
explain a significance level and describe the different types of statistical significance
define a p-value and explain how it is used to decide whether or not to reject the
null hypothesis
6.4 Introduction
In hypothesis testing, our objective is to choose between two opposite statements about
the population, where these statements are known as hypotheses. By convention these
are denoted by H0 and H1 , such that:
117
6. Hypothesis testing principles
From Example 6.1 we see that we use the null hypothesis, H0 , to represent ‘no
difference’, ‘no effect’, ‘no increase’ etc., while the alternative hypothesis, H1 , represents
‘a difference’, ‘an effect’, ‘an increase’ etc.
Many statistical procedures can be represented as statements about the values of
population parameters such as the mean, µ, or variance, σ 2 . The first step in any
hypothesis testing problem is to ‘translate’ the real problem into its technical
counterpart. For example, hypotheses can be ‘translated’ into technical forms similar to:
H0 : µ = µ0 vs. H1 : µ > µ0
H 0 : µ = µ0 .
In contrast the alternative hypothesis, H1 , will take one of three forms, i.e. using ‘6=’, ‘<’
or ‘>’, that is:
H1 : µ 6= µ0 or H1 : µ < µ0 or H1 : µ > µ0 .
1
Such a null hypothesis is called a simple null hypothesis. It is possible to have a composite null
hypothesis, such as H0 : µ ≥ µ0 , which allows for more than one parameter value, although we will only
focus on simple forms of H0 in this course.
118
6.5. Types of error
Note that only one of these forms will be used per test. To determine which form to use
will require careful consideration of the wording of the problem.
The form H1 : µ 6= µ0 is an example of a two-tailed test (or two-sided test) and we use
this form with problems worded such as ‘test the hypothesis that µ is zero’. Here, there
is no indication of the value of µ if it is not zero, that is, do we assume µ > 0 or µ < 0
in such cases? We cannot be sure, so we take the safe option of a two-sided test, i.e. we
specify H1 : µ 6= 0.
In contrast, had the problem been phrased as ‘test whether µ is greater than zero’, then
unambiguously we would opt for H1 : µ > 0. This is an example of an upper-tailed
test (or one-sided test). Similarly, ‘test whether µ is less than zero’ leads to H1 : µ < 0
which is a lower-tailed test (also a one-sided test).
Later, in Chapter 7, when testing for differences between two population means or two
population proportions, you need to look out for comparative phrases indicating if one
population parameter value should exceed the other (for example, testing whether
group A is on average faster/older/taller than group B). Practising problems will make
you proficient in correctly specifying your hypotheses.
6
6.5 Types of error
In any hypothesis test there are two types of inferential decision error which could be
committed. Clearly, we would like to reduce the probabilities of these errors as much as
possible. These two types of error are called Type I error and Type II error.
Both errors are undesirable and, depending on the context of the hypothesis test, it
could be argued that either one is worse than the other. (For example, which is worse, a
medical test incorrectly concluding a healthy person has an illness, or incorrectly
concluding that an ill person is healthy?) However, on balance, a Type I error is usually
considered to be more problematic. The possible decision space in hypothesis testing
can be presented as shown in Table 6.1.
Decision made
H0 not rejected H0 rejected
True state H0 true Correct decision Type I error
of nature H1 true Type II error Correct decision
For example, if H0 was being ‘innocent’ and H1 was being ‘guilty’, a Type I error would
119
6. Hypothesis testing principles
be finding an innocent person guilty (bad for them), while a Type II error would be
finding a guilty person innocent (bad for the victim/society/justice, but admittedly
good for them!).
The complement of a Type II error probability, that is 1 − β, is called the power of the
test – the probability that the test will reject a false null hypothesis. Hence power
measures the ability of the test to reject a false H0 , and so we seek the most powerful
test for any testing situation. Hence by seeking the ‘best’ test, we mean the best in the
‘most powerful’ sense.
Example 6.2 During the Covid-19 pandemic, there was a choice of tests (in
particular the rapid antigen tests and PCR tests) such that which test would you
use? The ‘best’ test, of course. What is ‘best’ though? It could be in terms of speed
and/or convenience (such as the rapid antigen tests), or in terms of Covid detection
accuracy (such as the PCR tests). In terms of power alone, we would opt for PCR
tests over rapid antigen tests.
In statistical testing, we would choose the best test in terms of power alone, but we
should be mindful that on occasions we may accept use of a less powerful test for the
6 sake of expediency (analogous to the use of rapid antigen tests).
Example 6.3 In the rush to develop Covid-19 vaccines (who wants to be locked
down forever?), health authorities in each country would need to licence and approve
each vaccine for use. In matters of life and death, to obtain approval would require
very ‘powerful’ evidence to justify rapid rollouts. As such, each candidate vaccine
would have to undergo a series of clinical trials, with the number of test patients
increasing at each stage. For example:
Phase 1: n1 ≈ 20–100, checking for safety of the vaccine on a small scale, while
monitoring for any side-effects.
While different countries may prescribe different numerical values for each of n1 , n2
and n3 (perhaps due to different clinical opinions on these requirements), note that:
n1 < n2 n3
120
6.6. Significance level
Decision made
H0 not rejected H0 rejected
True state H0 true 1−α P (Type I error) = α
of nature H1 true P (Type II error) = β Power = 1 − β
We have:
and:
Other things being equal, if you decrease α you increase β and vice-versa. Hence there
is a trade-off. However, treating Type I errors as being more serious, this is why we
control the value of α through the significance level and then we seek the most powerful
test to minimise β, equivalently to maximise 1 − β.
a 90% confidence level in estimation has parallels with a 10% significance level in
testing (α = 0.10)
121
6. Hypothesis testing principles
The most common significance levels are 10%, 5% and 1%, such that:
rejecting H0 at the 10% significance level reflects a weakly significant result, with
weak evidence
A sensible strategy to follow is to initially test at the 5% significance level and then test
either at the 1% or 10% significance level, depending on whether or not you have
6 rejected at the 5% significance level. A decision tree depiction of the procedure to
follow is presented in Figure 6.1.
Not reject H0
Choose the 10% level
122
6.7. Critical values
Once this goalpost has been set, how do we actually use α to decide whether or not to
reject H0 ? For that, we can use either critical values or p-values.
The critical values are drawn from the distribution of the test statistic used for testing
H0 . The use of critical values ensures a systematic and objective approach to hypothesis
testing, providing a clear decision rule based on statistical evidence.
123
6. Hypothesis testing principles
Figure 6.2: Rejection regions shown in red for a two-tailed z test at the 10% (left), 5%
6 (centre) and 1% (right) significance levels.
i.e. an upper-tailed test. Suppose a z-test is conducted, hence critical values will be
drawn from N (0, 1). For an upper-tailed test, only the right tail of the test statistic’s
distribution will form the rejection region, which has a total area of α. Using the bottom
row of Table 10 of the New Cambridge Statistical Tables, the critical values are for:
i.e. a lower-tailed test. Suppose a z-test is conducted, hence critical values will be drawn
from N (0, 1). For a lower-tailed test, only the left tail of the test statistic’s distribution
will form the rejection region, which has a total area of α. Using the bottom row of
Table 10 of the New Cambridge Statistical Tables, and noting symmetry of the standard
124
6.7. Critical values
Figure 6.3: Rejection regions shown in red for an upper-tailed z test at the 10% (left),
5% (centre) and 1% (right) significance levels.
Figure 6.4: Rejection regions shown in red for a lower-tailed z test at the 10% (left), 5%
(centre) and 1% (right) significance levels.
In each of Figures 6.2 to 6.4 we see that as α decreases, the size of the rejection region
also decreases (since α is the size of the rejection region!). For t tests, the critical values
would be obtained from a Student’s t distribution with the appropriate degrees of
125
6. Hypothesis testing principles
freedom (rather than from N (0, 1) for z tests). In all cases, if the test statistic value falls
in the rejection region, then we reject H0 as per the decision rule using critical values.
If we consider again the significance level decision tree in Figure 6.1, we can appreciate
why after rejecting H0 at the 5% significance level, say, we would not then test at the
10% significance level, since the rejection region when α = 0.05 is a subset of the
rejection region when α = 0.10, hence given the decision rule using critical values, the
test statistic value must also fall in the larger rejection region. As such, there is no
added value from moving to a 10% significance level having rejected H0 at the 5%
significance level. However, it is worthwhile to move to the 1% significance level as the
rejection region when α = 0.01 is a subset of that for α = 0.05, so there is a possibility
the test statistic value does not fall in this smaller rejection region.
We now consider an alternative, but equivalent, approach to hypothesis testing, which
uses p-values instead of critical values.
6.8 P -values
6 We introduce p-values, which provide an alternative way for deciding whether or not to
reject H0 .
Definition
A p-value is the probability of the event that the ‘test statistic’ (a known function
of our data) takes the observed value or more extreme (i.e. more unlikely) values
under H0 . It is a measure of the discrepancy between the null hypothesis, H0 , and
the data evidence.
So, p-values may be seen as a measure of how compatible our data are with the null
hypothesis, such that as the p-value gets closer to 1 then the data evidence becomes
more compatible with H0 (i.e. H0 seems more credible), while as the p-value gets closer
to 0 then the data evidence becomes less compatible with H0 (i.e. H0 seems more
incredible).
H0 : The new drug has no significant effect on the medical condition; it is not
effective in treating it.
Now, let us consider two scenarios in this scientific context: (a) strong evidence for
126
6.8. P -values
the effectiveness of the drug, and (b) weak or inconclusive evidence for the
effectiveness of the drug.
(a) In this scenario, the researchers conduct a well-designed clinical trial. The
results clearly show that patients who received the new drug experienced
significant improvements in their condition. The data are so compelling that
they provide strong evidence against the null hypothesis, H0 . The researchers
present robust evidence, including data and expert opinions from medical
professionals, supporting the effectiveness of the drug. This is similar to having
a very small p-value in hypothesis testing, indicating that the observed data are
highly inconsistent with H0 . Consequently, the researchers conclude that the
new drug is effective beyond a reasonable doubt and should be considered for
approval.
(b) In this scenario, the experimental results are inconclusive. While there might be
some indications that the new drug could be effective, the evidence is not strong
enough to justify claiming its effectiveness with confidence. The researchers’
statistical analysis shows that the p-value is relatively large, suggesting that the
data do not provide strong support for the alternative hypothesis, H1 . 6
Additionally, there may be counterarguments presented by experts in the field,
casting doubt on the effectiveness of the drug. As a result, the researchers are
unable to confidently conclude that the new drug is effective beyond a
reasonable doubt, similar to a situation where the p-value is not sufficiently
small. Further research and evidence may be needed to make a definitive
decision about the drug’s effectiveness.
Example 6.5 Suppose we are interested in researching the mean body mass index
(BMI) of a certain population of infants. Suppose that BMI in the population is
modelled as N (µ, 25), i.e. we assume a normal distribution with mean µ and a
known variance of σ 2 = 25. A random sample of n = 25 individuals is taken, yielding
a sample mean of 17, i.e. x̄ = 17.
Independently of the data, three experts give their own opinions as follows.
How can we assess these experts’ contradictory statements (at most one expert is
correct, they could all be incorrect)?
Here, the sampling distribution of the sample mean is:
σ2
25
X̄ ∼ N µ, = N µ, = N (µ, 1)
n 25
127
6. Hypothesis testing principles
Under H0 : µ = 16 (Dr A’s claim), the probability of being at least one standard
error above and below 16 is:
P (|X̄ − 16| ≥ 1) = P (X̄ ≤ 15) + P (X̄ ≥ 17)
15 − 16 17 − 16
=P Z≤ +P Z ≥
1 1
= P (Z ≤ −1) + P (Z ≥ 1)
= 0.3173
using =NORM.S.DIST(-1,1)+(1-NORM.S.DIST(1,1)), or Table 4 of the New
Cambridge Statistical Tables.
Under H0 : µ = 15 (Ms B’s claim), the probability of being at least two
standard errors above and below 15 is:
P (|X̄ − 15| ≥ 2) = P (X̄ ≤ 13) + P (X̄ ≥ 17)
13 − 15 17 − 15
=P Z≤ +P Z ≥
1 1
= P (Z ≤ −2) + P (Z ≥ 2)
= 0.0455
128
6.8. P -values
These conditional probabilities are the p-values for the respective sets of hypotheses:
H 0 : µ = µ0 vs. H1 : µ 6= µ0
such that the greater the difference (the greater the incompatibility) between the
data evidence and the claim (in the null hypothesis), the smaller the p-value.
In summary, of the three claims the one we would be most willing to reject would be
the claim that µ = 14, because if the hypothesis µ = 14 is true, the probability of
observing x̄ = 17, or more extreme values (i.e. x̄ ≤ 11 or x̄ ≥ 17), would be as small
as 0.0027. We are comfortable with this decision, as such a small probability event
would be very unlikely to occur in a single experiment.
On the other hand, we would be far less comfortable rejecting the claim that µ = 16,
because if the hypothesis µ = 16 is true, the probability of observing x̄ = 17, or more
extreme values (i.e. x̄ ≤ 15 or x̄ ≥ 17) is much larger at 0.3173. However, this does
not imply that Dr A’s claim is necessarily true.
It is important to remember that:
129
6. Hypothesis testing principles
We have explained that we control for the probability of a Type I error through our
choice of significance level, α, where α is a value in the interval [0, 1]. Since p-values are
also probabilities, that’s what the p stands for, we simply compare p-values with our
chosen benchmark significance level, α.
We now present the p-value decision rule.
When testing at the 100α% significance level, for α in the interval [0, 1], then:
(
≤ α then reject H0
if the p-value
> α then do not reject H0 .
130
6.9. Overview of chapter
Figure 6.8: Using the p-value decision rule with a 5% significance level.
131
6. Hypothesis testing principles
2. Define what a p-value is, and explain how it is used in hypothesis testing.
3. Compare and contrast the two general approaches to hypothesis testing: (a) the
‘critical value approach’, and (b) the ‘p-value approach’.
2. A p-value is the probability of obtaining the test statistic value, or a more extreme
value, conditional on the null hypothesis being true.
A ‘small’ p-value indicates that the data are inconsistent with H0 , while a ‘large’
p-value indicates that the data are consistent with H0 .
When testing at the 100α% significance level, for α in the interval [0, 1], then:
(
≤ α then reject H0
if the p-value
> α then do not reject H0 .
132
6.12. Solutions to Sample examination questions
3. (a) The critical value approach does not require a p-value to be calculated, but
does require critical value(s) to be obtained for a given significance level, α.
(b) The p-value approach does not require critical value(s) to be obtained, but
does require it to be determined for direct comparison with a given significance
level, α.
133
6. Hypothesis testing principles
134
Chapter 7
Hypothesis testing of means and
proportions
perform hypothesis tests for the difference between two population means and two
population proportions
135
7. Hypothesis testing of means and proportions
7.4 Introduction
We begin by listing the necessary steps to perform a statistical test. Once introduced,
we simply apply this ‘recipe’ to different testing scenarios.
Calculating p-values
Let the test statistic be X (a random variable), and the test statistic value be x.
For a test statistic distribution which is symmetric about zero such as N (0, 1) and
a Student’s t distribution, then the p-value is an area under the curve of the test
statistic distribution. The calculation of the p-value then depends on the form of the
alternative hypothesis, H1 , as follows.
Lower-tailed test P (X ≤ x)
Upper-tailed test P (X ≥ x)
136
7.5. Testing a population mean claim
When testing at the 100α% significance level, for α in the interval [0, 1], then if the
test statistic value:
(
falls in the rejection region (beyond the critical value), then reject H0
does not fall in the rejection region, then do not reject H0 .
When testing at the 100α% significance level, for α in the interval [0, 1], then:
(
≤ α then reject H0
if the p-value
> α then do not reject H0 .
The most common significance levels are 10%, 5% and 1%, such that:
rejecting H0 at the 10% significance level reflects a weakly significant result,
with weak evidence
rejecting H0 at the 5% significance level reflects a moderately significant result,
with moderate evidence
rejecting H0 at the 1% significance level reflects a highly significant result, with
strong evidence.
6. Draw conclusions. It is always important to draw conclusions in the context of
the original hypotheses. This is an important final step which guides us to make a
better decision about the original research hypothesis, and final conclusions should
be drawn in terms of the problem. 7
X ∼ N (µ, σ 2 )
To assess whether the mean monthly expenses have decreased from the assumed value
of 500, sample data will be required. Suppose a random sample of n = 100 is taken, and
let us assume that σ = 10 (in £000s). From Chapter 4, we know that:
σ2 (10)2
X̄ ∼ N µ, = N µ, = N (µ, 1).
n 100
137
7. Hypothesis testing of means and proportions
Further, suppose that the sample mean in our random sample of n = 100 is x̄ = 497 (in
£000s). Clearly, we see that:
x̄ = 497 6= 500 = µ
where 500 is the claimed value of µ being tested in H0 .
The question is whether we judge the difference between x̄ = 497 and the claim µ = 500
to be:
(a) small, and hence attributable to sampling error (so we think H0 is true)
(b) large, and hence classified as statistically significant (so we think H1 is true).
Adopting the p-value approach to testing, the p-value will allow us to choose between
explanations (a) and (b). We proceed by standardising X̄ such that:
X̄ − µ
Z= √ ∼ N (0, 1)
σ/ n
acts as our test statistic.
Using our sample data, we now obtain the test statistic value:
x̄ − µ 497 − 500
z= √ = √ = −3.
σ/ n 10/ 100
7 The p-value is the probability of our test statistic value or a more extreme value
conditional on H0 . Noting that H1 : µ < 500, ‘more extreme’ here means a z-value
≤ −3. This can be expressed as:
Note this value can easily be obtained using Excel (or, in the examination, Table 4 of
the New Cambridge Statistical Tables) as:
=NORM.S.DIST(-3,1)
Although the p-value is very small, indicating it is highly unlikely that this is a Type I
error, unfortunately we cannot be certain which outcome has actually occurred!
138
7.6. Hypothesis test for a single mean (σ 2 known)
Figure 7.1: The standard normal distribution, indicating the p-value of 0.00135 in red.
Note the right-hand plot is a zoomed in version of the left-hand tail of the left-hand plot.
We now show that the same conclusion is reached had we used the critical value
approach instead. At the 5% significance level, the critical value for this lower-tailed z
test is −1.645 (shown in Figure 6.4). Applying the critical value decision rule, since
−3 < −1.645 the test statistic value falls in the rejection region hence we reject H0 . 7
Moving to the 1% significance level (as per the significance level decision tree in Figure
6.1), the new critical value is −2.326 (also shown in Figure 6.4), hence we again reject
H0 and have a highly significant result with the same conclusion as the p-value approach.
Note that if the p-value is less than 0.01 (here it is 0.00135) then this must mean that
the test statistic value falls in the rejection region at the 1% significance level (since the
rejection region area is α). Hence using critical values or p-values will always result in
the same conclusion!
X̄ − µ0
Z= √ ∼ N (0, 1). (7.1)
σ/ n
Hence critical values and p-values are obtained from the standard normal
distribution, i.e. using the bottom row of Table 10 and Table 4, respectively.
139
7. Hypothesis testing of means and proportions
Example 7.1 The mean lifetime of 100 components in a sample is 1,570 hours and
their standard deviation is known to be 120 hours. Let µ be the mean lifetime of all
the components produced. Is it likely the sample comes from a population whose
mean is 1,600 hours?
We perform a two-tailed test since we are testing whether or not µ is 1,600 hours.
(Common sense might lead us to perform a lower-tailed test since 1,570 < 1,600,
suggesting that if µ is not 1,600 hours, then it is likely to be less than 1,600 hours.
However, since this is framed as a two-tailed test, a justification for performing a
lower-tailed test would be required, should you decide to opt for a lower-tailed test.
Indeed, in principle the alternative hypothesis should be determined before data are
collected, to avoid the data biasing our choice of alternative hypothesis!)
Hence we test:
H0 : µ = 1,600 vs. H1 : µ 6= 1,600.
Since σ (and hence σ 2 ) is known, we use (7.1) to calculate the test statistic value,
which is:
x̄ − µ0 1,570 − 1,600
z= √ = √ = −2.50.
σ/ n 120/ 100
For this two-tailed test the p-value is:
140
7.7. Hypothesis test for a single mean (σ 2 unknown)
Figure 7.2: The standard normal distribution, indicating the p-value of 0.0124 in red.
Hence critical values and p-values are obtained from the Student’s t distribution with
n−1 degrees of freedom, i.e. using Table 10 or T.DIST function in Excel, respectively.
The determination of the p-value will depend on whether the test is a lower-tailed,
upper-tailed or two-tailed test. In the list below, suppose the test statistic value is x with
df degrees of freedom.
141
7. Hypothesis testing of means and proportions
We test:
H0 : µ = 10.2 vs. H1 : µ > 10.2.
Note that this is an upper-tailed test as this is the region of interest (we hypothesise
that banks exercising comprehensive planning perform better than a 10.2% annual
7 return).
The summary statistics are n = 26, x̄ = 10.5 and s = 0.714. Hence, using (7.2), the
test statistic value is:
x̄ − µ0 10.5 − 10.2
t= √ = √ = 2.14.
s/ n 0.714/ 26
For this upper-tailed test the p-value is:
142
7.8. Hypothesis test for a single proportion
Figure 7.3: The Student’s t distribution with 25 degrees of freedom, indicating the p-value
of 0.0212 in red.
Validity requires nπ0 > 5 and n(1 − π0 ) > 5. Hence critical values and p-values are
obtained from the standard normal distribution, i.e. using the bottom row of Table
10 and Table 4, respectively.
143
7. Hypothesis testing of means and proportions
Example 7.3 To illustrate this, let us extend Example 5.4 where we consider a
survey conducted to estimate the proportion of bank customers who would be
interested in using a proposed new mobile telephone banking service. If we denote
the population proportion of customers who are interested by π, and it is found that
68 out of a random sample of 150 sampled customers were interested, then we would
estimate π with p = 68/150 = 0.453.
Suppose that other surveys have shown that 40% of the population of customers are
interested and it is proposed to test whether or not the above survey agrees with
this figure, i.e. we conduct a two-tailed test. Hence:
7 z=p
p − π0
=p
0.453 − 0.40
= 1.325.
π0 (1 − π0 )/n 0.40 × 0.60/150
The test statistic value would remain at 1.325, but now the p-value would be halved,
since =1-NORM.S.DIST(1.325,1) returns 0.0926, and since 0.0926 < 0.10 we would
now reject H0 at the 10% significance level. The test is now weakly significant, with
weak evidence that more than 40% of the population of customers is interested. Note
the p-value of 0.0926 is the red-shaded area in the right tail only in Figure 7.4.
144
7.9. Hypothesis testing of differences between parameters of two populations
While this is a different decision to the two-tailed test, it is not inconsistent since:
The power of the test has increased with the adoption of an upper-tailed H1 because
the effect of halving the two-tailed p-value makes it easier to reject H0 by increasing
the chance that the p-value falls below the chosen significance level.
Using critical values, for a two-tailed test at the 5% significance level, the critical
values are ±1.96 (shown in Figure 6.2). Since 1.325 < 1.96 the test statistic value
does not fall in the rejection region hence we do not reject H0 . Moving to the 10%
significance level (as per Figure 6.1), the new critical values are ±1.645 (also shown
in Figure 6.2), hence again we do not reject H0 since 1.325 < 1.645 and so the test
statistic value does not fall in the rejection region. The test is not statistically
significant, as per above.
Figure 7.4: The standard normal distribution, indicating the p-value of 0.1852 in red.
145
7. Hypothesis testing of means and proportions
Hence critical values and p-values are obtained from the standard normal
distribution, i.e. using the bottom row of Table 10 and Table 4, respectively.
Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (7.4),
we set the term (µ1 − µ2 ) equal to 0.
146
7.10. Difference between two population means
Example 7.4 Suppose we are interested in researching the average response time
of two different customer support teams, 1 and 2, in a company. Managers want to
determine if there is a statistically significant difference in the mean response time of
these two support teams.
We test:
H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
We assume that the data for each team follow a normal distribution, and that the
population variances are known to be:
respectively.
Managers collected a random sample of 30 customer support interactions from each
team (hence n1 = n2 = 30) and measured their response times, with sample means
of x̄1 = 150 minutes and x̄2 = 165 minutes.
Using (7.4), the test statistic value is:
x̄1 − x̄2 150 − 165
z=p 2 2
=p = −2.86
σ1 /n1 + σ2 /n2 400/30 + 425/30
147
7. Hypothesis testing of means and proportions
Figure 7.5: The standard normal distribution, indicating the p-value of 0.0042 in red.
If the population variances σ12 and σ22 are unknown, provided sample sizes n1 and n2
are large (greater than 30):
7 Z=
X̄1 − X̄2 − (µ1 − µ2 )
p ∼ N (0, 1) (approximately, for large n1 and n2 ).
S12 /n1 + S22 /n2
(7.5)
Hence critical values and p-values are obtained from the standard normal
distribution, i.e. using the bottom row of Table 10 and Table 4, respectively.
Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (7.5),
we set the term (µ1 − µ2 ) equal to 0.
Example 7.5 Extending Example 7.4, suppose the population variances are now
unknown. Using the samples, suppose the sample variances are s21 = 550 minutes2
and s22 = 575 minutes2 .
We still test:
H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
Using (7.5), the test statistic value is:
x̄1 − x̄2 150 − 165
z=p 2 2
=p = −2.45
s1 /n1 + s2 /n2 550/30 + 575/30
148
7.10. Difference between two population means
Hence we reject H0 only at the 5% significance level (not at the 1% significance level,
since 0.0143 > 0.01) and conclude that the test is moderately significant, with
moderate evidence of a difference in the mean response times of the two customer
support teams. Again, since x̄1 < x̄2 , it is likely that µ1 < µ2 (rather than merely
concluding that they are not equal).
Using critical values, for a two-tailed test at the 5% significance level, the critical
values are ±1.96 (shown in Figure 6.2). Since −2.45 < −1.96 the test statistic value
falls in the rejection region hence we reject H0 . Moving to the 1% significance level
(as per Figure 6.1), the new critical values are ±2.576 (also shown in Figure 6.2),
hence we do not reject H0 since −2.45 > −2.576 and so the test statistic value does
not fall in the rejection region. The test is moderately significant, as per above.
Note that the (absolute) test statistic value is smaller here relative to that in
Example 7.4 due to the sample variances being larger than the (known) population
variances. Given the inverse relationship between the test statistic value and the
p-value, the p-value in this example is inevitably larger (0.0143 vs. 0.0042), and now
only significant at the 5% significance level.
Figure 7.6: The standard normal distribution, indicating the p-value of 0.0143 in red.
Although still unknown, if we assume the population variances are equal to some
common variance, i.e. that σ12 = σ22 = σ 2 , then we only have one (common) unknown
variance to estimate. As with confidence intervals in Section 5.9.3, we utilise the
pooled variance estimator, given by:
149
7. Hypothesis testing of means and proportions
If the population variances σ12 and σ22 are unknown but assumed equal:
Hence critical values and p-values are obtained from the Student’s t distribution
with n1 + n2 − 2 degrees of freedom, i.e. using Table 10 or T.DIST function in Excel,
respectively.
Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (7.6),
we set the term (µ1 − µ2 ) equal to 0.
Example 7.6 To illustrate this, let us extend Example 5.10 and consider the
complaint reaction times of two similar companies. Random samples gave the
following statistics (in days):
We want to test for a difference between mean reaction times, i.e. we test:
H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
There are 12 + 10 − 2 = 20 degrees of freedom, hence we obtain the p-value from the
t20 distribution.
The p-value is returned by 2-2*T.DIST(ABS(2.87),20,1) which gives 0.0095. It is
shown as the red-shaded area in Figure 7.7. Since 0.0095 < 0.01 (just!), we reject H0
at the 1% significance level. On this basis, we have a highly significant result, and we
conclude that there is strong evidence that the mean reaction times of the two
companies are different. Indeed, it appears that Company 1 is slower, on average,
than Company 2, since x̄1 > x̄2 .
150
7.10. Difference between two population means
Using critical values, at the 5% significance level, the critical values for this
two-tailed t test are ±2.086 (using Table 10 for a t20 distribution). Since
2.87 > 2.086 the test statistic value falls in the rejection region hence we reject H0 .
Moving to the 1% significance level (as per Figure 6.1), the new critical values are
±2.845 (from Table 10), hence we reject H0 since 2.87 > 2.845 (just!) and so the test
statistic value again falls in the rejection region. We have a highly significant result
with the same conclusion as the p-value approach.
7
Figure 7.7: The Student’s t distribution with 20 degrees of freedom, indicating the p-value
of 0.0095 in red.
Using the sample mean and sample standard deviation of differenced data:
X̄d − µd
T = √ ∼ tn−1 . (7.7)
Sd / n
Hence critical values and p-values are obtained from the Student’s t distribution with
n−1 degrees of freedom, i.e. using Table 10 or T.DIST function in Excel, respectively.
151
7. Hypothesis testing of means and proportions
Example 7.7 Extending Example 5.11, the table below shows the before and after
weights (in pounds) of 8 adults after trying an experimental diet. We test whether
there is evidence that the diet is effective.
Before After
127 122
130 120
114 116
139 132
150 144
147 138
167 155
153 152
We want to test:
H0 : µ1 = µ2 vs. H1 : µ1 > µ2
which is equivalent to testing:
H0 : µd = 0 vs. H1 : µd > 0
where we choose a one-tailed test because we are looking for a reduction (if there is
7 any change, we would expect weight loss from a diet!) and we define µd = µ1 − µ2
since we anticipate that this way round the values will (more likely) be positive.
The differences (calculated as ‘Before − After’) are:
For example:
152
7.11. Difference between two population proportions
Figure 7.8: The Student’s t distribution with 7 degrees of freedom, indicating the p-value
of 0.0041 in red.
We derive the test statistic by standardising the (approximate, by the central limit
theorem) sampling distribution of the difference between two independent sample
proportions, P1 − P2 , given by:
π1 (1 − π1 ) π2 (1 − π2 )
P1 − P 2 ∼ N π1 − π2 , +
n1 n2
P1 − P2 − (π1 − π2 )
p ∼ N (0, 1)
π1 (1 − π1 )/n1 + π2 (1 − π2 )/n2
approximately, for large n1 and n2 . However, when evaluating this test statistic, which
values do we use for π1 and π2 ? In the test of a single proportion, we had H0 : π = π0 ,
where π0 is the tested (known) value.
When comparing two proportions, under H0 no value is given for π1 and π2 , only that
they are equal, that is:
π1 = π2 = π
where π is the common proportion whose value, of course, is still unknown! Hence we
need to estimate π from the sample data using the pooled proportion estimator.
153
7. Hypothesis testing of means and proportions
P1 − P2 − (π1 − π2 )
Z= p ∼ N (0, 1) (7.9)
P (1 − P )(1/n1 + 1/n2 )
(approximately, for large n1 and n2 ).
Hence critical values and p-values are obtained from the standard normal
distribution, i.e. using the bottom row of Table 10 and Table 4, respectively.
7 Example 7.8 To illustrate this, let us extend Example 5.12 by testing for a
difference between the population proportions of the general public who are aware of
a particular commercial product before and after an advertising campaign. Two
surveys were conducted and the results of the two random samples were:
If π1 and π2 are the true population proportions for ‘after’ and ‘before’ the
campaign, respectively, then we wish to test:
H0 : π1 = π2 vs. H1 : π1 > π2 .
Note that we use an upper-tailed test on the assumption that the advertising
campaign would increase the proportion aware – an example of the importance of
using common sense in determining the alternative hypothesis!
For the sample proportions, we have:
r1 65 r2 68
p1 = = = 0.5417 and p2 = = = 0.4533.
n1 120 n2 150
On the assumption that H0 is true, we estimate the common proportion, π, using
(7.8), to be:
65 + 68
= 0.4926.
120 + 150
154
7.12. Overview of chapter
Figure 7.9: The standard normal distribution, indicating the p-value of 0.0749 in red.
155
7. Hypothesis testing of means and proportions
156
7.15. Solutions to Sample examination questions
1. (a) Let µ1 denote the mean examination mark for the intensive tutoring group and
µ2 denote the mean examination mark for the standard tutoring group. The
wording ‘whether there is a difference between the mean examination marks’
implies a two-tailed test, hence the hypotheses can be written as:
H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
If equal variances are assumed, the test statistic value is 2.1449. If equal
variances are not assumed, the test statistic value is 2.1164.
The variances are unknown but the sample size is large enough, so the standard
normal distribution can be used. The t40 distribution (the nearest degrees of
freedom available in Table 10) is also correct and will be used in what follows.
The critical values at the 5% significance level are ±2.021, hence we reject the
null hypothesis. Moving to the 1% significance level, the critical values are
±2.704, so we do not reject H0 . We conclude that the test is moderately
significant such that there is moderate evidence of a difference between the two
tutoring groups.
(b) The assumptions for part (a) relate to the following.
• Assumption about whether variances are equal, i.e. whether σ12 = σ22 or
σ12 6= σ22 .
• Assumption about whether n1 + n2 is ‘large’ so that the normality
assumption is satisfied.
• Assumption about the samples being independent.
• Assumption about whether a normal or t distribution is used.
157
7. Hypothesis testing of means and proportions
2. (a) Let πT denote the true probability for the new treatment to work. We test:
H0 : πT = 0.50 vs. H1 : πT > 0.50.
Under H0 , the test statistic is:
p − 0.50
p ∼ N (0, 1)
0.50 × (1 − 0.50)/n
approximately due to the central limit theorem, since here n = 30 is (just
about) large enough. Hence the test statistic value is:
20/30 − 0.50
= 1.83.
0.0913
For α = 0.05, the critical value is 1.645. Since 1.83 > 1.645, we reject H0 at the
5% significance level. Moving to the 1% significance level, the critical value is
2.326, so we do not reject H0 . The test is moderately significant, with moderate
evidence that the treatment is effective.
(b) Let πP denote the true probability for the patient to recover with the placebo.
We test:
H0 : πT = πP vs. H1 : πT > πP .
For reference, the test statistic is:
πT − πP
p ∼ N (0, 1)
P (1 − P )(1/n1 + 1/n2 )
7 approximately, due to the central limit theorem.
The standard error is:
s
41 29 1 1
S.E.(πT − πP ) = × × + = 0.119
70 70 40 30
and so the test statistic value is:
20/30 − 21/40
= 1.191.
0.119
For α = 0.05, the critical value is 1.645. Since 1.191 < 1.645, we do not reject
H0 at the 5% significance level. Moving to the 10% significance level, the
critical value is 1.282, so again we do not reject H0 . The test is not statistically
significant. There is insufficient evidence of higher effectiveness.
158
Chapter 8
Contingency tables and the
chi-squared test
8.4 Introduction
In Chapter 7 we focused on testing the value of a population parameter of interest, such
as a mean, µ, or a proportion, π (and differences between means and proportions). Being
able to perform such statistical tests is of particular use when undertaking research.
159
8. Contingency tables and the chi-squared test
Here we shall look at two additional testing procedures, one which deals with testing for
association between two categorical variables (introduced in Chapter 2), and a second
which considers the shape of the distribution from which the sample data were drawn.
Example 8.1 Suppose that we are sampling people, and that one factor of interest
is hair colour (black, blonde, brown etc.) while another factor of interest is eye
colour (blue, brown, green etc.). In this example, each sampled person has one level
of each factor. We wish to test whether or not these factors are associated.
Therefore, we have the following.
160
8.6. Tests for association
So, under H0 , the distribution of eye colour is the same regardless of hair colour,
whereas if H1 is true it may be attributable to blonde-haired people having a
(significantly) higher proportion of blue eyes, say.
The association might also depend on the sex of the person, and that would be a
third factor which was associated with (i.e. interacted with) both of the other
factors, however we will not consider interactions in this course.
The main way of analysing these questions is by using a contingency table, discussed
in the next section.
In a contingency table, also known as a cross-tabulation, the data are in the form
of frequencies (counts), where the observations are organised in cross-tabulated
categories. We sample a certain number of units (people, perhaps) and classify them
according to the two factors of interest.
Example 8.2 In three areas of a city, a record has been kept of the numbers of
burglaries, robberies and car thefts which take place in a year. The total number of
offences was 150, and they were divided into the various categories as shown in the
following contingency table:
The cell frequencies are known as observed frequencies and show how the data
are spread across the different combinations of factor levels. The first step in any
analysis is to complete the row and column totals (as already done in this table).
1
When conducting tests for association, the null hypothesis can be expressed either as ‘There is no
association between categorical variables X and Y ’, or as ‘Categorical variables X and Y are independent’.
The corresponding alternative hypothesis would then replace ‘no association’ or ‘independent’ with ‘an
association’ or ‘not independent (dependent)’, respectively.
161
8. Contingency tables and the chi-squared test
Example 8.3 For the data in Example 8.2, if a record was selected at random from
the 150 records:
The expected frequency, Eij , for the cell in row i and column j of a contingency
table with r rows and c columns, is:
row i total × column j total
Eij =
total number of observations
8
where i = 1, 2, . . . , r and j = 1, 2, . . . , c.
Example 8.4 The completed expected frequency table for the data in Example 8.2
is (rounding to two decimal places, which is recommended):
Make sure you can replicate these expected frequencies using your own calculator.
162
8.6. Tests for association
χ2 test of association
Let the contingency table have r rows and c columns, then formally the test statistic
used for tests of association is:
r X
c
X (Oij − Eij )2
∼ χ2(r−1)(c−1) . (8.1)
i=1 j=1
Eij
Hence critical values are obtained from the χ2 distribution with (r − 1)(c − 1) degrees
of freedom in Table 8 of the New Cambridge Statistical Tables, and p-values are
obtained in Excel using:
Notice the ‘double summation’ here just means summing over all rows and all columns.
This test statistic follows an (approximate) chi-squared distribution with (r − 1)(c − 1)
degrees of freedom, where r and c denote the number of rows and columns, respectively,
in the contingency table. The approximation is reasonable provided all the expected
frequencies are at least 5.
For an r × c contingency table, we begin with rc cells. We lose one degree of freedom for
needing to use the total number of observations to compute the expected frequencies.
However, we also use the row and column totals in these calculations, but we only need
r − 1 row totals and c − 1 column totals, as the final one in each case can be deduced
using the total number of observations. Hence we only lose r − 1 degrees of freedom for
the row totals. Similarly, we only lose c − 1 degrees of freedom for the column totals.
Hence the overall degrees of freedom are:
k = rc − (r − 1) − (c − 1) − 1 = (r − 1)(c − 1).
163
8. Contingency tables and the chi-squared test
0.10
0.6
k=1 k=10
k=2 k=20
0.5
0.08
k=4 k=30
k=6 k=40
0.4
0.06
0.3
0.04
0.2
0.02
0.1
0.0
0 2 4 6 8 0.0 0 10 20 30 40 50
Example 8.5 Using the data in Example 8.2, we proceed with the hypothesis test.
We test:
H0 : No association between area and type of crime
versus:
H1 : Association between area and type of crime.
Note it is advisable to present your calculations as an extended contingency table as
shown below, where the three rows in each cell correspond to the observed
frequencies, the expected frequencies and the test statistic contributors, respectively.
164
8.6. Tests for association
=CHISQ.DIST.RT(23.13,4)
which returns a p-value of 0.000119. Using the p-value decision rule, since:
The conclusions in Example 8.5 are fairly obvious, given the small dimensions of the
contingency table. However, for data involving more factors, and more factor levels, this
165
8. Contingency tables and the chi-squared test
This justifies the name of a goodness-of-fit test, since we are testing whether or not a
particular probability distribution provides an adequate fit to the observed data. The
null hypothesis will assert that a specific hypothetical population distribution is the
true one. The alternative hypothesis is that this specific distribution is not the true one.
Goodness-of-fit tests can be performed for a variety of probability distributions.
However, there is a special case which we shall consider here when you are only dealing
with one variable. This is when we wish to test whether the sample data are drawn
from a discrete uniform distribution, i.e. that each characteristic is equally likely.
Example 8.7 Extending Example 8.6, for a die the obvious classifications would be
the six faces.
If the die is thrown n times, then our observed frequency data would be the number
of times each face appeared. Here k = 6.
Recall that in hypothesis testing we always assume that the null hypothesis, H0 , is true.
In order to conduct a goodness-of-fit test, the expected frequencies are computed
166
8.7. Goodness-of-fit tests
conditional on the probability distribution expressed in H0 . The test statistic will then
involve a comparison of the observed and expected frequencies. In broad terms, if H0 is
true, then we would expect small differences between these two sets of frequencies, while
large differences would be inconsistent with H0 .
We now consider how to compute the expected frequencies for discrete uniform
probability distributions.
For discrete uniform probability distributions, the expected frequencies are computed
as:
1
Ei = n × for i = 1, 2, . . . , k
k
where n denotes the sample size and 1/k is the uniform (i.e. equal, same) probability
for each characteristic.
Expected frequencies should not be rounded, just as we do not round sample means,
say. Note that the final expected frequency (for the kth category) can easily be
computed using the formula:
k−1
X
Ek = n − Ei .
i=1
This is because we have a constraint that the sum of the observed and expected
frequencies must be equal,2 i.e. we have:
k
X
Oi =
k
X
Ei
8
i=1 i=1
approximately, under H0 .
2
Recall the motivation for computing expected frequencies in the first place. Assuming H0 is true, we
want to know how a random sample of size n is expected to be distributed across the k categories.
167
8. Contingency tables and the chi-squared test
Note that this test statistic does not have a true χ2k−1 distribution under H0 , rather it is
only an approximating distribution. An important point to note is that this
approximation is only good enough provided all the expected frequencies are at least 5.
(In cases where one or more expected frequencies are less than 5, we merge categories
with neighbouring ones until the condition is satisfied.)
As seen in (8.2), there are k − 1 degrees of freedom when testing a discrete uniform
distribution. k is the number of categories (after merging), and we lose one degree of
freedom due to the constraint that:
Xk X k
Oi = Ei .
i=1 i=1
As with tests of association, goodness-of-fit tests are upper-tailed tests as, under H0 , we
would expect to see small differences between the observed and expected frequencies, as
the expected frequencies are computed conditional on H0 . Hence large test statistic
values are considered extreme under H0 , since these arise due to large differences
between the observed and expected frequencies.
versus:
H1 : There is a difference in preference for the wrapper types.
Note that under H0 , this is the same as testing for the suitability of the discrete
uniform distribution, i.e. that each wrapper type is equally likely.
How do we work out the expected frequencies? Well, for equal preferences, across
three wrapper types (hence k = 3), the expected frequencies will be:
1
Ei = 33 × = 11 for i = 1, 2 and 3.
3
Applying (8.2), the test statistic value is:
3
X (Oi − Ei )2 (8 − 11)2 (10 − 11)2 (15 − 11)2
= + + = 2.364.
i=1
Ei 11 11 11
168
8.8. Overview of chapter
which returns a p-value of 0.3067. Using the p-value decision rule, since:
we cannot reject H0 , even at the 10% significance level. Hence the test is not
significant. Therefore, there is insufficient evidence of a difference in preference for
the wrapper types.
Using critical values, at the 5% significance level, the critical value for this
upper-tailed chi-squared test is 5.991 (using Table 8 with ν = 2). Since 2.364 < 5.991
the test statistic value does not fall in the rejection region hence we do not reject H0 .
Moving to the 10% significance level (as per Figure 6.1), the new critical value is
4.605 (again using Table 8), hence we also do not reject H0 since 2.364 < 4.605 and
so the test statistic value again does not fall in the rejection region. We have an
insignificant result with the same conclusion as the p-value approach.
169
8. Contingency tables and the chi-squared test
(a) Based on the data in the table, and without conducting any significance test,
would you say there is an association between age group and watch preference?
Provide a brief justification for your answer.
(b) Is there any evidence of an association between age group and watch
preference?
3. Set out the null and alternative hypotheses, degrees of freedom, expected
frequencies, and 10%, 5% and 1% critical values for the following problem. The
following figures give the number of births by season in a town.
The number of days per season in this country are 93 (Spring), 80 (Summer), 100
(Autumn) and 92 (Winter). Is there any evidence that births vary over the year?
Hint: You would expect, if the births are uniformly distributed over the year, that
the number of births would be proportional to the number of days per season. So
work out your expected frequencies by taking the number of days per season
8 divided by the number of days in the year and multiplying by the total number of
births over the year.
2. (a) There are some differences between younger and older people regarding watch
preference. More specifically, 16% of younger people prefer an analogue watch
compared to 48% for people over 30. Hence there seems to be an association
between age and watch preference, although this needs to be investigated
further.
(b) Set out the null hypothesis that there is no association between age and watch
preference against the alternative, that there is an association. Be careful to
get these the correct way round!
H0 : No association between age group and watch preference.
H1 : Association between age group and watch preference.
170
8.11. Solutions to Sample examination questions
which gives a test statistic value of 24.146 (make sure you can replicate this).
This is a 2 × 3 contingency table so the degrees of freedom are
(2 − 1) × (3 − 1) = 2.
For α = 0.05, the critical value is 5.991, hence reject H0 since 24.146 > 5.991.
For α = 0.01, the critical value is 9.210, hence reject H0 since 24.146 > 9.210.
We conclude that the test is highly significant, with strong evidence of an
association between age group and watch preference.
93
Spring 730 × = 186
365
80
Summer 730 × = 160
365
100
Autumn 730 × = 200
365
92
Winter 730 × = 184
365
171
8. Contingency tables and the chi-squared test
The degrees of freedom are the number of categories minus 1 so, as there are four
seasons given, we have 4 − 1 = 3 degrees of freedom. Looking at Table 8 of the New
Cambridge Statistical Tables, we see that the critical value at the 5% significance
level is 7.815, hence we reject H0 since 62.35 > 7.815. Moving to the 1% significance
level, the critical value is now 11.34, so again we reject H0 as 62.35 > 11.34. We
conclude that the test is highly significant and conclude that there is (very!) strong
evidence that the number of births vary over the year by season.
If we look again at the observed and expected values we see:
Only the observed winter births are anywhere near the expected value – both
summer and autumn show higher births than expected, while spring is much lower.
172
Chapter 9
Sampling and experimental design
discuss the various methods of contact which may be used in a survey and the
related implications
9.4 Introduction
We proceed to describe the main stages of a survey and the sources of error at each
stage. This part of the course is the foundation for your future work in applied social
science, including business and management. There is not much point in learning the
173
9. Sampling and experimental design
various statistical techniques we have introduced you to in the rest of the subject guide
unless you understand the sources and limitations of the data you are using. This is
important to academics and policymakers alike!
174
9.6. Types of sampling techniques
Primary data are new data collected by researchers for a particular purpose.
Secondary data refer to existing data which have already been collected by others
or for another purpose.
Here we focus on the collection of primary data using various sampling techniques.
Some population units have no chance (i.e. a zero probability) of being selected.
Units which can be selected have an unknown (non-zero) probability of selection.
Sampling errors cannot be quantified.
They are used in the absence of a sampling frame, which is a list of all the
individual units in a population, serving as the basis for selecting a sample.
175
9. Sampling and experimental design
Convenience sampling
Judgemental sampling
Quota sampling
In quota sampling researchers divide the population into subgroups based on certain
characteristics, such as age, gender, or socio-economic status, known as quota
controls. The researcher then sets quotas (hence the name!) for each subgroup and
selects participants non-randomly until each quota is filled. This method aims to ensure
representation of key characteristics within the sample and requires the distribution of
these characteristics in the population to be (approximately) known in order to
replicate it in the sample. While quota sampling provides a structured approach, it may
introduce selection bias if the quotas are not well-defined or if the selection within each
quota is not random. Quota sampling is useful in the absence of a sampling frame, since
non-probability sampling techniques do not require a sampling frame. Another reason
for conducting a quota sample, instead of a random sample, might be speed. We may be
176
9.6. Types of sampling techniques
in a hurry and not want to spend time organising interviewers for a random sample – it
is much quicker to set target numbers (quotas) to interview.
Example 9.4 In a market survey for a new tech product, a company employs
quota sampling to ensure diverse participant representation. The population is
categorised by age groups and income levels. Quotas are set for each category, and
researchers purposefully select participants to meet these quotas. This method helps
capture insights from various demographic segments, guiding the company’s
marketing strategy for a more comprehensive market understanding.
Example 9.5 You would likely use a quota sample in the following situations.
177
9. Sampling and experimental design
When accuracy is not important. You may not need to have an answer to your
research question(s) to a high, and known, level of accuracy (which is only
possible using a random sample); rather you may merely require a rough idea
about a subject. Perhaps you only need to know if people, on the whole, like
your new flavour of ice cream in order to judge whether or not there is likely to
be sufficient demand to justify full-scale production and distribution. In this
case, asking a reasonably representative group of people (a quota) would be
perfectly adequate for your needs.
Although there may be several reasons to justify the use of a quota sample, you should
be aware of the problem caused by the omission of non-respondents. Because you only
count the individuals who reply (unlike random sampling where your estimate has to
allow for bias through non-response), the omission of non-respondents can lead to
serious errors as your results would be misleading. Listing non-response as it occurs in
quota samples is regarded as good practice.
Random sampling means that every population unit has a known (not necessarily
equal), non-zero probability of being selected in the sample. In all cases selection is
performed through some form of randomisation. For example, a pseudo-random number
generator can be used to generate a sequence of synthetic ‘random-like’ numbers.
9
Relative to non-random sampling methods, random sampling can be expensive and
time-consuming, and also requires a sampling frame. We aim to minimise both the
(random) sampling error and the systematic sampling bias. Since the probability of
selection is known, standard errors can be computed which allows confidence intervals
to be determined and hypothesis tests to be performed.
We consider five types of probability sampling.
178
9.6. Types of sampling techniques
Systematic sampling
179
9. Sampling and experimental design
Cluster sampling
Cluster sampling is a random sampling method where the population is divided into
clusters (hence the name!), and entire clusters are randomly selected for inclusion in the
study. Unlike other sampling techniques, cluster sampling involves selecting groups
(clusters) rather than individual elements. Once the clusters are chosen:
in a one-stage cluster sample all members within the selected clusters are included
in the sample
in a two-stage cluster sample a simple random sample is drawn from each selected
cluster (suitable when cluster sizes are large).
This method is often more practical and cost-effective when studying large and
geographically dispersed populations, as it reduces the need for extensive travel or data
collection across the entire population.
Ideally each cluster is as variable as the overall population (i.e. heterogeneity within
clusters is permitted (and likely), while there should be homogeneity between clusters).
Stratified sampling: all strata chosen, some units randomly selected from each
stratum (stratum = singular form of strata).
One-stage cluster sampling: some clusters chosen, all units selected in each
sampled cluster.
Two-stage cluster sampling: some clusters chosen, some units selected in each
sampled cluster.
Multistage sampling
180
9.7. Sources of error
sampling methods such as cluster, stratified, and simple random sampling. In the first
stage, clusters are randomly selected, then in subsequent stages, additional levels of
sampling occur, which may involve further random selection of subgroups or individual
elements. This method is often employed when it is impractical or too costly to survey
an entire population directly, providing a compromise between accuracy and efficiency.
During the first stage, large compound units are sampled, known as primary units.
During the second stage, smaller units, known as secondary units, are sampled from the
primary units. From here, additional sampling stages of this type may be performed, as
required, until we finally sample the basic units.
Sampling error
Sampling error refers to the difference between the characteristics of a sample and the
entire population from which the sample was drawn. It is an inherent part of the
sampling process and occurs because a sample is only a subset of the larger population.
The goal of sampling is to minimise this error, but it cannot be entirely eliminated.
9
Sampling error can lead to discrepancies between sample statistics and population
parameters, affecting the generalisability of study findings to the broader population.
Non-sampling error
Non-sampling error refers to errors in research which are not related to the act of
sampling itself but can still affect the accuracy of study results.
181
9. Sampling and experimental design
Selection bias occurs when the sample chosen is not representative of the entire
population, leading to a systematic difference between the characteristics of the sample
and the population. This bias arises when certain factors influence the selection process,
causing a non-random and non-representative sample. Selection bias can impact the
generalisability of study findings and may result in inaccurate conclusions if the selected
sample does not adequately reflect the characteristics of the broader population.
Response bias occurs when there is a systematic pattern of inaccurate responses from
participants, leading to distorted or unreliable data.
Response error refers to any deviation between the true value of the variable being
measured and the response obtained from a participant. This can result from factors
such as sampling error, measurement error, or processing errors during data collection.
The main sources of response error include:
role of the interviewer due to the characteristics and/or opinions of the interviewer,
asking leading questions and the incorrect recording of responses
role of the respondent who may lack knowledge, forget information or be reluctant
to give the correct answer due to the sensitivity of the subject matter.
Addressing and minimising non-sampling errors is crucial for improving the reliability
and validity of research findings. Such kinds of error can be controlled or allowed for
more effectively by a pilot survey, also known as a pilot study or pretest. This is a
small-scale research initiative conducted before the main study to test and refine the
research design, survey instruments, and procedures. It involves collecting data from a
small sample that is representative of the target population to identify potential issues,
assess the feasibility of the study, and make necessary adjustments. The pilot survey is
a crucial step in the research process, serving several purposes.
182
9.8. Non-response bias
By conducting a pilot survey, researchers can identify and address any logistical,
methodological, or practical challenges that may arise during the full-scale study.
Pilot surveys provide insights into the appropriateness of the chosen sampling
strategy. Researchers can assess whether the selected sampling method is feasible
and whether the sample is representative.
Researchers can test data analysis procedures on pilot data to ensure they are
suitable for the main study. This includes refining coding schemes, statistical tests,
and other analytical approaches.
Pilot surveys help gauge the feasibility of the entire research process, including
data collection, analysis, and interpretation. This allows researchers to make
adjustments to enhance the overall feasibility of the study.
Classification of non-response
Individuals with specific attitudes or opinions may be more or less likely to respond
to a survey, creating a bias in the collected data.
People with busy schedules or time constraints may be less likely to participate,
potentially biasing the sample towards individuals with more time available.
183
9. Sampling and experimental design
To mitigate non-response bias, researchers can employ strategies such as using follow-up
reminders, offering incentives, and analysing available demographic information of
non-respondents to understand potential biases. However, complete elimination of
non-response bias is challenging, and researchers should be transparent about the
limitations and potential biases in their findings.
Telephone surveys depend very much on whether your target population is on the
telephone (and how good the telephone system is).
We now explore some of the advantages and disadvantages of various contact methods.
Face-to-face interviews
Advantages:
Allow for in-depth and detailed responses, enabling interviewers to gather rich
9 qualitative data that might be challenging to obtain through other survey methods.
Allow for the observation of non-verbal cues, such as body language and facial
expressions, providing additional insights into respondents’ feelings and attitudes.
Can result in higher response rates compared to other methods, as the personal
interaction can build trust and rapport with respondents.
Disadvantages:
Can be expensive and time-consuming, involving travel, training, and the need for
skilled interviewers. This may limit the feasibility for large-scale surveys.
184
9.9. Method of contact
The format may limit the anonymity of responses, potentially affecting the
willingness of respondents to share candid feedback, particularly in situations
where privacy is a concern.
Telephone interviews
Advantages:
They enable rapid data collection, making them suitable for surveys that require
timely responses or when efficiency is a priority.
They offer a structured and standardised approach to data collection, ensuring that
each respondent receives the same set of questions in a consistent way.
They allow for random digit dialling, providing a more random and representative 9
sample compared to convenience sampling in some other survey methods.
Disadvantages:
Such surveys often face challenges related to low response rates, as many
individuals may screen calls or refuse to participate, leading to potential
non-response bias.
Complex or lengthy survey questions may be less suitable for telephone interviews,
as respondents might find it difficult to engage in detailed discussions over the
phone.
185
9. Sampling and experimental design
Some populations, such as those without access to a phone or with specific phone
preferences (for example, mobile-only users), may be underrepresented in telephone
surveys, impacting the sample’s representativeness.
Self-completion interviews
Advantages:
Such surveys, particularly those conducted online, are often cost-effective as they
eliminate the need for interviewers and associated expenses.
Online surveys, in particular, can reach a large and diverse audience, making them
suitable for studies involving widespread or international populations.
The absence of an interviewer minimises the potential for interviewer bias, ensuring
that respondents interpret and answer questions independently.
Disadvantages:
Respondents may misinterpret or have questions about certain survey items, and
self-completion surveys offer limited opportunities for clarification compared to
interviewer-administered methods.
9 Online surveys may exclude individuals with limited internet access, contributing
to a digital divide and potentially introducing bias against certain demographic
groups.
We see that the choice of contact method involves a series of trade-offs such that often
there is no single ‘right’ answer to deciding which is the best approach. Ultimately, a
value judgement is often required.
186
9.10. Experimental design
As seen in Example 9.15, it is important to control for confounding factors, that is,
factors which are correlated with the observed variables (such as the rival’s actions).
Failure to properly control for such factors may lead us to treat a false positive as a
genuine causal relationship! Let’s consider a further example.
187
9. Sampling and experimental design
participants. However, there might be confounding factors that could affect the
results.
The company notices an increase in productivity among employees who participated
in the training programme. However, it turns out that these employees were also
given a pay rise around the same time as the training. The increase in productivity
could be attributed to the salary increase rather than the training programme. In
this case, the pay rise is a confounding factor because it is associated with both the
independent variable (the training programme) and the dependent variable (the
productivity). It creates a potential alternative explanation for the observed increase
in productivity, making it challenging to attribute the changes solely to the training
programme.
Without considering the confounding factor of the pay rise, the company may
incorrectly conclude that the training programme was the primary driver of
increased productivity. The true effect of the training programme is confounded by
the simultaneous influence of the pay rise, leading to a potential misinterpretation of
the results.
To address this confounding factor, the company should carefully analyse the data,
control for variables like salary increases, or use statistical techniques like regression
analysis to separate the effects of the training programme from other potential
influences. By doing so, the company can obtain a more accurate understanding of
the true impact of the training programme on employee productivity.
In an observational study data are collected on units (not necessarily people) without
any intervention. Researchers do their best not to influence the observations in any way.
A sample survey is a good example of such a study, where data are collected in the form
of questionnaire responses. As discussed previously, every effort is made to ensure
response bias is minimised (if not completely eliminated).
Example 9.17 To assess the likely effect on tooth decay of adding fluoride to the
water supply, you can look at the data for your non-fluoridated water population
and compare it with one of the communities with naturally-occurring fluoride in
their water and measure tooth decay in both populations, but be careful! A lot of
other things may be different. Are the following the same for both communities?
188
9.12. Longitudinal surveys
Eating habits.
Age distribution.
Think of other relevant attributes which may differ between the two communities. If
you can match in this way (i.e. find two communities which share the same
characteristics and only differ in the fluoride concentration of their water supply),
your results may have some significance. However, finding such matching
communities may be easier said than done in practice!
So, to credibly establish a causal link in an observational study, all other relevant
factors need to be adequately controlled for, such that any change between observation
periods can be explained by only one variable.
They are well-suited for studying dynamic processes, such as learning trajectories,
career progression, or the development of health conditions, as researchers can
capture changes at multiple points.
Longitudinal data are suitable for event history analysis, allowing researchers to
study the occurrence and timing of specific events or transitions in the lives of
participants.
189
9. Sampling and experimental design
Longitudinal studies help control for cohort effects, where individuals from different
birth cohorts may exhibit different characteristics or behaviours due to shared
experiences.
Disadvantages:
Longitudinal surveys are often resource-intensive in terms of time, cost and effort.
Tracking participants over an extended period requires sustained funding and
logistical support.
Over time, participants may drop out or become unavailable, leading to attrition.
Loss of participants can compromise the representativeness of the sample and affect
the generalisability of findings.
Despite the disadvantages, such studies are widely regarded as being the best way of
studying change over time.
190
9.15. Sample examination questions
2. You work for a market research company and your manager has asked you to carry
out a random sample survey for a laptop company to identify whether a new laptop
model is attractive to consumers. The main concern is to produce results of high
accuracy. You are being asked to prepare a brief summary containing the items
below.
(a) Choose an appropriate probability sampling scheme. Provide a brief
justification for your answer.
(b) Describe the sampling frame and the method of contact you will use. Briefly
explain the reasons for your choices.
(c) Provide an example in which non-response bias may occur. State an action
9
which you would take to address this issue.
(d) State the main research question of the survey. Identify the variables
associated with this question.
3. You have been asked to design a stratified random sample survey from the
employees of a certain large company to examine whether job satisfaction of
employees varies between different job types.
(a) Discuss how you will choose your sampling frame. Also discuss any
limitation(s) of your choice.
(b) Propose two relevant stratification factors. Justify your choice.
(c) Provide an action to reduce response bias and explain why you think this
would be successful.
(d) Briefly discuss the statistical methodology you would use to analyse the
collected data.
191
9. Sampling and experimental design
2. One of the main things to avoid in this part is to write ‘essays’ without any
structure. This question asks for specific things and each one of them requires only
one or two lines of response. If you are unsure of what these things are, do not
write lengthy answers. This is a waste of your valuable examination time. If you
can identify what is being asked, keep in mind that the answer should not be long.
Note also that in most cases there is no single ‘right’ answer to the question. Some
suggested answers are given below.
(a) We are asked for accuracy and random (probability) sampling, so a reasonable
option is the use of stratified random sampling which is known to produce
results of high accuracy. An example of a sampling scheme could be a
stratified sample of those customers who bought this laptop recently.
(b) The question requires:
9 ◦ a description of a sampling frame
◦ a justification of its choice
◦ mentioning a (sensible) contact method.
Use a list provided by retailers to identify people who bought this laptop
model recently. The list could include the postal address, telephone or email
address of purchasers. Stratification could be made by gender of buyer. Finally,
an explanation should be provided as to which contact method you would
prefer – for example, email is fast but there may be a lot of non-response.
(c) The question requires an example of non-response bias and an action
suggested to address this issue.
For example, selected respondents may simply ignore the survey. Offering
(possibly financial) incentives could help mitigate non-response.
(d) A suggested answer for the question is ‘How satisfied are you with your new
laptop model?’. In terms of variables one could mention ‘satisfaction’ and
possible demographic attributes of respondents such as age or gender.
192
9.16. Solutions to Sample examination questions
3. (a) An indicative answer here would be to use an email list. A limitation with this
choice is that this list may not contain all current employees (new starters may
not have their email account activated, and recent leavers may not yet have
their email account deactivated).
(b) Examples of stratification factors are income level, gender, age group etc., as
we suspect job satisfaction may differ across these attributes. In order for
stratified sampling to be effective, within strata the members should be
homogeneous.
(c) Employees may be reluctant to express negative opinions (in case this
negatively impacts their career). By ensuring anonymity of responses, honest
answers are more likely to be expressed.
(d) Examples here are appropriate graphs (boxplots, density histograms etc.),
confidence intervals and hypothesis tests of job satisfaction across different job
types.
193
9. Sampling and experimental design
194
Chapter 10
Correlation and linear regression
calculate r
195
10. Correlation and linear regression
10.4 Introduction
In Chapter 8, you were introduced to the idea of testing for evidence of an association
between different attributes of two categorical variables using the chi-squared
distribution. We did this by looking at the number of individuals falling into a category,
or experiencing a particular contingency.
Correlation and linear regression enable us to see the connection between the actual
dimensions of two or more measurable variables. The work we will do in this chapter
will only involve looking at two variables at a time, but you should be aware that
statisticians use these theories and similar formulae to look at the relationship between
many variables, so-called ‘multivariate’ analysis.
When we use these terms we are concerned with using models for prediction and
decision making. So, how do we model the relationship between two variables? We are
going to look at:
It is important you understand what these two terms have in common, but also the
differences between them.
Example 10.1 An example of paired data is the following which represents the
number of people unemployed and the corresponding monthly reported crime figures
for twelve areas of a city.
When dealing with paired data, the first action is to construct a scatter diagram (also
called a scatterplot) of the data, and visually inspect it for any apparent relationship
between the two measurable variables.
Figure 10.1 shows such a scatter diagram for these data, which gives an impression of a
positive, linear relationship, i.e. it can be seen that x (the number unemployed) and y
(the number of offences) increase together, roughly in a straight line, but subject to a
certain amount of scatter. So, the relationship between x and y is not exactly linear –
the points do not lie exactly on a straight line.
196
10.5. Scatter diagrams
Data showing a general ‘upward shape’ like this are said to be positively correlated, and
we shall see how to quantify this correlation. Other possible scatter patterns are shown
in Figure 10.2.
The left-hand plot shows data which have a negative correlation, i.e. y decreases as x
increases, and vice versa. The right-hand plot shows uncorrelated data, i.e. no clear
relationship between x and y.
Note that correlation assesses the strength of the linear relationship between two
measurable variables. Hence uncorrelated data, in general, just means an absence of
linearity. It is perfectly possible that uncorrelated data are related, just not linearly –
for example, x and y may exhibit a quadratic relationship. 10
Example 10.2 Below is a list of variables, along with their expected correlation.
197
10. Correlation and linear regression
Figure 10.2: Scatter diagrams – negatively correlated variables (left) and uncorrelated
variables (right).
10 Example 10.3 Let us consider a few more examples. In each case we observe
strong correlations.
(a) The correlation is likely due to the fact that both variables are influenced by a
third variable – temperature. Warmer weather leads to increased ice cream sales
and more people swimming, thereby increasing the risk of drowning.
198
10.7. Correlation coefficient
(b) This correlation is likely coincidental. Changes in divorce rates and margarine
consumption are influenced by various social and economic factors, but there is
no direct causal relationship between the two.
(c) The more young people there are, the more juvenile offenders, scholarship
winners, and students there are likely to be. Connecting these two figures is
pretty meaningless.
It is quiteP
commonP in examination
P questions
P to be given certain summary statistics (for
example, i xi , i x2i , i yi , i yi2 and i xi yi ) to save you time from computing such
P
quantities directly using raw data. Hence it may be easier for you to remember the
199
10. Correlation and linear regression
expressions for Sxx , Syy and Sxy (the ‘corrected sum of squares’ for x and y, and
corresponding cross-products, respectively), and how they combine to calculate r.
The sample correlation coefficient measures how closely the points in a scatter diagram
lie around a straight line, and the sign of r tells us the direction of this line, i.e.
upward-sloping or downward-sloping, for positive and negative r, respectively. It does
not tells us the gradient of the line – this is what we will determine in linear regression.
Example 10.4 For the dataset in Example 10.1, we have n = 12, x̄ = 1,665 and
ȳ = 5,567, and so:
This is a strong, positive correlation. We also note that the value of r agrees with
the scatter diagram shown in Figure 10.1, i.e. positive.
is symmetric; that is, the correlation of x and y is the same as the correlation of
y and x
can only take values between ±1, i.e. −1 ≤ r ≤ 1, or |r| ≤ 1, i.e. sample
correlation coefficients always have an absolute value less than or equal to 1.
r ≈ 0 indicates that x and y are not linearly related, i.e. the variables are
uncorrelated
200
10.7. Correlation coefficient
y = x(1 − x) for 0 ≤ x ≤ 1
then the correlation is zero (as they are not linearly related), but, clearly, there is a
well-defined relationship between the two variables, so they are certainly not
independent. Figure 10.3 demonstrates this point for simulated sample data, where
we see a clear relationship between x and y, but it is clearly not a linear
relationship.1 Data of this kind would have a sample correlation near zero (here,
r = 0.15).
We saw in Chapter 2 that we can use the median and interquartile range as measures of
location and dispersion, respectively, instead of the mean and standard deviation (or
variance), when dealing with datasets that may have skewed distributions or outliers.
Similarly, we may calculate the Spearman rank correlation, rs , instead of r.
To compute rs we rank the xi and yi values in ascending order. Of course, it may be
that we only have the ranks, in which case we would have to use this method.
1
In fact, these data are scattered around a parabola with (approximate) equation y = 2(x−15)(85−x).
201
10. Correlation and linear regression
If there are no tied rankings of xi and yi , the Spearman rank correlation is:
n
d2i
P
6
i=1
rs = 1 − (10.2)
n(n2 − 1)
where the di s are the differences in the ranks between each xi and yi .
As with other order statistics, such as the median and quartiles, it is helpful to use the
Spearman rank correlation if you are worried about the effect of extreme observations
(outliers) in your sample. The limits for rs are the same as for r, that is −1 ≤ r ≤ 1, i.e.
|rs | ≤ 1.
Staff member A B C D E F G H I J
Rank order in test 2 3 5 1 4 9 10 6 7 8
Rank order in productivity 1 2 3 4 5 6 7 8 9 10
Staff member A B C D E F G H I J
di 1 1 2 −3 −1 3 3 −2 −2 −2
d2i 1 1 4 9 1 9 9 4 4 4
10 and so:
10
X
d2i = 46.
i=1
which is quite strong, indicating that the test is a reasonably good predictor of sales
ability.
Note we are implying a causal connection here, i.e. that greater performance in the
aptitude test results in greater sales ability.
202
10.8. Linear regression
y is the dependent variable (or response variable), i.e. that which we are trying
to explain.
x is the independent variable (or explanatory variable), i.e. the variable we think
influences y.
Multiple linear regression is just a natural extension of this set-up, but with more than
one independent variable (covered in EC2020 Elements of econometrics).
There can be a number of reasons for wanting to establish a mathematical relationship
between a dependent variable and an independent variable, for example:
y = β0 + β1 x 10
where β0 and β1 are fixed, but unknown, population parameters. Our objective is to
estimate β0 and β1 using (paired) sample data (xi , yi ), for i = 1, 2, . . . , n.
Note the use of the word approximate. Particularly in the social sciences, we would not
expect a perfect linear relationship between the two variables. Therefore, we modify this
basic model to:
y = β0 + β1 x + ε
where ε is some random disturbance from the initial ‘approximate’ line. In other words,
each y observation almost lies on the line, but ‘jumps’ off the line according to the
random variable ε. This disturbance is often referred to as the error term.
For each pair of observations (xi , yi ), for i = 1, 2, . . . , n, we can write this as:
y i = β 0 + β 1 xi + εi .
203
10. Correlation and linear regression
The random error terms ε1 , ε2 , . . . , εn corresponding to the n data points are assumed
to be independent and identically normally distributed, with zero mean and constant
(but unknown) variance, σ 2 . That is:
εi ∼ N (0, σ 2 ) for i = 1, 2, . . . , n.
The existence of three model parameters: the linear equation parameters β0 and
β1 , and the error term variance, σ 2 .
Var(εi ) = σ 2 for all i = 1, 2, . . . , n, i.e. the error term variance is constant and
does not depend on the independent variable.
You may feel that some of these assumptions are particularly strong and restrictive. For
example, why should the error term variance be constant across all observations?
Indeed, your scepticism serves you well! In a more comprehensive discussion of linear
regression, such as in EC2020 Elements of econometrics, model assumptions would
be properly tested to assess their validity. Given the limited scope of linear regression in
this course, sadly we are too time-constrained to consider such tests in detail. However,
do be aware that with any form of modelling, a thorough critique of model assumptions
is essential. Analysis based on false assumptions leads to invalid results, a bad thing!
yb = βb0 + βb1 x
where βb0 and βb1 denote our estimates of β0 and β1 , respectively. The derivation of the
formulae for βb0 and βb1 is not required for this course, although you do need to know
how to calculate point estimates of β0 and β1 .
204
10.8. Linear regression
and:
βb0 = ȳ − βb1 x̄. (10.4)
Note in practice that you need to compute the value βb1 first, since this is needed to
calculate βb0 .
10.8.3 Prediction
Having estimated the regression line, an important application of it is for prediction.
That is, for a given value of the independent variable, we can use it in the estimated
regression line to obtain a prediction of y.
For a given value of the independent variable, x0 , the predicted value of the dependent
variable, yb, is:
yb = βb0 + βb1 x0 .
Remember to attach the appropriate units to the prediction (i.e. the units of
measurement of the original y data). Also, ensure the value you use for x0 is correct –
for example, if the original x data is in 000s, then a prediction of y when the
independent variable is 10,000, say, would mean x0 = 10, and not 10,000!
205
10. Correlation and linear regression
and:
5,445 410
βb0 = ȳ − βb1 x̄ = − 3.221 × = 343.7.
12 12
Hence the estimated regression line is:
yb = 343.7 + 3.221x.
which is £456,400. Note that since the advertising costs were given in £000s, we used
x0 = 35, and then converted the predicted sales into pounds.
Non-linear relationships
We have only seen here how to use a straight line to model the best fit. So, we could be
missing some quite important non-linear relationships, particularly if we were working
in the natural sciences (recall Figure 10.3).
Extrapolation
In Example 10.7, we used our line of best fit to predict the value of y for a given value
of x, i.e. advertising expenditure of £35,000. Such predictions are only reliable if we are
dealing with x0 values which lie within the range of available x data, known as
interpolation. If we use the estimated regression line for prediction using x0 values
which lie outside the range of the available x data, then this is known as
extrapolation, for which any predictions should be viewed with caution.
For Example 10.7, it may not be immediately obvious that the relationship between
advertising expenditure and sales could change, but a moment’s thought should
convince you that, were you to quadruple advertising expenditure, you would be
206
10.9. Overview of chapter
unlikely to get a nearly 4 × 3.221 ≈ 13-fold increase in sales! Basic economics would
suggest some diminishing marginal returns to advertising expenditure!
Sometimes it is very easy to see that the relationship must change. For instance,
consider Example 10.8, which shows an anthropologist’s data on years of education of a
mother and the number of children she has, based on a Pacific island.
yb = 8 − 0.6x
based on data of women with between 5 and 8 years of education who had 0 to 8 live
births. This looks sensible. We predict yb = 8 − 0.6 × 5 = 5 live births for those with
5 years of education, and yb = 8 − 0.6 × 10 = 2 live births for those with 10 years of
education.
This is all very convincing, but say a woman on the island went to university and
completed a doctorate, and so had 15 years of education. She clearly cannot have
yb = 8 − 0.6 × 15 = −1 children! Also, if someone missed school entirely, is she really
likely to have yb = 8 − 0.6 × 0 = 8 children? We have no way of knowing. The
relationship shown by the existing data will probably not hold beyond the x data
range of 5 to 8 years of education. So, exercise caution when extrapolating!
As already mentioned, examiners frequently give you the following summary statistics:
n
X n
X n
X n
X n
X
xi , x2i , yi , yi2 and xi y i
i=1 i=1 i=1 i=1 i=1
for i = 1, 2, . . . , n, in order to save you time. If you do not know how to take advantage
of these, you will waste valuable time which you really need for the rest of the question. 10
Note that if you use your calculator, show no working and get the answer wrong, you
are unlikely to get any credit.
207
10. Correlation and linear regression
Examination mark 50 80 70 40 30 75 95
Project mark 75 60 55 40 50 80 65
You want to know if students who had relatively high project marks in the subject
also excel in examinations.
(a) Calculate the Spearman rank correlation.
(b) Based on your answer to part (a), do you think students who have the highest
project marks are also likely to score well in examinations? Briefly justify your
view.
Week #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 9 11 12 13 15 18 16 14 12 10
y 420 350 360 300 225 200 230 280 315 410
(a) Draw a scatter diagram of these data. Carefully label the diagram.
(b) Calculate the sample correlation coefficient. Interpret its value.
(c) Calculate and report the least squares line of y on x. Draw the line on the
scatter diagram.
(d) Based on the regression model above, what will be the predicted loss from
shoplifting when there are 17 workers on duty? Would you trust this value?
Justify your answer.
208
10.12. Solutions to Sample examination questions
(b) There is some correlation, but it is not strong. It looks as if there is a positive
connection between project marks and examination marks.
2. (a) A scatter diagram is:
Stolen merchandise vs number of workers
400
value of merchandise in $'s lost to shoplifters
350
300
250
10
200
10 12 14 16 18
(b) The summary statistics can be substituted into the formula for the sample
correlation coefficient to obtain the value r = −0.9688. An interpretation of
this value is the following – the data suggest that the higher the number of
workers, the lower the loss from shoplifters. The fact that the value is very
close to −1 suggests that this is a strong, negative linear relationship.
(c) The regression line can be written by either:
yb = βb0 + βb1 x or y = βb0 + βb1 x + ε.
The formula for βb1 is: P
xi yi − nx̄ȳ
βb1 = P 2
xi − nx̄2
209
10. Correlation and linear regression
10
210
A
Appendix A
Mathematics primer and the role of
statistics in the research process
n
X
(b) x2i
i=1
n
X
(c) (xi − 2)
i=1
n
X
(d) (xi − 2)2
i=1
n
!2
X
(e) xi
i=1
n
X
(f) 2.
i=1
Solution:
(a) We have:
n
X
2xi = 2x1 + 2x2 + 2x3 + 2x4 + 2x5 + 2x6 + 2x7
i=1
= 2 × (x1 + x2 + x3 + x4 + x5 + x6 + x7 )
= 2 × (1 + 1 + 1 + 2 + 4 + 8 + 9)
= 52.
211
A. Mathematics primer and the role of statistics in the research process
A
(b) We have:
n
X
x2i = x21 + x22 + x23 + x24 + x25 + x26 + x27
i=1
= 12 + 12 + 12 + 22 + 42 + 82 + 92
= 168.
(c) We have:
n
X
(xi − 2) = (x1 − 2) + (x2 − 2) + (x3 − 2) + (x4 − 2)
i=1
+ (x5 − 2) + (x6 − 2) + (x7 − 2)
= (1 − 2) + (1 − 2) + (1 − 2) + (2 − 2)
+ (4 − 2) + (8 − 2) + (9 − 2)
= 12.
(e) We have:
n
!2
X
xi = (x1 + x2 + x3 + x4 + x5 + x6 + x7 )2
i=1
= (1 + 1 + 1 + 2 + 4 + 8 + 9)2
= (26)2
= 676.
Note that:
n
!2 n
X X
xi 6= x2i .
i=1 i=1
(f) We have:
n
X
2 = 2 + 2 + 2 + 2 + 2 + 2 + 2 = 14.
i=1
212
A.1. Worked examples
A
2. Suppose that x1 = 7, x2 = 3, x3 = 1, x4 = 0, x5 = −6, and y1 = −3, y2 = 5,
y3 = −8, y4 = 9, y5 = 1. Calculate the following quantities:
4
X
(a) 2yi
i=2
3
X
(b) 4(xi − 1)
i=1
5
X
(c) y12 + (x2i + 2yi2 ).
i=3
Solution:
(a) We have:
4
X
2yi = 2 × (y2 + y3 + y4 ) = 2 × (5 − 8 + 9) = 12.
i=2
(b) We have:
3
X 3
X
4(xi − 1) = 4 × (xi − 1)
i=1 i=1
= 4 × ((x1 − 1) + (x2 − 1) + (x3 − 1))
= 4 × ((7 − 1) + (3 − 1) + (1 − 1))
= 4 × (6 + 2 − 0)
= 32.
(c) We have:
5
X 5
X
y12 + 2 2 2
(xi + 2yi ) = y1 + ((x23 + 2y32 ) + (x24 + 2y42 ) + (x25 + 2y52 ))
i=3 i=3
= (−3) + (12 + 2 × (−8)2 ) + (02 + 2 × 92 )
2
+ ((−6)2 + 2 × 12 )
= 9 + 129 + 162 + 38
= 338.
213
A. Mathematics primer and the role of statistics in the research process
A
5
X y2 i
(c) y43 + .
i=4
xi
Solution:
(a) We have:
5
X
x2i = x23 + x24 + x25 = (−2.8)2 + (0.4)2 + (6.1)2 = 7.84 + 0.16 + 37.21 = 45.21.
i=3
(b) We have:
2
X 1 1 1 1 1
= + = + = 4 + 0.1 = 4.1.
xy
i=1 i i
x1 y 1 x2 y 2 (−0.5) × (−0.5) 2.5 × 4.0
(c) We have:
5
y2 y42 y52 (−2.0)2 02
X
i
y43 + = y43 + + 3
= (−2.0) + + = −8 + 10 = 2.
i=4
xi x4 x5 0.4 6.1
4. Explain why:
n n
!2 n n n
X X X X X
x2i 6= xi and xi yi 6= xi yi .
i=1 i=1 i=1 i=1 i=1
Solution:
Writing out the full summations, we obtain:
n n
!2
X X
x2i = x21 + x22 + · · · + x2n 6= (x1 + x2 + · · · + xn )2 = xi .
i=1 i=1
Therefore, the ‘sum of squares’ is, in general, not equal to the ‘square of the sum’
because the expansion of the quadratic gives:
so, unless all the cross-product terms sum to zero, the two expressions are not the
same. Hence, in general, the two expressions are different. Similarly:
n
X n
X n
X
xi yi = x1 y1 +x2 y2 +· · ·+xn yn 6= (x1 +x2 +· · ·+xn )(y1 +y2 +· · ·+yn ) = xi yi .
i=1 i=1 i=1
Therefore, the ‘sum of the products’ is, in general, not equal to the ‘product of the
sums’ because the expansion of the products gives:
(x1 + x2 + · · · + xn )(y1 + y2 + · · · + yn ) = x1 y1 + x2 y2 + · · · + xn yn + x1 y2 + x1 y3 + · · · .
214
A.2. Practice problems
A
A.2 Practice problems
2
X
6. Suppose x1 = 4, x2 = 1 and x3 = 2. For these figures, compute x3i .
i=1
7. If n = 4, x1 = 2, x2 = 3, x3 = 5 and x4 = 7, compute:
3
X
(a) xi
i=1
4
1X 2
(b) x.
n i=1 i
215
A. Mathematics primer and the role of statistics in the research process
A
8. Given x1 = 3, x2 = 1, x3 = 4, x4 = 6 and x5 = 8, find:
P
(a) 5xi
i=1
4
x2i .
P
(b)
i=3
Given also that p1 = 1/4, p2 = 1/8, p3 = 1/8, p4 = 1/3 and p5 = 1/6, find:
5
P
(c) p i xi
i=1
5
pi x2i .
P
(d)
i=3
(b) Here the ‘of’ and ÷ take precedence and this is:
1
of 12 (= 4) minus 4 ÷ 2 (= 2)
3
i.e. 4 minus 2, i.e. +2.
5
×2
5
which is 1 × 2, which is 2.
(b) (0.07)2 = 0.07 × 0.07 = 0.0049. Be careful with the decimal points!
(c) 0.49 is 0.7 × 0.7 and −0.7 × −0.7, so the square root of 0.49 is ±0.7.
Be sure that you understand the rules for placing decimal points. When you work
on interval estimation or hypothesis tests of proportions you will need this.
216
A.3. Solutions to Practice problems
A
3. (a) The answer is (98/100) × 200 which is 98 × 2, or 196.
(b) To get a percentage, multiply the given fraction by 100, i.e. we obtain:
17 17
× 100 = × 4 = 17 × 4 = 68%.
25 1
So 17/25 can be written as 68%.
(c) 25% of 98/144 is another way of writing (25/100) × (98/144) which is
(1/4) × (49/72) or 49/288 which is 0.1701 to four decimal places.
= 21.75.
8. (a) We have:
5
X
xi = x1 + x2 + x3 + x4 + x5 = 3 + 1 + 4 + 6 + 8 = 22.
i=1
(b) We have:
4
X
x2i = x23 + x24 = 42 + 62 = 16 + 36 = 52.
i=3
217
A. Mathematics primer and the role of statistics in the research process
A
(c) We have:
5
X
p i xi = p 1 x1 + p 2 x2 + p 3 x 3 + p 4 x4 + p 5 x5
i=1
1 1 1 1 1
= ×3+ ×1+ ×4+ ×6+ ×8
4 8 8 3 6
3 1 1 1
= + + +2+1
4 8 2 3
17
=4
24
(d) We have:
5
X 1 1 1
pi x2i = p3 x23 + p4 x24 + p5 x25 = × 42 + × 62 + × 82
i=3
8 3 6
1 1 1
= × 16 + × 36 + × 64
8 3 6
2
= 2 + 12 + 10
3
2
= 24
3
Check this carefully if you could not do it yourself first time. Note that i = 3 and
i = 5 mark the beginning and end, respectively, of the summation, as did i = 3 and
i = 4 in (b).
9. (a) Here the straight line graph has a slope of 1 and the line cuts the y-axis at
x = 3 (when x is zero, y is 3).
y
3
-3 x
218
A.3. Solutions to Practice problems
A
(b) Here the straight line still has a positive slope of 3 (y increases by 3 when x
increases by 1) but the line crosses the y-axis at a minus value (−2).
2/3 x
-2
219
A. Mathematics primer and the role of statistics in the research process
A
220
Appendix B B
Data visualisation and descriptive
statistics
1. A survey has been conducted on the wage rates paid to clerical assistants in twenty
companies in your locality in order to make proposals for a review of wages in your
own firm. The results of this survey are as follows:
Hourly rate in £
12.50 11.80 12.10 11.90 12.40 12.10 11.90 12.00 11.80 11.90
12.20 12.00 11.90 12.40 12.00 11.90 12.00 12.10 12.20 12.30
Solution:
A good idea is to first list the data in increasing order:
11.80 11.80 11.90 11.90 11.90 11.90 11.90 12.00 12.00 12.00
12.00 12.10 12.10 12.10 12.20 12.20 12.30 12.40 12.40 12.50
•
• •
• • •
• • • • • •
• • • • • • • •
£12.00 £12.50
2. The following two sets of data represent the lengths (in minutes) of students’
attention spans during a one-hour class. Construct density histograms for each of
these datasets and use these to comment on comparisons between the two
distributions.
221
B. Data visualisation and descriptive statistics
Statistics class
01 43 16 28 27 25 26 25 22 26
B 47 40 14 36 23 32 15 31 19 25
21 07 28 49 31 22 24 26 14 45
38 48 36 22 29 12 32 11 34 42
55 27 06 23 42 21 58 23 35 13
Economics class
60 39 30 41 37 27 38 04 25 43
58 60 21 53 26 47 08 51 19 31
29 21 31 60 48 30 28 37 07 60
50 60 51 24 41 03 37 14 46 60
60 48 25 32 59 11 60 28 54 18
60 42 04 26 60 41 60 11 43 28
Solution:
The exact shape of the density histograms will depend on the class intervals you
have used. A sensible approach is to choose round numbers. Below the following
have been used:
Statistics (n = 50)
Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[0, 10) 10 3 0.06 0.006
[10, 20) 10 8 0.16 0.016
[20, 30) 10 20 0.40 0.040
[30, 40) 10 9 0.18 0.018
[40, 50) 10 8 0.16 0.016
[50, 60) 10 2 0.04 0.004
[60, 70) 10 0 0.00 0.000
Economics (n = 60)
Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[0, 10) 10 5 0.083 0.0083
[10, 20) 10 5 0.083 0.0083
[20, 30) 10 12 0.200 0.0200
[30, 40) 10 10 0.167 0.0167
[40, 50) 10 10 0.167 0.0167
[50, 60) 10 7 0.117 0.0117
[60, 70) 10 11 0.183 0.0183
222
B.1. Worked examples
For the statistics class, the density histogram of students’ attention spans is
approximately symmetric. Attention spans for the economics class students are
more variable and higher on average than those of statistics students, and they are
not symmetric due to a group of students who maintain interest throughout the
class. Note we have more data on economics students, which is perfectly
acceptable. Sample sizes of different groups may very well be different in practice.
3. Sales of footwear in a store were recorded for 52 weeks and these are shown below.
30 60 67 63 69 54 68 60 62 83 66 70 68
61 74 94 87 66 69 66 62 78 90 98 93 73
70 68 47 40 51 56 56 58 57 47 71 76 80
79 77 77 73 64 67 59 46 54 53 49 58 62
We can see from the stem-and-leaf diagram that the data look approximately
symmetric, centred at about 66.
223
B. Data visualisation and descriptive statistics
4. Calculate the mean, median and mode of the following sample values:
4, 5, 7, 7, 8, 10 and 12.
Given there is an odd number of observations, since n = 7, then the median is:
x((n+1)/2) = x(4) = 7.
The mode is 7 since this occurs twice, with the other values only occurring once
each.
5. Consider again the wage rate data in Question 1. Calculate the mean, median and
mode for these data. Explain which measure should be used in deciding on a rate
to be used in your company.
Solution:
The mean is:
8
P
f k xk
k=1 2 × 11.80 + 5 × 11.90 + · · · + 1 × 12.50 241.4
x̄ = 8
= = = £12.07.
P 20 20
fk
k=1
The mode is probably a poor measure of location in this case (the mode is often a
poor measure when we are dealing with measurable variables), so the median or
mean is preferred. Depending on the company’s negotiating strategy, it might
choose a figure slightly higher than the mean, say £12.20, since it is higher than the
average and beaten by fewer than one-third of rivals.
224
B.1. Worked examples
6. Hard!
In a sample of n = 6 objects the mean of the data is 15 and the median is 11.
Another observation is then added to the sample mean and this takes the value B
x7 = 12.
(a) Calculate the mean of the seven observations.
(b) What can you conclude about the median of the seven observations?
Solution:
The ‘new’ mean is:
6
P
xi + x7
i=1 nx̄original + x7 6 × 15 + 12 102
x̄new = = = = = 14.6.
6+1 6+1 7 7
Note the easy way to calculate the new sum of data, using the old mean.
There is not enough information in the question to calculate the new median
exactly, but it must be somewhere between 11 and 12 (inclusive), in part because
for the original data:
x(1) , x(2) , x(3) ≤ 11
and:
x(4) , x(5) , x(6) ≥ 11.
x7 = 12 fits somewhere in the second group, increasing the median.
• If x(3) < x7 < x(4) then the new median will be x7 = 12.
• If x(3) < x(4) < x7 then the new median will be x(4) , where 11 ≤ x(4) ≤ 12.
Other cases can also be worked out if we had enough information, or estimated by
a bit more algebra.
225
B. Data visualisation and descriptive statistics
(c) The company uses the data to claim that ‘40% of airline passengers travel with
baggage over the weight allowance’. Explain whether or not you think this
claim is valid. (Think about how the data were collected!)
B
Solution:
(b) The sample mean, using midpoints of the class intervals, is:
7
P
f k xk
k=1 (21 × 2.5) + (2 × 7.5) + · · · + (3 × 32.5) 1,650
x̄ = = = = 16.5 kg.
P7 21 + 2 + · · · + 3 100
fk
k=1
For the median, since n = 100 we seek the 50.5th ordered value, which must be
located within the [15, 20) class interval (using the cumulative frequency
column in the table above). Since we do not have the raw data, we use the
226
B.1. Worked examples
(c) The claim is not valid. The data were taken from one route and only one flight
on that route. We cannot extrapolate to make conclusions about all flights.
There needs to be more variety in the sampling. For example, include flights to
many destinations, both domestic and international flights etc.
30 60 67 63 69 54 68 60 62 83 66 70 68
61 74 94 87 66 69 66 62 78 90 98 93 73
70 68 47 40 51 56 56 58 57 47 71 76 80
79 77 77 73 64 67 59 46 54 53 49 58 62
(a) Use the stem-and-leaf diagram to find the median and the quartiles of the data.
Solution:
(a) There are 52 values so the median is halfway between the 26th and 27th
ordered values. It is easy to read off from the graph that both the 26th and
27th ordered values equal 66, so the median equals 66.
(b) Again, we can read off from the graph that the lower and upper quartiles are
about 57 and 74, respectively. On this basis, the interquartile range is
74 − 57 = 17. However, there exist slightly different definitions of the quartiles
so you might have obtained a slightly different answer (but not too different!)
depending on the definition you have used.
x(13) + x(14) 57 + 58
Q1 = = = 57.5
2 2
and:
x(39) + x(40) 73 + 74
Q3 = = = 73.5.
2 2
Using this method, the interquartile range is 73.5 − 57.5 = 16.
227
B. Data visualisation and descriptive statistics
9. Calculate the range, variance and standard deviation of the following sample values:
where x(1) is the minimum value in the dataset, and x(n) is the maximum value.
We have i xi = 53 and i x2i = 447, hence x̄ = 53/7 = 7.57 and the (sample)
P P
variance is:
!
1 X 1
s2 = x2i − nx̄2 = × (447 − 7 × (7.57)2 ) = 7.64.
n−1 i
6
10. Compare the means and standard deviations of the sample datasets shown in
Question 2 and comment. You may wish to use the following summary statistics.
Solution:
For the statistics students:
s
50
1 X 1,395 1 (1,395)2
x̄ = xi = = 27.9 and s = × 46,713 − = 12.6.
50 i=1 50 49 50
So the statistics students’ mean and standard deviation of attention spans are 27.9
minutes and 12.6 minutes, respectively. For the economics students:
s
60
1 X 2,225 1 (2,225)2
x̄ = xi = = 37.1 and s = × 100,387 − = 17.4.
60 i=1 60 59 60
So the economics students’ mean and standard deviation of attention spans are
37.1 minutes and 17.4 minutes, respectively.
These statistics represent the main features of the distributions shown in the
density histograms of Question 2. The economics students have the higher mean
and variability due to the group which maintains interest throughout the class.
228
B.1. Worked examples
11. Calculate the mean and standard deviation of the following groups of numbers
(treat these as samples).
(a) 5, 3, 4, 7, 10, 7 and 1. B
(b) 12, 10, 11, 14, 17, 14 and 8.
(c) 12, 8, 10, 16, 22, 16 and 4.
(d) 25, 9, 16, 49, 100, 49 and 1.
Comment on any relationships between the statistics for the groups.
Solution:
The sample means and standard deviations are:
12. Consider again the wage rate data in Question 1. Calculate the range, interquartile
range and standard deviation of these data.
Solution:
The range is £12.50 − £11.80 = £0.70. For the quartiles we have:
x(5) + x(6) 11.90 + 11.90
Q1 = = = £11.90
2 2
and:
x(15) + x(16) 12.20 + 12.20
Q3 = = = £12.20.
2 2
Therefore, the interquartile range is £12.20 − £11.90 = £0.30. To compute the
(sample) variance, compute the following:
20
X
x2i = 2,914.50
i=1
20 20
!2
X 1 X 1
Sxx = x2i − xi = 2,914.50 − × (241.4)2 = 0.802
i=1
n i=1
20
hence:
Sxx 0.802
s2 = = = 0.042211.
n−1 19
229
B. Data visualisation and descriptive statistics
√
Hence the (sample) standard deviation is 0.042211 = £0.21. Note that two
decimal places are sufficient here.
B Tip: Working out the quartiles is easier if the sample size is a multiple of 4. If we
have 4k items, listed in increasing order as x(1) , x(2) , . . . , x(4k) , then:
x(k) + x(k+1) x(2k) + x(2k+1) x(3k) + x(3k+1)
Q1 = , Q2 = and Q3 = .
2 2 2
13. The table below summarises the distribution of salaries for a sample of 50
employees.
Salary (£000s) [5, 10) [10, 15) [15, 20) [20, 25)
Number of employees 8 14 21 7
(a) Draw a density histogram for these data.
(b) Calculate the mean, standard deviation and median.
(c) What is the modal class for this sample?
Solution:
(a) We have, using midpoints for calculations:
Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk f k xk fk x2k
[5, 10) 5 8 0.16 0.032 60.0 450.00
[10, 15) 5 14 0.28 0.056 175.0 2,187.50
[15, 20) 5 21 0.42 0.081 367.5 6,431.25
[20, 25) 5 7 0.14 0.028 157.5 3,543.75
Total 50 760.0 12,612.50
230
B.1. Worked examples
For the median, since n = 50 we seek the 25.5th ordered value, which must be
located within the [15, 20) class interval. Since we do not have the raw data,
we use the interpolation approach. Hence:
bin width × number of remaining observations
m = endpoint of previous bin +
bin frequency
5 × (25.5 − 22)
= 15 + = 15.83, i.e. £15,830.
21
(c) The modal class is [15, 20) (where the modal class is the class interval with the
greatest frequency).
Household size, xk 1 2 3 4 5
Frequency, fk 8 30 11 7 8
and: X
fk x2k = 8 × 12 + 30 × 22 + · · · + 8 × 52 = 539.
So:
K
P
f k xk
k=1 169
x̄ = K
= = 2.64
P 64
fk
k=1
and: 2
K K
fk x2k
P P
f k xk 2
2 k=1
k=1 539 169
s = − K = − = 1.45.
PK P 64 64
fk fk
i=1 k=1
√
Therefore, the standard deviation is s = 1.45 = 1.20.
231
B. Data visualisation and descriptive statistics
15. James set a class test for his students. The test was marked out of 10, and the
students’ marks are summarised below. (Note that no student obtained full marks,
and every student scored at least 3.)
B
Mark (out of 10) 3 4 5 6 7 8 9
Number of students 1 1 6 5 11 3 1
Solution:
Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[3, 4) 1 1 0.0357 0.0357
[4, 5) 1 1 0.0357 0.0357
[5, 6) 1 6 0.2143 0.2143
[6, 7) 1 5 0.1786 0.1786
[7, 8) 1 11 0.3929 0.3929
[8, 9) 1 3 0.1071 0.1071
[9, 10) 1 1 0.0357 0.0357
(b) x̄ = 177/28 = 6.32 marks. There are 28 students. The 14th lowest score was 7
and the 15th lowest was also 7. Therefore, the median mark also equals 7
marks.
232
B.2. Practice problems
The lower quartile equals 5 and the upper quartile equals 7, so the
interquartile range = 7 − 5 = 2. Note that, perhaps surprisingly, the upper
quartile equals the middle quartile, which is, of course, the median. This sort
of thing can happen with discrete data!
2. Think about why and when you would use each of the following.
(a) A density histogram.
(b) A stem-and-leaf diagram.
When would you not do so?
3. Find the mean of the number of hours of television watched per week by 10
students, with the following observations:
4. Say whether the following statement is true or false and briefly give your reason(s).
‘The mean of a dataset is always greater than the median.’
5. If n = 4, x1 = 1, x2 = 4, x3 = 5 and x4 = 6, find:
4
1X
xi .
3 i=2
7. If x1 = 4, x2 = 2, x3 = 2, x4 = 5 and x5 = 6, calculate:
(a) the mode
(b) the mean.
233
B. Data visualisation and descriptive statistics
8. Calculate the mean, median and mode of the prices (in £) of spreadsheet packages.
Also, calculate the range, variance, standard deviation and interquartile range of
the spreadsheet prices shown below. Check against the answers given after the data.
B
Name Price Price − Mean (Price − Mean)2
Brand A 52 −82.54 6,812.60
Brand B 64 −70.54 4,975.67
Brand C 65 −69.54 4,835.60
Brand D 82 −52.54 2,760.29
Brand E 105 −29.54 875.52
Brand F 107 −27.54 758.37
Brand G 110 −24.54 602.14
Brand H 115 −19.54 381.75
Brand I 155 20.46 418.67
Brand J 195 60.46 3,655.60
Brand K 195 60.46 3,655.60
Brand L 239 104.46 10,912.21
Brand M 265 130.46 17,020.21
You should be able to show that the arithmetic mean is 134.54, the median is 110,
and the mode is 195.
9. Work out s2 for a sample of nine observations of the number of minutes students
took to complete a statistics problem.
2, 4, 5, 6, 6, 7, 8, 11 and 20.
10. State whether the following statement is true or false and briefly give your
reason(s). ‘Three quarters of the observations in a dataset are less than the lower
quartile.’
11. The data below show the number of daily telephone calls received by an office
supplies company over a period of 25 working days.
219 541 58 7 13
476 418 177 175 455
258 312 164 314 336
121 77 183 133 78
291 138 244 36 48
(a) Construct a stem-and-leaf diagram for these data and use this to find the
median of the data.
(b) Find the first and third quartiles of the data.
(c) Would you expect the mean to be similar to the median? Explain.
(d) Comment on your figures.
234
B.3. Solutions to Practice problems
235
B. Data visualisation and descriptive statistics
4. The statement is false. The mean is only greater than the median when the dataset
is positively skewed. If the distribution is symmetric, then the mean and median
are equal. If the distribution is negatively skewed, then the mean is less than the
B median.
5. We have:
4
1X 1 4+5+6 15
xi = (x2 + x3 + x4 ) = = = 5.
3 i=2 3 3 3
The first observation, x1 , could be an outlier due to its value, 1, being ‘far’ from
the other observed values. Given the small sample size of n = 4, including this
potentially extreme observation might give a misleading estimate of the true
population mean.
6. (a) Since n = 3, the median is the midpoint of the ordered dataset, which is:
7. (a) The mode is the value which occurs most frequently. Here the mode is 2.
(b) The mean is:
n
1X x1 + x2 + x3 + x4 + x5 4+2+2+5+6 19
xi = = = = 3.8.
n i=1 5 5 5
8. There are 13 observations so the mean is the sum of them (1,749 as given) divided
by 13. This comes to £134.54.
The numbers have been arranged in order of size, so the median is the
((13 + 1)/2)th observation, that is the 7th (ordered) observation. This is £110.
The mode of the data, as given, is £195 which occurs twice (the other figures only
occur once). However, if we round to the nearest 10 (counting the 5s downwards)
the prices become (in £s) 50, 60, 60, 80, 100, 110, 110, 110, 150, 190, 190, 240, 260
and the mode is then £110 which occurs three times (£60 and £190 occur twice –
less often).
The range of the data is 265 − 52 = 213, so a price difference of £213 between the
most expensive and cheapest brands. The variance, standard deviation and
interquartile range are provided in the table, so check through your working and
make sure you can determine these.
236
B.3. Solutions to Practice problems
Therefore:
n
2 1 X
s = (xi − 7.67)2
n − 1 i=1 B
1
= ((2 − 7.67)2 + (4 − 7.67)2 + (5 − 7.67)2 + (6 − 7.67)2 + (6 − 7.67)2
8
+ (7 − 7.67)2 + (8 − 7.67)2 + (11 − 7.67)2 + (20 − 7.67)2 )
1
= (32.15 + 13.47 + 7.13 + 2.79 + 2.79 + 0.45 + 0.11 + 11.09 + 152.03)
8
222.01
=
8
= 27.75.
You can see this is very hard work, and it is quite remarkable that having rounded
figures to the nearest two decimal places before squaring the (xi − x̄) terms we still
get the same answer as that using the short-cut formula.
The short-cut formula gives us:
n
!
1 X 1
s2 = x2i − nx̄2 = (751 − 9 × (7.67)2 ) = 27.75.
n−1 i=1
8
10. The statement is false. By definition, the lower quartile separates the bottom 25%
of observations in a dataset from the top 75% of observations.
(b) For the quartiles, noting that different quartile calculation methods are
acceptable, possible (interpolated) solutions are:
x(7) − x(6) 78 − 77
Q1 = x(n/4) = x(6.25) = x(6) + = 77 + = 77.25
4 4
and:
3(x(19) − x(18) ) 3(312 − 291)
Q3 = x(3n/4) = x(18.75) = x(18) + = 291 + = 306.75.
4 4
237
B. Data visualisation and descriptive statistics
238
Appendix C
Probability theory
C
C.1 Worked examples
1. Calculate the probability that, when two fair dice are rolled, the sum of the
upturned faces will be:
(a) an odd number
(b) less than 9
(c) exactly 12
(d) exactly 4.
Solution:
The following table shows all the possibilities [as: (Score on first die, Score on
second die)]:
The total number of possible outcomes, N , is 36. Note all outcomes are equally
likely.
(a) The number of favourable points = 18, so P (odd total) = 18/36 = 1/2 = 0.50.
(b) The number of favourable points = 26, so P (less than 9) = 26/36 = 13/18
= 0.7222.
(c) The number of favourable points = 1, so P (exactly 12) = 1/36 = 0.0278.
(d) The number of favourable points = 3, so P (exactly 4) = 3/36 = 1/12 = 0.0833.
2. A simple gambling game is played as follows. The gambler throws three fair coins
together with one fair die and wins £1 if the number of heads obtained in the
throw of the coins is greater than or equal to the score on the die. Calculate the
probability that the gambler wins.
239
C. Probability theory
Solution:
The total number of equally likely outcomes is:
N = 2 × 2 × 2 × 6 = 48.
We can simply add the numbers of cases, so there are n = 12 favourable outcomes.
Hence the required probability is 12/48 = 1/4 = 0.25.
3. Three fair dice are thrown and you are told that at least one of the upturned faces
shows a 4. Using this fact, determine the (conditional) probabilities of:
(a) exactly one 4
(b) exactly two 4s.
Solution:
We first identify N , the total number of equally likely outcomes when at least one
die shows a 4, i.e. we restrict our attention to the ‘smaller’ sample space, rather
than the ‘full’ sample space of 63 = 216 outcomes. If the first die shows a 4, then
the other two can be anything, so the ‘pattern’ is 4XY . There are six choices for
each of X and Y , and these choices are independent of each other. Hence there are
6 × 6 = 36 outcomes in which the first die shows a 4.
If the first die does not show a 4, then there are 11 possible combinations of values
for the other two dice. To see why, one method is to list the possible pairs of values:
where, of course, the outcome ‘44’ can only be shown in one of the two lists. This
‘listing’ approach works because the number of possibilities for the two dice is small
enough such that we can list all of those in which we are interested (unlike with
three dice).
Another, slightly different, method is to calculate the numbers in each row
(without listing them all). For the pattern 4X there are six choices of X, hence six
cases, and for the pattern [not 4]4 there are five choices for [not 4] and hence five
cases, giving eleven possibilities altogether. As mentioned above, anything except
short lists should be avoided, so it is useful to know lots of alternative ways for
working out the number of relevant outcomes.
240
C.1. Worked examples
N = 36 + (5 × 11) = 91.
We now identify the number of favourable outcomes, n, for two or more 4s.
If the first die shows a 4, then there has to be at least one 4 on the other two, so
there are 11 ways for the other two dice. If the first die does not show a 4, then
there are 5 choices for what it might show, while both the other two dice must C
show 4s. Clearly, the two 4s can only happen in 1 way. So:
n = 11 + (5 × 1) = 16
and exactly one of these cases, (4, 4, 4), represents more than two 4s, so that fifteen
represent exactly two 4s.
Hence the required probabilities are as follows.
(a) We have:
91 − 16 75
P (exactly one 4) = = = 0.8242.
91 91
(b) We have:
15
P (exactly two 4s) = = 0.1648.
91
A B
a d b
e f
(a) With the same notation from the diagram, the area for Ac is everything
outside the area for A, so P (Ac ) = 1 − P (A).
241
C. Probability theory
P (A ∪ B) = a + e + f + b + (d + g)
= (a + d + g + e) + (d + b + f + g) − (d + g)
= P (A) + P (B) − P (A ∩ B).
C
(c) If A implies B, then this means that the region for A is completely inside the
region for B, i.e. that the intersection of A and B is the whole of A. Hence the
area of the shape for A is no larger than the area of the shape for B, i.e.
P (A) ≤ P (B).
(d) It is clear from the diagram that:
area(A ∪ B ∪ C) = a + b + c + d + e + f + g
P (A) = a + d + e + g P (A ∩ B) = d + g
P (B) = b + d + f + g P (A ∩ C) = e + g
P (C) = c + e + f + g P (B ∩ C) = f + g
P (A ∩ B ∩ C) = g.
So P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C)
equals:
(a + d + e + g) + (b + d + f + g) + (c + e + f + g)
− ((f + g) + (e + g) + (d + g)) + g
=a+b+c+d+e+f +g
= area(A ∪ B ∪ C)
= P (A ∪ B ∪ C).
So:
20 + 15 + 10 − 9 − 5 − 6 + 3 28
P (at least one subject) = = = 0.80.
35 35
242
C.1. Worked examples
243
C. Probability theory
10. Of all the candles produced by a company, 0.01% do not have wicks (the core piece
of string). A retailer buys 10,000 candles from the company.
(a) What is the probability that all the candles have wicks?
(b) What is the probability that at least one candle will not have a wick?
Solution:
C Let X be the number of candles without a wick.
(a) We have:
P (X = 0) = (0.9999)10,000 = 0.3679.
(b) We have:
P (X ≥ 1) = 1 − P (X = 0) = 1 − 0.3679 = 0.6321.
GGGG
GGGB GGBG GBGG BGGG
GGBB GBGB BGGB GBBG BGBG BBGG
BBBG BBGB BGBB GBBB
BBBB
244
C.1. Worked examples
(c) A pair of fair dice is thrown and the faces are equal.
(d) Five cards are drawn at random from a deck of cards and all are the same suit.
(e) Four independent tosses of a fair coin result in at least two heads. What is the
probability that all four tosses are heads?
Solution:
(a) By independence, we have (0.50)5 = 0.03215. C
(b) By independence (noting we only care about the first and last outcomes), we
have (0.50)2 = 0.25.
(c) Of the 36 equally likely outcomes, 6 have equal faces, hence the probability is
6/36 = 1/6 = 0.1667.
(d) Note this is sampling without replacement, and we can have any card as the
first card drawn, the probability is:
12 11 10 9
1× × × × = 0.00198.
51 50 49 48
(e) The sample space has 11 points, all equally likely, and one is the event of
interest, so the probability is 1/11 = 0.0909.
13. A company is concerned about interruptions to email. It was noticed that problems
occurred on 15% of workdays. To see how bad the situation is, calculate the
probabilities of an interruption during a five-day working week:
(a) on Monday and again on Tuesday
(b) for the first time on Thursday
(c) every day
(d) at least once during the week.
Solution:
(a) This is the probability of the event occurring on Monday and on Tuesday.
Each has probability 0.15 and the outcome is independent of day of the week,
so calculate:
(0.15)2 = 0.025.
(b) We require the probability of the event not occurring on the first three days of
the week, but occurring on Thursday. So the result will be:
(1 − 0.15)3 × 0.15 = 0.092.
245
C. Probability theory
A = both show heads = {HH} and B = 10p shows heads = {HT, HH}
and use conditional probability, i.e. the result that P (A | B) = P (A ∩ B)/P (B).
To do this we note that in this case the event A ∩ B is the same as the event A
(which is important to spot, when it happens). Hence:
P (A ∩ B) P (A) P (HH) 0.25
P (HH | 10p shows heads) = = = = = 0.50.
P (B) P (B) P (HT or HH) 0.5
246
C.1. Worked examples
can assume that one throw’s outcome does not affect the others. That is, the
outcomes of the different throws are independent, which (always!) means we
multiply the probabilities. Combining these two thoughts, there are:
N = 2 × 2 × 2 × 2 × 2 × 2 = 26 = 64
C
items (‘outcomes’) in the full, six-throw sample space, each with probability:
6
1 1 1 1 1 1 1 1
× × × × × = = .
2 2 2 2 2 2 2 64
(a) It is (just about) possible to list accurately the 20 cases in which there are
exactly three heads (and hence exactly three tails), identifying the four cases
(top row) in which the three Hs are adjacent.
To make sure that you count everything, but just once, it is essential to
organise the listed items according to some pattern.
Hence:
n 20 5
P (exactly 3 Hs) = = = = 0.3125.
N 64 16
(b) If there are exactly three heads on consecutive throws, then it is possible to
specify which of these different outcomes we are dealing with by identifying
how many (if any) T s there are before the sequence of Hs. (The remaining T s
must be after the sequence of Hs.) The possible values (out of 3) are 0, 1, 2
and 3, so there are four cases.
Hence:
n 4 1
P (3 consecutive Hs | exactly 3 Hs) = = = = 0.20.
N 20 5
247
C. Probability theory
16. Let K be the event of drawing a ‘king’ from a well-shuffled deck of playing cards.
Let D be the event of drawing a ‘diamond’ from the pack. Determine:
(a) P (K) (f) P (K | D)
(b) P (D) (g) P (D | K)
(c) P (K c ) (h) P (D ∪ K c )
C (d) P (K ∩ D) (i) P (Dc ∩ K)
(e) P (K ∪ D) (j) P ((Dc ∩ K) | (D ∪ K)).
Are the events D and K independent, mutually exclusive, neither or both?
Solution:
(a) We have:
4 1
P (K) = = .
52 13
(b) We have:
13 1
P (D) = = .
52 4
(c) We have:
1 12
P (K c ) = 1 − = .
13 13
(d) We have:
1
P (K ∩ D) = .
52
(e) We have:
16 4
P (K ∪ D) = = .
52 13
(f) We have:
1
P (K | D) = .
13
(g) We have:
1
P (D | K) = .
4
(h) We have:
49
P (D ∪ K c ) = .
52
(i) We have:
3
P (Dc ∩ K) = .
52
(j) We have:
3
P (Dc ∩ K | D ∪ K) = .
16
The events are independent since P (K ∩ D) = P (K) P (D), but they are not
mutually exclusive (consider the King of Diamonds).
248
C.1. Worked examples
17. At a local school, 90% of the students took test A, and 15% of the students took
both test A and test B. Based on the information provided, which of the following
calculations are not possible, and why? What can you say based on the data?
(a) P (B | A).
(b) P (A | B).
(c) P (A ∪ B).
C
If you knew that everyone who took test B also took test A, how would that
change your answers?
Solution:
(a) Possible.
P (A ∩ B) 0.15
P (B | A) = = = 0.167.
P (A) 0.90
(b) Not possible because P (B) is unknown and we would need to calculate
P (A ∩ B) 0.15
P (A | B) = = .
P (B) P (B)
18. Given two events, A and B, state why each of the following is not possible. Use
formulae or equations to illustrate your answer.
(a) P (A) = −0.46.
(b) P (A) = 0.26 and P (Ac ) = 0.62.
(c) P (A ∩ B) = 0.92 and P (A ∪ B) = 0.42.
(d) P (A ∩ B) = P (A) P (B) and P (B) > P (B | A).
Solution:
(a) Not possible because a probability cannot be negative since 0 ≤ P (A) ≤ 1.
(b) Not possible because P (A) + P (Ac ) = 1. Here 0.26 + 0.62 = 0.88 6= 1.
(c) Not possible because A ∩ B is a subset of A ∪ B, hence we have
P (A ∩ B) ≤ P (A ∪ B). Here 0.92 > 0.42.
(d) Not possible because two events cannot both be independent (as implied by
the condition P (A ∩ B) = P (A) P (B)) and dependent (as implied by the
condition P (B) > P (B | A)) at the same time.
P (A | B) = 0.20 P (B | C) = 0 P (C | A) = 0.25.
249
C. Probability theory
i. C implies A.
ii. B and C are mutually exclusive.
iii. A ∪ B ∪ C = S.
C (b) What is P (B | A)?
(c) What is P (B | Ac )?
Solution:
(a) i. The necessary condition for C implies A is that P (A | C) = 1. Using
Bayes’ formula, we see that:
P (A ∪ B ∪ C) = P (A ∪ B)
= P (A) + P (B) − P (A ∩ B)
= 0.40 + 0.50 − P (A | B) P (B)
= 0.40 + 0.50 − 0.20 × 0.50
= 0.80.
20. In an audit Bill analyses 60% of the audit items and George analyses 40%. Bill’s
error rate is 5% and George’s error rate is 3%. Suppose an item is sampled at
random.
(a) What is the probability that it is in error (i.e. audited incorrectly)?
(b) If the chosen item is incorrect what is the probability that Bill is to blame?
250
C.1. Worked examples
Solution:
Let B = Bill audits item, G = George audits item, and E = incorrect audit. Hence
P (B) = 0.60, P (G) = 0.40, P (E | B) = 0.05 and P (E | G) = 0.03.
(a) Using the total probability formula:
P (E) = P (E | B) P (B) + P (E | G) P (G) = (0.05 × 0.60) + (0.03 × 0.40) = 0.042.
21. Two fair coins are tossed. You are told that ‘at least one is a head’. What is the
probability that both are heads?
Solution:
At first sight it may appear that this should be equal to 1/2, but this is wrong! The
correct solution may be achieved in several ways; here we shall use a conditional
probability approach.
Let HH denote the event ‘both coins show heads’, and let A denote the event ‘at
least one coin shows a head’. Then:
P (A ∩ HH) P (HH) 1/4 1
P (HH | A) = = = = .
P (A) P (A) 3/4 3
Note that A ∩ HH = HH, i.e. ‘at least one head’ and ‘both show heads’ only
corresponds to ‘both show heads’, HH.
22. The probability of a horse winning a race is 0.30 if it is dry and 0.50 if it is wet.
The weather forecast gives the chance of rain as 40%.
(a) Find the probability that the horse wins.
(b) If you are told that the horse lost the race, what is the probability that the
weather was dry on the day of the race? (Assuming you cannot remember!)
Solution:
Let A = ‘horse wins’, Ac = ‘horse loses’, D = ‘dry’ and Dc = ‘wet’. We have:
P (A | D) = 0.30, P (A | Dc ) = 0.50, P (D) = 0.60 and P (Dc ) = 0.40.
(a) We have:
P (A) = P (A | D) P (D) + P (A | Dc ) P (Dc ) = 0.30 × 0.60 + 0.50 × 0.40 = 0.38.
Hence the horse has a probability of 0.38 of winning the race.
(b) From (a), we can determine that there is a 62% chance that the horse loses the
race, i.e. P (Ac ) = 0.62. Hence:
c P (Ac | D) P (D) 0.70 × 0.60
P (D | A ) = c
= = 0.6774.
P (A ) 0.62
That is, the probability it is dry, given that the horse loses, is 0.6774, or about
68%.
251
C. Probability theory
C (b) If wine is ordered, what is the probability that the person ordering is
well-dressed?
(c) If wine is not ordered, what is the probability that the person ordering is
poorly-dressed?
Solution:
The following notation is used:
• W = well-dressed
• C = casually-dressed
• P = poorly dressed
• O = wine ordered.
(a) We have:
P (O) = P (O | W ) P (W ) + P (O | M ) P (M ) + P (O | P ) P (P )
= (0.70 × 0.50) + (0.50 × 0.40) + (0.30 × 0.10)
= 0.58.
P (O | W ) P (W ) 0.70 × 0.50
P (W | O) = = ≈ 0.60.
P (O) 0.58
(c) We require:
24. In a large lecture, 60% of students self-identify as female and 40% self-identify as
male. Records show that 15% of female students and 20% of male students are
registered as part-time students.
(a) If a student is chosen at random from the lecture, what is the probability that
the student studies part-time?
(b) If a randomly chosen student studies part-time, what is the probability that
the student is male?
252
C.1. Worked examples
Solution:
Let P T = ‘part time’, F = ‘female’ and M = ‘male’.
(a) We have:
(b) We have:
C
P (P T | M ) P (M ) 0.20 × 0.40
P (M | P T ) = = = 0.4706.
P (P T ) 0.17
25. 20% of men show early signs of losing their hair. 2% of men carry a gene that is
related to hair loss. 80% of men who carry the gene experience early hair loss.
(a) What is the probability that a man carries the gene and experiences early hair
loss?
(b) What is the probability that a man carries the gene, given that he experiences
early hair loss?
Solution:
Using obvious notation, P (H) = 0.20, P (G) = 0.02 and P (H | G) = 0.80.
(a) We require:
(b) We require:
P (G ∩ H) 0.016
P (G | H) = = = 0.08.
P (H) 0.20
26. James is a salesman for a company and sells two products, A and B. He visits three
different customers each day. For each customer, the probability that James sells
product A is 1/3 and the probability is 1/4 that he sells product B. The sale of
product A is independent of the sale of product B during any visit, and the results
of the three visits are mutually independent. Calculate the probability that James
will:
(a) sell both products, A and B, on the first visit
(b) sell only one product during the first visit
(c) make no sales of product A during the day
(d) make at least one sale of product B during the day.
Solution:
Let A = ‘product A sold’ and B = ‘product B sold’.
(a) We have:
1 1 1
P (A ∩ B) = P (A) P (B) = × = .
3 4 12
253
C. Probability theory
(b) We have:
c c 1 3 2 1 5
P (A ∩ B ) + P (A ∩ B) = × + × = .
3 4 3 4 12
(c) Since P (Ac ) = 2/3, then:
3
2 8
P (no A sales all day) = = .
C 3 27
(d) We have:
3
3 37
P (at least 1 B sale) = 1 − P (no B sales) = 1 − = .
4 64
254
C.1. Worked examples
28. Hard!
There are 3 identical urns.
• The first urn contains 7 red balls and 3 white balls.
• The second urn contains 5 red balls and 5 white balls.
• The third urn contains 4 red balls and 8 white balls. C
One of the urns is selected at random (i.e. each urn has a 1-in-3 chance of being
selected). Balls are then drawn from the selected urn without replacement.
• The first ball is red.
• The second ball is white.
• The third ball is red.
• The fourth ball is red.
At each stage of sampling, calculate the probabilities of the selected urn being Urn
I, Urn II or Urn III.
Solution:
The following table shows the relevant calculations:
Note that the ‘Sum’ row is calculated using the total probability formula, and the
urn probabilities are computed using Bayes’ formula.
255
C. Probability theory
Determine:
C (a) F c
(b) E c ∩ F c
(c) (E ∪ F )c
(d) E c ∩ F .
2. Draw the appropriate Venn diagram to show each of the following in connection
with Question 1:
(a) E ∪ F = {3, 4, 5, 6}
(b) E ∩ F = {4}
(c) E c = {1, 2, 5, 6}.
What are the probabilities associated with a delivery chosen at random for each of
the following?
(a) A delivery is early.
(b) A delivery is from Smith.
(c) A delivery is from Jones and late.
4. There are three sites a company may move to: A, B and C. We are told that P (A)
(the probability of a move to A) is 1/2, and P (B) = 1/3. What is P (C)?
5. Two events A and B are independent with P (A) = 1/3 and P (B) = 1/4. What is
P (A ∩ B)?
6. Say whether the following statement is true or false and briefly give your reason(s).
‘If two events are independent, then they must be mutually exclusive.’
256
C.2. Practice problems
9. A coffee machine may be defective because it dispenses the wrong amount of coffee,
event C, and/or it dispenses the wrong amount of sugar, event S. The probabilities
of these defects are: C
P (C) = 0.05, P (S) = 0.04 and P (C ∩ S) = 0.01.
10. A company gets 60% of its supplies from manufacturer A, and the remainder from
manufacturer Z. The quality of the parts delivered is given below:
11. A company has a security system comprising of four electronic devices (A, B, C
and D) which operate independently. Each device has a probability of 0.10 of
failure. The four electronic devices are arranged such that the whole system
operates if at least one of A or B functions and at least one of C or D functions.
Show that the probability that the whole system functions properly is 0.9801.
(Use set theory and the laws of probability, or a probability tree.)
12. A student can enter a course either as a beginner (73% of all students) or as a
transferring student (27% of all students). It is found that 62% of beginners
eventually graduate, and that 78% of transferring students eventually graduate.
(a) Find the probability that a randomly chosen student:
i. is a beginner who will eventually graduate
ii. will eventually graduate
iii. is either a beginner or will eventually graduate, or both.
(b) Are the events ‘eventually graduates’ and ‘enters as a transferring student’
statistically independent?
257
C. Probability theory
(c) If a student eventually graduates, what is the probability that the student
entered as a transferring student?
(d) If two entering students are chosen at random, what is the probability that not
only do they enter in the same way but that they also both graduate or both
fail?
C
C.3 Solutions to Practice problems
1. We have:
S = {1, 2, 3, 4, 5, 6}, E = {3, 4} and F = {4, 5, 6}.
3. (a) Of the total number of equally likely outcomes, 300, there are 30 which are
early. Hence required probability is 30/300 = 0.10.
(b) Again, of the total number of equally likely outcomes, 300, there are 150 from
Smith. Hence required probability is 150/300 = 0.50.
(c) Now, of the 300, there are only 10 which are late and from Jones. Hence the
probability is 10/300 = 1/30.
258
C.3. Solutions to Practice problems
8. (a) The additive law, for any two events A and B, is:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
For example, consider the following Venn diagram:
x z y
259
C. Probability theory
(b) The multiplicative law, for any two independent events A and B, is:
P (A ∩ B) = P (A) P (B).
(b) Having no defects is the complement of the event of having at least one defect,
hence:
P (no defects) = 1 − P (C ∪ S) = 1 − 0.08 = 0.92.
10. Note that the percentage of good and bad parts total to 100.
(a) The probability that a randomly chosen part comes from A is 0.6 (60%), the
probability that one of A’s parts is bad is 0.03 (3%), so the probability that a
randomly chosen part comes from A and is bad is 0.6 × 0.03 = 0.018.
(b) Add all the outcomes, which gives:
(0.6×0.97)+(0.6×0.03)+(0.4×0.93)+(0.4×0.07) = 0.582+0.018+0.372+0.028 = 1.
(c) The probability of receiving a bad part is the probability of either receiving a
bad part from A or from B, which is:
Probabilities
Device Not fail Fail
A 0.9 0.1
B 0.9 0.1
C 0.9 0.1
D 0.9 0.1
The system fails if both A and B fail (or more) or both C and D fail (or more).
To work out the probability that the system works properly, first work out the
probability it will fail, P (F ). The probability it will work is P (F c ).
The system fails if any of the following occur:
• A, B, C and D all fail = (0.1)4 = 0.0001
• ABC, ABD, ACD or BCD fail = (0.1)3 × 0.9 × 4 = 0.0036
• A and B fail and C & D are okay = 0.1 × 0.1 × 0.9 × 0.9 = 0.0081
• C and D fail and A & B are okay = 0.1 × 0.1 × 0.9 × 0.9 = 0.0081.
260
C.3. Solutions to Practice problems
P (T ∩ G) P (G | T ) P (T ) 0.2106
P (T | G) = = = = 0.3176.
P (G) P (G) 0.6632
and:
P (Gc ∩ B) = P (Gc | B) P (B) = (1 − 0.62) × 0.73 = 0.2774.
Two students being chosen at random can be considered as independent
events, hence the required probability is:
261
C. Probability theory
262
Appendix D
Random variables, the normal and
sampling distributions
X=x 1 2 3 4 5 6 Total
P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6 1
x P (X = x) 1/6 2/6 3/6 4/6 5/6 6/6 21/6 = 3.50
x2 P (X = x) 1/6 4/6 9/6 16/6 25/6 36/6 91/6
263
D. Random variables, the normal and sampling distributions
X=x 8 9 10 11 12 Total
P (X = x) 5/36 4/36 3/36 2/36 1/36 1
x P (X = x) 40/36 36/36 30/36 22/36 12/36 252/36
x2 P (X = x) 320/36 324/36 300/36 242/36 144/36 1,974/36
0.10
0.08
0.06
0.04
2 4 6 8 10 12
Value of sum
Y =y 0 1 2 3 4 5 Total
P (Y = y) 6/36 10/36 8/36 6/36 4/36 2/36 1
y P (Y = y) 0 10/36 16/36 18/36 16/36 10/36 70/36
y 2 P (Y = y) 0 10/36 32/36 54/36 64/36 50/36 210/36
This yields a mean of E(Y ) = 70/36 = 1.94, while the variance is:
210
E(Y 2 ) − µ2 = − (1.94)2 = 2.05.
36
264
D.1. Worked examples
0.25
0.20
D
Probability
0.15
0.10
0.05
0 1 2 3 4 5
X=x 8 9 10 11 12
P (X = x) 0.16 0.18 0.20 0.22 0.24
Hence:
Var(X) = 106 − (10.20)2 = 1.96.
265
D. Random variables, the normal and sampling distributions
Solution:
The expected value is:
X
E(X) = x P (X = x) = (500 × 0.01) + (100 × 0.03) + · · · + (0 × 0.35) = £14.00
(b) We have:
(c) We have:
(d) We have:
P (0.81 < Z < 1.94) = P (Z < 1.94) − P (Z < 0.81) = 0.9738 − 0.7910 = 0.1828.
(e) We have:
266
D.1. Worked examples
(f) We have:
P (Z > −1.28) = P (Z < 1.28) = 0.8997.
(g) We have:
Solution:
Since X ∼ N (10,√4), we use the transformation Z = (X − µ)/σ with the values
µ = 10 and σ = 4 = 2.
(a) i. We have:
13.4 − 10
P (X > 13.4) = P Z > = P (Z > 1.70) = 1 − P (Z ≤ 1.70)
2
= 1 − 0.9554
= 0.0446.
ii. We have:
8 − 10 9 − 10
P (8 < X < 9) = P <Z<
2 2
= P (−1 < Z < −0.50)
= P (Z < −0.50) − P (Z < −1)
= (1 − P (Z ≤ 0.50)) − (1 − P (Z ≤ 1))
= (1 − 0.6915) − (1 − 0.8413)
= 0.1498.
267
D. Random variables, the normal and sampling distributions
(b) We want to find the value a such that P (10 − a < X < 10 + a) = 0.95, that is:
(10 − a) − 10 (10 + a) − 10
0.95 = P <Z<
2 2
a a
=P − <Z<
2 2
a a
=P Z< −P Z <−
2 2
a
=1−2×P Z <− .
2
D
This is the same as 2 × P (Z > a/2) = 0.05, i.e. P (Z > a/2) = 0.025, or
P (Z ≤ a/2) = 0.975. Therefore, from Table 4 of the New Cambridge Statistical
Tables, a/2 = 1.96, and so a = 3.92.
(c) We want to find the value b such that P (10 − b < X < 10 + b) = 0.99. Similar
reasoning shows that P (Z ≤ b/2) = 0.995. Therefore, from Table 4, b/2 = 2.58
(approximately), so that b = 5.16.
(d) We want k such that P (Z > k) = 0.01, or P (Z ≤ k) = 0.99. From Table 4,
k = 2.33 (approximately).
(e) We want x such that P (Z < x) = 0.05. This means that x < 0 and
P (Z > |x|) = 0.05, or P (Z ≤ |x|) = 0.95, so, from Table 4, |x| = 1.645 (by
interpolating between Φ(1.64) and Φ(1.65)) and hence x = −1.645.
7. Your company requires a special type of light bulb which is available from only two
suppliers. Supplier A’s light bulbs have a mean lifetime of 2,000 hours with a
standard deviation of 180 hours. Supplier B’s light bulbs have a mean lifetime of
1,850 hours with a standard deviation of 100 hours. The distribution of the
lifetimes of each type of light bulb is normal. Your company requires that the
lifetime of a light bulb be not less than 1,500 hours. All other things being equal,
which type of bulb should you buy, and why?
Solution:
Let A and B be the random variables representing the lifetimes (in hours) of light
bulbs from supplier A and supplier B, respectively. We are told that:
Since the relevant criterion is that light bulbs last at least 1,500 hours, the
company should choose the supplier whose light bulbs have a greater probability of
doing so. We find that:
1,500 − 2,000
P (A > 1,500) = P Z > = P (Z > −2.78) = P (Z < 2.78) = 0.99728
180
and:
1,500 − 1,850
P (B > 1,500) = P Z > = P (Z > −3.50) = P (Z < 3.50) = 0.9998.
100
268
D.1. Worked examples
Therefore, the company should buy light bulbs from supplier B, since they have a
greater probability of lasting the required time.
Note it is good practice to define notation and any units of measurement and to
state the distributions of the random variables. Note also that here it is not essential
to compute the probability values in order to determine what the company should
do, since −2.78 > −3.50 implies that P (Z > −2.78) < P (Z > −3.50).
8. A normal distribution has a mean of 40. If 10% of the distribution falls between the
values of 50 and 60, what is the standard deviation of the distribution?
Solution: D
2
Let X ∼ N (40, σ ). We seek σ, and know that:
50 − 40 60 − 40 10 20
P (50 ≤ X ≤ 60) = P ≤Z≤ =P ≤Z≤ = 0.10.
σ σ σ σ
Hence we know that one z-value (i.e. 20/σ) is twice the other (i.e. 10/σ), and their
corresponding probabilities differ by 0.10. We also know that the z-values are
positive since both represent numbers larger than the mean of 40. Now we need to
use Table 4 of the New Cambridge Statistical Tables to find two such z-values.
Looking at Table 4 we find the values to be, approximately, 1.25 and 2.50.
Therefore, σ = 8.
9. The life, in hours, of a light bulb is normally distributed with a mean of 200 hours.
If a consumer requires at least 90% of the light bulbs to have lives exceeding 150
hours, what is the largest value that the standard deviation can have?
Solution:
Let X be the random variable representing the lifetime of a light bulb (in hours),
so that for some value σ we have X ∼ N (200, σ 2 ). We want P (X > 150) = 0.90,
such that:
150 − 200 50
P (X > 150) = P Z > =P Z>− = 0.90.
σ σ
Note that this is the same as P (Z < 50/σ) = 0.90, so 50/σ = 1.28, giving
σ = 39.06.
10. The random variable X has a normal distribution with mean µ and variance σ 2 , i.e.
X ∼ N (µ, σ 2 ). It is known that:
269
D. Random variables, the normal and sampling distributions
Solution:
(a) The sketch below shows the probabilities with P (X ≤ 66) shaded blue and
P (X ≥ 81) shaded red.
(b) We have X ∼ N (µ, σ 2 ), where µ and σ 2 are unknown. Using Table 4 of the
New Cambridge Statistical Tables, we find that P (Z ≤ −1.80) = 0.0359 and
P (Z ≥ 1.20) = 0.1151. Therefore:
66 − µ
P (X ≤ 66) = P Z ≤ = P (Z ≤ −1.80) = 0.0359
σ
and:
81 − µ
P (X ≥ 81) = P Z≥ = P (Z ≥ 1.20) = 0.1151.
σ
Hence:
66 − µ 81 − µ
= −1.80 and = 1.20.
σ σ
Rearranging, we obtain a pair of simultaneous equations which can be solved
to find µ and σ. Specifically:
Subtracting the first from the second, gives 15 = 3σ, and hence σ = 5. For
completeness, 81 − µ = 6, so µ = 75. Therefore, X ∼ N (75, 52 ).
(c) Given X ∼ N (75, 52 ), we have:
69 − 75 83 − 75
P (69 ≤ X ≤ 83) = P ≤Z≤ = P (−1.20 ≤ Z ≤ 1.60)
5 5
= Φ(1.60) − (1 − Φ(1.20))
= 0.9452 − (1 − 0.8849)
= 0.8301.
270
D.1. Worked examples
11. The number of newspapers sold daily at a kiosk is normally distributed with a
mean of 350 and a standard deviation of 30.
(a) Find the probability that fewer than 300 newspapers are sold on a particular
day.
(b) How many newspapers should the newsagent stock each day such that the
probability of running out on any particular day is 10%?
Solution:
Let X be the number of newspapers sold, hence X ∼ N (350, 900).
(a) We have: D
300 − 350
P (X < 300) = P Z < = P (Z < −1.67) = 1 − P (Z < 1.67)
30
= 1 − 0.9525
= 0.0475.
(b) Let s be the required stock, then we require P (X > s) = 0.10. Hence:
s − 350
P Z> = 0.10
30
s − 350
⇒ ≥ 1.28
30
⇒ s ≥ 350 + 1.28 × 30 = 388.4.
12. Consider the following set of data. Does it appear to approximately follow a normal
distribution? Justify your answer.
45 31 37 55 54 56
48 54 52 55 52 51
49 46 62 38 45 48
47 46 40 61 50 58
46 35 36 59 50 48
39 48 51 52 43 45
Solution:
To see whether this set of data approximates a normal distribution, we need to
analyse it. Using a calculator we calculate the mean to be µ = 48.1 and the
standard deviation to be σ = 7.3 (assuming population data).
For ± one standard deviation, i.e. µ ± σ, the interval is (40.8, 55.4) which contains
24 observations, representing 24/36 = 67% of the data.
For ± two standard deviations, i.e. µ ± 2σ, the interval is (33.1, 62.7) which
contains 35 observations, representing 35/36 = 97% of the data.
271
D. Random variables, the normal and sampling distributions
These percentages match very closely to what we expect for a normal distribution.
We could also construct a histogram of the data, as shown below. This appears to
confirm that the data do indeed approximate a normal distribution.
A B C D
3 6 9 12
Solution:
N
P
xi
i=1 3 + 6 + 9 + 12
µ= = = 7.50.
N 4
N
x2i
P
i=1 9 + 36 + 81 + 144
σ2 = − µ2 = − (7.50)2 = 11.25.
N 4
272
D.1. Worked examples
(b) The sampling distribution of the sample mean for samples of size n = 2 is:
Sample Values X̄ = x̄ P (X̄ = x̄)
AB 3 6 4.5 1/6
AC 3 9 6 1/6
AD 3 12 7.5 1/6
BC 6 9 7.5 1/6
BD 6 12 9 1/6
CD 9 12 10.5 1/6
(c) The mean of the sampling distribution is:
4.5 + 6 + 7.5 + 7.5 + 9 + 10.5 45 D
E(X̄) = = = 7.50.
6 6
14. A random variable, X, can take the values 1, 2 and 3, each with equal probability.
List all possible samples of size two which may be chosen when order matters,
without replacement, from this population, and hence construct the sampling
distribution of the sample mean, X̄.
Solution:
Each possible sample has an equal probability of occurrence of 1/6. The sampling
distribution of X̄ is:
15. Discuss the differences or similarities between a sampling distribution of size 5 and
a single (simple) random sample of size 5.
Solution:
If 5 members are selected from a population such that every possible set of 5
population members has the same probability of being selected, then the sample is
a simple random sample. In a sampling distribution of size 5, every possible sample
of size 5 from the population is averaged and the result is the sampling
distribution. The similarity is the inclusive nature of both the simple random
sample as well as the sampling distribution.
273
D. Random variables, the normal and sampling distributions
the sampling distribution of the sample mean as a random quantity over repeated
double tosses.
Solution:
Each possible sample has an equal probability of occurrence of 1/16. The sampling
distribution of X̄ is:
Relative frequency, P (X̄ = x̄) 1/16 1/8 3/16 1/4 3/16 1/8 1/16
17. The weights of a large group of animals have mean 8.2 kg and standard deviation
2.2 kg. What is the probability that a random selection of 80 animals from the
group will have a mean weight between 8.3 kg and 8.4 kg? State any assumptions
you make.
Solution:
We are not told that the population is normal, but n is ‘large’ so we can apply the
central limit theorem. The sampling distribution of X̄ is, approximately:
σ2 (2.2)2
X̄ ∼ N µ, = N 8.2, .
n 80
Hence, using Table 4 of the New Cambridge Statistical Tables:
8.3 − 8.2 8.4 − 8.2
P (8.3 ≤ X̄ ≤ 8.4) = P √ ≤Z≤ √
2.2/ 80 2.2/ 80
= P (0.41 ≤ Z ≤ 0.81)
= P (Z ≤ 0.81) − P (Z ≤ 0.41)
= 0.7910 − 0.6591
= 0.1319.
18. A random sample of 25 audits is to be taken from a company’s total audits, and
the average value of these audits is to be calculated.
(a) Explain what is meant by the sampling distribution of this average and discuss
its relationship to the population mean.
(b) Is it reasonable to assume that this sampling distribution is normally
distributed?
(c) If the population of all audits has a mean of £54 and a standard deviation of
£10, find the probability that:
274
D.1. Worked examples
σ2
100
X̄ ∼ N µ, = N 54, .
n 25
i. We have:
!
60 − 54
P (X̄ > 60) = P Z>p = P (Z > 3) = 0.00135.
100/25
275
D. Random variables, the normal and sampling distributions
5. The manufacturer of a new brand of lithium battery claims that the mean life of a
battery is 3,800 hours with a standard deviation of 250 hours. Assume the lifespans
of lithium batteries follow a normal distribution.
(a) What percentage of batteries will last for more than 3,500 hours?
(b) What percentage of batteries will last for more than 4,000 hours?
(c) If 700 batteries are supplied, how many should last between 3,500 and 4,000
hours?
6. The following six observations give the time taken, in seconds, to complete a
100-metre sprint by all six individuals competing in a race. Note this is population
data.
Individual Time
A 15
B 14
C 10
D 12
E 20
F 15
276
D.3. Solutions to Practice problems
(a) Find the population mean, µ, and the population standard deviation, σ, of the
sprint times.
(b) Calculate the sample mean for each possible sample of:
i. two individuals
ii. three individuals
iii. four individuals.
(c) Work out the mean for each set of sample means (it must come to µ) and
compare the standard deviations of the sample means about µ.
This may take some time, but, after you have done it, you should have a
clearer idea about sampling distributions!
D
0.683
(a) If we are looking for 0.68 (approximately) as the area under the curve one
standard deviation either side of µ, we need the grey shaded area in the
diagram above. Half of this area must be 0.68/2 = 0.34 and hence the whole
area to the left of µ + σ must be 0.34 + 0.50 = 0.84.
Looking at Table 4, for Φ(z) = 0.84, we see that z is between 0.99 (where Φ(z)
is 0.8389) and 1.00 (where Φ(z) = 0.8413), i.e. approximately correct.
277
D. Random variables, the normal and sampling distributions
(b) Similarly, look up Φ(z) of 0.95/2 + 0.50 = 0.475 + 0.50 = 0.975, which gives us
a z of exactly 1.96.
(c) Similarly, look up Φ(z) of 0.99/2 + 0.50 = 0.495 + 0.50 = 0.995, which gives z
between 2.57 and 2.58. Therefore, the third example is more approximate!
and:
B − 55 45 − 55
P (B < 45) = P < = P (Z < −1.67).
6 6
Clearly, P (Z < −1.67) < P (Z < −1) (if you do not see this immediately, shade
these regions on a sketch of the standard normal distribution) and hence schools of
type A would have a higher proportion of students with marks below 45.
278
D.3. Solutions to Practice problems
279
D. Random variables, the normal and sampling distributions
280
D.3. Solutions to Practice problems
For the case of three individuals, the mean of all the sample means is:
20
1 X 286.67
µ= x̄i = = 14.33.
20 i=1 20
For the case of four individuals, the mean of all the sample means is:
15
1 X 215
µ= x̄i = = 14.33.
15 i=1 15
The standard deviations of X̄ are 1.9551, 1.3824 and 0.9776 for samples of size D
2, 3 and 4, respectively. This confirms that the accuracy of our population
mean estimator improves as we increase our sample size because we increase
the amount of information about the population in the sample.
281
D. Random variables, the normal and sampling distributions
282
Appendix E
Interval estimation
283
E. Interval estimation
s 40.6
x̄±z0.025 × √ = 320.41±1.96× √ = 320.41±11.25 ⇒ (£309.16, £331.66).
n 50
Note that because n is large we have used the standard normal approximation.
It is more accurate to use a t distribution on 49 degrees of freedom. Using
Table 10 of the New Cambridge Statistical Tables, the nearest available value is
t0.025, 50 = 2.009. This gives a 95% confidence interval of:
s 40.6
x̄±t0.025, 49 × √ = 320.41±2.009× √ = 320.41±11.54 ⇒ (£308.87, £331.95)
n 50
E so not much of a difference.
To obtain a 95% confidence interval for the total value of the stock, 9,875µ,
multiply the interval endpoints by 9,875. This gives (to the nearest £10,000):
(£3,050,000, £3,280,000).
Solution:
In this question we are estimating a proportion with n = 100. Let π be the
proportion of the party’s supporters in the population. With p = 35/100 = 0.35
and n = 100, a 95% confidence interval for π is calculated as:
r r
p(1 − p) 0.35 × 0.65
p ± z0.025 × ⇒ 0.35 ± 1.96 × ⇒ (0.257, 0.443).
n 100
4. A claimant of Extra Sensory Perception (ESP) agrees to be tested for this ability.
Blindfolded, he claims to be able to identify more randomly chosen cards than
would be expected by pure guesswork.
An experiment is conducted in which 200 playing cards are drawn at random, and
with replacement, from a deck of cards, and the claimant is asked to name their
suits (hearts, diamonds, spades or clubs).
Of the 200 cards he identifies 60 correctly. Compute a 95% confidence interval for
his true probability of identifying a suit correctly. Is this evidence of ESP?
284
E.1. Worked examples
Solution:
We have the sample proportion p = 60/200 = 0.30 and an estimated standard error
of: r
0.30 × 0.70
= 0.0324.
200
A 95% confidence interval for the true probability of the correct identification of a
suit is:
r
p(1 − p)
p ± z0.025 × ⇒ 0.30 ± 1.96 × 0.0324 ⇒ (0.236, 0.364).
n
As 0.25 (pure guesswork) is within this interval, then the claimant’s performance is
not convincing!
5. 400 college students, chosen at random, are interviewed and it is found that 240 use
E
the refectory.
(a) Use these data to compute a:
i. 95% confidence interval
ii. 99% confidence interval
for the proportion of students who use the refectory.
(b) The college has 12,000 students. The college catering officer claims that the
refectory is used by at least 9,000 students and that the survey has yielded a
low figure due to sampling variability. Is this claim reasonable?
Solution:
ii. Similarly, a 99% confidence interval for the population proportion is:
(b) A 99% confidence interval for the total number of students using the refectory
is obtained by multiplying the interval endpoints by the total number of
students, i.e. 12,000. This gives (6,442, 7,959).
The catering officer’s claim of 9,000 is incompatible with these data as it falls
well above the 99% confidence interval upper endpoint. We conclude that the
actual number of users is well below 9,000.
285
E. Interval estimation
6. A simple random sample of 100 workers had weekly salaries with a mean of £315
and a standard deviation of £20.
(a) Calculate a 90% confidence interval for the mean weekly salary of all workers
in the factory.
(b) How many more workers should be sampled if it is required that the estimate
is to be within £3 of the true average (again, with 90% confidence)?
Note this means a tolerance of £3 – equivalent to a confidence interval width
of £6.
Solution:
(a) We have n = 100, x̄ = 315 and s = 20. The estimated standard error is:
s 20
√ =√ = 2.
n 100
E
Hence a 90% confidence interval for the true mean is:
s
x̄ ± t0.05, n−1 × √ ⇒ 315 ± 1.645 × 2 ⇒ (311.71, 318.29)
n
where we use the approximation t0.05, 99 ≈ z0.05 = 1.645 since n is large.
(b) For the tolerance to be 3, we require:
20
3 ≥ 1.645 × √ .
n
Solving this gives n ≥ 120.27, so we round up to get n = 121. Hence we need
to take a further sample of 21 workers.
286
E.1. Worked examples
Solving gives n = 507. Therefore, we must sample 457 more children and 467
more adults.
8. Two market research companies each take random samples to assess the public’s
attitude toward a new government policy. If n represents the sample size and r the
number of people against, the results of these independent surveys are as follows:
n r
Company 1 400 160
Company 2 900 324
E
Solution:
Here we are estimating the difference between two proportions (if the two
companies’ results are compatible, then there should be no difference between the
proportions). The formula for a 95% confidence interval of the difference is:
s
p1 (1 − p1 ) p2 (1 − p2 )
p1 − p2 ± z0.025 × + .
n1 n2
The estimates of the proportions are p1 = 160/400 = 0.40 and p2 = 324/900 = 0.36,
so a 95% confidence interval is:
r
0.40 × 0.60 0.36 × 0.64
0.40 − 0.36 ± 1.96 × + ⇒ (−0.017, 0.097).
400 900
Since zero is in this interval it is plausible that the two proportions are the same, so
the companies’ surveys are compatible indicating likely representativeness.
9. A sample of 954 adults in early 1987 found that 23% of them held shares.
(a) Given a UK adult population of 41 million, and assuming a proper random
sample was taken, compute a 95% confidence interval estimate for the number
of shareholders in the UK in 1987 (following liberalisation of financial markets
in the UK).
(b) A ‘similar’ survey the previous year (prior to financial liberalisation) had
found a total of 7 million shareholders. Assuming ‘similar’ means the same
sample size, find a 95% confidence interval estimate of the increase in
shareholders between the two years.
287
E. Interval estimation
Solution:
(a) Let π be the proportion of shareholders in the population in 1987. Start by
estimating π. We are estimating a proportion and n is large, so a 95%
confidence interval for π is:
r r
p(1 − p) 0.23 × 0.77
p ± z0.025 × ⇒ 0.23 ± 1.96 × ⇒ (0.203, 0.257).
n 954
Therefore, a 95% confidence interval for the number (rather than the
proportion) of shareholders in the UK in 1987 is obtained by multiplying the
above interval endpoints by 41 million, resulting in:
9,430,000 ± 1,107,000 ⇒ (8,323,000, 10,537,000)
Therefore, we estimate there were about 9.43 million shareholders in the UK in
1987, with a margin of error of approximately 1.1 million.
E (b) Let us start by finding a 95% confidence interval for the difference in the two
proportions. We use the formula:
s
p1 (1 − p1 ) p2 (1 − p2 )
p1 − p2 ± z0.025 × + .
n1 n2
The estimates of the proportions π1 and π2 are 0.230 and 7/41 = 0.171,
respectively. We know n1 = 954, and although n2 is unknown we can assume it
is approximately equal to 954 (as the previous year’s survey was ‘similar’), so a
95% confidence interval is:
r
0.230 × 0.770 0.171 × 0.829
0.230 − 0.171 ± 1.96 × + = 0.059 ± 0.036
954 954
giving (0.023, 0.094). Multiply the interval endpoints by 41 million and we get
a confidence interval of:
2,419,000 ± 1,476,000 ⇒ (943,000, 3,895,000).
We estimate that the number of shareholders has increased by about 2.4
million during the period covered by the surveys.
There is quite a large margin of error of approximately 1.5 million, especially
when compared with a point estimate (i.e. interval midpoint) of 2.4 million.
However, it seems financial liberalisation increased the number of shareholders.
288
E.1. Worked examples
Solution:
As both sample sizes are ‘large’ there is no need to use a pooled estimator of the
variance as we would expect s21 ≈ σ12 and s22 ≈ σ22 . A 95% confidence interval for the
difference in means is:
s r
s21 s22 (21)2 (29)2
x̄1 −x̄2 ±z0.025 × + ⇒ 559−503±1.96× + ⇒ (46.0, 66.0).
n1 n2 60 45
Zero is well below the lower endpoint, so there is evidence that the advertising
campaign has increased sales.
11. A survey is conducted on time spent recording (in hours per track) for 25 music
industry recording studios, classified as successful or unsuccessful according to their
recent chart performances. The relevant statistics resulting from this study are:
E
Sample size Sample mean Sample standard deviation
Successful studios 12 9.2 1.4
Unsuccessful studios 13 6.8 1.9
(a) Compute a 98% confidence interval for the difference between population mean
recording times between successful and unsuccessful studios.
(b) On the basis of this confidence interval, do you consider this to be sufficient
evidence of a true difference in mean recording times between the different
types of studios? Justify your answer.
Solution:
(a) Both samples are small (combined sample size < 30), so it will be necessary to
use a pooled variance by assuming a common value of σ 2 and estimating it
using the pooled variance estimator:
This variance is our best estimate of σ 2 , and applies (by assumption) to both
populations. If the true population means µ1 and µ2 are estimated by x̄1 and
289
E. Interval estimation
x̄2 (the sample means), respectively, then the central limit theorem allows us
to use the following approximations:
Sp2 Sp2
Var(X̄1 ) ≈ and Var(X̄2 ) ≈
n1 n2
with point estimates of 2.8209/12 and 2.8209/13, provided that we use the t
distribution.
The point estimates are 9.2 for µ1 and 6.8 for µ2 . We estimate the difference
between the means by:
x̄1 − x̄2 = 9.2 − 6.8 = 2.4.
Since the random samples are independent, we add the variances, hence:
s2p s2p
2.8209 2.8209 1 1
+ = + = 2.8209 × +
n1 n2 12 13 12 13
E on 23 degrees of freedom.
Therefore, a 98% confidence interval for the difference between the means is,
noting here t0.01, 23 = 2.500 for 98% confidence:
s
1 1
2.4 ± 2.500 × 2.8209 × + ⇒ 2.4 ± 2.500 × 0.6724
12 13
resulting in the confidence interval (0.7190, 4.0810).
(b) Since zero is not in this interval, it follows (with 98% confidence) that the true
difference is greater than zero, so that more time (on average) is spent
recording at the successful studios.
12. Two advertising companies each give quotations for nine different campaigns. Their
quotations (in £000s) are shown in the following table. Compute a 95% confidence
interval for the true difference between average quotations. Can you deduce from
this interval if one company is more expensive on average than the other?
Company 1 2 3 4 5 6 7 8 9
A 39 24 36 42 45 30 38 32 39
B 46 26 32 39 51 34 37 41 44
Solution:
The data are paired observations for n = 9 campaigns. First compute the
differences:
7, 2, −4, −3, 6, 4, −1, 9 and 5.
The sample mean of these differences is 2.78 and the sample standard deviation is
4.58. (Make sure you know how to do these calculations!) Therefore, a 95%
confidence interval for the true difference, µd , is:
s 4.58
x̄d ± t0.025, n−1 × √ ⇒ 2.78 ± 2.306 × √ ⇒ (−0.74, 6.30).
n 9
Since zero is in this confidence interval we cannot conclude that one company is
more expensive than the other on average. It is possible that the differences in
quotations are due to random variation.
290
E.2. Practice problems
2. National figures for a blood test result have been collected and the population
standard deviation is 1.2. You select a random sample of 100 observations and find
a sample mean of 25 units. Construct a 95% confidence interval for the population
mean.
4. Look at Table 10 of the New Cambridge Statistical Tables. Note that different
probability tails are given for ν = 1, 2 etc. Now consider a 95% confidence interval
for µ when n = 21 (i.e. for 20 degrees of freedom, as ν is the same as n − 1 in this
case). You can see the t-value is t0.025, 20 = 2.086. However, when ν → ∞, the
t-value converges to 1.96 – exactly the same as z0.025 for the standard normal
distribution. Although you can see that t-values are given for quite large degrees of
freedom, we generally assume that the standard normal distribution can be used
instead of Student’s t if the degrees of freedom are greater than 30 (some people
say 40, others say 50).
6. A random sample of 200 students is observed. 30 of them say they ‘really enjoy’
studying Statistics.
(a) Calculate the proportion of students in the sample who say they ‘really enjoy’
studying Statistics and then construct a 95% confidence interval for the
population proportion of students who ‘really enjoy’ studying Statistics.
You now take a further random sample, in another institution. This time there are
100 students, and 18 say they ‘really enjoy’ studying Statistics.
291
E. Interval estimation
(b) Construct a 95% confidence interval for the population proportion of students
who ‘really enjoy’ studying Statistics based on this second sample.
(c) Construct a 95% confidence interval for the difference between the two
proportions.
1. False. When calculated from the same data set, the only difference is the change in
confidence level. There is a trade-off whereby higher confidence (which is good,
other things equal) leads to a larger multiplier coefficient (a z-value or t-value)
leading to a greater margin of error, and hence a wider confidence interval (which is
bad, other things equal). Therefore a 95% confidence interval would be wider, since
E 95% confidence is greater than 90% confidence.
σ 1.2
x̄ ± 1.96 × √ ⇒ 25 ± 1.96 × √ ⇒ (24.7648, 25.2352).
n 100
5. The sample proportions are pA = 220/300 = 0.733 and pB = 200/250 = 0.8. For a
95% confidence interval, we use the z-value of 1.96. A 95% confidence interval is:
r
0.733 × (1 − 0.733) 0.8 × (1 − 0.8)
0.8−0.733±1.96× + ⇒ (−0.0035, 0.1375).
300 250
292
E.3. Solutions to Practice problems
6. (a) We are given n = 200 and p = 30/200 = 0.15. So the estimated standard error
is:
0.15 × 0.85 √
r
E.S.E.(p) = = 0.0006375 = 0.0252.
200
A 95% confidence interval is:
0.18 × 0.82 √
r
E.S.E.(p) = = 0.0015 = 0.0384
100 E
which is larger than the one in the first sample owing to the relatively smaller
sample size.
This gives us a 95% confidence interval of:
or between 12.98% and 31.02% – not a very useful confidence interval! Such a
wide interval arises because the sample proportions differ greatly, meaning the
proportions of students enjoying Statistics in these two different institutions
seem to be very different.
293
E. Interval estimation
294
Appendix F
Hypothesis testing principles
Solution:
In hypothesis testing, the binary decision can result in one of two possible
outcomes based on the analysis of sample data:
• Reject the null hypothesis: This decision indicates that there is enough
evidence in the sample data to conclude that the null hypothesis is not true. It F
suggests that there is an effect, a difference, or a change in the population
parameter, as suggested by the alternative hypothesis.
• Fail to reject the null hypothesis: This decision implies that there is insufficient
evidence in the sample data to reject the null hypothesis. It does not
necessarily mean that the null hypothesis is proven true; rather, it suggests
that the available evidence is not strong enough to support a rejection.
The goal of hypothesis testing is to make a decision about the null hypothesis
based on the observed sample data while controlling the risk of making a Type I
error (incorrectly rejecting a true null hypothesis).
Solution:
When testing H0 : µ = µ0 , the alternative hypothesis for:
(a) a two-tailed test is H1 : µ 6= µ0
(b) an upper-tailed test is H1 : µ > µ0
(c) a lower-tailed test is H1 : µ < µ0 .
3. (a) Briefly define Type I and Type II errors in the context of the statistical test of
a hypothesis.
(b) What is the general effect on the probabilities of each type of these errors
happening if the sample size is increased?
295
F. Hypothesis testing principles
Solution:
(a) A Type I error is rejecting H0 when it is true. A Type II error is failing to
reject a false H0 . We can express these as conditional probabilities as follows:
and:
β = P (Type II error) = P (Not reject H0 | H1 is true).
(b) Increasing the sample size decreases the probabilities of making both types of
error because there is greater precision in the estimation of parameters.
4. (a) Explain what is meant by the statement: ‘The test is significant at the 5%
significance level’.
(b) How should you interpret a test which is significant at the 10% significance
level, but not at the 5% significance level?
Solution:
F (a) ‘The test is significant at the 5% significance level’ means there is a less than
5% chance of getting data as extreme as actually observed if the null
hypothesis was true. This implies that the data are inconsistent with the null
hypothesis, which we reject, i.e. the test is ‘moderately significant’ with
‘moderate evidence’ to justify rejecting H0 .
(b) A test which is significant at the 10% significance level, but not at the 5%
significance level is often interpreted as meaning there is some doubt about the
null hypothesis, but not enough to reject it with confidence, i.e. the test is
‘weakly significant’ with ‘weak evidence’ to justify rejecting H0 .
5. You are interested in researching the average daily sunlight exposure in a certain
region. Sunlight exposure in the region is modelled as a normal distribution with a
mean of µ and a known variance of σ 2 = 36. A random sample of n = 30 days is
taken, yielding a sample mean of 8 hours of sunlight per day, i.e. x̄ = 8 hours.
Independently of the data, three experts provide their opinions about the average
daily sunlight exposure in the region as follows:
• Meteorologist A claims the population mean sunlight exposure is µ = 7.5
hours.
• Climate Scientist B claims the population mean sunlight exposure is µ = 7.2
hours.
• Environmental Scientist C claims the population mean sunlight exposure is
µ = 6.8 hours.
Based on the data evidence, which expert’s claim do you find the most convincing?
Solution:
The sampling distribution of the sample mean is:
σ2
36
X̄ ∼ N µ, = N µ, = N (µ, 1.2)
n 30
296
F.1. Worked examples
297
F. Hypothesis testing principles
Decision made
F H0 not rejected H0 rejected
True state H0 true
of nature H1 true
Decision made
H0 not rejected H0 rejected
True state H0 true
of nature H1 true
298
F.3. Solutions to Practice problems
Decision made
H0 not rejected H0 rejected
True state H0 true Correct decision Type I error
of nature H1 true Type II error Correct decision
Decision made
H0 not rejected H0 rejected
True state H0 true 1−α P (Type I error) = α
of nature H1 true P (Type II error) = β Power = 1 − β
4. (a) A test which is significant at the 1% significance level is highly significant. This
indicates that it is very unlikely to have obtained the sample we did if the null
hypothesis was actually true. Therefore, we would be very confident in
rejecting the null hypothesis with strong evidence to justify doing so.
(b) A test which is significant at the 10% significance level, but not at the 5%
significance level, is weakly significant. This indicates mild (at best) support
for rejecting the null hypothesis, with only weak evidence. The test outcome is
less ‘conclusive’ than in part (a).
299
F. Hypothesis testing principles
5. We have:
Significance level decision tree
Not reject H0
Choose the 10% level
300
Appendix G
Hypothesis testing of means and
proportions
(a) Test whether the mean voltage of the whole batch is 12 volts using two
appropriate significance levels.
(b) Test whether the mean batch voltage is less than 12 volts using two
appropriate significance levels.
(c) Which test do you think is the more appropriate? Briefly explain why.
G
Solution:
(a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is
small and that σ 2 is unknown. We can use a t test, and this is valid provided
the data are normally distributed. The test statistic value is:
x̄ − 12 12.7 − 12
t= √ = √ = 2.16.
s/ 7 0.858/ 7
301
G. Hypothesis testing of means and proportions
• If you suspected before collecting the data that the mean voltage was less
than 12 volts, the test in part (b) would be appropriate.
• If you had no prior reason to believe that the mean was less than 12 volts,
you would use the test in part (a).
• General rule: decide on whether it is a one-tailed or two-tailed test before
collecting the data!
2. A salesperson claims that they can sell at least 100 items per day. Over a period of
11 days their resulting sales were as shown below. Do the data support their claim?
94, 105, 100, 90, 96, 101, 92, 97, 93, 100 and 91.
Solution:
Here we look for evidence to refute the claim. If µ is the mean number of sales per
day, then we wish to test:
302
G.1. Worked examples
At the 1% significance level, we compare this to the critical value of z0.01 = 2.326
(from the bottom row of Table 10) and we see that it is significant at the 1%
significance level since 3.33 > 2.326. The test is highly significant and there is
strong evidence that more than 25% of households use ‘Snolite’. It appears that the
advertising campaign has indeed been successful.
0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1.
(a) State the null and alternative hypotheses testing whether the true proportion
of toothpaste brands with the ADA seal verifying effective decay prevention is
less than 0.50.
(b) Assuming that the sample was random, from a much larger population,
conduct a test between the two hypotheses.
Solution:
(a) Appropriate hypotheses are H0 : π = 0.50 vs. H1 : π < 0.50.
(b) The data show that there are r = 21 ADA seals among n = 46 items in the
G
sample. Let π be the true proportion. If H0 is true then the sample proportion
is:
π(1 − π) 0.50 × 0.50
P ∼ N π, = N 0.50, = N (0.50, 0.005435)
n 46
21/46 − 0.50
z= √ = −0.59.
0.005435
This is clearly not going to be significant at any plausible significance level
given that the p-value is:
Therefore, we cannot reject the null hypothesis. The test is not statistically
significant, and there is insufficient evidence that less than half the brands
have the ADA seal.
303
G. Hypothesis testing of means and proportions
Solution:
We suspect the newspaper is being ‘sensationalist’ such that we seek evidence that
the true percentage of workers who fear losing their job is actually less than 30%.
The null and alternative hypotheses are:
P − π0 P − 0.30
Z=p =p ∼ N (0, 1).
π0 (1 − π0 )/n 0.30 × (1 − 0.70)/n
The sample proportion is p = 78/300 = 0.26 and the test statistic value is −1.51.
For α = 0.05, the lower-tailed test critical value is z0.95 = −1.645. Since
−1.51 > −1.645, we do not reject H0 at the 5% significance level. Moving to
α = 0.10, the critical value is now z0.90 = −1.282. Since −1.51 < −1.282, we reject
H0 at the 10% significance level. The test is weakly significant and we conclude that
there is weak evidence that fewer than 30% of workers were worried about losing
their jobs, suggesting there is only a marginal reason to doubt the newspaper
report’s claim.
Solution:
Compare this to the critical value of z0.90 = −1.282 (from the bottom row of
Table 10) and you will see that it is not significant at even the 10%
significance level since −0.20 > −1.282. Hence the test is not statistically
significant, i.e. there is insufficient evidence that fewer than 40% of households
in Los Angeles County were protected by earthquake insurance.
304
G.1. Worked examples
using Table 4. Hence we would not reject H0 for any α < 0.4207, remembering
that the p-value is the smallest significance level such that we reject the null
hypothesis.
P (Z ≤ −0.976) = 0.1645.
8. A market research company has conducted a survey of adults in two large towns,
either side of an international border, in order to judge attitudes towards a
controversial internationally-broadcast celebrity television programme.
The following table shows some of the information obtained by the survey:
Town A Town Z
Sample size 80 80
Sample number approving of the programme 44 41
(a) Conduct a formal hypothesis test, at two appropriate significance levels, of the
claim that the population proportions approving of the programme in the two
towns are equal.
305
G. Hypothesis testing of means and proportions
(b) Would your conclusion be the same if, in both towns, the sample sizes had
been 1,500 (with the same sample proportions of approvals and the same
pooled sample proportion)?
Solution:
(a) We test:
H0 : πA = πZ vs. H1 : πA 6= πZ .
Note there is no a priori reason to suppose which town would have a greater
proportion approving of the programme in the event that the proportions are
not equal, hence we conduct a two-tailed test.
The sample proportions are pA = 44/80 = 0.55 and pZ = 41/80 = 0.5125, and
the estimate of the common proportion under H0 is:
44 + 41
p= = 0.53125.
80 + 80
We ues the test statistic:
PA − PZ
Z=p ∼ N (0, 1)
P (1 − P )(1/nA + 1/nZ )
approximately under H0 . The test statistic value is:
0.55 − 0.5125
G p = 0.48.
0.53125 × (1 − 0.53125) × (1/80 + 1/80)
At the 5% significance level, the critical values are ±1.96. Since 0.48 < 1.96 we
do not reject H0 . Following the significance level decision tree, we now test at
the 10% significance level, for which the critical values are ±1.645. Since
0.48 < 1.645, again we do not reject H0 . Hence the test is not statistically
significant, i.e. there is insufficient evidence of a difference in the population
proportions approving of the programme.
For reference, the p-value for this two-tailed test is:
Since 1.96 < 2.06, we reject H0 at the 5% significance level. Now testing at the
1% significance level we have critical values of ±2.576. Since 2.06 < 2.576, we
do not reject H0 and conclude that the test is moderately significant indicating
moderate evidence of a difference in the population proportions approving of
the programme.
The larger sample sizes have increased the power of the test, i.e. they have
increased our ability to reject H0 whereby the difference in the sample
306
G.1. Worked examples
Since 0.01 < 0.0394 < 0.05, the test is moderately significant, as before.
where p is the pooled sample proportion, i.e. p = (50 + 16)/(250 + 320) = 0.116.
This test statistic value is, at 5.55, obviously very extreme and hence is highly
significant (since z0.01 = 2.326 < 5.55), so there is strong evidence that
anticorrosives reduce lead levels.
10. Two companies supplying a television repair service are compared by their repair
times (in days). Random samples of recent repair times for these companies gave
the following statistics:
307
G. Hypothesis testing of means and proportions
Solution:
(a) H0 : µA = µB vs. H1 : µA 6= µB , where we use a two-tailed test since there is no
indication of which company would have a faster mean repair time. The test
statistic value is:
11.9 − 10.8
z=p = 2.06.
7.3/44 + 6.3/52
For a two-tailed test, this is significant at the 5% significance level (since
z0.025 = 1.96 < 2.06), but it is not at the 1% significance level (since
2.06 < 2.326 = z0.01 ). We reject H0 and conclude that the test is moderately
significant with moderate evidence that the companies differ in terms of their
mean repair times.
(b) The p-value for this two-tailed test is 2 × P (Z > 2.06) = 0.0394.
(c) For small samples, we should use a pooled estimate of the population standard
deviation: s
(9 − 1) × 7.3 + (17 − 1) × 6.2
s= = 2.5626
(9 − 1) + (17 − 1)
on 24 degrees of freedom. Hence the test statistic value in this case is:
11.9 − 10.8
t= p = 1.04.
2.5626 × 1/9 + 1/17
G This should be compared with critical values from the t24 distribution (5%:
1.711 and 10%: 1.318) and is clearly not significant, even at the 10%
significance level. With the smaller samples we fail to detect the difference.
Comparing the two test statistic calculations shows that the different results
flow from differences in the estimated standard errors, hence ultimately (and
unsurprisingly) from the differences in the sample sizes used in the two
situations.
11. A study was conducted to determine the amount of hours spent on Facebook by
university and high school students. For this reason, a questionnaire was
administered to a random sample of 14 university and 11 high school students and
the hours per day spent on Facebook were recorded.
308
G.2. Practice problems
Solution:
(a) Let µ1 denote the mean hours per day spent on Facebook for university
students and let µ2 denote the mean hours per day spent on Facebook for high
school students. We test:
H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
Since the sample sizes are small, we assume σ12 = σ22 and use the test statistic:
X̄1 − X̄2
q ∼ tn1 +n2 −2
2
Sp (1/n1 + 1/n2 )
309
G. Hypothesis testing of means and proportions
3. You have been asked to compare the percentages of people in two groups with
n1 = 16 and n2 = 24 who are in favour of a new plan. You decide to make a pooled
estimate of the proportion and make a test. Which test would you use?
4. Two different methods of determination of the percentage fat content in meat are
available. Both methods are used on portions of the same meat sample. Is there
any evidence to suggest that one method gives a higher reading than the other?
310
G.3. Solutions to practice problems
p-value of the test is reduced when σ 2 is given. Therefore, the risk of rejecting H0 is
also reduced.
2. Here the manufacturer is claiming that 90% of the population will be relieved over
8 hours, i.e. π = 0.90. There is a random sample of n = 200 and 160 gained relief,
giving a sample proportion of p = 160/200 = 0.80.
Given the manufacturer’s claim, we would be concerned if we found significant
evidence that fewer than 90% were relieved (more than 90% would not be a
problem!). So we have a one-tailed test with:
H0 : π = 0.90 vs. H1 : π < 0.90.
We use the population value π to work out the standard error, and so the test
statistic value is:
0.80 − 0.90 0.10
z=p = = −4.717.
(0.90 × 0.10)/200 0.0212
This goes beyond Table 4 of the New Cambridge Statistical Tables, so this result is
highly significant, since the p-value is nearly zero. Alternatively, you could look at
the bottom row of Table 10 which provides the 5% significance level critical value
of −1.645 for a lower-tailed test and the 1% significance level critical value of
−2.326. This confirms that the result is highly significant. So we reject H0 and the
manufacturer’s claim is not legitimate. Based on the observed sample, we have
(very) strong evidence that the proportion of the population given relief is G
significantly less than the 90% claimed.
3. We would test:
H0 : π1 = π2 vs. H1 : π1 6= π2 .
Under H0 , the test statistic would be:
P1 − P2
p ∼ N (0, 1)
P (1 − P )(1/16 + 1/24)
where P is the pooled proportion estimator:
n 1 P1 + n 2 P 2 16P1 + 24P2
P = = .
n1 + n2 40
Since the test statistic follows (approximately) the standard normal distribution,
there are no degrees of freedom to consider. For a two-tailed test, the 5% critical
values would be ±1.96.
4. Since the same meat is being put through two different tests, we are clearly dealing
with paired samples. We wish to test:
H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2
but for paired samples we do our calculations using the differenced data, so we
might reformulate this presentation of the hypotheses in the form:
H0 : µ1 − µ2 = 0 vs. H1 : µ1 − µ2 6= 0.
If we list the differences from the original data table then we get:
311
G. Hypothesis testing of means and proportions
312
Appendix H
Contingency tables and the
chi-squared test
2. An analyst of the retail trade uses as analytical tools the concepts of ‘Footfall’ (the
daily number of customers per unit sales area of a shop) and ‘Ticket price’ (the
average sale price of an item in the shop’s offer). Shops are classified as offering low,
medium or high price items and, during any sales period, as having low, medium or
high footfall. During the January sales, the analyst studies a sample of shops and
obtains the following frequency data for the nine possible combined classifications:
Low price Medium price High price
Low footfall 22 43 16
Medium footfall 37 126 25
High footfall 45 75 23
313
H. Contingency tables and the chi-squared test
Conduct a suitable test for association between Ticket classification and Footfall
level, and report on your findings.
Solution:
We test:
H0 : No association between ticket price and footfall level
vs.
H1 : Association between ticket price and footfall level.
We next compute the expected values for each cell using:
row i total × column j total
total number of observations
which gives:
Ticket price
Footfall Low Medium High Total
O1· 22 43 16 81
Low E1· 20.45 47.97 12.58 46
(O1· − E1· )2 /E1· 0.12 0.52 0.93
O2· 37 126 25 188
Medium E2· 47.46 111.34 29.20 188
(O2· − E2· )2 /E2· 2.30 1.93 0.61
O3· 45 75 23 143
High E3· 36.10 84.69 22.21 143
(O3· − E3· )2 /E3· 2.20 1.11 0.03
H Total 104 244 64 412
3. The table below shows the relationship between gender and party identification in
a US state.
Party identification
Gender Democrat Independent Republican Total
Male 108 72 198 378
Female 216 59 142 417
Total 324 131 340 795
314
H.1. Worked examples
Test for an association between gender and party identification at two appropriate
significance levels and comment on your results.
Solution:
We test:
H0 : There is no association between gender and party identification
vs.
H1 : There is an association between gender and party identification.
The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 44.71. The number of degrees
P
4. The following table shows the numbers of car accidents in an urban area over a
period of time. These are classified by severity and by type of vehicle. Carry out a
test for association on these data and draw conclusions.
Severity of accident
Minor Medium Major
Saloon 29 39 16
Van 15 24 12
Sports car 7 20 12
Solution:
We test:
H0 : There is no association between type of vehicle and severity of accident
vs.
H1 : There is an association between type of vehicle and severity of accident.
315
H. Contingency tables and the chi-squared test
5. In a random sample of 200 children in a school it was found that 171 had been
inoculated against the common cold before the winter. The table below shows the
numbers observed to have suffered from colds over the winter season.
Test for evidence of an association between colds and inoculation, and draw
conclusions.
Solution:
We test:
H0 : There is no association between being inoculated and cold prevention.
vs.
H1 : There is an association between being inoculated and cold prevention.
We construct the following contingency table:
Inoculated Not inoculated Total
O1j 40 5 45
No cold E1j 38.475 6.525 45
(O1j − E1j )2 /E1j 0.060 0.356
O2j 131 24 155
Cold E2j 132.525 22.475 155
(O2j − E2j )2 /E2j 0.018 0.103
Total 171 29 200
316
H.1. Worked examples
Examining bodies
A B C D
Pass 233 737 358 176
Refer 16 68 29 20
Fail 73 167 136 64
Solution:
We test:
vs.
The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 21.21. The number of degrees
P
317
H. Contingency tables and the chi-squared test
7. Many people believe that when a horse races, it has a better chance of winning if
its starting line-up position is closer to the rail on the inside of the track. The
starting position of 1 is closest to the inside rail, followed by position 2, and so on.
The table below lists the numbers of wins of horses in the different starting
positions. Do the data support the claim that the probabilities of winning in the
different starting positions are not all the same?
Starting position 1 2 3 4 5 6 7 8
Number of wins 29 19 18 25 17 10 15 11
Solution:
We test whether the data follow a discrete uniform distribution of 8 categories. Let
pi = P (X = i), for i = 1, 2, . . . , 8.
We test the null hypothesis H0 : pi = 1/8, for i = 1, 2, . . . , 8.
Note n = 29 + 19 + 18 + 25 + 17 + 10 + 15 + 11 = 144. The expected frequencies are
Ei = 144/8 = 18, for all i = 1, 2, . . . , 8.
H
Starting position 1 2 3 4 5 6 7 8 Total
Oi 29 19 18 25 17 10 15 11 144
Ei 18 18 18 18 18 18 18 18 144
Oi − Ei 11 1 0 7 −1 −8 −3 −7 0
(Oi − Ei )2 /Ei 6.72 0.06 0 2.72 0.06 3.56 0.50 2.72 16.34
Under H0 , i (Oi − Ei )2 /Ei ∼ χ27 . At the 5% significance level, the critical value is
P
14.067. Since 16.34 > 14.067, we reject the null hypothesis. Turning to the 1%
significance level, the critical value is 18.475. Since 16.34 < 18.475, we cannot reject
the null hypothesis, hence we conclude that the test is moderately significant, with
moderate evidence to support the claim that the chances of winning in the different
starting positions are not all the same.
318
H.2. Practice problems
2. You have carried out a χ2 test on a 3 × 4 contingency table to study whether there
is evidence of an association between advertising type and level of sales of a
product. You have four types of advertising (A, B, C and D) and three levels of
sales (low, medium and high).
Your calculated χ2 value is 13.50. Giving degrees of freedom and an appropriate
significance level, set out your hypotheses. What would you say about the result?
3. In a survey conducted to decide where to locate a factory, samples from five towns
were examined to see the numbers of skilled and unskilled workers. The data were
as follows.
Number of Number of
Area skilled workers unskilled workers
A 80 184
B 58 147
C 114 276
D 55 196
E 83 229
(a) Does the population proportion of skilled workers vary with the area?
(b) Test:
H0 : πD = πothers vs. H1 : πD < πothers . H
Think about your results and what you would explain to your management
team who had seen the chi-squared results and want, for other reasons, to site
their factory at area D.
4. The table below shows a contingency table for a sample of 1,104 randomly-selected
adults from three types of environment (City, Town and Rural) who have been
classified into two groups by level of exercise (high and low). Test the hypothesis
that there is no association between level of exercise and type of environment and
draw conclusions.
Level of exercise
Environment High Low
City 221 256
Town 230 118
Rural 159 120
5. You have been given the number of births in a country for each of the four seasons
of the year and are asked whether births vary over the year. What would you need
to know in order to carry out a chi-squared test of the hypothesis that births are
spread uniformly between the four seasons? Outline the steps in your work.
319
H. Contingency tables and the chi-squared test
Medium
Low
Very
dissatisfied
2. We test:
H0 : No association between advertising type and level of sales
H1 : There is an association between advertising type and level of sales.
The degrees of freedom are (r − 1)(c − 1) = (3 − 1)(4 − 1) = 6. Using a 5%
significance level, the critical value is 12.59. Since 12.59 < 13.50, we reject the null
hypothesis. Moving to the 1% significance level, the critical value is 16.81, hence we
do not reject the null hypothesis and conclude that the test is moderately significant
with moderate evidence of an association between advertising type and level of sales
3. (a) We test:
H0 : There is no association between proportion of skilled workers and
area.
H1 : There is an association between proportion of skilled workers and
area.
We get a contingency table which should look like:
320
H.3. Solutions to Practice problems
The χ2 test statistic value is i i (Oij − Eij )2 /Eij = 5.88. The number of
P P
degrees of freedom is (5 − 1)(2 − 1) = 4, so we compare the test statistic value
with the χ24 distribuion, using Table 8.
The test is not significant at the 5% significance level (the critical value is
9.488). It is also not significant at the 10% significance level (the critical value
is 7.779) so the test is not statistically significant and we have found
insufficient evidence of an association. That is, the proportion of skilled
workers does not appear to vary with area. H
(b) We have an upper-tailed z test:
The 5% critical value is 1.645 and the 1% critical value is 2.326. So our
calculated test statistic value of 2.1613 lies between the two, meaning we reject
the null hypothesis at the 5% significance level only. Hence the test is
moderately significant.
We can tell the management team that there is moderate evidence of a lower
proportion of skilled workers in area D compared with the others. However, it
is not significant at the 1% significance level and so it would be sensible to
check whether there are other factors of interest to them before deciding
against the area.
321
H. Contingency tables and the chi-squared test
4. We test:
H0 : No association between level of exercise and type of environment
H1 : There is an association between level of exercise and type of environment.
Determining row and column totals, we obtain the expected frequencies as shown
in the following table, along with the calculated χ2 values for each cell.
Level of exercise
Environment High Low Total
Observed 221 256 477
City Expected 263.56 213.44
(O1· − E1· )2 /E1· 6.87 8.49
Observed 230 118 348
Town Expected 192.28 155.72
(O2· − E2· )2 /E2· 7.40 9.14
Observed 159 120 279
Rural Expected 154.16 124.84
(O3· − E3· )2 /E3· 0.15 0.19
Total 610 494 1,104
The test statistic value is:
6.87 + 8.49 + 7.40 + 9.14 + 0.15 + 0.19 = 32.24.
There are (r − 1)(c − 1) = (3 − 1)(2 − 1) = 2 degrees of freedom, and the 5%
critical value is 5.991 using Table 8. Since 5.991 < 32.24, we reject H0 . Testing at
the 1% significance level, the critical value is 9.210. Again, we reject H0 and
conclude that the test is highly significant. There is strong evidence of an
H association between level of exercise and type of environment. Looking at the large
contributors to the test statistic value, we see that those living in cities tend not to
exercise much, while those in towns tend to exercise more.
5. We are told that we are supplied with the observed frequencies, i.e. the number of
births in a country for each of the four seasons. To test whether births are evenly
distributed between the four seasons, we would need to calculate the expected
number of births for each season. We would need to know the exact number of days
in each season to be able to calculate the expected frequencies, and hence perform
a goodness-of-test of an even distribution. Remember the seasons are of slightly
different lengths reflecting the variable number of days in each month. We would
then test:
H0 : There is a uniform distribution of live births throughout the year.
H1 : There is a non-uniform distribution of live births throughout the year.
Denoting the observed and expected frequencies for season i as Oi and Ei ,
respectively, the test statistic is:
4
X (Oi − Ei )2
∼ χ23 .
i=1
Ei
At a 5% significance level, the critical value would be 7.815 and so we would reject
the null hypothesis if the test statistic value exceeded 7.815. We would then test at
a suitable second significance level to determine the strength of the test result.
322
Appendix I
Sampling and experimental design
3. The simplest random sampling technique is simple random sampling. Give two
reasons why it may be desirable to use a sampling design which is more
sophisticated than simple random sampling.
Solution:
Possible reasons include the following.
• We can estimate parameters with greater precision (stratified sampling).
• It may be more cost effective (cluster sampling).
323
I. Sampling and experimental design
4. A corporation wants to estimate the total number of worker-hours lost for a given
month because of accidents among its employees. Each employee is classified into
one of three categories – labourer, technician and administrator. Which sampling
method do you think would be preferable here? Justify your choice.
Solution:
We can expect the three groups to be homogeneous with respect to accident rates –
for example, labourers probably have the highest rates. Therefore, stratified
sampling is preferable.
5. Describe how stratified sampling is performed and explain how it differs from quota
sampling.
Solution:
In stratified sampling, the population is divided into strata, natural groupings
within the population, and a simple random sample is taken from each stratum.
Stratified sampling differs from quota sampling in the following ways.
• Stratified sampling is random sampling, whereas quota sampling is
non-random sampling.
• In stratified sampling a sampling frame is required, whereas in quota sampling
pre-chosen frequencies in each category are sought.
Solution:
Similarity: In both one-stage and two-stage cluster sampling, only some clusters are
sampled (at random).
I Difference: In one-stage every unit within each selected cluster is sampled, while in
two-stage a random sample is taken from each selected cluster.
7. In the context of sampling, explain the difference between item non-response and
unit non-response.
Solution:
Item non-response occurs when a sampled member fails to respond to a specific
question in a survey. Unit non-response occurs when no information is collected
from a sample member at all.
324
I.1. Worked examples
Solution:
(a) There may be people whom the data collection procedure does not reach, who
refuse to respond, or who are unable to respond. As an example, in telephone
surveys, people may not be at home or may not have time to do the survey. In
a mail survey, the survey form may not reach the addressee.
(b) The non-respondents may be different from respondents, leading to biased
inferences. Additionally, the sample size is effectively reduced, leading to a loss
of precision in estimation.
(c) Numerous calls, concentrating on evenings and weekends. Interviewers with
flexible schedules who can make appointments with interviewees at a time
convenient for them. Cooperation can be enlisted by sending an advance letter,
ensure respondents feel their participation is important and having
well-trained interviewers. Clear layout of survey form, easy response tasks,
send reminders and call respondents by telephone.
9. A research group has designed a survey and finds the costs are greater than the
available budget. Two possible methods of saving money are a sample size
reduction or spending less on interviewers (for example, by providing less
interviewer training or taking on less-experienced interviewers). Discuss the
advantages and disadvantages of these two approaches.
Solution:
A smaller sample size leads to less precision in terms of estimation. Reducing
interviewer training leads to response bias. We can statistically assess the reduction
in precision, but the response bias is much harder to quantify, making it hard to
draw conclusions from the survey. Therefore, the group should probably go for a
smaller sample size, unless this will give them too imprecise results for their
purposes.
10. Readers of the magazine Popular Science were asked to phone in (on a premium I
rate number) their responses to the following question: ‘Should the United States
build more fossil fuel-generating plants or the new so-called safe nuclear generators
to meet the energy crisis?’. Of the total call-ins, 86% chose the nuclear option.
Discuss the way the poll was conducted, the question wording, and whether or not
you think the results are a good estimate of the prevailing mood in the country.
Solution:
Both selection bias and response bias may have occurred. Due to the phone-in
method, respondents may well be very different from the general population
(selection bias). Response bias is possible due to the question wording – for
example ‘so-called safe nuclear generators’ calls safety into question. It is probably
hard to say much about the prevailing mood in the country based on the survey.
11. A company producing handheld electronic devices (tablets, mobile phones etc.)
wants to understand how people of different ages rate its products. For this reason,
the company’s management has decided to use a survey of its customers and has
asked you to devise an appropriate random sampling scheme. Outline the key
components of your sampling scheme.
325
I. Sampling and experimental design
Solution:
Note that there is no single ‘right’ answer. A possible set of ‘ingredients’ of a good
answer is given below.
• Propose stratified sampling since customers of all ages are to be surveyed.
• Sampling frame could be the company’s customer database.
• Take a simple random sample from each stratum.
• An obvious stratification factor is age group, another could be gender.
• Contact method: mail, telephone or email (likely to have all details on the
database).
• Minimise non-response through a suitable incentive, such as a discount off the
next purchase.
12. Explain the difference between an experimental design and an observational study.
Solution:
An experimental design involves manipulating an independent variable to observe
its effects on a dependent variable, establishing cause-and-effect relationships. In
contrast, an observational study observes and analyses existing conditions,
variables, or behaviours without intervention. While experimental designs allow for
greater control, observational studies are valuable for exploring real-world
phenomena, but causation is not easily inferred.
2. Discuss the statistical problems you might expect to have in each of the following
situations.
(a) Conducting a census of the population.
(b) Setting up a study of single-parent families.
(c) Establishing future demand for post-compulsory education.
326
I.3. Solutions to Practice problems
3. Think of three quota controls you might use to make a quota sample of shoppers in
order to ask them about the average amount of money they spend on shopping per
week. Two controls should be easy for your interviewer to identify. How can you
help them with the third control?
4. You have been asked to design a random sample in order to study the way school
children learn in your country. Explain the clusters you might choose, and why.
5. Which method of contact might you use for your questionnaire in each of the
following circumstances?
(a) A random sample of school children about their favourite lessons.
(b) A random sample of households about their expenditure on non-essential
items.
(c) A quota sample of shoppers about shopping expenditure.
(d) A random sample of bank employees about how good their computing facilities
are.
(e) A random sample of the general population about whether they liked
yesterday’s television programmes.
6. You are carrying out a random sample survey of leisure patterns for a holiday
company, and have to decide whether to use interviews at people’s homes and
workplaces, postal (mail) questionnaires, or a telephone survey. Explain which
method of contact you would use, and why.
7. Your government is assessing whether it should change the speed limit on its
motorways or main roads. Several countries in your immediate area have lowered
their limit recently by 10 miles per hour.
What control factors might you use in order to examine the likely effect on road
accidents of a change in your country?
I
8. List some advantages and disadvantages of a longitudinal survey. Describe how you
would design such a survey if you were aiming to study the changes in people’s use
of health services over a 20-year period. State the target group, survey design, and
frequency of contact. You should give examples of a few of the questions you might
ask.
327
I. Sampling and experimental design
responsible for the lists of schools or companies and why and when they would be
updated. Is there a subscription to pay annually, for example, in order to be a
member of a school or company association, or is there a government regulation
which means that schools or companies must be registered?
The question of target group also affects the usefulness of your sampling frame. If
you want to know about foreign nationals resident in your country, then the
electoral register is hardly appropriate, nor will it help you if you are interested
primarily in contacting people under voting age.
The list of new buildings will help you get in touch with people living in particular
areas so you may find you stratify by the socio-economic characteristics of the place
rather than the people.
Schools and companies are fine as a contact for people for many purposes, but the
fact that they have been approached through a particular organisation may affect
responses to particular questions. Would you tell interviewers who contacted you in
school time and had got hold of you through your school that you hate school, or
even about what you do after school? Would you reply to a postal questionnaire
sent via your company if you were asked how often you took unnecessary sick leave?
The more you think about all this, the more difficult it can seem! So think of three
lists and then list the ways you could find them useful as a basis for sampling and
the problems you might have in particular circumstances.
2. (a) If the population is very large, then this increases the chances of making
non-sampling errors which reduces the quality of the collected data. For
example, large amounts of data collected may result in more data-entry errors
leading to inaccurate data and hence inaccurate conclusions drawn from any
statistical analysis of the data.
(b) An adequate sampling frame of single-parent families is unlikely to be available
I for use. This would make random sampling either impossible, or very
time-consuming if a sampling frame had to be created.
(c) Those who would potentially undertake post-compulsory education in the
future will be ‘young’ today. Data protection laws may prevent us from
accessing lists of school-age children. In addition, young children may not
realise or appreciate the merits of post-compulsory education so their current
opinions on the subject may differ from their future decisions about whether to
continue into further education and higher education.
3. The two quota controls which are relatively easy for your interviewer to identify are
gender and age group. They are also useful controls to use as they help you gain a
representative picture of shopping patterns. We would expect people to buy
different things according to whether they are female (particular perfumes, for
example) or male (special car accessories) or by age group (out-of-season holidays
for older people, or pop concert tickets for younger people) to give trivial examples.
If we are wrong, and older people are into pop concerts of lots of females want to
make their cars sportier, then we would find this out if we have made sure we have
sufficient numbers in each of the controls.
328
I.3. Solutions to Practice problems
The question of a third control is more tricky. We would like to know about
people’s preferences if they spend more or less money when they shop, or according
to their income group, or depending on how many people they are shopping for
(just themselves, or their family, or perhaps their elderly neighbours). Of course,
the interviewer could ask people how much they spent on shopping last week, or
what their family income is, or how many people they are shopping for. People
might reply to the last of these questions, but may well be unhappy about the
other two – so you get a lot of refusals, and lose your sample!
Even if they do reply, the interviewer will then have to discard some of the
interviews they have started. If everyone interviewed so far is shopping for
themselves, for example, and the quota has been filled, the interviewer will have to
ignore the person they have stopped and go to look for someone else to ask whether
they have shopped for others!
If the aim is to interview people who have bought a lot of things at that time, then
the interviewer could wait for people to leave the store concerned, and stop people
who have bought a lot that day, a medium amount, or a small amount, judging by
how full their shopping trolley is! Or do the same if people are accompanied by
children or not on their shopping expedition. An alternative is to interview at more
than one shopping area and deliberately go to some shops in high-income areas and
some in low-income areas, knowing that most of the people you interview will be in
an income category which matches their surroundings.
4. Here an obvious cluster will be the school. This is because a main influence on how
children learn will be the school they are at – you could give yourself an enormous
amount of unnecessary extra work and expense if you choose a sample of 100
children, say, from 100 different schools – you would in fact have to find out about
each of their schools individually. So it makes sense to sample clusters of schools
themselves. For a similar reason, it would be a good idea to cluster children
according to the study group (class) to which they belong. The only problem with
this is that if you have too few clusters, it may be difficult to separate the effect of I
outside influences (the effect of a particular teacher, for example) as opposed to the
actual teaching methods used.
5. There are no single ‘right’ answers to these – you may disagree with the
suggestions, but if you do make sure you can explain why you think what you do! In
addition, the explanation must use the kinds of arguments given in this chapter.
(a) This slightly depends on the age of the children concerned. The younger the
child, the more difficult it will be to elicit a clear written reply. Telephone is
probably not an option (annoying to parents). On the whole, interviews at
school make most sense – though we hope the children’s replies would not be
too influenced by their environment.
(b) The items are non-essential – we need clear records – probably a diary kept at
home (if people will agree to do this) would be best. Random digit dialling
might work, but you would need to catch people while they still remembered
their shopping. Interviewing outside the store is not an option – we have been
told this is a random survey so we will need a list of addresses rather than a
quota approach to people as they shop.
329
I. Sampling and experimental design
6. Arguments could be made for and against any of the suggested contact methods.
However, on balance, a postal (mail) questionnaire might be most preferable.
Respondents could then return the survey form at their own convenience and
would not be rushed to respond as would be the case with an interview or
telephone survey. Also, this would be the least expensive and least time-consuming
method to employ. An obvious disadvantage is the likely high non-response rate. A
suitable incentive could be used to help reduce non-response.
I
7. Possible control factors might be the:
• number of vehicles per capita of population (because the effect of speed limits
will be different the more cars there are being driven)
• number of vehicles per mile (or kilometre) of made-up road (for the same
reason)
• age distribution of the driving population (younger drivers are more likely to
be involved in high-speed accidents; the very old have a high incidence of
low-speed, low-fatality accidents)
• ‘mix’ of road types (on the whole, motorways tend to have fewer (but more
serious) accidents compared with minor roads).
There are a few controls you might like to use. You would not use them all at once
– the data would become too detailed for sensible statistical analysis and some of
the figures you want might not be available. You may think of more – check if the
figures exist for your country!
330
I.3. Solutions to Practice problems
331
I. Sampling and experimental design
332
Appendix J
Correlation and linear regression
Salespeople (x) 210 209 219 225 232 221 220 233 200 215 205 227
Sales (y) 206 200 204 215 222 216 210 218 201 212 204 212
This indicates a strong, positive linear relationship between salespeople and sales.
The more salespeople employed, the higher the sales.
2. State whether the following statements are true or false, explaining your answers.
(a) The sample correlation coefficient between x and y is the same as the sample
correlation coefficient between y and x.
(b) If the slope is negative in a regression equation, then the sample correlation
coefficient between x and y would be negative too.
(c) If two variables have a sample correlation coefficient of −1 they are not related.
(d) A large sample correlation coefficient means the regression line will have a
steep slope.
333
J. Correlation and linear regression
Solution:
(a) True. The sample correlation coefficient is:
Sxy
r=p .
Sxx Syy
3. Write down the simple linear regression model, explaining each term in the model.
Solution:
The simple linear regression model is:
y = β0 + β1 x + ε
where:
• y is the dependent (or response) variable
J • x is the independent (or explanatory) variable
• β0 is the y-intercept
• β1 is the slope of the line
• ε is a random error term.
4. List the assumptions which we make when using the simple linear regression model
to explain changes in y by changes in x (i.e. the ‘regression of y on x’).
Solution:
We have the following four assumptions.
• A linear relationship between the variables of the form y = β0 + β1 x + ε.
• The existence of three model parameters: the linear equation parameters β0
and β1 , and the error term variance, σ 2 .
334
J.1. Worked examples
• Var(εi ) = σ 2 for all i = 1, 2, . . . , n, i.e. the error term variance is constant and
does not depend on the independent variable.
• The εi s are independent and N (0, σ 2 ) distributed for all i = 1, 2, . . . , n.
5. The table below shows the cost of fire damage for ten fires together with the
corresponding distances of the fires to the nearest fire station.
Distance in miles (x) 4.9 4.5 6.3 3.2 5.0 5.7 4.0 4.3 2.5 5.2
Cost in £000s (y) 31.1 31.1 43.1 22.1 36.2 35.8 25.9 28.0 22.9 33.5
Observation 1 2 3 4 5 6 7 8 9 10
xi 1.1 3.9 2.8 3.2 2.9 4.4 3.4 4.9 2.3 3.8
J
yi 6.4 17.0 12.8 14.4 13.1 18.7 15.1 20.6 11.0 16.6
Calculate the estimates of β0 and β1 for the regression of y on x based on the above
sample data.
Solution:
We have:
32.7 145.7
x̄ = = 3.27 and ȳ = = 14.57.
10 10
335
J. Correlation and linear regression
x y
1,000 871
2,000 1,300
3,000 1,760
4,000 2,326
5,000 2,950
5
X 5
X
x2i = 55,000,000 and yi2 = 19,659,017.
i=1 i=1
336
J.1. Worked examples
(b) The percentage of income which households spend on essential items is:
y
× 100%
x
which is approximated by:
βb0 + βb1 x
× 100%.
x
This percentage is decreasing with increasing x so that poorer households
spend a larger proportion of their income on essentials.
8. The following table shows the number of computers (in 000s), x, produced by a
company each month and the corresponding monthly costs (in £000s), y, for
running its computer maintenance department.
(a) Calculate the sample correlation coefficient for computers and maintenance
costs.
(b) Find the best-fitting straight line relating y and x.
(c) Plot the points on a scatter diagram and draw the line of best fit.
(d) Comment on your results. How would you check on the strength of the
relationship you have found?
J
Solution:
(a) We have:
75.5 1,080
x̄ == 7.55 and ȳ = = 108.
10 10
The sample correlation coefficient is:
P
xi yi − nx̄ȳ
r=pP 2
( xi − nx̄2 )( yi2 − nȳ 2 )
P
337
J. Correlation and linear regression
x
115
x
Maintenance costs (in £000s)
x x
x
110
x
x
105
x
100
Note to plot the line you need to compute any two points on the line. For
example, for x = 7 and x = 8 the y-coordinates are determined, respectively,
as:
37.415 + 9.349 × 7 ≈ 103 and 37.415 + 9.349 × 8 ≈ 112
giving points (7, 103) and (8, 112). The line of best fit should be drawn passing
through these two points.
J (d) The sample correlation coefficient is close to 1, hence there is a strong, positive
linear relationship between computers and maintenance costs. More computers
means higher monthly maintenance costs. For each additional 1,000
computers, maintenance costs increase by £9,349.
338
J.1. Worked examples
(a) Calculate the sample correlation coefficient and comment on its value.
(b) Determine the line of best fit of y on x.
(c) Plot the points on a scatter diagram and draw the line of best fit.
(d) If you were told a student achieved a mark of 68 in Paper I, what would you
predict for the student’s mark in Paper II? Would you trust this prediction?
Solution:
(a) We have:
856 1,041
x̄ = = 65.846 and ȳ = = 80.077.
13 13
Therefore:
P
xi yi − nx̄ȳ
r=pP 2
( xi − nx̄2 )( yi2 − nȳ 2 )
P
x
J
x
90
Paper II examination mark
x
x x
x
80
x
x
x x
70
x x
60
50 60 70 80
339
J. Correlation and linear regression
Note to plot the line you need to compute any two points on the line. For
example, for x = 50 and x = 70 the y-coordinates are determined, respectively,
as:
26.5310 + 0.8132 × 50 ≈ 67 and 26.5310 + 0.8132 × 70 ≈ 83
giving points (50, 67) and (70, 83). The line of best fit should be drawn passing
through these two points.
(d) We have:
yb = 26.5310 + 0.8132 × 68 = 81.8286.
Since examination marks are integers, we round this to 82.
The Paper I mark of 68 is within the range of our x data, hence this is
interpolation so we trust the accuracy of the prediction.
10. The following table, published in USA Today, lists divorce rates and mobility rates
for different regions of the USA. The divorce rate is measured as the annual
number of divorces per 1,000 population, and the mobility rate is the percentage of
people living in a different household from five years before.
Region Mobility rate (x variable) Divorce rate (y variable)
New England 41 4.0
Middle Atlantic 37 3.4
East North Central 44 5.1
West North Central 46 4.6
South Atlantic 47 5.6
East South Central 44 6.0
West South Central 50 6.5
Mountain 57 7.6
Pacific 56 5.9
Summary statistics for the dataset are:
9
X 9
X 9
X 9
X 9
X
xi = 422, x2i = 20,132, yi = 48.7, yi2 = 276.91 and xi yi = 2,341.6.
i=1 i=1 i=1 i=1 i=1
(a) Calculate the sample correlation coefficient and comment on its value.
J (b) Calculate the regression equation.
(c) Plot the points on a scatter diagram and draw the line of best fit.
(d) Compute the expected divorce rate if the mobility rate is 40.
(e) Why might it be reasonable to use the divorce rate as the y variable?
Solution:
(a) The sample correlation coefficient is:
P
xi yi − nx̄ȳ
r=pP 2
( xi − nx̄2 )( yi2 − nȳ 2 )
P
340
J.1. Worked examples
which suggests a strong, positive linear relationship betweeen divorce rate and
mobility rate (as seen from the scatter diagram in part (c)).
(c) We have:
Scatter diagram of divorce rate vs. mobility rate
x
Divorce rate (per 1,000 population)
x
6
x
x
x
5
x
4
40 45 50 55
J
Note to plot the line you need to compute any two points on the line. For
example, for x = 45 and x = 50 the y-coordinates are determined, respectively,
as:
−2.4893 + 0.1685 × 45 ≈ 5 and − 2.4893 + 0.1685 × 50 ≈ 6
giving points (45, 5) and (50, 6). The line of best fit should be drawn passing
through these two points.
(d) For x = 40, the expected divorce rate is −2.4893 + 0.1685 × 40 = 4.25 per
1,000 population.
(e) The use of the divorce rate as the dependent variable is reasonable due to the
likely disruptive effect moving home may have on relationships (any reasonable
argument would be acceptable here).
341
J. Correlation and linear regression
2. Think of an example where you feel the correlation is clearly spurious (that is,
there is a correlation, but no causal connection) and explain how it might arise.
Also, think of a ‘clear’ correlation and the circumstances in which you might accept
causality.
3. Work out βb0 and βb1 in Example 10.7 using advertising costs as the dependent
variable, and sales as the independent variable. Now predict advertising costs when
sales are £460,000.
Make sure you understand how and why your results are different from Example
10.7!
4. Try to think of a likely linear relationship between x and y which would probably
work over some of the data, but then break down like that in the anthropologist
case in Example 10.8. This should make sure you understand the difference
between interpolation and extrapolation.
5. The following data were recorded during an investigation into the effect of fertiliser
in g/m2 , x, on crop yields in kg/ha, y.
Crop yields (kg/ha) 160 168 176 179 183 186 189 186 184
Fertiliser (g/m2 ) 0 1 2 3 4 5 6 7 8
J
Here are some useful summary statistics:
9
X 9
X 9
X 9
X 9
X
xi = 36, yi = 1,611, xi yi = 6,627, x2i = 204 and yi2 = 289,099.
i=1 i=1 i=1 i=1 i=1
(a) Plot the data and comment on the appropriateness of using the simple linear
regression model.
(b) Calculate a least squares regression line for the data.
(c) Predict the crop yield for 3.5 g/m2 of fertiliser.
(d) Would you feel confident predicting a crop yield for 10 g/m2 of fertiliser?
Briefly justify your answer.
342
J.3. Solutions to Practice problems
2. Here you should think about two things which have risen or fallen over time
together, but have no obvious connection. Examples might be the number of
examination successes and the number of films shown per year in a town.
A ‘clear’ correlation might be recovery rates from a particular illness in relation to
the amount of medicine given. You might accept this correlation as perhaps being
causal ‘other things equal’ if everyone who got the disease were given medicine as
soon as it was diagnosed and also if recovery began only after the medicine was
administered.
3. We now have:
12 × 191,325 − 410 × 5,445
βb1 = = 0.1251
12 × 2,512,925 − (5,445)2
and:
410 − 0.1251 × 5,445
J
βb0 = = −22.5975.
12
So the regression equation, if we decide that advertising costs depend on sales, is:
yb = −22.5975 + 0.1251x.
We are assuming that as sales rise, the company concerned spends more on
advertising. When sales are £460,000 we get predicted advertising costs of:
i.e. £34,948.50.
Note that the xs and the ys were given in thousands, so be careful over the units of
measurement!
343
J. Correlation and linear regression
5. (a) We have:
Scatter diagram of crop yields vs. fertiliser
190
x
x x
185
x
180 x
Crop yields (kg/ha)
x
175
170
x
165
160
0 2 4 6 8
Fertiliser (g/m^2)
(d) No, we would not feel confident since this would be clear extrapolation. The x
data values do not exceed 8, and the scatter diagram suggests that fertiliser
values above 7 g/m2 may actually have a negative effect on crop yield.
344
Appendix K
Examination formula sheet
Sample size determination for estimating Sample size determination for estimating
a population mean: a population proportion:
n≥
(zα/2 )2 σ 2
n≥
(zα/2 )2 p(1 − p) K
e2 e2
z test of hypothesis for a single mean (σ t test of hypothesis for a single mean (σ
known): unknown):
X̄ − µ0 X̄ − µ0
Z= √ T = √
σ/ n S/ n
345
K. Examination formula sheet
z test of hypothesis for a single z test for the difference between two means
proportion: (variances known):
P − π0 X̄1 − X̄2 − (µ1 − µ2 )
Z∼
=p Z=
π0 (1 − π0 )/n
p
σ12 /n1 + σ22 /n2
t test for the difference between two means Confidence interval endpoints for the
(variances unknown): difference between two means:
s
X̄1 − X̄2 − (µ1 − µ2 )
1 1
T = q x̄1 − x̄2 ± tα/2, n1 +n2 −2 × s2p +
Sp2 (1/n1 + 1/n2 ) n1 n2
Pooled variance estimator when assuming t test for the difference in means in
equal variances: paired samples:
Confidence interval endpoints for the z test for the difference between two
difference in means in paired samples: proportions:
sd P1 − P2 − (π1 − π2 )
x̄d ± tα/2, n−1 × √ Z=p
n P (1 − P )(1/n1 + 1/n2 )
346
Appendix L
Sample examination paper
Candidates should answer THREE questions: all parts of Section A (50 marks in total)
and TWO questions from Section B (25 marks each).
Section A
Answer all parts of question 1 (50 marks in total).
(6 marks)
(b) Classify each one of the following variables as either measurable (continuous)
or categorical. If a variable is categorical, further classify it as either nominal
or ordinal. Justify your answer. (No marks will be awarded without a
justification.)
i. Types of musical instrument.
ii. Interest rates set by a central bank.
iii. Finishing position of athletes in a sprint race.
(6 marks)
(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. If A and B are mutually exclusive events with P (A) > 0 and P (B) > 0,
then P (A | B) = P (A).
ii. The probability that a normal random variable is less than one standard
deviation from its mean is 95%.
iii. If a 90% confidence interval for π is (0.412, 0.428), then this means that
there is a 90% probability that 0.412 < π < 0.428.
L
iv. A hypothesis test which is not significant at the 10% significance level can
be significant at the 1% significance level.
v. If the value of βb1 in a simple linear regression is −0.1, then the variables x
and y must have a weak, negative sample correlation coefficient.
(10 marks)
347
L. Sample examination paper
(d) In your own words, define the term non-response bias and provide a real-world
example.
(4 marks)
(e) A machine fills bottles with water. The amount that the machine delivers is
normally distributed, with a mean of 1,000 cm3 and a variance of σ 2 . 10% of
filled bottles are known to have less than 995 cm3 of water.
i. Calculate σ to three decimal places.
(3 marks)
ii. A bottle overflows if it is filled with more than 1,010 cm3 of water.
Calculate the probability that a bottle overflows, given that the machine
fills it with at least 1,005 cm3 of water. Express your answer in terms of Φ,
the cumulative distribution function of the standard normal distribution.
(4 marks)
iii. A random sample of 20 bottles is measured, and the mean of this sample
is required. Calculate the probability that this sample mean is less than
1,001 cm3 to four decimal places.
(4 marks)
(f) The probability distribution of a random variable X is given below.
X=x −5 −1 1 5
P (X = x) 3c 2c 2c 3c
(3 marks)
(g) You wish to estimate a population mean, µ, and know to use the following
formula to determine the sample size:
(zα/2 )2 σ 2
n≥ .
L e2
(6 marks)
348
Section B
Answer two out of the three questions from this section (25 marks each).
i. Based on the data in the table, and without conducting any significance
test, would you say there is an association between public opinion on the
new policy and the city of residence? Provide a brief justification for your
answer.
ii. Calculate the chi-squared statistic for the hypothesis of independence
between public opinion on the new policy and the city of residence, and
test that hypothesis at two appropriate levels. What do you conclude?
(13 marks)
(b) You work for a market research company and your boss has asked you to carry
out a random sample survey for a mobile phone company to identify whether a
recently launched mobile phone is attractive to younger people. Limited time
and money resources are available at your disposal. You are being asked to
prepare a brief summary containing the items below in no more than three
sentences for each of them.
i. Choose an appropriate probability sampling scheme. Provide a brief
justification for your answer.
ii. Describe the sampling frame and the method of contact you will use.
Briefly explain the reasons for your choices.
iii. Provide an example in which response bias may occur. State an action
that you would take to address this issue.
iv. State the main research question of the survey. Identify the variables
associated with this question.
(12 marks)
349
L. Sample examination paper
L iii. Give a 98% confidence interval for the mean hours in the office of women.
(12 marks)
350
4. (a) The data below are the exam marks of 30 students at a particular course.
42 44 45 45 47
47 48 52 53 54
55 55 56 56 57
58 59 60 62 63
63 64 64 65 66
66 67 73 95 98
i. Find the mean, the median and the interquartile range of the data above.
It is given that their sum of the data is 1,779.
ii. Carefully construct, draw and label a boxplot of these data.
iii. Comment on the data, given the shape of the boxplot and the measures
which you have calculated.
(12 marks)
(b) i. A doctor is conducting an experiment to test whether a new treatment for
a disease is effective. In this context, a treatment is considered effective if
it is successful with a probability of more than 0.50. The treatment was
applied to 40 randomly sampled patients and it was successful for 26 of
them. You are asked to use an appropriate hypothesis test to determine
whether the treatment is effective in general. State the test hypotheses,
and specify your test statistic and its distribution under the null
hypothesis. Comment on your findings.
ii. A second experiment followed where a placebo pill was given to another
group of 30 randomly sampled patients. A placebo pill contains no
medication and is prescribed so that the patient will expect to get well. In
some situations, this expectation is enough for the patient to recover. This
effect, also known as the placebo effect, occurred in the second experiment
where 16 patients recovered. You are asked to consider an appropriate
hypothesis test to incorporate this new evidence with the previous data
and reassess the effectiveness of the new treatment.
(13 marks)
[END OF PAPER]
351
L. Sample examination paper
352
Appendix M
Sample examination paper –
Solutions
1. (a) i. We have:
2
X 1 1 1 1
=− + = = 0.05.
i=1
x i 5 4 20
ii. We have:
3
X √ √ √ √
| xi yi | = | −45 × −5| + | 25 × 4| + | 3 × 3| = 15 + 10 + 3 = 28.
i=1
iii. We have:
X3
−x1 − x2i yi2 = −(−45)−((25)2 ×42 )−(32 ×32 ) = 45−10,000−81 = −10,036.
i=2
ii. Measurable. Interest rates can be measured in percentage points (or basis
points) to several decimal places.
iii. Categorical, ordinal. Finishing position is in rank order, i.e. 1st, 2nd, 3rd
etc.
ii. False. The probability that a normal random variable is less than one
standard deviation from its mean is ≈ 68%.
iii. False. 90% of the time the computed confidence interval covers π. As π is
unknown, it either does or does not fall in the interval (0.412, 0.428).
iv. False. If a hypothesis test is not significant at the 10% significance level,
then it cannot be significant at any lower percentage significance level.
Alternatively, this could be illustrated with a suitable diagram or with
p-values.
M
v. False. While βb1 = −0.1 < 0, the correlation would be negative, but we
cannot infer the strength of the correlation.
353
M. Sample examination paper – Solutions
Hence:
5
− = −1.282 ⇒ σ = 3.900 cm3 .
σ
In the above, it is important to at least get the correct probability
expression, then identify the correct z-value and finally solve for σ.
ii. We require:
354
(f) i. We require:
X
p(x) = 3c + 2c + 2c + 3c = 1 ⇒ c = 0.10.
x
ii. We have:
X
E(X) = xp(x) = (−5 × 0.30) + (−1 × 0.20) + (1 × 0.20) + (5 × 0.30) = 0
x
(g) i. zα/2 would depend on the level of confidence we required. A default choice
would be 95%, i.e. zα/2 = 1.96, but a case could be made for other
confidence levels.
ii. σ 2 could be an assumed value of the population variance. Alternatively,
use a pilot study and use the sample variance, s2 , to estimate σ 2 .
iii. e would be our desired tolerance on estimation error, whose value would
depend on our requirements. Any sensible value would be acceptable.
i,j
Ei,j
M
which gives a value of 5.273. This is a 2 × 3 contingency table so the
degrees of freedom are (2 − 1)(3 − 1) = 2.
355
M. Sample examination paper – Solutions
For α = 0.05, the critical value is 5.991, hence we do not reject H0 . For
α = 0.10, the critical value is 4.605, hence we reject H0 . There is weak
evidence of an association between public opinion on the new policy and
the city of residence.
(b) General instructions: we are asked for accuracy and random (probability)
sampling, so this implies using some kind of a list.
i. Stratified random sampling is appropriate here due to the accuracy
requirement.
ii. An example answer is given below:
• Use list provided by the mobile phone company to identify those who
bought the recently launched mobile phone.
• List could be postal address, phone or email.
• Stratify by age group or by gender of buyer.
• Explanation as to which you would prefer. For example, email is fast
if all have it but there may be no response.
iii. If anonymity was not guaranteed, then response bias may occur. Ensuring
anonymity is conveyed to respondents should mitigate this.
iv. Main research question: How does the new mobile phone compare with
previous models?
Associated variables: mobile phone model, and a measure of consumer
preference.
3. (a) i. A reasonable scatter diagram is show below:
356
ii. The summary statistics can be substituted into the formula for the sample
correlation coefficient to obtain the value 0.9260. The higher the alcohol
consumption, the higher the cigarette consumption. The fact that the
value is close to 1, suggests that this is a strong, positive linear
relationship.
iii. The regression line can be written by the equation:
• For α = 0.05, critical values are ±1.96 (±2.00 if the t60 distribution is
used).
• Decision: reject H0 .
• Choose smaller α, say α = 0.01, hence critical values are ±2.576 (or
±2.660), hence reject H0 .
• The test is highly significant, with strong evidence of a difference in
M
the mean hours in the office between males and females.
357
M. Sample examination paper – Solutions
358
iii. The two main things to note here, are positive/right skewness and the fact
that mean > median.
(b) This a standard exercise for a one-sided hypothesis test for a single proportion
(part i.) and differences between two proportions (part ii.). The working of the
exercise is given below.
i. Let πT denote the true probability for the new treatment to work. We can
use the following test:
• H0 : πT = 0.50 vs. H1 : πT > 0.50.
p
• Standard error: 0.50 × (1 − 0.50)/40 = 0.0791.
• Test statistic value: 1.897.
• For α = 0.05, the critical value is 1.645.
• Decision: reject H0 .
• The test is moderately significant, with moderate evidence that this
treatment is better than doing nothing.
ii. Let πP denote the true probability for the patient to recover with the
placebo.
• H0 : πT = πP vs. H1 : πT > πP . For reference the statistic is:
pT − pP
∼ N (0, 1).
s.e.(pT − pP )
359
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.