0% found this document useful (0 votes)
76 views378 pages

St104a Vle

This document is a subject guide for the Statistics 1 course offered by the University of London as part of their undergraduate program in Economics, Management, Finance, and the Social Sciences. It includes an overview of the course structure, aims, learning outcomes, and resources available to students. The guide is authored by James S. Abdey from the Department of Statistics at the London School of Economics and is published in 2024.

Uploaded by

rafsanhossain11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views378 pages

St104a Vle

This document is a subject guide for the Statistics 1 course offered by the University of London as part of their undergraduate program in Economics, Management, Finance, and the Social Sciences. It includes an overview of the course structure, aims, learning outcomes, and resources available to students. The guide is authored by James S. Abdey from the Department of Statistics at the London School of Economics and is published in 2024.

Uploaded by

rafsanhossain11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 378

Undergraduate study in Economics,

Management, Finance and the Social Sciences

Statistics 1

J.S. Abdey

ST104a
2024
Statistics 1
J.S. Abdey
ST104a
2024

Undergraduate study in
Economics, Management,
Finance and the Social Sciences

This subject guide is for a 100 course offered as part of the University of London’s
undergraduate study in Economics, Management, Finance and the Social
Sciences. This is equivalent to Level 4 within the Framework for Higher Education
Qualifications in England, Wales and Northern Ireland (FHEQ).
For more information see: london.ac.uk
This guide was prepared for the University of London by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London
School of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that
due to pressure of work the author is unable to enter into any correspondence
relating to, or arising from, the guide. If you have any comments on this subject
guide, please communicate these through the discussion forum on the virtual
learning environment.

University of London
Publications Office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk

Published by: University of London


© University of London 2024
The University of London asserts copyright over all material in this subject guide
except where otherwise indicated. All rights reserved. No part of this work may
be reproduced in any form, or by any means, without permission in writing from
the publisher. We make every effort to respect copyright. If you think we have
inadvertently used your copyright material, please let us know.
4
Contents

Contents

0 Preface 1
0.1 Route map to the subject guide . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1
0.3 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.4 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.5 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.6 Employability outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7 Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7.2 Mathematical background . . . . . . . . . . . . . . . . . . . . . . 4
0.7.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.7.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.7.5 Online study resources . . . . . . . . . . . . . . . . . . . . . . . . 5
0.7.6 The VLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7.7 Making use of the Online Library . . . . . . . . . . . . . . . . . . 7
0.8 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1 Mathematics primer and the role of statistics in the research process 9


1.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Arithmetic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Squares and square roots . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Fractions and percentages . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Some further notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8.1 Absolute value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8.2Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
P
1.9 Summation operator, . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.10 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

i
Contents

1.11 The graph of a linear function . . . . . . . . . . . . . . . . . . . . . . . . 15


1.12 The role of statistics in the research process . . . . . . . . . . . . . . . . 17
1.13 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.14 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 20
1.16 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 21

2 Data visualisation and descriptive statistics 23


2.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Types of variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.1 Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.1 Presentational traps . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.2 Dot plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.4 Stem-and-leaf diagram . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Measures of location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.7.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8.1 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8.3 Variance and standard deviation . . . . . . . . . . . . . . . . . . . 37
2.9 Test your understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 46
2.13 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 47

3 Probability theory 51
3.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

ii
Contents

3.3 Recommeded reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


3.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 The concept of probability . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Relative frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 ‘Randomness’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.8 Properties of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.8.1 Notational vocabulary . . . . . . . . . . . . . . . . . . . . . . . . 57
3.8.2 Venn diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.8.3 The additive law . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8.4 The multiplicative law . . . . . . . . . . . . . . . . . . . . . . . . 60
3.9 Conditional probability and Bayes’ formula . . . . . . . . . . . . . . . . . 61
3.9.1 Bayes’ formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.9.2 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 62
3.9.3 Independent events (revisited) . . . . . . . . . . . . . . . . . . . . 65
3.10 Probability trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.11 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.13 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 67
3.14 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 68

4 Random variables, the normal and sampling distributions 69


4.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7 Expectation of a discrete random variable . . . . . . . . . . . . . . . . . 73
4.8 Functions of a random variable . . . . . . . . . . . . . . . . . . . . . . . 75
4.9 Variance of a discrete random variable . . . . . . . . . . . . . . . . . . . 77
4.10 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.10.1 Standard normal statistical tables . . . . . . . . . . . . . . . . . . 80
4.10.2 The general normal distribution . . . . . . . . . . . . . . . . . . . 83
4.11 Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.12 Sampling distribution of X̄ . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.13 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

iii
Contents

4.14 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 88


4.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 88
4.16 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 89

5 Interval estimation 91
5.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.1 Principle of confidence intervals . . . . . . . . . . . . . . . . . . . 92
5.5 Interval estimation for a population mean . . . . . . . . . . . . . . . . . 94
5.5.1 Variance known (σ 2 known) . . . . . . . . . . . . . . . . . . . . . 94
5.5.2 Variance unknown (σ 2 unknown) . . . . . . . . . . . . . . . . . . 95
5.5.3 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5.4 Confidence interval for a single mean (σ 2 known) . . . . . . . . . 98
5.5.5 Confidence interval for a single mean (σ 2 unknown) . . . . . . . . 99
5.6 Confidence interval for a single proportion . . . . . . . . . . . . . . . . . 100
5.7 Sample size determination . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.8 Estimation of differences between parameters of two populations . . . . . 103
5.9 Difference between two population means . . . . . . . . . . . . . . . . . . 104
5.9.1 Unpaired samples: variances known . . . . . . . . . . . . . . . . . 104
5.9.2 Unpaired samples: variances unknown and unequal . . . . . . . . 106
5.9.3 Unpaired samples: variances unknown and equal . . . . . . . . . . 107
5.9.4 Paired (dependent) samples . . . . . . . . . . . . . . . . . . . . . 109
5.10 Difference between two population proportions . . . . . . . . . . . . . . . 111
5.11 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.13 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 114
5.14 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 114

6 Hypothesis testing principles 117


6.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.5 Types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

iv
Contents

6.6 Significance level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


6.7 Critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7.1 Rejection region for two-tailed tests . . . . . . . . . . . . . . . . . 123
6.7.2 Rejection region for upper-tailed tests . . . . . . . . . . . . . . . . 124
6.7.3 Rejection region for lower-tailed tests . . . . . . . . . . . . . . . . 124
6.8 P -values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.8.1 Interpretation of p-values . . . . . . . . . . . . . . . . . . . . . . . 129
6.8.2 Statistical significance versus practical significance . . . . . . . . . 131
6.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 132
6.12 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 132

7 Hypothesis testing of means and proportions 135


7.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.5 Testing a population mean claim . . . . . . . . . . . . . . . . . . . . . . 137
7.6 Hypothesis test for a single mean (σ 2 known) . . . . . . . . . . . . . . . 139
7.7 Hypothesis test for a single mean (σ 2 unknown) . . . . . . . . . . . . . . 140
7.8 Hypothesis test for a single proportion . . . . . . . . . . . . . . . . . . . 143
7.9 Hypothesis testing of differences between parameters of two populations . 145
7.10 Difference between two population means . . . . . . . . . . . . . . . . . . 146
7.10.1 Unpaired samples: variances known . . . . . . . . . . . . . . . . . 146
7.10.2 Unpaired samples: variances unknown and unequal . . . . . . . . 147
7.10.3 Unpaired samples: variances unknown and equal . . . . . . . . . . 149
7.10.4 Paired (dependent) samples . . . . . . . . . . . . . . . . . . . . . 151
7.11 Difference between two population proportions . . . . . . . . . . . . . . . 153
7.12 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.14 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 156
7.15 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 157

8 Contingency tables and the chi-squared test 159


8.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

v
Contents

8.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


8.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.5 Association versus correlation . . . . . . . . . . . . . . . . . . . . . . . . 160
8.6 Tests for association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.6.1 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.6.2 Expected frequencies . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.6.3 Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.6.4 The chi-squared, χ2 , distribution . . . . . . . . . . . . . . . . . . 163
8.6.5 Degrees of freedom . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.6.6 Performing the test . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.7 Goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.7.1 Observed and expected frequencies . . . . . . . . . . . . . . . . . 166
8.7.2 The goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . 167
8.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.10 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 169
8.11 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 170

9 Sampling and experimental design 173


9.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.5 Motivation for sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.6 Types of sampling techniques . . . . . . . . . . . . . . . . . . . . . . . . 175
9.6.1 Non-random sampling . . . . . . . . . . . . . . . . . . . . . . . . 175
9.6.2 Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.7 Sources of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.8 Non-response bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.9 Method of contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.10 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.11 Observational studies and designed experiments . . . . . . . . . . . . . . 188
9.11.1 Observational study . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.12 Longitudinal surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.13 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

vi
Contents

9.14 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 190


9.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 191
9.16 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 192

10 Correlation and linear regression 195


10.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.3 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.5 Scatter diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.6 Causal and non-causal relationships . . . . . . . . . . . . . . . . . . . . . 198
10.7 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.7.1 Spearman rank correlation . . . . . . . . . . . . . . . . . . . . . . 201
10.8 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.8.1 The simple linear regression model . . . . . . . . . . . . . . . . . 203
10.8.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.8.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.8.4 Points to watch about linear regression . . . . . . . . . . . . . . . 206
10.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
10.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . 208
10.12 Solutions to Sample examination questions . . . . . . . . . . . . . . . . 209

A Mathematics primer and the role of statistics in the research process 211
A.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
A.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
A.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 216

B Data visualisation and descriptive statistics 221


B.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
B.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
B.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 235

C Probability theory 239


C.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
C.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
C.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 258

vii
Contents

D Random variables, the normal and sampling distributions 263


D.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
D.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
D.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 277

E Interval estimation 283


E.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
E.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
E.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 292

F Hypothesis testing principles 295


F.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
F.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
F.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 298

G Hypothesis testing of means and proportions 301


G.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
G.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
G.3 Solutions to practice problems . . . . . . . . . . . . . . . . . . . . . . . . 310

H Contingency tables and the chi-squared test 313


H.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
H.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
H.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 320

I Sampling and experimental design 323


I.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
I.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
I.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 327

J Correlation and linear regression 333


J.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
J.2 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
J.3 Solutions to Practice problems . . . . . . . . . . . . . . . . . . . . . . . . 343

K Examination formula sheet 345

viii
Contents

L Sample examination paper 347

M Sample examination paper – Solutions 353

ix
Contents

x
Chapter 0
Preface

0.1 Route map to the subject guide


This subject guide provides you with a framework for covering the syllabus of the
ST104A Statistics 1 course and directs you to additional resources such as readings
and the virtual learning environment (VLE). The material in this half course is
necessary as preparation for other courses you may study later on as part of your degree.

You may choose to take ST104B Statistics 2 so that you can study the concepts
introduced here in greater depth. A natural continuation of this half course and
ST104B Statistics 2 are the advanced half courses ST2133 Advanced
statistics: distribution theory and ST2134 Advanced statistics: statistical
inference.
Two applied statistics courses for which this half course is a prerequisite are
ST2187 Business analytics, applied modelling and prediction and ST3188
Statistical methods for market research.
You may wish to develop your economic statistics by taking EC2020 Elements of
econometrics, which requires ST104B Statistics 2 as well.

The chapters are not a series of self-contained topics, rather they build on each other
sequentially. As such, you are strongly advised to follow the subject guide in chapter
order. There is little point in rushing past material which you have only partially
understood in order to reach the final chapter. Once you have completed your work on
all of the chapters, you will be ready for examination revision. A good place to start is
the sample examination paper which you will find at the end of the subject guide.
Colour has been included in places to emphasise important items. Formulae in the main
body of chapters are in blue – these exclude formulae used in examples. Key terms and
concepts when introduced are shown in magenta. References to other courses and half
courses are shown in purple (such as above). Terms in italics are shown in purple for
emphasis. References to chapters, sections, figures and tables are shown in teal.

0.2 Introduction to the subject area


Welcome to the wonderful world of statistics! This discipline has unparalleled
applicability in a wide range of areas such as finance, business, management, economics
and other fields in the social sciences. ST104A Statistics 1 provides you with the
opportunity to understand the fundamentals and gain the vital quantitative skills and
powers of analysis that are highly sought-after by employers in many sectors.

1
0. Preface

Statistics forms a core component of our programmes. All of the courses mentioned
above require an understanding of the concepts and techniques introduced in this
course. You will develop analytical skills on this course that will help you with your
future studies and in the world of work.

0.3 Syllabus
The up-to-date course syllabus for ST104A Statistics 1 can be found in the course
information sheet, which is available on the course VLE page.

0.4 Aims and objectives


The emphasis of this half course is on the application of statistical methods in
management, economics and the social sciences. We will focus on the interpretation of
tables and results, as well as the appropriate way to approach statistical problems. Note
that this course is at an elementary mathematical level. We will introduce ideas of
probability, inference and multivariate analysis which we will further develop in the half
course ST104B Statistics 2.

0.5 Learning outcomes


At the end of this half course, and having completed the essential reading and activities,
you should:

be familiar with the key ideas of statistics that are accessible to a student with a
moderate mathematical competence

be able to routinely apply a variety of methods for explaining, summarising and


presenting data and interpreting results clearly using appropriate diagrams, titles
and labels when required

be able to summarise the ideas of randomness and variability, and the way in which
these link to probability theory to allow the systematic and logical collection of
statistical techniques of great practical importance in many applied areas

have a grounding in probability theory and some grasp of the most common
statistical methods

be able to perform inference to test the significance of common measures such as


means and proportions and conduct chi-squared tests of contingency tables

be able to use correlation analysis and simple linear regression and know when it is
appropriate to do so.

2
0.6. Employability outcomes

0.6 Employability outcomes


Below are the three most relevant skill outcomes for students undertaking this course
which can be conveyed to future prospective employers:

1. complex problem-solving

2. decision making

3. communication.

0.7 Overview of learning resources

0.7.1 The subject guide


The subject guide is a self-contained resource, i.e. the content provided here is sufficient
to prepare for the examination. We will discuss in detail all examinable topics with
numerous worked examples and practice problems. It is essential to study extensively
using the subject guide in order to perform well in the final examination. You do not
need to buy a textbook, although you may want to read about the same topics written
by different authors. See the suggested ‘Further reading’ below.
The subject guide provides a range of activities that will enable you to test your
understanding of the basic ideas and concepts. We want to encourage you to try the
exercises that you encounter throughout the material before working through the
solutions. With statistics, the motto is ‘practise, practise, practise. . .’. It is the best way
to learn the material and prepare for examinations. The course is rigorous and
demanding, but the skills you will be developing will be rewarding and well recognised
by future employers.
A suggested approach for studying ST104A Statistics 1 is to split the material into
10 weeks as follows:

Week Chapter
1 Chapter 1: Mathematics primer and the role of statistics in the research process
2 Chapter 2: Data visualisation and descriptive statistics
3 Chapter 3: Probability theory
4 Chapter 4: Random variables, the normal and sampling distributions
5 Chapter 5: Interval estimation of means and proportions
6 Chapter 6: Hypothesis testing principles
7 Chapter 7: Hypothesis testing of means and proportions
8 Chapter 8: Contingency tables and the chi-squared test
9 Chapter 9: Sampling and experimental design
10 Chapter 10: Correlation and linear regression

3
0. Preface

We recommend the following procedure:

1. Read the introductory comments.

2. Study the chapter content and practice problems.

3. Go through the learning outcomes carefully.

4. Refer back to this subject guide, or to supplementary texts, to improve your


understanding until you are able to work through the problems confidently.

The last step is the most important. It is easy to think that you have understood the
material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
To prepare for the examination, you will only need to read the material in the subject
guide, but it may be helpful from time to time to look at the suggested ‘Further
reading’ below.

0.7.2 Mathematical background


To study and understand statistics you will need to be familiar with some simple
abstract mathematical concepts and apply common sense to see how to use these ideas
in real-life applications. The concepts needed for probability and statistical inference are
impossible to absorb by just reading them in a book – although you may find you need
to do this more than once! You need to read, then think, then try some problems, then
read and think some more. This process should be repeated until you find the problems
easy to do.
You will also need to use high-school arithmetic and understand some basic algebraic
ideas. These ideas are very important. Starting with them should help you feel
comfortable with this half course from the outset and they are therefore introduced to
you in Chapter 1.

Calculators

A calculator may be used when answering questions on the examination paper for
ST104A Statistics 1. It must comply in all respects with the specification given in
the Regulations. You should also refer to the admission notice you will receive when
entering the examination and the ‘Notice on permitted materials’.

0.7.3 Essential reading


This subject guide is ‘self-contained’ meaning that this is the only resource which is
essential reading for ST104A Statistics 1. Throughout the subject guide there are
many worked examples and sample examination questions replicating resources
typically provided in statistical textbooks.

4
0.7. Overview of learning resources

Statistical tables

In the examination you will be provided with relevant extracts of:

Lindley, D.V. and W.F. Scott New Cambridge Statistical Tables. (Cambridge:
Cambridge University Press, 1995) 2nd edition [ISBN 9780521484855].

The relevant extracts can be found at the end of this subject guide, and are the same as
those distributed for use in the examination. It is advisable that you become familiar
with them, rather than those at the end of a textbook which may differ in presentation.

0.7.4 Further reading


As mentioned above, this subject guide is sufficient for study of ST104A Statistics 1.
Of course, you are free to read around the subject area in any text, paper or online
resource. You should support your learning by reading as widely as possible and by
thinking about how these principles apply in the real world. To help you read
extensively, you have free access to the virtual learning environment (VLE) and
University of London Online Library (see below).
Numerous titles are available covering the topics frequently covered in foundation
statistics courses such as ST104A Statistics 1. Due to the inevitable heterogeneity
among students taking this half course, some may find one author’s style easier to
understand than another’s.
That said, the recommended textbook for this course is:

Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE


Publications, 2023) 1st edition [ISBN 9781529774092].

This textbook shows many real-world business applications of all statistical methods
covered in ST104A Statistics 1. The textbook is also useful for ST2187 Business
analytics, applied modelling and prediction and ST3188 Statistical methods
for market research so if you study either of these courses you can benefit from a
single textbook!

0.7.5 Online study resources


In addition to the subject guide and the Essential reading, it is crucial that you take
advantage of the study resources that are available online for this course, including the
VLE and the Online Library.
You can access the VLE, the Online Library and your University of London email
account via the Student Portal at: http://my.london.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged into the Student Portal in order to register! As soon as
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functional University of London email account.

5
0. Preface

If you have forgotten these login details, please click on the ‘Forgot Password’ link on
the login page.

0.7.6 The VLE


The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. It forms an
important part of your study experience with the University of London and you should
access it regularly.
The VLE provides a range of resources for EMFSS courses:

Course materials: Subject guides and other course materials available for
download. In some courses, the content of the subject guide is transferred into the
VLE and additional resources and activities are integrated with the text.

Readings: Direct links, wherever possible, to essential readings in the Online


Library, including journal articles and ebooks.

Video content: Including introductions to courses and topics within courses,


interviews, lessons and debates.

Screencasts: Videos of PowerPoint presentations, animated podcasts and


on-screen worked examples.

External material: Links out to carefully selected third-party resources.

Self-test activities: Multiple-choice, numerical and algebraic quizzes to check


your understanding.

Collaborative activities: Work with fellow students to build a body of


knowledge.

Discussion forums: A space where you can share your thoughts and questions
with fellow students. Many forums will be supported by a ‘course moderator’, a
subject expert employed by LSE to facilitate the discussion and clarify difficult
topics.

Past examination papers: We provide up to three years of past examinations


alongside Examiners’ commentaries that provide guidance on how to approach the
questions.

Study skills: Expert advice on getting started with your studies, preparing for
examinations and developing your digital literacy skills.

Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.

6
0.8. Examination advice

0.7.7 Making use of the Online Library


The Online Library (https://onlinelibrary.london.ac.uk) contains a huge array of journal
articles and other resources to help you read widely and extensively.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login.
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed in a reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please use the online help pages
(https://onlinelibrary.london.ac.uk/resources/summon) or contact the Online Library
team using the ‘Chat with us’ function.

0.8 Examination advice


Important: The information and advice given here are based on the examination
structure used at the time this subject guide was written. Please note that subject
guides may be used for several years. Because of this we strongly advise you to always
check both the current Programme regulations for relevant information about the
examination, and the VLE where you should be advised of any forthcoming changes.
You should also carefully check the rubric/instructions on the paper you actually sit
and follow those instructions.
The examination is by a two-hour unseen question paper. No books may be taken into
the examination, but the use of calculators is permitted, and statistical tables and a
formula sheet are provided (the formula sheet can be found at the end of the subject
guide).
Section A, worth 50 marks, is compulsory with several short questions covering a wide
range of the syllabus. In Section B two out of three longer questions must be answered,
also worth 50 marks in total. You may use your calculator whenever you feel it is
appropriate, always remembering that the examiners can give marks only for what
appears on the examination script. Therefore, it is important to always show your
working.
In terms of the examination, as always, it is important to manage your time carefully
and not to dwell on one question for too long – move on and focus on solving the easier
questions, coming back to harder ones later.
Remember, it is important to check the VLE for:

up-to-date information on examination and assessment arrangements for this course

where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.

7
0. Preface

8
1

Chapter 1
Mathematics primer and the role of
statistics in the research process

1.1 Synopsis of chapter


This chapter outlines the essential mathematical building blocks which you will need to
work with in this half course. Most of these will likely be revision to you, but some new
material may be introduced. There is also a general introduction to some of the
statistical ideas which you will be learning about in this half course. It should enable
you to link different parts of the syllabus and see their relevance to each other and also
to the other courses you are studying.

1.2 Learning outcomes


By the end of this chapter, and having completed the recommended reading and
activities, you should be able to:

manipulate arithmetic and algebraic expressions using simple rules

recall and use common signs: square, square root, ‘greater than’, ‘less than’ and
absolute value

demonstrate use of the summation operator and work with the ‘i’, or index, of x

draw the straight line for a linear function

explain the role of statistics in the research process.

1.3 Recommended reading


Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE
Publications, 2023) first edition [ISBN 9781529774092] Chapter 1.

1.4 Introduction
This opening chapter introduces some basic concepts and mathematical tools upon
which the rest of the half course is built. Before proceeding to the rest of the subject

9
1. Mathematics primer and the role of statistics in the research process
1
guide, it is essential that you have a solid understanding of these fundamental concepts
and tools.
You should be a confident user of the basic mathematical operations (addition,
subtraction, multiplication and division) and be able to use these operations on a
calculator. The content of this chapter is expected to be a ‘refresher’ of the elementary
algebraic and arithmetic rules from schooldays. Some material featured in this chapter
may be new to you, such as the summation operator and graphs of linear functions. If
so, you should master these new ideas before progressing.
Finally, remember that although it is unlikely that an examination question would test
you on the topics in this chapter alone, the material covered here may well be an
important part of the answer!

1.5 Arithmetic operations


We begin with elementary arithmetic operations which will be used when working with
data in ST104A Statistics 1. Often students understand the statistical concepts but
fail to manage a problem because they cannot do the required arithmetic. Although this
is not primarily an arithmetic paper, many calculations will be used, so it is vital to
ensure that you are comfortable with the examples and activities in this subject guide.
The acronym to remember is BODMAS, which tells us the correct order (that is, the
priority) in which mathematical operations are performed:

Brackets

Order (i.e. powers, square roots etc.)

Division

Multiplication

Addition

Subtraction.

You should also know that:

the sum of a and b means a + b

the difference between a and b means either a − b or b − a

the product of a and b means a × b = a · b

the quotient of a and b means a divided by b, i.e. a/b.

10
1.6. Squares and square roots
1
Example 1.1 What is (35 ÷ 7 + 2) − (42 − 8 × 3)?
BODMAS tells us to work out the brackets first. Here there are two sets of brackets,
so let us deal with them one at a time.

First bracket: 35 ÷ 7 + 2
• do division first: 35 ÷ 7 + 2 = 5 + 2
• then perform the addition: 5 + 2 = 7.

Second bracket: 42 − 8 × 3
• do order first: 42 − 8 × 3 = 16 − 8 × 3
• next do multiplication: 16 − 8 × 3 = 16 − 24
• then perform the subtraction: 16 − 24 = −8.

Now the problem has been simplified we complete the calculation with the final
subtraction: 7 − (−8) = 7 + 8 = 15. Note that the two negatives become positive!

1.6 Squares and square roots


The power is the number of times a quantity is to be multiplied by itself. For example,
34 = 3 × 3 × 3 × 3 = 81. Any number raised to the power 2 is called ‘squared’, hence x2
is ‘x squared’, which is simply x × x.
Remember that squared values, such as x2 , are always non-negative. This is important,
for example, when we compute the quantity s2 in Chapter 2 which involves squared
terms, so a negative answer should ring alarm bells telling us a mistake has been made!

It might be helpful to √ of the square root of x (denoted by x) as the reverse of
think √
√ such that x × x = x. Note that positive real numbers have two square
the square,
roots: ± 81 = ±9, although the positive square root will typically be used in ST104A
Statistics 1. In practice, the main problems you will encounter involve taking square
roots of numbers with decimal places. Be careful that you understand that 0.9 is the
square root of 0.81, and that 0.3 is the square root of 0.09 (and not 0.9!). Of course, in
the examination you can perform such calculations on your calculator, but it always
helps to have an idea of what the answer should be as a feasibility check of your answer!

1.7 Fractions and percentages


A fraction is part of a whole and can be expressed as either:

a common fraction: for example, 1/2 or 3/8


a decimal fraction: for example, 0.50 or 0.375.

In the common fraction, the top number is the numerator and the bottom number is
the denominator. In practice, decimal fractions are more commonly used.

11
1. Mathematics primer and the role of statistics in the research process
1
When multiplying fractions together, just multiply all the numerators together to
obtain the new numerator, and do the same with the denominators. For example:

4 1 2 4×1×2 8
× × = = .
9 3 5 9×3×5 135

Percentages give an alternative way of representing fractions by relating a particular


quantity to the whole in parts per hundred. For example, 60% is 60 parts per 100,
which, as a common fraction, is simply 60/100.

1.8 Some further notation

1.8.1 Absolute value

One useful sign in statistics is | | which denotes the absolute value. This is the
numerical value of a real number regardless of its sign (positive or negative). The
absolute value of x, sometimes referred to as the modulus of x, or ‘mod x’, is |x|. So
|7.1| = |−7.1| = 7.1.
Statisticians sometimes want to indicate that they only want to use the positive value of
a number. For example, let the distance between town X and town Y be 5 miles.
Suppose someone walks from X to Y – a distance of 5 miles. A mathematician would
write this as +5 miles. Later, after shopping, the person returns to X and the
mathematician would record him as walking −5 miles (taking into account the direction
of travel). Hence this way the mathematician can show the person ended up where they
started. We, however, may be more interested in the fact that the person has had some
exercise that day! So, we need notation to indicate this. The absolute value enables us
to take only the positive values of our variables. The distance, d, from Y to X may well
be expressed mathematically as −5 miles, but you will probably be interested in the
absolute amount, so |−d| = d.

1.8.2 Inequalities

An inequality is a mathematical statement that one quantity is greater or less than


another:

x > y means ‘x is greater than y’

x ≥ y means ‘x is greater than or equal to y’

x < y means ‘x is less than y’

x ≤ y means ‘x is less than or equal to y’

x ≈ y means ‘x is approximately equal to y’.

12
P
1.9. Summation operator,

P 1
1.9 Summation operator,
P
The summation operator, , is likely to be new to many of you. It is widely used in
statistics and you will come across it frequently in ST104A Statistics 1, so make sure
you are comfortable using it before proceeding further!
Statistics involves data analysis, so to use statistical methods we need data! Individual
observations are typically represented using a subscript notation. For example, the
heights of n people1 would be represented by x1 , x2 , . . . , xn , where the subscript denotes
the order in which the heights are observed (x1 represents the height of the first observed
person, x2 the height of the second observed person etc.). Hence xi represents the height
of the ith individual and, in order to list them all, the subscript i must take all integer
values from 1 to n, inclusive. So, the whole set of observations is {xi : i = 1, 2, . . . , n}
which can be read as ‘a set of observations xi such that i goes from 1 to n’.
P
Summation operator,

The sum of a set of n observations, that is x1 + x2 + · · · + xn , may be written as:


n
X
xi (1.1)
i=1
P
where is the summation operator, which can be read as ‘the sum of’. Therefore,
Pn
xi is read as ‘the sum of xi , for i equals 1 to n’.
i=1

We see that the summation is said to be over i, where i is the index of summation
and the range of i, in P
(1.1), is from 1 to n. The lower bound of the range is the value of
i written underneath , and the upper bound is written above it. Note that the lower
bound can be any integer (positive, negative or zero), such that the summation is over
all values of the index of summation in step increments of size one from the lower
bound to the upper bound, inclusive.
P
As stated above, appears frequently in statistics. For example, in Chapter 2 you will
meet descriptive statistics including the arithmetic mean of observations which is
defined as:
n
1X
x̄ = xi .
n i=1

n
P
Rather than write out xi in full, when all the xi s are summed we sometimes write
i=1
n
P P
short-cuts, such as xi , or (when the range of summation is obvious) just xi .
1
Note that the resulting sum does not involve i in any form. Hence the sum is unaffected
by (or invariant to) the choice of letter used for the index of summation. Hence, for

1
Throughout this half course, n will denote a sample size.

13
1. Mathematics primer and the role of statistics in the research process
1
example, the following summations are all equal:
n
X n
X n
X
xi = xj = xk
i=1 j=1 k=1

since each represents x1 + x2 + · · · + xn . Sometimes the way that xi depends on i is


known. For example, if xi = i, we have:
3
X 3
X
xi = i = 1 + 2 + 3 = 6.
i=1 i=1

However, do not always assume that xi = i!

Example 1.2 If {xi : i = 1, 2, . . . , n} is a set of observations, we might observe


x1 = 4, x2 = 5, x3 = 1, x4 = −2 and x5 = 9. Therefore:
4
X
x2i = 42 + 52 + 12 + (−2)2 = 46
i=1

and:
5
X 5
X
xi (xi − 2) = (x2i − 2xi ) = ((−2)2 − 2 × −2) + (92 − 2 × 9) = 71
i=4 i=4

remembering to use BODMAS in the second example.

1.10 Graphs
In Chapter 2 you will spend some time learning how to present data in graphical form,
and also in the representation of the normal distribution in Chapter 4. You should make
sure you have understood the following material. If you are taking MT105A
Mathematics 1, you will need to use these ideas there as well.
When a variable y depends on another variable x, we can represent the relationship
mathematically using functions. In general we write this as y = f (x), where f is the
rule which allows us to determine the value of y when we input the value of x. Graphs
are diagrammatic representations of such relationships, using coordinates and axes. The
graph of a function y = f (x) is the set of all points in the plane of the form (x, f (x)).
Sketches of graphs can be very useful. To sketch a graph, we begin with the x-axis and
y-axis as shown in Figure 1.1.
We then plot all points of the form (x, f (x)). Therefore, at x units from the origin (the
point where the axes cross), we plot a point whose height above the x-axis (that is,
whose y coordinate) is f (x), as shown in Figure 1.2.
Joining all points together of the form (x, f (x)) results in a curve (or sometimes a
straight line), which is called the graph of f (x). A typical curve might look like that
shown in Figure 1.3.

14
1.11. The graph of a linear function
1

Figure 1.1: Graph axes.

Figure 1.2: Example of a plotted coordinate.

However, you should not imagine that the correct way to sketch a graph is to plot a few
points of the form (x, f (x)) and join them up – this approach rarely works well in
practice and more sophisticated techniques are needed. There are two function types
which you need to know about for this half course:

linear functions (i.e. the graph of a straight line, see below)

normal functions (which we shall meet frequently in later chapters).

1.11 The graph of a linear function


Linear functions are those of the form:

f (x) = a + bx

and their graphs are straight lines which are characterised by a gradient (or slope), b,
and a y-intercept (where x = 0) at the point (0, a).
A sketch of the function y = 3 + 2x is provided in Figure 1.4, and the function
y = 2 − x is shown in Figure 1.5.

15
1. Mathematics primer and the role of statistics in the research process
1

Figure 1.3: The graph of a generic function, y = f (x).

y
3

-1.5 x

Figure 1.4: A sketch of the linear function y = 3 + 2x.

2 x

Figure 1.5: A sketch of the linear function y = 2 − x.

16
1.12. The role of statistics in the research process
1
1.12 The role of statistics in the research process
Before we get into details, let us begin with the ‘big picture’. First, some definitions.

Research: trying to answer questions about the world in a systematic (scientific)


way.
Empirical research: doing research by first collecting relevant information (data)
about the world.

Research may be about almost any topic: physics, biology, medicine, economics, history,
literature etc. Most of our examples will be from the social sciences: economics,
management, finance, sociology, political science, psychology etc. Research in this sense
is not just what universities do. Governments, businesses, and all of us as individuals do
it too. Statistics is used in essentially the same way for all of these.

Example 1.3 It all starts with a question.

Can labour regulation hinder economic performance?

Understanding the gender pay gap: what has competition got to do with it?

Children and online risk: powerless victims or resourceful participants?

Refugee protection as a collective action problem: is the European Union (EU)


shirking its responsibilities?

Do directors perform for pay?

Heeding the push from below: how do social movements persuade the rich to
listen to the poor?

Does devolution lead to regional inequalities in welfare activity?

The childhood origins of adult socio-economic disadvantage: do cohort and


gender matter?

Parental care as unpaid family labour: how do spouses share?

Key stages of the empirical research process

We can think of the empirical research process as having five key stages.

1. Formulating the research question.

2. Research design: deciding what kinds of data to collect, how and from where.

3. Collecting the data.

4. Analysis of the data to answer the research question.

5. Reporting the answer and how it was obtained.

17
1. Mathematics primer and the role of statistics in the research process
1
The main job of statistics is the analysis of data, although it also informs other stages
of the research process. Statistics are used when the data are quantitative, i.e. in the
form of numbers.
Statistical analysis of quantitative data has the following features.

It can cope with large volumes of data, in which case the first task is to provide an
understandable summary of the data. This is the job of descriptive statistics.

It can deal with situations where the observed data are regarded as only a part (a
sample) from all the data which could have been obtained (the population). There
is then uncertainty in the conclusions. Measuring this uncertainty is the job of
statistical inference.

We continue with an example of how statistics can be used to help answer a research
question.

Example 1.4 CCTV, crime and fear of crime.


Our research question is what is the effect of closed-circuit television (CCTV)
surveillance on:

the number of recorded crimes?

the fear of crime felt by individuals?

We illustrate this using part of the following study.

Gill, M. and A. Spriggs ‘Assessing the impact of CCTV’, Study 292, Home
Office Research, 2005.

The research design of the study comprised the following.

Target area: a housing estate in northern England.

Control area: a second, comparable housing estate.

Intervention: CCTV cameras installed in the target area but not in the
control area.

Comparison of measures of crime and the fear of crime in the target and
control areas in the 12 months before and 12 months after the intervention.

The data and data collection were as follows.

Level of crime: the number of crimes recorded by the police, in the 12 months
before and 12 months after the intervention.

Fear of crime: a survey of residents of the areas.


• Respondents: random samples of residents in each of the areas.

18
1.12. The role of statistics in the research process
1
• In each area, one sample before the intervention date and one about 12
months after.
• Sample sizes:
Before After
Target area 172 168
Control area 215 242

• Question considered here: ‘In general, how much, if at all, do you worry
that you or other people in your household will be victims of crime?’ (from
1 = ‘all the time’ to 5 = ‘never’).
Statistical analysis of the data.

% of respondents who worry ‘sometimes’, ‘often’ or ‘all the time’:


Target Control
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
26 23 −3 53 46 −7 0.98 (0.55, 1.74)

It is possible to calculate various statistics, for example the Relative Effect Size
RES = ([d]/[c])/([b]/[a]) = 0.98 is a summary measure which compares the
changes in the two areas.
RES < 1, which means that the observed change in the reported fear of crime
has been a bit less good in the target area.
However, there is uncertainty because of sampling: only 168 and 242 individuals
were actually interviewed at each time in each area, respectively.
The confidence interval for RES includes 1, which means that changes in the
self-reported fear of crime in the two areas are ‘not statistically significantly
different’ from each other.

The number of (any kind of) recorded crimes:


Target area Control area
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
112 101 −11 73 88 15 1.34 (0.79, 1.89)

Now the RES > 1, which means that the observed change in the number of
crimes has been worse in the control area than in the target area.
However, the numbers of crimes in each area are fairly small, which means that
these estimates of the changes in crime rates are fairly uncertain.
The confidence interval for RES again includes 1, which means that the changes
in crime rates in the two areas are not statistically significantly different from
each other.

In summary, this study did not support the claim that the introduction of CCTV
reduces crime or the fear of crime.

19
1. Mathematics primer and the role of statistics in the research process
1
If you want to read more about research of this question, see Welsh, B.C. and
D.P. Farrington ‘Effects of closed circuit television surveillance on crime’,
Campbell Systematic Reviews 17 2008, pp. 1–73.

Many of the statistical terms and concepts mentioned above have not been explained yet
– that is what the rest of the course is for! However, it serves as an interesting example
of how statistics can be employed in the social sciences to investigate research questions.

1.13 Overview of chapter


Much of this material should be familiar to you, but some may be new. Although it is
only a language or set of rules to help you deal with statistics, without it you will not be
able to make sense of the following chapters. Before you continue, make sure you have
completed all the worked examples in Appendix A, and understood what you have done.

1.14 Key terms and concepts


Absolute value Numerator
BODMAS Percentage
Denominator Power
Descriptive statistics Product
Difference Quantitative
Empirical Quotient
Fraction Research
Graph Square root
Index of summation Statistical inference
Inequality Sum
Linear function Summation operator
Modulus

1.15 Sample examination questions


1. Suppose that x1 = −0.2, x2 = 2.5, x3 = −3.7, x4 = 0.8, x5 = 7.4, and y1 = −0.2,
y2 = 8.0, y3 = 3.9, y4 = −2.0, y5 = 0. Calculate the following quantities:
5
X
(a) x2i
i=3
2
X 1
(b)
xy
i=1 i i
5
X y2 i
(c) y43 + .
i=4
xi

20
1.16. Solutions to Sample examination questions
1
2. Suppose that y1 = −2, y2 = −5, y3 = 1, y4 = 16, y5 = 10, and z1 = 8, z2 = −5,
z3 = 6, z4 = 4, z5 = 10. Calculate the following quantities:
3
X
(a) zi2
i=1
5
X √
(b) yi zi
i=4
3
X 1
(c) z42 + .
y
i=1 i

1.16 Solutions to Sample examination questions


1. (a) We have:
5
X
x2i = (−3.7)2 + (0.8)2 + (7.4)2 = 13.69 + 0.64 + 54.76 = 69.09.
i=3

(b) We have:
2
X 1 1 1
= + = 25 + 0.05 = 25.05.
i=1
x i y i (−0.2) × (−0.2) 2.5 × 8.0

(c) We have:
5
y2 (−2.0)2 02
X  
i
y43 + = (−2.0) + 3
+ = −8 + 5 = −3.
i=4
xi 0.8 7.4

2. (a) We have:
3
X
zi2 = 82 + (−5)2 + 62 = 64 + 25 + 36 = 125.
i=1

(b) We have:
5
X √ √ √
yi zi = 16 × 4 + 10 × 10 = 8 + 10 = 18.
i=4

(c) We have:
3  
X 1 1 1
z42 + 2
= 4 + − − + 1 = 16.3.
y
i=1 i
2 5

21
1. Mathematics primer and the role of statistics in the research process
1

22
Chapter 2 2
Data visualisation and descriptive
statistics

2.1 Synopsis of chapter


This chapter contains two separate but related themes, both to do with the
understanding of data. First, we look at graphical representations for data which allow
us to see their most important characteristics. Second, we calculate simple numbers,
such as the mean or standard deviation, which will summarise those characteristics. In
summary, you should be able to use appropriate diagrams and measures in order to
explain and clarify data which you have collected or which are presented to you.

2.2 Learning outcomes


After completing this chapter, and having completed the essential reading and
activities, you should be able to:

draw and interpret density histograms, stem-and-leaf diagrams and boxplots

incorporate labels and titles correctly in your diagrams and state the units which
you have used

calculate the following: arithmetic mean, median, mode, standard deviation,


variance, quartiles, range and interquartile range

explain the use and limitations of the above quantities.

2.3 Recommended reading


Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE
Publications, 2023) 1st edition [ISBN 9781529774092] Chapter 2.

2.4 Introduction
Both themes considered in this chapter (data visualisation and descriptive statistics)
could be applied to population data, but in most cases (namely here) they are applied to
a sample. The notation would change slightly if a population was being represented.

23
2. Data visualisation and descriptive statistics

Most visual representations are very tedious to construct in practice without the aid of
a computer. However, you will understand much more if you try a few by hand (as is
commonly asked in examinations). You should also be aware that spreadsheets do not
2 always use correct terminology when discussing and labelling graphs. It is important,
once again, to go over this material slowly and make sure you have mastered the basic
statistical definitions introduced here before you proceed to more theoretical ideas.

2.5 Types of variable


Data1 are obtained on any desired variable. A variable is something which, well, varies!
For quantitative variables, i.e. numerical variables, these can be classified into two types.

Types of quantitative variable

Discrete variables: These have outcomes you can count. Examples include the
number of passengers on a flight and the number of telephone calls received each
day in a call centre. Observed values for these will be 0, 1, 2, . . . (i.e. non-negative
integers).

Continuous variables: These have outcomes you can measure. Examples include
height, weight and time, all of which can be measured to several decimal places,
and typically have units of measurement (such as metres, kilograms and hours).

Many of the problems for which people use statistics to help them understand and make
decisions involve types of variables which can be measured. When we are dealing with a
continuous variable – for which there is a generally recognised method of determining
its value – we can also call it a measurable variable. The numbers which we then
obtain come ready-equipped with an ordered relation, i.e. we can always tell if two
measurements are equal (to the available accuracy) or if one is greater or less than the
other.
Of course, before we do any sort of data analysis, we need to collect data. Chapter 9
will discuss a range of different techniques which can be employed to obtain a sample.
For now, we just consider some simple examples of situations where data might be
collected, such as a:

pre-election opinion poll asking 1,000 people about their voting intentions

market research survey asking adults how many hours of television they watch per
week

census interviewer asking parents how many of their children are receiving full-time
education (note that a census is the total enumeration of a population, hence this
would not be a sample!).
1
Note that the word ‘data’ is plural, but is very often used as if it was singular. You will probably
see both forms used when reading widely.

24
2.5. Types of variable

2.5.1 Categorical variables


Qualitative data, often referred to as categorical variables, represent characteristics or
qualities that can be divided into distinct groups or categories. Unlike quantitative data, 2
which recall are numerical, qualitative data is non-numeric and describes attributes or
qualities. Categorical variables can take on different categories or groups, and they are
often used to classify items into specific classes or labels based on shared characteristics.
A polling organisation might be asked to determine whether, say, the political
preferences of voters were in some way linked to their highest level of education – for
example, do graduates tend to be supporters of Party XYZ? In consumer research,
market research companies might be hired to determine whether users were satisfied
with the service they obtained from a business (such as a restaurant) or a department
of local or central government (housing departments being one important example). For
qualitative variables, these can be classified into two types.

Types of qualitative variable

Nominal variables: These have categories with no inherent order or ranking.


Examples include colours (such as red, blue, green etc.) and types of fruit (such
as apple, banana, orange etc.).

Ordinal variables: These have categories with a meaningful order or ranking


but the intervals between them are not consistent. Examples include highest
educational level achieved (such as high school, undergraduate, postgraduate)
and degree classification (such as first class, upper second class, lower second
class etc.).

Example 2.1 Consider the following.

(a) The total number of graduates (in a sample).

(b) The total number of Party XYZ supporters (in a sample).

(c) The number of graduates who support Party XYZ.

(d) The number of Party XYZ supporters who are graduates.

(e) Satisfaction levels of diners at a restaurant.

In cases (a) and (b) we are doing simple counts, within a sample, of a single category
– graduates and Party XYZ supporters, respectively – while in cases (c) and (d) we
are looking at some kind of cross-tabulation between two categorical variables – a
scenario which will be considered in Chapter 8.
There is no obvious and generally recognised way of putting political preferences in
order (in the way that we can certainly say that 1 < 2). It is similarly impossible to
rank (as the technical term has it) many other categories of interest: in combatting
discrimination against people, for instance, organisations might want to look at the
effects of gender, religion, nationality, sexual orientation, disability etc. but the

25
2. Data visualisation and descriptive statistics

whole point of combatting discrimination is that different levels of each category


cannot be ranked. Hence these are examples of nominal variables.
2 In case (e), by contrast, there is a clear ranking: the restaurant would be pleased if
there were lots of people who expressed themselves as being ‘very satisfied’, rather
than merely ‘satisfied’, let alone ‘dissatisfied’ or ‘very dissatisfied’ ! Hence this is an
ordinal variable.

2.6 Data visualisation


Datasets consist of potentially vast amounts of data. Hedge funds, for example, have
access to very large databases of historical price information on a range of financial
assets, such as so-called ‘tick data’ – very high-frequency intra-day data. Of course, the
human brain cannot easily make sense of such large quantities of numbers when
presented with them on a screen. However, the human brain can cope with visual
representations of data. By producing various plots, we can instantly ‘eyeball’ to get a
bird’s-eye view of the dataset. So, at a glance, we can quickly get a feel for the data and
determine whether there are any interesting features, relationships etc. which could
then be examined in greater depth. In modelling, for example, we often make
distributional assumptions, and a suitable variable plot allows us to easily check the
feasibility of a particular distribution by eye. To summarise, plots are a great medium
for communicating the salient features of a dataset to a wide audience.
The main representations we use in ST104A Statistics 1 are histograms,
stem-and-leaf diagrams and boxplots. We will also use scatterplots to visualise the
relationship, if any, between two measurable variables (covered in Chapter 10).
Note that there are many other representations available from software packages like
Tableau, in particular pie charts and standard bar charts which are appropriate when
dealing with categorical data, although these will not be considered further in this half
course. If interested, you are recommended to study ST2187 Business analytics,
applied modelling and prediction.

2.6.1 Presentational traps

Before we see our first graphical representation you should be aware when reading
articles in newspapers, magazines and even within academic journals, that it is easy to
mislead the reader by careless or poorly-defined diagrams. As such, presenting data
effectively with diagrams requires careful planning.

A good diagram:
• provides a clear summary of the data
• is a fair and honest representation
• highlights underlying patterns
• allows the extraction of a lot of information quickly.

26
2.6. Data visualisation

A bad diagram:
• confuses the viewer
• misleads (either accidentally or intentionally). 2
Advertisers and politicians are notorious for ‘spinning’ data to portray a particular
narrative for their own objectives!

2.6.2 Dot plot


The simplicity of a dot plot makes it an ideal starting point to think about the concept
of a sample distribution. For small datasets, this type of plot is very effective for
seeing the data’s underlying distribution. We use the following procedure.

1. Obtain the range of the dataset (the values spanned by the data), and draw a
horizontal line to accommodate this range.

2. Place dots (hence the name ‘dot plot’ !) corresponding to the values above the line,
resulting in the empirical distribution.

Example 2.2 Hourly wage rates (in £) for clerical assistants:

12.20 11.50 11.80 11.60 12.10 11.80 11.60 11.70 11.50


11.60 11.90 11.70 11.60 12.10 11.70 11.80 11.90 12.00


• • •
• • • • • •
• • • • • • • •
11.50 11.60 11.70 11.80 11.90 12.00 12.10 12.20

Instantly, some interesting features emerge from the dot plot which are not
immediately obvious from the raw data. For example, most clerical assistants earn
less than £12 per hour and nobody (in the sample) earns more than £12.20 per hour.

2.6.3 Histogram
Histograms are excellent diagrams to use when we want to visualise the frequency
distribution of discrete or continuous variables. Our focus will be on how to construct a
density histogram.
Data are first organised into a table which arranges the data into class intervals (also
called bins) – disjointed subdivisions of the total range of values which the variable
takes. Let K denote the number of class intervals. These K class intervals should be
mutually exclusive (meaning they do not overlap, such that each observation belongs to
at most one class interval) and collectively exhaustive (meaning that each observation
belongs to at least one class interval).

27
2. Data visualisation and descriptive statistics

Recall that our objective is to represent the distribution of the data. As such, when
choosing K, too many class intervals will dilute the distribution, while too few will
concentrate it (using technical jargon, will tend to degenerate the distribution). Either
2 way, the pattern of the distribution will be lost – defeating the purpose of the
histogram. As a guide, K = 6 or 7 should be sufficient, but remember to always exercise
common sense!
To each class interval, the corresponding frequency is determined, i.e. the number of
observations of the variable which fall within each class interval. Let fk denote the
frequency of class interval k, and let wk denote the width of class interval k, for
k = 1, 2, . . . , K.
PK
The relative frequency of class interval k is rk = fk /n, where n = fk is the sample
k=1
size, i.e. the sum of all the class interval frequencies.
The density of class interval k is dk = rk /wk , and it is this density which is plotted on
the y-axis (the vertical axis). It is preferable to construct density histograms only if
each class interval has the same width.

Example 2.3 Consider the weekly production output of a factory over a 50-week
period (you can choose what the manufactured good is!). Note that this is a discrete
variable since the output will take integer values, i.e. something which we can count.
The data are (in ascending order for convenience):

350 354 354 358 358 359 360 360 362 362
363 364 365 365 365 368 371 372 372 379
381 382 383 385 392 393 395 396 396 398
402 404 406 410 420 437 438 441 444 445
450 451 453 454 456 458 459 460 467 469

We construct the following table, noting that a square bracket ‘[’ includes the class
interval endpoint, while a round bracket ‘)’ excludes the class interval endpoint.

Interval Relative Cumulative


width, Frequency, frequency, Density, frequency,
P
Class interval wk fk rk = fk /n dk = rk /wk k fk
[340, 360) 20 6 0.12 0.006 6
[360, 380) 20 14 0.28 0.014 20
[380, 400) 20 10 0.20 0.010 30
[400, 420) 20 4 0.08 0.004 34
[420, 440) 20 3 0.06 0.003 37
[440, 460) 20 10 0.20 0.010 40
[460, 480) 20 3 0.06 0.003 50

Note that here we have K = 7 class intervals each of width 20, i.e. wk = 20 for
k = 1, 2, . . . , 7. From the raw data, check to see how each of the frequencies, fk , has
been obtained. For example, f1 = 6 represents the first six observations (350, 354,
354, 358, 358 and 359).

28
2.6. Data visualisation

We have n = 50, hence the relative frequencies are rk = fk /50 for k = 1, 2, . . . , 7. For
example, r1 = f1 /n = 6/20 = 0.12. The density values can then be calculated. For
example, d1 = r1 /w1 = 0.12/20 = 0.006. 2
The table above includes an additional column of ‘Cumulative frequency’, which is
obtained by simply determining the running total of the class frequencies (for
example, the cumulative frequency up to the second class interval is 6 + 14 = 20).
Note the final column is not required to construct a density histogram, although the
computation of cumulative frequencies may be useful when determining medians and
quartiles (to be discussed later in this chapter).
To construct the histogram, adjacent bars are drawn over the respective class
intervals such that the histogram has a total area of one. The histogram for the
above example is shown in Figure 2.1.

Figure 2.1: Density histogram of weekly production output for Example 2.3.

2.6.4 Stem-and-leaf diagram


A stem-and-leaf diagram uses the raw data. As the name suggests, it is formed using
a ‘stem’ and corresponding ‘leaves’. The choice of the stem involves determining a
major component of an observed value, such as the ‘10s’ unit if the order of magnitude
of the observations were 15, 25, 35 etc., or if data are of the order of magnitude 1.5, 2.5,
3.5 etc. the integer part. The remainder of the observed value plays the role of the ‘leaf’.
Applied to the weekly production dataset, we obtain the stem-and-leaf diagram shown
below in Example 2.4.

29
2. Data visualisation and descriptive statistics

Example 2.4 Continuing with Example 2.3, the stem-and-leaf diagram is:

2 Stem-and-leaf diagram of weekly production output

Stem (Tens) Leaves (Units)


35 044889
36 0022345558
37 1229
38 1235
39 235668
40 246
41 0
42 0
43 78
44 145
45 0134689
46 079

Note the informative title and labels for the stems and leaves.

For the stem-and-leaf diagram in Example 2.4, note the following points.
These stems are formed of the ‘10s’ part of the observations.
Leaves are vertically aligned, hence rotating the stem-and-leaf diagram 90 degrees
anti-clockwise reproduces the shape of the data’s distribution, similar to what
would be revealed with a density histogram.
The leaves are placed in ascending order within the stems, so it is a good idea to
sort the raw data into ascending order first of all (fortunately the raw data in
Example 2.3 were already arranged in ascending order, but for other datasets this
may not be the case).
Unlike the histogram, the actual data values are preserved. This is advantageous if
we want to calculate various descriptive statistics later on.

So far we have considered how to summarise a dataset visually. This methodology is


appropriate to get a visual feel for the distribution of the dataset. In practice, we would
also like to summarise things numerically. There are two key properties of a dataset
which will be of particular interest.

Key properties of a dataset

Measures of location – a central point about which the data tend (also known
as measures of central tendency).

Measures of dispersion – a measure of the variability of the data, i.e. how


spread out the data are about the central point (also known as measures of
spread).

30
2.7. Measures of location

2.7 Measures of location


The mean, median and mode are the three principal measures of location. In general, 2
these will not all give the same numerical value for a given dataset/distribution.2 These
three measures (and, later, measures of dispersion) will now be introduced using the
following small sample dataset:

32, 28, 67, 39, 19, 48, 32, 44, 37 and 24. (2.1)

2.7.1 Mean
The preferred measure of location/central tendency, which is simply the ‘average’ of the
data. It will be frequently applied in various statistical inference techniques in later
chapters.

(Sample) mean
P
Using the summation operator, , which remember is just a form of ‘notational
shorthand’, we define the sample mean, x̄, as:
n
1X x1 + x2 + · · · + xn
x̄ = xi = .
n i=1
n

To note, the notation x̄ will be used to denote an observed sample mean for a sample
dataset, while µ will denote its population counterpart, i.e. the population mean.

Example 2.5 For the dataset in (2.1) above:


10
1 X 32 + 28 + · · · + 24 370
x̄ = xi = = = 37.
10 i=1 10 10

Of course, it is possible to encounter datasets in frequency form, that is each data value
is given with the corresponding frequency of observations for that value, fk , for
k = 1, 2, . . . , K, where there are K different variable values. In such a situation, use the
formula:
K
P
fk xk
k=1
x̄ = K . (2.2)
P
fk
k=1

Note that this preserves the idea of ‘adding up all the observations and dividing by the
total number of observations’. This is an example of a weighted mean, where the weights
are the relative frequencies (as seen in the construction of density histograms).
2
These three measures can be the same in special cases, such as the normal distribution (introduced
in Chapter 4) which is symmetric about the mean (and so mean = median) and achieves a maximum at
this point, i.e. mean = median = mode.

31
2. Data visualisation and descriptive statistics

If the data are given in grouped-frequency form, such as that shown in the table in
Example 2.3, then the individual data values are unknown3 – all we know is the class
interval in which each observation lies. The sensible solution is to use the midpoint of
2 the interval as a proxy for each observation recorded as belonging within that class
interval. Hence you still use the grouped-frequency mean formula (2.2), but each xi
value will be substituted with the appropriate class interval midpoint.

Example 2.6 Using the weekly production data in Example 2.3, the interval
midpoints are: 350, 370, 390, 410, 440, 450 and 470, respectively. These will act as
the data values for the respective class intervals. The mean is then calculated as:
K
P 7
P
f k xk f k xk
k=1 k=1 (6 × 350) + (14 × 370) + · · · + (3 × 470)
x̄ = = = = 400.4.
PK P7 6 + 14 + · · · + 3
fk fk
k=1 k=1

Compared to the true mean of the raw data (which is 399.72), we see that using the
midpoints as proxies gives a mean very close to the true sample mean value. Note
the mean is not rounded up or down since it is an arithmetic result.

A drawback with the mean is its sensitivity to outliers, i.e. extreme observations. For
example, suppose we record the net worth of 10 randomly chosen people. If Elon Musk
(one of the world’s richest people at time of writing), say, was included, his substantial
net worth would pull the mean upward considerably! By increasing the sample size n,
the effect of his inclusion, although diluted, would still be non-negligible, assuming we
were not just sampling from the population of billionaires!

2.7.2 Median
The (sample) median, m, is the middle value of the ordered dataset, where observations
are arranged in ascending order. By definition, 50 per cent of the observations are
greater than or equal to the median, and 50 per cent are less than or equal to the
median.

(Sample) median

Arrange the n numbers in ascending order, x(1) , x(2) , . . . , x(n) , (known as the order
statistics, such that x(1) is the first order statistic, i.e. the smallest observed value,
and x(n) is the nth order statistic, i.e. the largest observed value), then the sample
median, m, depends on whether the sample size is odd or even. If:

n is odd, then there is an explicit middle value, so m = x((n+1)/2)

n is even, then there is no explicit middle value, so take the average of the values
either side of the ‘midpoint’, hence m = (x(n/2) + x(n/2+1) )/2.

3
Of course, we do have the raw data for the weekly production output and so we could work out the
exact sample mean, but here suppose we did not have access to the raw data, instead we were just given
the table of class interval frequencies as shown in Example 2.3.

32
2.7. Measures of location

Example 2.7 For the dataset in (2.1), the ordered observations are:

19, 24, 28, 32, 32, 37, 39, 44, 48 and 67. 2
Here n = 10, i.e. there is an even number of observations, so we compute the average
of the fifth and sixth ordered observations, that is:
x(n/2) + x(n/2+1) x(5) + x(6) 32 + 37
m= = = = 34.5.
2 2 2

If we only had data in grouped-frequency form (as in Example 2.3), then we can make
use of the cumulative frequencies. Since n = 50, the median is the 25.5th ordered
observation which must lie in the [380, 400) class interval because once we exhaust the
ordered data up to the [360, 380) class interval we have only accounted for the smallest
20 observations, while once the [380, 400) class interval is exhausted we have accounted
for the smallest 30 observations, meaning the median must lie in this class interval.
Assuming the raw data are not accessible, we could use the midpoint (i.e. 390) as
denoting the median. Alternatively, we could use an interpolation method which uses
the following ‘general’ formula for grouped data, once you have identified the class
which includes the median (such as [380, 400) above):
bin width × number of remaining observations
endpoint of previous bin + .
bin frequency

Example 2.8 Returning to the weekly production output data from Example 2.3,
the median would be:
20 × (25.5 − 20)
380 + = 391.
10
For comparison, using the raw data, x(25) = 392 and x(26) = 393, gives the ‘true’
sample median of 392.5.

Although an advantage of the median is that it is not influenced by outliers (Elon


Musk’s net worth would be x(n) and so would not affect the median), in practice it is of
limited use in formal statistical inference.
For symmetric data, the mean and median are always equal. Therefore, this is a simple
way to verify whether a dataset is symmetric. Asymmetric distributions are skewed,
where skewness measures the departure from symmetry. Although you will not be
expected to compute the coefficient of skewness (its numerical value), you need to be
familiar with the two types of skewness.

Skewness

When mean > median, this indicates a positively-skewed distribution (also,


referred to as ‘right-skewed’).

When mean < median, this indicates a negatively-skewed distribution (also,


referred to as ‘left-skewed’).

33
2. Data visualisation and descriptive statistics

Positively-skewed
distribution
2

Negatively-skewed
distribution

Figure 2.2: Different types of skewed distributions.

Graphically, skewness can be determined by identifying where the long ‘tail’ of the
distribution lies. If the long tail is heading toward +∞ (positive infinity) on the x-axis
(i.e. on the right-hand side), then this indicates a positively-skewed (right-skewed)
distribution. Similarly, if the long tail is heading toward −∞ (negative infinity) on the
x-axis (i.e. on the left-hand side) then this indicates a negatively-skewed (left-skewed)
distribution, as illustrated in Figure 2.2.

Example 2.9 The hourly wage rates used in Example 2.2 are skewed to the right,
due to the influence of the relatively large values 12.00, 12.10, 12.10 and 12.20. The
effect of these (similar to Elon Musk’s effect mentioned above, albeit far less extreme
here) is to ‘drag’ or ‘pull’ the mean upward, hence mean > median.

Example 2.10 For the weekly production output data in Example 2.3, we have
calculated the mean and median to be 399.72 and 392.50, respectively. Since the
mean is greater than the median, the data form a positively-skewed distribution, as
confirmed by the histogram in Figure 2.1.

2.7.3 Mode
Our final measure of location is the mode.

(Sample) mode

The (sample) mode is the most frequently-occurring value in a (sample) dataset.

It is perfectly possible to encounter a multimodal distribution where several data values

34
2.8. Measures of dispersion

are tied in terms of their frequency of occurrence.

Example 2.11 The modal value of the dataset in (2.1) is 32, since it occurs twice 2
while the other values only occur once each.

Example 2.12 For the weekly production output data in Example 2.3, looking at
the stem-and-leaf diagram in Example 2.4, we can quickly see that 365 is the modal
value (the three consecutive 5s opposite the second stem stand out). If just given
grouped frequency data, then instead of reporting a modal value we can determine
the modal class interval, which is [360, 380) with 14 observations. (The fact that this
includes 365 here is a coincidence – the modal class interval and modal value are not
equivalent.)

2.8 Measures of dispersion


The dispersion (or spread) of a dataset is very important when drawing conclusions
from it. Hence it is essential to have a useful measure of this property, and several
candidates exist, which are introduced below. As expected, there are advantages and
disadvantages to each.

2.8.1 Range
Our first measure of spread is the range.

Range

The range is the largest value minus the smallest value, that is:

range = x(n) − x(1) .

Example 2.13 For the dataset in (2.1), the range is:

x(n) − x(1) = 67 − 19 = 48.

Clearly, the range is very sensitive to extreme observations since (when they occur) they
are going to be the smallest and/or largest observations (x(1) and/or x(n) , respectively),
and so this measure is of limited appeal. If we were confident that no outliers were
present (or decided to remove any outliers), then the range would better represent the
true spread of the data.
However, the range motivates our consideration of the interquartile range (IQR) instead.
The IQR is the difference between the upper (third) quartile, Q3 , minus the lower (first)
quartile, Q1 . The upper quartile divides ordered data into the bottom 75% and the top
25%, while the lower quartile divides ordered data into the bottom 25% and the top

35
2. Data visualisation and descriptive statistics

75%. Unsurprisingly the median, given our earlier definition, is the middle (second)
quartile, i.e. m = Q2 . By discarding the top 25% and bottom 25% of observations,
respectively, we restrict attention solely to the central 50% of observations.
2
Interquartile range

The interquartile range (IQR) is defined as:

IQR = Q3 − Q1

where Q3 and Q1 are the third (upper) and first (lower) quartiles, respectively.

Example 2.14 Continuing with the dataset in (2.1), computation of the quartiles
can be problematic since, for example, for the lower quartile we require the value
such that the smallest 2.5 observations are below it and the largest 7.5 observations
are above it. A suggested approach (motivated by the median calculation when n is
even) is to use:
x(2) + x(3) 24 + 28
Q1 = = = 26.
2 2
Similarly:
x(7) + x(8) 39 + 44
Q3 = = = 41.5.
2 2
Hence IQR = Q3 − Q1 = 41.5 − 26 = 15.5. Contrast this with the range of 48
(derived in Example 2.13) which is much larger due to the effects of x(1) and x(n) .

There are many different methodologies for computing quartiles, and conventions vary
from country to country, from textbook to textbook, and even from software package to
software package! Any reasonable approach is perfectly acceptable in the examination.
For example, interpolation methods, as demonstrated previously for the case of the
median, are valid. The approach shown in Example 2.14 is the simplest, and so it is
recommended.

2.8.2 Boxplot
At this point, it is useful to introduce another graphical method, the boxplot, also
known as a box-and-whisker plot, no prizes for guessing why!
In a boxplot, the middle horizontal line is the median and the upper and lower ends of
the box are the upper and lower quartiles, respectively. The whiskers extend from the
box to the most extreme data points within 1.5 times the IQR from the quartiles. Any
data points beyond the whiskers are considered outliers and are plotted individually.
Sometimes we distinguish between outliers and extreme outliers, with the latter plotted
using a different symbol. An example of a (generic) boxplot is shown in Figure 2.3.
If you are presented with a boxplot, then it is easy to obtain all of the following: the
median, quartiles, IQR, range and skewness. Recall that skewness (the departure from
symmetry) is characterised by a long tail, attributable to outliers, which are readily
apparent from a boxplot.

36
2.8. Measures of dispersion

x Values more than 3 boxlengths from Q3 (extreme outlier)

o Values more than 1.5 boxlengths from Q3 (outlier)


2
Largest observed value that is not an outlier

Q3

50% of cases
have values Q2 = Median
within the box

Q1

Smallest observed value that is not an outlier

o Values more than 1.5 boxlengths from Q1 (outlier)

x Values more than 3 boxlengths from Q1 (extreme outlier)

Figure 2.3: An example of a boxplot (not to scale).

Example 2.15 From the boxplot shown in Figure 2.4, it can be seen that the
median, Q2 , is around 74, Q1 is about 63, and Q3 is approximately 77. The many
outliers provide a useful indicator that this is a negatively-skewed distribution as the
long tail covers lower values of the variable. Note also that Q3 − Q2 < Q2 − Q1 ,
which tends to indicate negative skewness.

2.8.3 Variance and standard deviation

The variance and standard deviation are much better and more useful statistics for
representing the dispersion of a dataset. You need to be familiar with their definitions
and methods of calculation for a sample of data values x1 , x2 , . . . , xn .
Begin by computing the so-called ‘corrected sum of squares’, Sxx , the sum of the
squared deviations of each data value from the (sample) mean, where:

n
X n
X
2
Sxx = (xi − x̄) = x2i − nx̄2 . (2.3)
i=1 i=1

37
2. Data visualisation and descriptive statistics

Figure 2.4: A boxplot showing a negatively-skewed distribution.

n
P
Recall from earlier x̄ = xi /n. To see why (2.3) holds:
i=1
n
X
Sxx = (xi − x̄)2
i=1
n
X
= (x2i − 2x̄xi + x̄2 ) (expansion of quadratic)
i=1
n
X n
X n
X
= x2i − 2x̄xi + x̄2 (separating into three summations)
i=1 i=1 i=1
n
X n
X
= x2i − 2x̄ xi +nx̄2 (noting that x̄ is a constant added n times)
i=1 i=1
| {z }
= nx̄
n
X
= x2i − 2nx̄2 + nx̄2 (substituting in nx̄)
i=1
n
X
= x2i − nx̄2 (simplifying)
i=1
n
P n
P
which uses the fact that x̄ = xi /n, and so xi = nx̄.
i=1 i=1
We now define the sample variance.

Sample variance

The sample variance, s2 , is defined as:


n n
!
Sxx 1 X 1 X
s2 = = (xi − x̄)2 = x2i − nx̄2 .
n−1 n−1 i=1
n−1 i=1

38
2.8. Measures of dispersion

Note the divisor used to compute s2 is n − 1, not n. Do not worry about why (this is
covered in ST104B Statistics 2) just remember to divide by n − 1 when computing a
sample variance.4 To obtain the sample standard deviation, s, we just take the
(positive) square root of the sample variance, s2 .
2
Sample standard deviation

The sample standard deviation, s, is:


s
√ Sxx
s= s2 = .
n−1

Example 2.16 Using the dataset in (2.1), x̄ = 37, so:

Sxx = (32 − 37)2 + (28 − 37)2 + · · · + (24 − 37)2 = 25 + 81 + · · · + 169 = 1,698.


p
Hence s = 1,698/(10 − 1) = 13.74.
P 2
Note that, given xi = 15,388, we could have calculated Sxx using the other
expression:
Xn
Sxx = x2i − nx̄2 = 15,388 − 10 × (37)2 = 1,698.
i=1

So this alternative method is much quicker to calculate Sxx .

When data are given in grouped-frequency form, the sample variance is calculated as
follows.

Sample variance for grouped-frequency data

For grouped-frequency data with K classes, to compute the sample variance we use
the formula:
 2
K K K
fk (xk − x̄)2 fk x2k
P P P
 fk xk 
2 k=1 k=1  k=1
s = = K − K  .

K
P P  P 
fk fk fk
k=1 k=1 k=1

Recall that the last bracketed squared term is simply the mean formula for grouped data
shown in (2.2). Note that for grouped-frequency data we can ignore the ‘divide by n − 1’
rule, since we would expect n to be very large in such cases, such that n − 1 ≈ n and so
K
P
dividing by n or n − 1 makes negligible difference in practice, noting that fk = n.
k=1

N
4
In contrast for population data, the population variance is σ 2 = (xi − µ)2 /N , i.e. we use the N
P
i=1
divisor here, where N denotes the population size while n denotes the sample size. Also, note the use of
µ (the population mean) instead of x̄ (the sample mean).

39
2. Data visualisation and descriptive statistics

Example 2.17 A stockbroker is interested in the level of trading activity on a


particular stock exchange. They have collected the following data, which are weekly
2 average volumes (in millions), over a 29-week period. This is an example of time
series data. Note that this variable is treated as if it was discrete, but because the
numbers are so large the variable can be treated as continuous.

172.5 154.6 163.5


161.9 151.6 172.6
172.3 132.4 168.3
181.3 144.0 155.3
169.1 133.6 143.4
155.0 149.0 140.6
148.6 135.8 125.1
159.8 139.9 171.3
161.6 164.4 167.0
153.8 175.6

To construct a density histogram we first decide on the number of class intervals, K,


which is a subjective decision. The objective is to convey information in a useful
way. In this case the data lie between (roughly) 120 and 190 million shares/week, so
class intervals of width 10 million will give K = 7 classes.
With almost 30 observations this choice is probably adequate; more observations
might support more class intervals (with widths of 5 million, say); fewer observations
would, perhaps, need a larger class interval of width 20 million.
Therefore, the class intervals are defined like this:

120 ≤ volume < 130, 130 ≤ volume < 140 etc.

or, alternatively, [120, 130), [130, 140) etc. We now proceed to determine the density
values to plot (and cumulative frequencies, for later). We construct the following
table:

Interval Relative
width, Frequency, frequency, Density, Midpoint,
Class interval wk fk rk = fk /n dk = rk /wk xk f k xk fk x2k
[120, 130) 10 1 0.0345 0.00345 125 125 15,625
[130, 140) 10 4 0.1379 0.01379 135 540 72,900
[140, 150) 10 5 0.1724 0.01724 145 725 105,125
[150, 160) 10 6 0.2069 0.02069 155 930 144,150
[160, 170) 10 7 0.2414 0.02414 165 1,155 190,575
[170, 180) 10 5 0.1724 0.01724 175 875 153,125
[180, 190)
P 10 1 0.0345 0.0345 185 185 34,225
Total, 29 4,535 715,725

The density histogram is as shown in Figure 2.5.

40
2.8. Measures of dispersion

We now use the grouped-frequency data to compute particular descriptive statistics,


specifically the mean, variance and standard deviation.
Using the grouped-frequency data, the sample mean is: 2
7
P
f k xk
k=1 4,535
x̄ = 7
= = 156.4
P 29
fk
k=1

and the sample variance is:


7 7
P 2
fk x2k
P
f x
 k=1 k k 
2
s = k=1
− 7
  = 715,725 − (156.4)2 = 219.2.
P7 P  29
fk fk
k=1 k=1

giving a standard deviation of 219.2 = 14.8. For comparison, the ungrouped mean,
variance and standard deviation are 156.0, 217.0 and 14.7, respectively (compute
these yourself to verify!).
Note the units for the mean and standard deviation are ‘millions of shares/week’,
while the units for the variance are the square of those for the standard deviation,
i.e. ‘(millions of shares/week)2 ’, so this is an obvious reason why we often work with
the standard deviation, rather than the variance, due to the original (and more
meaningful) units.

Figure 2.5: Density histogram of trading volume data for Example 2.17.

41
2. Data visualisation and descriptive statistics

2.9 Test your understanding

2 Let us now consider an extended example bringing together many of the issues
considered in this chapter.
At a time of economic growth but political uncertainty, a random sample of n = 40
economists (from the population of all economists) produces the following forecasts for
the growth rate of an economy in the next year:

1.3 3.8 4.1 2.6 2.4 2.2 3.4 5.1 1.8 2.7
3.1 2.3 3.7 2.5 4.1 4.7 2.2 1.9 3.6 2.8
4.3 3.1 4.2 4.6 3.4 3.9 2.9 1.9 3.3 8.2
5.4 3.3 4.5 5.2 3.1 2.5 3.3 3.4 4.4 5.2

(a) Draw a density histogram for these data.


(b) Construct a stem-and-leaf diagram for these data.
(c) Comment on the shape of the sample distribution.
(d) Determine the median of the data using the stem-and-leaf diagram in (a).
(e) Produce a boxplot for these data.
(f) Using the following summary statistics, calculate the sample mean and sample
standard deviation of the growth rate forecasts:
40
X 40
X
Sum of data = xi = 140.4 and Sum of squares of data = x2i = 557.26.
i=1 i=1

(g) Comment on the relative values of the mean and median.


(h) What percentage of the data fall within one sample standard deviation of the
sample mean? And what percentage fall within two sample standard deviations of
the sample mean?

Solution:

(a) It would be sensible to have class interval widths of 1 unit, which conveniently
makes the density values the same as the relative frequencies! We construct the
following table and plot the density histogram.

Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[1.0, 2.0) 1 4 0.100 0.100
[2.0, 3.0) 1 10 0.250 0.250
[3.0, 4.0) 1 13 0.325 0.325
[4.0, 5.0) 1 8 0.200 0.200
[5.0, 6.0) 1 4 0.100 0.100
[6.0, 7.0) 1 0 0.000 0.000
[7.0, 8.0) 1 0 0.000 0.000
[8.0, 9.0) 1 1 0.025 0.025

42
2.9. Test your understanding

(b) A stem-and-leaf diagram for the data is:

Stem-and-leaf diagram of economic growth forecasts

Stem (%) Leaves (0.1%)


1 3899
2 2234556789
3 1113334446789
4 11234567
5 1224
6
7
8 2

Note that we still show the ‘6’ and ‘7’ stems even though they have no
corresponding leaves. If we omitted these stems (so that the ‘8’ stem is immediately
below the ‘5’ stem) then this would distort the true shape of the sample
distribution, which would be misleading.
(c) The density histogram and stem-and-leaf diagram show that the data are
positively-skewed (skewed to the right), due to the outlier forecast of 8.2%.
Note if you are ever asked to comment on the shape of a distribution, consider:
• Is the distribution (roughly) symmetric?
• Is the distribution bimodal?
• Is the distribution skewed (an elongated tail in one direction)? If so, what is
the direction of the skewness?
• Are there any outliers?

43
2. Data visualisation and descriptive statistics

(d) There are n = 40 observations, so the median is the average of the 20th and 21st
ordered observations. Using the stem-and-leaf diagram in part (b), we see that
x(20) = 3.3 and x(21) = 3.4. Therefore, the median is (3.3 + 3.4)/2 = 3.35%.
2
(e) Since Q2 is the median, which is 3.35, we now need the first and third quartiles, Q1
and Q3 , respectively. There are several methods for determining the quartiles, and
any reasonable approach would be acceptable in an examination. For simplicity,
here we will use the following since n is divisible by 4:
Q1 = x(n/4) = x(10) = 2.5% and Q3 = x(3n/4) = x(30) = 4.2%.
Hence the interquartile range (IQR) is Q3 − Q1 = 4.2 − 2.5 = 1.7%. Therefore, the
whisker limits must satisfy:
max(x(1) , Q1 − 1.5 × IQR) and min(x(n) , Q3 + 1.5 × IQR)
which is:
max(1.3, −0.05) = 1.30 and min(8.2, 6.75) = 6.75.
We see that there is just a single observation which lies outside the interval
[1.30, 6.75], which is x(40) = 8.2% and hence this is plotted individually in the
boxplot. Since this is less than Q3 + 3 × IQR = 4.2 + 3 × 1.7 = 9.3%, then this
observation is an outlier, rather than an extreme outlier.
The boxplot is (a horizontal orientation is also fine):

Note that the upper whisker terminates at 5.4, which is the most extreme data
point within 1.5 times the IQR above Q3 , i.e. the maximum value no larger than
6.75% as easily seen from the stem-and-leaf diagram in part (b). The lower whisker
terminates at x(1) = 1.3%, since the minimum value of the dataset is within 1.5
times the IQR below Q1 .
It is important to note that boxplot conventions may vary, and some software or
implementations might use slightly different methods for calculating whiskers.
Additionally, different multipliers (other than 1.5) might be used in practice
depending on the desired sensitivity to outliers.

44
2.9. Test your understanding

(f) We have sample data, not population data, hence the (sample) mean is denoted by
x̄ and the (sample) standard deviation is denoted by s. We have:

1X
n
140.4 2
x̄ = xi = = 3.51%
n i=1 40

and:
n
!
1 X 1
s2 = x2i − nx̄2 557.26 − 40 × (3.51)2 = 1.6527.

=
n−1 i=1
39

Therefore, the standard deviation is s = 1.6527 = 1.29%.

(g) In (c) it was concluded that the density histogram and stem-and-leaf diagram of
the data were positively-skewed, and this is consistent with the mean being larger
than the median. It is possible to quantify skewness, although this is beyond the
scope of the syllabus.

(h) We calculate:

x̄ − s = 3.51 − 1.29 = 2.22 and x̄ + s = 3.51 + 1.29 = 4.80

also:

x̄ − 2 × s = 3.51 − 2 × 1.29 = 0.93 and x̄ + 2 × s = 3.51 + 2 × 1.29 = 6.09.

Now we use the stem-and-leaf diagram to see that 29 observations are between 2.22
and 4.80 (i.e. the interval [2.22, 4.80]), and 39 observations are between 0.93 and
6.09 (i.e. the interval [0.93, 6.09]). So the proportion (or percentage) of the data in
each interval, respectively, is:
29 39
= 0.725 = 72.5% and = 0.975 = 97.5%.
40 40

Some general points to note are the following.

• Many ‘bell-shaped’ distributions we meet – that is, distributions which look a bit
like the normal distribution (introduced in Chapter 4) – have the property that
68% of the data lie within approximately one standard deviation of the mean, and
95% of the data lie within approximately two standard deviations of the mean. The
percentages in (h) are fairly similar to these.

• The exercise illustrates the importance of (at least) one more decimal place than in
the original data. If we had 3.5% and 1.3% for the mean and standard deviation,
respectively, the ‘boundaries’ for the interval with one standard deviation would
have been 3.5 ± 1.3 ⇒ [2.2, 4.8]. Since 2.2 is a data value which appears twice, we
would have had to worry about which side of the ‘boundary’ to allocate these.
(This type of issue can still happen with the extra decimal place, but much less
frequently.)

• When constructing a histogram, it is possible to ‘lose’ a pattern in the data – for


example, an approximate bell shape – through two common errors:

45
2. Data visualisation and descriptive statistics

• too few class intervals (which is the same as too wide class intervals)
• too many class intervals (which is the same as too narrow class intervals).
2 For example, with too many class intervals, you mainly get 0, 1 or 2 items per class
interval, so any (true) peak is hidden by the subdivisions which you have used.
• The best number of (equal-sized) class intervals depends on the sample size. For
large samples, many class intervals will not lose the pattern, while for small
samples they will. However, with the datasets which tend to crop up in ST104A
Statistics 1, somewhere between 6 and 10 class intervals are likely to work well.

2.10 Overview of chapter


In statistical analysis, there are usually simply too many numbers to make sense of just
by staring at them. Data visualisation and descriptive statistics attempt to summarise
key features of the data to make them understandable and easy to communicate. The
main function of diagrams is to bring out interesting features of a dataset visually by
displaying its distribution, i.e. summarising the whole sample distribution of a variable.
Descriptive statistics allow us to summarise one feature of the sample distribution in a
single number. In this chapter we have worked with measures of central tendency,
measures of dispersion and skewness.

2.11 Key terms and concepts


Boxplot Median (of sample)
Categorical variable Mode (of sample)
Continuous Nominal
Density histogram Ordinal
Discrete Range
Dot plot Sample distribution
Interquartile range (IQR) Skewness
Mean (of sample) Standard deviation (of sample)
Measurable variable Stem-and-leaf diagram
Measures of dispersion Variable
Measures of location Variance (of sample)

2.12 Sample examination questions


1. Classify each one of the following variables as either measurable (continuous) or
categorical. If a variable is categorical, further classify it as nominal or ordinal.
Justify your answer.
(a) Gross domestic product (GDP) of a country.
(b) Five possible responses to a customer satisfaction survey ranging from ‘very
satisfied’ to ‘very dissatisfied’.
(c) A person’s name.

46
2.13. Solutions to Sample examination questions

2. The data below contain measurements of the low-density lipoproteins, also known
as the ‘bad’ cholesterol, in the blood of 30 patients. Data are measured in
milligrams per decilitres (mg/dL).
2
95 96 96 98 99
99 101 101 102 102
103 104 104 107 107
111 112 113 113 114
115 117 121 123 124
127 129 131 135 143

(a) Construct a density histogram of the data.


(b) Find the mean (given that the sum of the data is 3,342), the P
median and the
standard deviation (given that the sum of the squared data, x2i , is 377,076).
(c) Comment on the data given the shape of the histogram.

3. The average daily intakes of calories, measured in kcals, for a random sample of 12
athletes were:

1,808, 1,936, 1,957, 2,004, 2,009, 2,101,


2,147, 2,154, 2,200, 2,231, 2,500, 3,061.

(a) Construct a boxplot of the data. (The boxplot does not need to be exactly to
scale, but values of box properties and whiskers should be clearly labelled.)
(b) Based on the shape of the boxplot you have drawn, describe the distribution of
the data.
(c) Name two other types of graphical displays which would be suitable to
represent the data. Briefly explain your choices.

2.13 Solutions to Sample examination questions


1. A general tip for identifying measurable and categorical variables is to think of the
possible values they can take. If these are finite and represent specific entities the
variable is categorical. Otherwise, if these consist of numbers corresponding to
measurements, the data are continuous and the variable is measurable. Such
variables may also have measurement units or can be measured to various decimal
places.
(a) Measurable, because GDP can be measured in $bn or $tn to several decimal
places.
(b) Each satisfaction level corresponds to a category. The level of satisfaction is in
a ranked order – for example, in terms of the list items provided. Therefore,
this is a categorical ordinal variable.
(c) Each name (James, Jane etc.) is a category. Also, there is no natural ordering
between the names – for example, we cannot really say that ‘James is higher
than Jane’. Therefore, this is a categorical nominal variable.

47
2. Data visualisation and descriptive statistics

2. (a) We have:

2 Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[90, 100) 10 6 0.200 0.0200
[100, 110) 10 9 0.300 0.0300
[110, 120) 10 7 0.233 0.0233
[120, 130) 10 5 0.167 0.0167
[130, 140) 10 2 0.067 0.0067
[140, 150) 10 1 0.033 0.0033

(b) We have:
3,342
x̄ = = 111.4 mg/dL
30

also median = 109 mg/dL, and the standard deviation is:

r
1
s= × (377,076 − 30 × (111.4)2 ) = 12.83 mg/dL.
29

(c) The data exhibit positive skewness, as shown by the mean being greater than
the median.

48
2.13. Solutions to Sample examination questions

3. (a) Depending on quartile calculation methods, there may be slight variations in


the computed values of Q1 and Q3 . However, the boxplot should look (very)
similar to:
2

Note that no label of the x-axis is necessary and that the plot can be
transposed.
(b) Based on the shape of the boxplot above, we can see that the distribution of
the data is positively skewed, equivalently skewed to the right, due to the
presence of the outlier of 3,061 kcals.
(c) A density histogram, stem-and-leaf diagram or a dot plot are other types of
suitable graphical displays. The reason is that the variable is measurable and
these graphs are suitable for displaying the distribution of such variables.

49
2. Data visualisation and descriptive statistics

50
Chapter 3
Probability theory
3
3.1 Synopsis of chapter
The world is full of unpredictability. Will a country’s economic growth increase or
decrease next year? Will artificial intelligence replace the majority of human jobs in a
particular sector in the next decade? What will be the cost of borrowing next month?
These are some instances of uncertainty. While we can anticipate potential outcomes
which could happen (like economic growth increasing, decreasing, or no change), we do
not know with certainty in advance what will happen. Probability allows us to model
uncertainty and in this chapter we explore probability theory.
In other courses, particularly in ST104B Statistics 2 and ST2187 Business
analytics, applied modelling and prediction, you will make full use of probability
in both theory and in decision trees, and highlight the ways in which such information
can be used. We will look at probability at quite a superficial level in this half course.
Even so, you may find that, although the concepts of probability introduced are simple,
their application in particular circumstances may be very difficult.

3.2 Learning outcomes


After completing this chapter, and having completed the essential reading and
activities, you should be able to:

apply the ideas and notation used for sets in simple examples

recall the basic axioms of probability and apply them

draw and use appropriate Venn diagrams

recall some common probability results

distinguish between the ideas of conditional probability and independence.

3.3 Recommeded reading

Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE


Publications, 2023) 1st edition [ISBN 9781529771092] Chapter 4.

51
3. Probability theory

3.4 Introduction
Chance is what makes life worth living – if everything was known in advance, imagine
the disappointment! If we had perfect information about the future, as well as the
present and the past, there would be no need to consider the concepts of probability.
However, it is usually the case that uncertainty cannot be eliminated and hence its
3 presence should be recognised and attempts made to quantify it.
Probability theory is used to determine how likely various events are to occur, such
as:

the probability of a specific market trend occurring during a financial analysis

the likelihood of selecting a candidate with specific skills during the hiring process

when analysing financial data, the probability of observing a particular pattern in


stock prices

the possibility of a project meeting its deadlines and milestones based on historical
data.

3.5 The concept of probability


Probability forms the bedrock on which statistical methods seen later in the course are
based. This chapter will be devoted to understanding this important concept, its
properties and applications.
One can view probability as a quantifiable measure of one’s degree of belief in a
particular event or set (they mean the same thing) of interest. To motivate the use of
the terms ‘event’ and ‘set’, we begin by introducing the concept of an experiment. An
experiment can take many forms, but to keep things simple let us consider two basic
examples.

1. The toss of a (fair) coin.

2. The roll of a (fair) die.

Sample space

We define the sample space, S, as the set of all possible outcomes of an experiment.

Example 3.1 For our two examples, we have the following.

1. Coin toss: S = {H, T }, where H and T denote ‘heads’ and ‘tails’, respectively,
and are called the elements or members of the sample space.

2. Die score: S = {1, 2, 3, 4, 5, 6}.

52
3.5. The concept of probability

So the coin toss sample space has two elementary outcomes, H and T , while the score
on a die has six elementary outcomes. These individual elementary outcomes are
themselves events, but we may wish to consider slightly more exciting events of interest.
For example, for the die score, we may be interested in the event of obtaining an even
score, or a score greater than 4 etc. Hence we proceed to define an event.

Event
3
An event is a collection of elementary outcomes from the sample space S of an
experiment which is a subset of S.

Typically, we can denote events by letters for brevity. For example, A = ‘an even score’,
and B = ‘a score greater than 4’. Hence A = {2, 4, 6} and B = {5, 6}.
The universal convention is that we define probability to lie on a scale from 0 to 1
inclusive. We could, of course, multiply by 100% to express a probability as a
percentage. This means that the probability of any event A is denoted P (A) and is a
real number somewhere in the unit interval, i.e. we always have that:
0 ≤ P (A) ≤ 1.
Note the following.

If A is an impossible event, then P (A) = 0.


If A is a certain event, then P (A) = 1.
For events A and B, if P (A) < P (B), then A is less likely to occur than B.

So, we have a probability scale from 0 to 1 on which we are able to rank events, as
evident from the P (A) < P (B) result above. However, we need to consider how best to
quantify these probabilities. Let us begin with the experiments where each elementary
outcome is equally likely, hence our (fair) coin toss and (fair) die score fulfil this
criterion (conveniently).

Determining event probabilites for equally likely elementary outcomes

For an experiment with equally likely elementary outcomes:

let N be the total number of equally likely elementary outcomes

let n be the number of these elementary outcomes which are favourable to our
event of interest, A, then:
n
P (A) = .
N

Example 3.2 We continue with Example 3.1.

1. For the coin toss, if A is the event ‘heads’, then N = 2 (H and T ) and n = 1
(H). So, for a fair coin, P (A) = 1/2 = 0.50.

53
3. Probability theory

2. For the die score, if A is the event ‘an even score’, then N = 6 (1, 2, 3, 4, 5 and
6) and n = 3 (2, 4 and 6). So, for a fair die, P (A) = 3/6 = 1/2 = 0.5.
Finally, if B is the event ‘score greater than 4’, then N = 6 (as before) and
n = 2 (5 and 6). Hence P (B) = 2/6 = 1/3.

3 Example 3.3 Consider a business scenario where a company has a workforce of


200 employees, and 70 of them have specific skills required for a special project. In
this case, let us randomly select one employee, and the event of interest is choosing
an employee with the required skills. Here, our values are n = 70 (the number of
employees with the required skills) and N = 200 (the total number of employees).
The probability of selecting an employee with the necessary skills can be expressed
as:
70
P (skillful employee) = = 0.35.
200

3.6 Relative frequency


So far, we have only considered equally likely experimental outcomes. Clearly, to apply
probabilistic concepts more widely we require a more general interpretation, known as
the relative frequency interpretation.

Relative frequency approach to probability

Suppose the event A associated with some experiment either does or does not occur.
Also, suppose we conduct this experiment independently F times. (‘Independent’ is an
important term, discussed later.) Suppose that, following these repeated experiments,
the event A occurs f times.
The ‘frequentist’ approach to probability would regard:
f
P (A) =
F
as F → ∞.

Example 3.4 For a coin toss with event A = {H}, if the coin is fair we would
expect that repeatedly tossing the coin F times would result in approximately
f = F/2 heads, hence P (A) = (F/2)/F = 1/2. Of course, this approach is not
confined to fair coins!

Example 3.5 Consider an analyst investigating the probability of a specific stock


reaching a certain price level, denoted as P (A). The analyst randomly examines the
historical daily closing prices of 10,000 stock market days and finds that on 2,023 of
those days, the stock reached the target price.

54
3.7. ‘Randomness’

To estimate the probability using the relative frequency approach, the analyst
divides the number of favourable outcomes (days when the stock reached the target
price) by the total number of outcomes (total days examined). Calculating, we have:
2,023
P (A) = = 0.2023 ≈ 0.20.
10,000
So, the estimated probability of the stock reaching the specified price level is 3
approximately 0.20, or 20%.

Example 3.6 Consider a business management scenario where you are


investigating the probability of successfully implementing a new project management
methodology, denoted as P (A). You conduct the implementation experiment 500
times and observe successful project completions 85 times.
To estimate the probability using the relative frequency approach, you divide the
number of successful outcomes (completed projects) by the total number of trials
(implementation attempts). Calculating, we have:
85
P (A) = = 0.17.
500
The estimated probability of successfully implementing the new project management
methodology is 0.17, or 17%.

Intuitively, this is an appealing interpretation and is extremely useful when we come to


its use in statistical inference later on in the course.

3.7 ‘Randomness’
Statistical inference is concerned with the drawing of conclusions from data which are
subject to randomness, perhaps due to the sampling procedure, perhaps due to
observational errors, perhaps for some other reason.
Let us stop and think why, when we repeat an experiment under apparently identical
conditions, we get different results.
The answer is that although the conditions may be as identical as we are able to control
them to be, there will inevitably be a large number of uncontrollable (and frequently
unknown) variables which we do not measure and which have a cumulative effect on the
result of the sample or experiment. For example, weather conditions may affect the
outcomes of biological or other ‘field’ experiments.
Therefore, the cumulative effect is to cause variation in our results. It is this variation
which we term randomness and, although we never fully know the true generating
mechanism for our data, we can take the random component into account via the
concept of probability, which is, of course, why probability plays such an important role
in data analysis.

55
3. Probability theory

3.8 Properties of probability


We begin this section by presenting three simple, self-evident truths known as axioms
which list the basic properties we require of event probabilities.

Axioms of probability
3
1. For any event A, P (A) ≥ 0.

2. For the sample space S, P (S) = 1.

3. If {Ai }, i = 1, 2, . . . , n, are mutually exclusive events, then the probability of


their ‘union’ is the sum of their respective probabilities, i.e. we have:
n
! n
[ X
P Ai = P (Ai ).
i=1 i=1

The first two axioms should not be surprising. The third may appear quite difficult.
Events are called mutually exclusive when they cannot both occur simultaneously.

Example 3.7 When rolling a die once, the event A = ‘obtain an even score’ and
the event B = ‘obtain an odd score’ are mutually exclusive.

Example 3.8 A market research is analysing consumer choices between two


substitute products, Product A and Product B. The events A = ‘consumer chooses
Product A’ and B = ‘consumer chooses Product B’ are mutually exclusive because a
consumer cannot choose both products simultaneously since they are substitutes.

Example 3.9 Suppose you are managing a manufacturing process involving two
alternative production methods, Method A and Method B, and are interested in
whether Method A or Method B is more effective for a specific task. If the
production methods are such that a particular task can only be completed using one
method at a time, then the events A = ‘Method A is employed’ and B = ‘Method B
is employed’ are mutually exclusive.

All the above examples highlight the concept of mutually exclusive events, where the
occurrence of one event prevents the occurrence of the other event.
Extending this, a collection of events is pairwise mutually exclusive if no two events can
occur simultaneously. For instance the three events A, B and C are pairwise mutually
exclusive if A and B cannot occur together and B and C cannot occur together and A
and C cannot occur together. Another way of putting this is that a collection of events
is pairwise mutually exclusive if at most one of them can occur.
Related to this is the concept of a collection of events being collectively exhaustive.
This means at least one of them must occur, i.e. all possible experimental outcomes are
included among the collection of events.

56
3.8. Properties of probability

3.8.1 Notational vocabulary


Axiom 3 above introduced a new symbol, ∪. For the remainder of this chapter, various
symbols connecting sets will be used as a form of notational shorthand. It is important
to be familiar with these symbols, hence two versions are provided – one informal, and
one formal.

Symbol Informal version Formal version Example 3


∪ or union A ∪ B = ‘A union B’
∩ and intersect A ∩ B = ‘A intersect B’
c
not complement of Ac = ‘complement of A’
| given conditional on A | B = ‘A conditional on B’

Also, do make sure that you distinguish between a set and the probability of a set. This
distinction is important. A set, remember, is a collection of elementary outcomes from
S, whereas a probability (from axioms 1 and 2) is a number on the unit interval, [0, 1].
For example, A = ‘an even die score’, while P (A) = 0.50, for a fair die.

3.8.2 Venn diagrams


A helpful geometric technique which can often be used is to represent the sample space
elements in a Venn diagram.
Imagine we roll a die twice and record the total score. Hence our sample space will be:
S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}.
Suppose we are interested in the following three events:

A = ‘an even total’, such that A = {2, 4, 6, 8, 10, 12}


B = ‘a total strictly less than 8’, such that B = {2, 3, 4, 5, 6, 7}
C = ‘a total greater than 4 but less than 10’, such that C = {5, 6, 7, 8, 9}.

Having defined these events, it is therefore possible to insert every element in the
sample space S into a Venn diagram, as shown in Figure 3.1.
The box represents S, so every possible outcome of the experiment (the total score
when a die is rolled twice) appears within the box. Three (overlapping) circles are
drawn representing the events A, B and C. Each element of S is then inserted into the
appropriate area. For example, the area where the three circles all intersect represents
the event A ∩ B ∩ C into which we place the element ‘6’, since this is the only member
of S which satisfies all three events A, B and C.

Example 3.10 Using Figure 3.1, we can determine the following sets:

A ∩ B = {2, 4, 6} A ∩ C = {6, 8}
A ∩ B ∩ C = {6} (A ∪ B ∪ C)c = {11}
A ∩ B ∩ C c = {2, 4} Ac ∩ B = {3, 5, 7}
(A ∪ C)c ∩ B = {3} A | C = {6, 8}.

57
3. Probability theory

A B

10, 12 2, 4
3

3 8
6

5, 7

9
11
C

Figure 3.1: Venn diagram for pre-defined sets A, B and C recording the total score when
a die is rolled twice.

3.8.3 The additive law


We now introduce our first probability law.

The additive law

Let A and B be any two events. The additive law states that:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

So P (A ∪ B) is the probability that at least one of A and B occurs, and P (A ∩ B)


is the probability that both A and B occur.

Example 3.11 We can think about this using a Venn diagram. The total area of
the Venn diagram in Figure 3.2 is assumed to be 1, so area represents probability.
Event A is composed of all points in the left-hand circle, and event B is composed of
all points in the right-hand circle. Hence:
P (A) = area x + area z P (B) = area y + area z

P (A ∩ B) = area z P (A ∪ B) = area x + area y + area z


and:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= (area x + area z) + (area y + area z) − (area z)
= area x + area y + area z.
Therefore, to compute P (A ∪ B) we need to subtract P (A ∩ B) otherwise that
region would have been counted twice.

58
3.8. Properties of probability

x z y
3

Figure 3.2: Venn diagram illustrating the additive law.

Example 3.12 Consider an industrial situation in which a machine component can


be defective in two ways such that:

P (defective in first way) = P (D1 ) = 0.01

P (defective in second way) = P (D2 ) = 0.05

P (defective in both ways) = P (D1 ∩ D2 ) = 0.001.

Therefore, it follows that the probability that the component is defective is:

P (D1 ∪ D2 ) = P (D1 ) + P (D2 ) − P (D1 ∩ D2 )


= 0.01 + 0.05 − 0.001
= 0.059.

Additive law – special case 1

If A and B are mutually exclusive events, i.e. they cannot occur simultaneously, then
P (A ∩ B) = 0. Hence:

P (A ∪ B)= P (A) + P (B) − P (A ∩ B)


= P (A) + P (B) − 0
= P (A) + P (B).

Such events can be depicted by two non-overlapping sets in a Venn diagram as shown
in Figure 3.3. Now revisit axiom 3, to see this result generalised for n mutually
exclusive events.

59
3. Probability theory

A B

Figure 3.3: Venn diagram illustrating two mutually exclusive events.

Additive law – special case 2

The probability of an event A not occurring, i.e. the complement, Ac , is:

P (Ac ) = 1 − P (A).

3.8.4 The multiplicative law

The multiplicative law is concerned with the probability of two events occurring at
the same time – specifically when the two events have the special property of
independence. An informal definition of independence is that two events are said to
be independent if one has no influence on the other.

The multiplicative law (for independent events)

Formally, events A and B are independent if the probability of their intersect is


the product of their individual probabilities, i.e. we have:

P (A ∩ B) = P (A) P (B).

Example 3.13 Consider rolling two fair dice. The score on one die has no influence
on the score on the other die. Hence the respective scores are independent events,
and so:
1 1 1
P (two sixes) = × = .
6 6 36

Note the multiplicative (or product) law does not hold for dependent events, which is
the subject of conditional probability, discussed shortly. Also, take a moment to
ensure you are comfortable with the terms ‘mutually exclusive’ and ‘independent’.
These are not the same thing, so do not get these terms confused!

60
3.9. Conditional probability and Bayes’ formula

3.9 Conditional probability and Bayes’ formula


We have just introduced the concept of independent events – one event has no influence
on another. Clearly, there are going to be many situations where independence does not
in fact hold, i.e. the occurrence of one event has a ‘knock-on’ effect on the probability of
another event occurring.
3
Example 3.14 For a single roll of a fair die, let the event A be ‘roll a 6’, and the
event B be ‘an even number’. The following probabilities are obvious:
1 1 1
P (A) = , P (B) = and P (A | B) = .
6 2 3
So, we see that the probability of a 6 changes from 1/6 to 1/3 once we are given the
information that ‘an even number’ has occurred. Similarly, the probability of an even
number changes from 1/2 to 1, conditional on a 6 occurring, i.e. P (B | A) = 1.

Example 3.15 In order to understand and develop formulae for conditional


probability, consider the following simple example, representing the classification by
sex and subject (where A, B, C and D are defined below) of 144 college students.

Subject Female Male Total


A: Maths 4 14 18
B: Economics 17 41 58
C: Science 4 25 29
D: Arts 28 11 39
Total 53 91 144

Let F = ‘Female’ and M = ‘Male’ (obviously!), then we have P (A) = 18/144,


P (F ) = 53/144 and P (A ∩ F ) = 4/144. Note that P (A ∩ F ) 6= P (A) P (F ), hence A
and F are not independent events.
From the table we have the following probabilities:

P (A | F ) = 4/53 (6= P (A))

P (F | A) = 4/18 (6= P (F )).

The correct relationship of these conditional probabilities to the original


unconditional probabilities is:

4/144 4 P (A ∩ F )
P (A | F ) = = = .
53/144 53 P (F )

Similarly:

4/144 4 P (A ∩ F )
P (F | A) = = = . (Note P (A ∩ F ) = P (F ∩ A).)
18/144 18 P (A)

61
3. Probability theory

Note also another important relationship involving conditional probability is the ‘total
probability formula’ (discussed in greater depth shortly). This expresses an
unconditional probability in terms of other, conditional probabilities.

Example 3.16 Continuing with Example 3.15:

3
   
18 4 53 14 91
P (A) = = × + ×
144 53 144 91 144
= P (A | F ) P (F ) + P (A | M ) P (M ).

3.9.1 Bayes’ formula

Conditional probability

For any two events A and B, we define conditional probabilities as:

P (A ∩ B) P (A ∩ B)
P (A | B) = and P (B | A) = (3.1)
P (B) P (A)

provided P (A) > 0 and P (B) > 0.


In words: ‘the probability of one event, given a second event, is equal to the
probability of both, divided by the probability of the second (conditioning) event’.

This is the simplest form of Bayes’ formula, and this can be expressed in other ways.
Rearranging (3.1), we obtain:

P (A ∩ B) = P (A | B) P (B) = P (B | A) P (A)

from which we can derive Bayes’ formula.

Bayes’ formula

The simplest form of Bayes’ formula is:

P (B | A) P (A)
P (A | B) = .
P (B)

3.9.2 Total probability formula


The simplest case of the total probability formula involves calculating the probability of
an event A from information about its two conditional probabilities with respect to some
other event B and its complement, B c , together with knowledge of P (B). Note that:

B and B c are mutually exclusive B and B c are collectively exhaustive.

62
3.9. Conditional probability and Bayes’ formula

Fulfilment of these criteria (being mutually exclusive and collectively exhaustive) allows
us to view B and B c as a partition of the sample space.

The (simple form of the) total probability formula

The total probability formula is:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c ). 3
In words: ‘the probability of an event is equal to its conditional probability on a
second event times the probability of the second event, plus its probability conditional
on the second event not occurring times the probability of that non-occurrence’.

There is a more general form of the total probability formula. Let B1 , B2 , . . . , Bn


partition the sample space S into n pairwise mutually exclusive (at most one can occur)
and collectively exhaustive (at least one of them must occur) events. For example, for
n = 4, see Figure 3.4.

B3 S

B1

B2 B4

Figure 3.4: An example of a partitioned sample space.

Figure 3.5 superimposes an event A.

B3 S

B1 A

B2 B4

Figure 3.5: The event A within a partitioned sample space.

63
3. Probability theory

Extending the simple form of the total probability formula, we obtain:


Xn
P (A) = P (A | Bi ) P (Bi )
i=1

= P (A | B1 ) P (B1 ) + P (A | B2 ) P (B2 ) + · · · + P (A | Bn ) P (Bn ).


Recall that P (B | A) = P (A ∩ B)/P (A) = P (A | B) P (B)/P (A), so assuming we have
3 the partition B and B c , then:
P (A | B) P (B)
P (B | A) = .
P (A | B) P (B) + P (A | B c ) P (B c )
A more general partition gives us a more complete form of Bayes’ formula.

General form of Bayes’ formula

For a general partition of the sample space S into B1 , B2 , . . . , Bn , and for some event
A, then:
P (A | Bk ) P (Bk )
P (Bk | A) = P n .
P (A | Bi ) P (Bi )
i=1

Example 3.17 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity – if a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity
– if a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
P (A | B c ) = 0.01, and hence:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.

We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:

P (B) = 0.0001 P (B c ) = 0.9999


P (A | B) = 0.99 and P (A | B c ) = 0.01.

Hence:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098

64
3.10. Probability trees

Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.

3.9.3 Independent events (revisited)


The terms ‘dependent’ and ‘independent’ reflect the fact that the probability of an 3
event is changed when another event is known to occur only if there is some dependence
between the events. If there is such a dependence, then P (A | B) 6= P (A).
It follows from this that two events, A and B, are independent if and only if:

P (A | B) = P (A).

Recall, from the multiplicative law in Section 3.8.4, that under independence
P (A ∩ B) = P (A) P (B). Substituting this into our conditional probability formula gives
the required result:

P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A)
P (B) P (B)

provided P (B) > 0. Hence if A and B are independent, knowledge of B, i.e. ‘ | B’, is of
no value to us in determining the probability of A occurring.

3.10 Probability trees


Probability problems can often be represented in several ways. We have already seen
how Venn diagrams provide a convenient way to visualise the members of various
combinations of events.
Here we introduce a probability tree, also referred to as a tree diagram. This is best
explained by way of an example.

Example 3.18 The London Special Electronics (LSE) company is investigating a


fault in its manufacturing plant. The tests done so far prove that the fault is at one
of three locations: A, B or C. Taking all the tests and other evidence into account,
the company assesses the chances of the fault being at each site as:

Suspect site A B C
Probability of it being site of fault 0.5 0.2 0.3

The next test they will do is expected to improve the identification of the correct
site, but (like most tests) it is not entirely accurate.

If the fault is at A, then there is a 70 per cent chance of a correct identification;


i.e. given that A is the site of the fault then the probability that the test says
that it is at A is 0.70, and in this case the probabilities of either possible error
are equal.

65
3. Probability theory

If the fault is at B, then there is a 60 per cent chance of a correct identification,


and in this case too the probabilities of either possible error are equal.

If the fault is at C, then there is an 80 per cent chance of a correct identification,


and in this case too the probabilities of either possible error are equal.

Draw a probability tree for this problem, and use it to answer the following:
3
(a) What is the probability that the new test will (rightly or wrongly) identify A as
the site of the fault?

(b) If the new test does identify A as the site of the fault, find:
i. the company’s revised probability that C is the site of the fault
ii. the company’s revised probability that B is not the site of the fault.
Let A, B and C stand for the events: ‘fault is at A’, ‘fault is at B’ and ‘fault is at C’,
respectively. Also, let a, b and c stand for the events: ‘the test says the fault is at A’,
‘the test says the fault is at B’ and ‘the test says the fault is at C’, respectively. The
probability tree is shown in Figure 3.6.

(a) The probability P (a) is the sum of the three values against ‘branches’ which
include the event a, that is:

0.35 + 0.04 + 0.03 = 0.42.

(b) i. The conditional probability P (C | a) is the value for the C ∩ a branch


divided by P (a), that is:
0.03 1
= = 0.071.
0.35 + 0.04 + 0.03 14

ii. The conditional probability P (B c | a) is the sum of the values for the A ∩ a
and C ∩ a branches divided by P (a), that is:
0.35 + 0.03
= 0.905.
0.42
Alternatively:

value of (B ∩ a) branch 0.04


P (B c | a) = 1 − =1− = 0.905.
P (a) 0.42

3.11 Overview of chapter


This chapter has introduced the idea of probability, and defined the key terms. You
have also seen the three axioms of probability, and various probability laws derived from
these. Conditional probability computed using Bayes’ formula is very important for
allowing us to update probabilities in light of new information.

66
3.12. Key terms and concepts
Probability tree

True site Test says Final


of fault fault at probabilities

a 0.7
p = 0.5 x 0.7 = 0.35

b 0.15
p = 0.5 x 0.15 = 0.075
3
A c 0.15
p = 0.5 x 0.15 = 0.075
0.5
a 0.2
p = 0.2 x 0.2 = 0.04

B b 0.6
p = 0.2 x 0.6 = 0.12
0.2
c 0.2
p = 0.2 x 0.2 = 0.04

C a 0.1
p = 0.3 x 0.1 = 0.03
0.3
b 0.1
p = 0.3 x 0.1 = 0.03

c 0.8
p = 0.3 x 0.8 = 0.24

Figure 3.6: Probability tree for Example 3.18.

3.12 Key terms and concepts


Additive law Mutually exclusive
Axioms Partition
Bayes’ formula Probability theory
Collectively exhaustive Probability tree
Complement Relative frequency
Conditional probability Sample space
Element Set
Event Subset
Experiment Total probability formula
Independence Tree diagram
Member Venn diagram
Multiplicative law

3.13 Sample examination questions


1. If P (B) = 0.05, P (A | B) = 0.70 and P (A | B c ) = 0.30, find P (B | A).

2. An engine encounters a standard environment with a probability of 0.95, and a


severe environment with a probability of 0.05. In a normal environment the
probability of failure is 0.02, whereas in the severe environment this probability is
0.50.

67
3. Probability theory

(a) What is the probability of failure?


(b) Given that failure has occurred, what is the probability that the environment
encountered was severe?

3. Suppose there are two boxes; the first box contains three green balls and one red
ball, whereas the second box contains two green balls and two red balls. Suppose a
3 box is chosen at random and then a ball is drawn randomly from this box.
(a) What is the probability that the ball drawn is green?
(b) If the ball drawn was green, what is the probability that the first box was
chosen?

3.14 Solutions to Sample examination questions


1. The solution of this exercise requires the following steps. Note, however, that these
steps can be performed in a different order.
We have that:
P (B c ) = 1 − P (B) = 1 − 0.05 = 0.95
and:
P (A ∩ B) = P (A | B) P (B) = 0.70 × 0.05 = 0.035.
Applying the total probability formula, we have:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c ) = 0.70 × 0.05 + 0.30 × 0.95 = 0.32.

Applying Bayes’ formula, we have:


P (A ∩ B) 0.035
P (B | A) = = = 0.1094.
P (A) 0.32

2. (a) Applying the total probability formula, we have:

P (F ) = P (F | N ) P (N ) + P (F | S) P (S) = 0.02 × 0.95 + 0.50 × 0.05 = 0.044.

(b) Applying Bayes’ formula, we have:


P (F | S) P (S) 0.025 25
P (S | F ) = = = = 0.5682.
P (F ) 0.044 44

3. (a) Let B1 and B2 denote boxes 1 and 2, respectively. Let G denote a green ball
and R denote a red ball. Applying the total probability formula, we have:
   
3 1 1 1 5
P (G) = P (G | B1 ) P (B1 ) + P (G | B2 ) P (B2 ) = × + × = .
4 2 2 2 8

(b) Applying Bayes’ formula, we have:


P (G | B1 ) P (B1 ) 3/4 × 1/2 3
P (B1 | G) = = = = 0.60.
P (G) 5/8 5

68
Chapter 4
Random variables, the normal and
sampling distributions

4.1 Synopsis of chapter


4

We proceed to distinguish between discrete and continuous random variables, and


thereafter see how to calculate the expected value and variance for simple discrete
distributions. The normal distribution is presented, and we see how to determine
probabilities of events for normally-distributed variables. The concept of a sampling
distribution is presented, including the special case of the central limit theorem.

4.2 Learning outcomes


After completing this chapter, you should be able to:

define a random variable and distinguish it from the values which it takes

summarise basic discrete probability distributions with expected values and


variances of discrete random variables

compute probabilities as areas under the curve for a normal distribution

state and apply sampling distributions of the sample mean, including the central
limit theorem.

4.3 Recommended reading


Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE
Publications, 2023) 1st edition [ISBN 9781529771092] Chapters 5 and 7.

4.4 Introduction
A random variable is a variable which contains the outcomes of a chance experiment.
An alternative view is that a random variable is a description of all possible outcomes of
an experiment together with the probabilities of each outcome occurring. These, and
other possible definitions, are somewhat abstract, so we illustrate with some examples.

69
4. Random variables, the normal and sampling distributions

Example 4.1 Consider the outcomes when two fair dice are rolled. We can read off
the various possibilities for the pair of scores observed from the sample space as
follows in the form (first score, second score):

(1, 1) (2, 1) (3, 1) (4, 1) (5, 1) (6, 1)


(1, 2) (2, 2) (3, 2) (4, 2) (5, 2) (6, 2)
(1, 3) (2, 3) (3, 3) (4, 3) (5, 3) (6, 3)
(1, 4) (2, 4) (3, 4) (4, 4) (5, 4) (6, 4)
(1, 5) (2, 5) (3, 5) (4, 5) (5, 5) (6, 5)
4 (1, 6) (2, 6) (3, 6) (4, 6) (5, 6) (6, 6)

Suppose we define the random variable X to be the sum of the two scores. For
example, if we observe (1, 1), then the sum is 1 + 1 = 2. We could write the possible
outcomes along with their respective probabilities, pX (x) (where lower case x is a
specific value of the random variable X), in a table depicting the probability
distribution of the random variable X as follows.

Sum of two fair dice, X 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 5 4 3 2 1
Probability of outcome
36 36 36 36 36 36 36 36 36 36 36

Note that each of the 36 pairs of scores is equally likely, hence why P (X = 2) = 1/36
(since (1, 1) is the only one of the 36 possible outcomes for the sum to equal 2),
P (X = 3) = 2/36 (resulting from the two possible outcomes (1, 2) and (2, 1)), and so
on.

Example 4.2 We determine the sample space when two fair coins are tossed (each
resulting in heads, H, or tails, T ), and the associated random variable, X, which
counts the number of tails. The sample space is:

S = {HH, HT, T H, T T }.

X takes the form:

Number of tails, X 0 1 2
Probability of outcome 0.25 0.50 0.25

So, X is a random variable taking values 0, 1 and 2 with probabilities 0.25, 0.50 and
0.25, respectively.

Example 4.3 Suppose you are assessing the success or failure of a specific project
management approach. The two possible outcomes are project success (Success) and
project failure (Failure). Let X be a random variable representing the project
outcome in a randomly selected project. The probabilities are as follows:

P (X = Success) = 0.6 and P (X = Failure) = 0.4.

70
4.5. Discrete random variables

The probability distribution for this random variable would be:

Project outcome, X Success Failure


Probability of outcome 0.6 0.4

Example 4.4 An analyst is analysing the distribution of investment preferences


among a group of investors, represented by the random variable X. The investment
preferences can be categorised into four types: Stock (S), Bonds (B), Mixed Portfolio
(MP), or Other (O). Assume that the probabilities of each investment preference are
as follows: 4
P (X = S) = 0.40, P (X = B) = 0.25, P (X = MP) = 0.15 and P (X = O) = 0.20.

The probability distribution for this random variable would be:

Investment preference, X A B AB O
Probability of outcome 0.40 0.25 0.15 0.20

4.5 Discrete random variables


Since different types of random variables will be analysed differently, it is necessary to
distinguish between discrete and continuous random variables. A random variable is
discrete if its set of possible values consists of isolated points on the number line.
Often, but not necessarily always, these will be non-negative integers. The number of
such points may be finite or infinite.
Examples 4.1 to 4.4 are all discrete random variables (and finite). Note that the sum of
all the probabilities must be 1, since one of the possible (mutually exclusive) outcomes
must occur. Examples 4.1 and 4.2 have numeric values of X, while Examples 4.3 and 4.4
have qualitative outcomes (the project outcome and investment preference, respectively).
Other examples of discrete random variables include:

the number of successful product launches out of a series of attempts

the number of defects in a batch of manufactured items

the number of customer complaints received in a given week

the number of financial transactions processed within a specified timeframe

the number of customers arriving at a service centre during a certain hour

the number of errors in a financial report.

In each of the examples, the discrete random variable is numeric (the number of . . .).
This will be the case for most discrete random variable examples. For some of these,
they are clearly finite (such as the number of product launches). Others are finite in

71
4. Random variables, the normal and sampling distributions

practice, although there may be no theoretical upper limit, so it may be convenient to


specify an infinite number of outcomes.
We can describe a discrete distribution (i.e. the random variable and its associated
probabilities) in different ways.

With a density histogram or bar chart (rarely).

Listed in a table, such as in Example 4.1 (sometimes)

With a formula (most common).

4 A discrete distribution would usually be described by a formula (statisticians use the


term ‘probability mass function (pmf )’).

4.6 Continuous random variables


A random variable is continuous if it takes on values at every point over a continuous
interval on the number line. Loosely, things are measured rather than counted.
Examples of continuous random variables include:

the revenue generated by a business in a specific time period

the stock price of a company at a particular moment

the time taken for a customer service representative to resolve a query

the interest rate in a financial market

the duration of a project completion time

the energy consumption of a manufacturing process per unit of output.

The probability properties of a continuous random variable, X, can be described by a


non-negative function, f (x), which is defined over the relevant interval S. f (x) is called
the probability density function (pdf ). A pdf itself does not represent a probability
(hence f (x) is not bounded by 1), instead it is a density of probability at a point x,
with probability itself corresponding to the area under the (graph of the) function f (x).
A complete study of continuous distributions requires a knowledge of calculus, which is
beyond the scope of this course (study ST104B Statistics 2 if interested).

Example 4.5 If we wanted P (1 < X < 3), say, then we would compute the area
under the curve defined by f (x) and above the x-axis interval (1, 3). This is
illustrated in Figure 4.1 for an arbitrary pdf.
In this way the pdf will give us the probabilities associated with any interval of
interest. To calculate this area requires the mathematical technique of integration,
which is very important in the theory of continuous random variables because of its
role in determining areas. However, we will not require integration in this course
(again, study ST104B Statistics 2 if interested).

72
4.7. Expectation of a discrete random variable

4
Figure 4.1: For an arbitrary pdf, P (1 < X < 3) is shown as the area under the pdf and
above the x-axis interval (1, 3).

The following properties for Example 4.5 should be readily apparent:

P (1 < X < 3) = the area under f (x) above the x-axis interval (1, 3)

the total area under the curve is 1, since this represents the probability of X taking
any possible value.

4.7 Expectation of a discrete random variable


We now concentrate on discrete random variables. Recall, from Example 4.2, the
distribution of the number of tails, X, when two fair coins are tossed:

Number of tails, X 0 1 2
Probability of outcome 0.25 0.50 0.25

Suppose the experiment is repeated a large number of times, say n = 4,000,000. We


would expect to observe the outcome of 0 approximately 1 million times (25% of the
time), the outcome of 1 approximately 2 million times (50% of the time), and the
outcome of 2 approximately 1 million times (25% of the time). This is due to our earlier
relative frequency interpretation of probability. If we observed this, the mean value of X
would be:
sum of measurements (0 × 1,000,000) + (1 × 2,000,000) + (2 × 1,000,000)
=
n 4,000,000
0 × 1,000,000 1 × 2,000,000 1 × 2,000,000
= + +
4,000,000 4,000,000 4,000,000
1 1 1
=0× +1× +2×
4 2 4
= 0 × 0.25 + 1 × 0.50 + 2 × 0.25
= 1.

73
4. Random variables, the normal and sampling distributions

In the penultimate line, we can see that the first term is 0 × pX (0), the second term is
1 × pX (1) and the third term is 2 × pX (2).
Therefore, the mean value of X is:
2
X
x pX (x).
x=0

Of course, this is no coincidence.

Expectation of a discrete random variable

4 We define the mean, or expected value, or ‘expectation’ of a discrete random


variable X to be: X
E(X) = x pX (x) (4.1)
where we sum over all the x values which are taken by the random variable X.
We often write:
E(X) = µ
where the Greek letter, µ, is widely used to represent the population mean.
In Excel, this can be computed using the SUMPRODUCT function.

We can think of E(X) as the long-run average when the experiment is carried out a
large number of times.

Example 4.6 Suppose I buy a lottery ticket for £1. I can win £500 with a
probability of 0.001 or £100 with a probability of 0.003. What is my expected profit?
We begin by defining the random variable X to be my profit (in pounds). Its
distribution is:

Profit, X −1 99 499
Probability of outcome 0.996 0.003 0.001

Using (4.1), we get:

E(X) = (−1 × 0.996) + (99 × 0.003) + (499 × 0.001) = −0.2.

So I expect to make a loss of £0.20 (which will go to funding the prize money or,
possibly, charity).
Using Excel:

74
4.8. Functions of a random variable

Example 4.7 Consider an economic scenario where you are analysing the
distribution of successful investment outcomes based on two investment decisions of
a group of investors. The random variable X represents the number of investments
(out of two) resulting in profitable returns. Suppose the probabilities are
pX (0) = 0.25, pX (1) = 0.50, and pX (2) = 0.25. Using (4.1), the expected number of
profitable investments is:
2
X
E(X) = x pX (x) = 0 × 0.25 + 1 × 0.50 + 2 × 0.25 = 1.
x=0

4.8 Functions of a random variable


We have seen that a random variable may be specified by the set of values it takes
together with the associated probabilities of each value.
For example, suppose an unbiased (fair) die is rolled. Let X denote the score obtained.
We can represent the outcomes using the table:

Score, X 1 2 3 4 5 6
Probability of outcome 1/6 1/6 1/6 1/6 1/6 1/6

Suppose we were interested in the random variables X1 = 1/X, X2 = X 2 or the random


variable X3 where: 
0 for x = 1, 2 or 3

X3 = 1 for x = 4 or 5

2 for x = 6.

These take the values derived from the function given, and the associated probabilities
are derived from those of X. Therefore, from the distribution of X we can derive, for
example, the distribution of X1 = 1/X.

Inverse of score, X1 = 1/X 1 1/2 1/3 1/4 1/5 1/6


Probability of outcome 1/6 1/6 1/6 1/6 1/6 1/6

Similarly, for X2 = X 2 we obtain:

Square of score, X2 = X 2 1 4 9 16 25 36
Probability of outcome 1/6 1/6 1/6 1/6 1/6 1/6

And, finally, for X3 (as previously defined):

Random variable, X3 0 1 2
Probability of outcome 1/2 1/3 1/6

75
4. Random variables, the normal and sampling distributions

Just as we defined: X
E(X) = x pX (x)

we can define the expectation of a function of a discrete random variable.

Expectation of a function of a discrete random variable

For a discrete random variable X where g(X) is the function of X being considered,
the expectation of this function of X is given by:

4 E(g(X)) =
X
g(x) pX (x)

where we sum over all the x values which are taken by the random variable X.

Example 4.8 For the random variables X1 , X2 and X3 defined above, we have the
following.
For X1 , its expectation is:
  6
1 X 1 1 1 1 1 1 1 1 1 1 1 1 1 49
E(X1 ) = E = pX (x) = × + × + × + × + × + × = = 0.4083.
X x=1
x 1 6 2 6 3 6 4 6 5 6 6 6 120

For X2 , its expectation is:


6
X 1 1 1 1 1 1 91
E(X2 ) = E(X 2 ) = x2 pX (x) = 12 × +22 × +32 × +42 × +52 × +62 × = = 15.17.
x=1
6 6 6 6 6 6 6

For X3 , its expectation is:


2
X 1 1 1 2
E(X3 ) = x pX (x) = 0 × +1× +2× = .
x=0
2 3 6 3

Using Excel:

76
4.9. Variance of a discrete random variable

An immediate application of expectations of functions of discrete random variables is


the calculation of the variance of a discrete random variable.

4.9 Variance of a discrete random variable


Just as we needed the idea of measures of dispersion to describe a set of data in
Chapter 2, so we need to define the variance of a discrete random variable. The
definition is similar. If X is a random variable, we define the variance by:
X
Var(X) = (x − µ)2 pX (x).
4
Recall that E(X) = µ, and the summation is taken over all the x values which are taken
by the random variable X. We often write:
Var(X) = σ 2
and just as we could find a sample variance in two ways, we can rewrite this as:
X
Var(X) = x2 pX (x) − µ2 . (4.2)
So there are two equivalent versions we can use. The latter is often easier in practice.
We could write (4.2) more succinctly as follows

Variance of a discrete random variable

The variance of a discrete random variable X is:

Var(X) = E(X 2 ) − (E(X))2 .

The (positive) square root of the variance is the standard deviation.

Example 4.9 For the two investment decisions in Example 4.7, we saw that
E(X) = µ = 1. Therefore, the variance is (using the first method):
2
X
2
σ = (x − µ)2 pX (x) = (0 − 1)2 × 0.25 + (1 − 1)2 × 0.50 + (2 − 1)2 × 0.25
x=0

= 0.25 + 0 + 0.25
= 0.50

or (using the second method):


2
X
σ2 = x2 pX (x) − µ2 = (02 × 0.25 + 12 × 0.50 + 22 × 0.25) − 12
x=0

= (0 + 0.50 + 1) − 1
= 0.50

giving a standard deviation of 0.50 = 0.7071.

77
4. Random variables, the normal and sampling distributions

Example 4.10 Economists are studying the impact of economic recessions on a


country’s GDP. The random variable X represents the number of quarters with
negative GDP growth in a given year. The probability distribution provided reflects
historical data on the occurrence of quarters with negative GDP growth.

Number of quarters, X 0 1 2 3 4
Probability of outcome 0.25 0.30 0.25 0.15 0.05

(a) What are the expectation and variance of X?


4 We have:
4
X
E(X) = x pX (x) = (0×0.25)+(1×0.30)+(2×0.25)+(3×0.15)+(4×0.05) = 1.45.
x=0

and:
4
X
σ 2 = Var(X) = (x − µ)2 pX (x)
x=0

= (0 − 1.45)2 × 0.25 + (1 − 1.45)2 × 0.30 + (2 − 1.45)2 × 0.25


+ (3 − 1.45)2 × 0.15 + (4 − 1.45)2 × 0.05
= 1.3475.

(b) What is the probability that the number of quarters with negative GDP growth
exceeds:
i. µ + 2σ?
ii. µ + 3σ?
p √
We have that the standard deviation is σ = Var(X) = 1.3475 = 1.16.
i. Hence P (X > µ + 2σ) is:

P (X > 1.45 + 2 × 1.16) = P (X > 3.77) = P (X = 4) = 0.05.

ii. Hence P (X > µ + 3σ) is:

P (X > 1.45 + 3 × 1.16) = P (X > 4.93) = 0.

It is important to distinguish a frequency distribution from a probability distribution.

A frequency distribution uses data – it counts the number of observations satisfying


some criterion, or taking some value, in the dataset.
A probability distribution is based on theory, or some assumed property – it gives
the probability of an observation satisfying a criterion or taking some value, based
on theory or assumptions.

The two are related, but are not the same!

78
4.10. The normal distribution

4.10 The normal distribution


The normal distribution is by far the most important probability distribution in
statistics. This is for three broad reasons.

Many variables have distributions which are approximately normal – for example,
weights of humans and animals.

The normal distribution has extremely convenient mathematical properties which


make it a useful default choice of distribution in many contexts.
4
Even when a variable is not itself even approximately normally distributed,
functions of several observations of the variable (‘sampling distributions’) are often
approximately normal due to the central limit theorem (CLT). Because of this, the
normal distribution has a crucial role in statistical inference. This will be discussed
later in the chapter.

Normal distribution

A random variable X has a normal distribution with mean µ and variance σ 2 ,


denoted X ∼ N (µ, σ 2 ) when its probability density function is:
 
1 1 2
f (x) = √ exp − 2 (x − µ) for − ∞ < x < ∞ (4.3)
2πσ 2 2σ

where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are
parameters, with −∞ < µ < ∞ and σ 2 > 0.
The mean (i.e. expected value) of X is:

E(X) = µ

and the variance of X is:


Var(X) = σ 2
hence the standard deviation is σ.

The normal distribution is the so-called ‘bell curve’. The two parameters affect it as
follows.

The mean, µ, determines the location (or central tendency) of the curve.

The variance, σ 2 , determines the dispersion (or spread) of the curve.

For example, in Figure 4.2, N (0, 1) and N (5, 1) have the same dispersion but different
locations – the N (5, 1) curve is identical to the N (0, 1) curve, but shifted 5 units to the
right, while N (0, 1) and N (0, 9) have the same location but different dispersions – the
N (0, 9) curve is centred at the same value as the N (0, 1) curve, but spread out more
widely.

79
4. Random variables, the normal and sampling distributions

0.4
0.3
N(0, 1) N(5, 1)

0.2
0.1

4 N(0, 9)
0.0

−5 0 5 10

Figure 4.2: Three examples of normal distributions.

The mean can also be inferred from the observation that the normal distribution is
symmetric about µ. This also implies that the median of the normal distribution is also
µ, and we also note that since the distribution reaches a maximum at µ, then the mean
and median are also equal to the mode. Hence for any normal distribution:

mean = median = mode.

The most important normal distribution is the special case when µ = 0 and σ 2 = 1. We
call this the standard normal distribution, denoted by Z, i.e. Z ∼ N (0, 1). A
standardised variable has
√ a zero mean and a variance of one (hence also a standard
deviation of one since 1 = 1). Tabulated probabilities which appear in statistical
tables are for the standard normal distribution.

4.10.1 Standard normal statistical tables


We now discuss the determination of normal probabilities (i.e. areas under the curve)
using standard normal statistical tables. Extracts from the New Cambridge Statistical
Tables will be provided in the examination. Here we focus on Table 4, which can be
found at the end of this subject guide.

Standard normal probabilities

Table 4 of the New Cambridge Statistical Tables lists ‘lower-tail’ probabilities, which
can be represented as:
P (Z ≤ z) = Φ(z) for z ≥ 0
using the conventional Z notation for a standard normal random variable.1

1
Although Z is the conventional letter used to denote a standard normal random variable, Table 4
uses (somewhat confusingly) ‘x’ to denote ‘z’.

80
4.10. The normal distribution

A cumulative probability is the probability of being less than or equal to some particular
value. Note the cumulative probability for the Z distribution, P (Z ≤ z), is often
denoted Φ(z). We now consider some examples of working out probabilities from
Z ∼ N (0, 1).

Example 4.11 If Z ∼ N (0, 1), what is P (Z > 1.20)?


When computing probabilities, it can be helpful to draw a quick sketch to visualise
the specific area of probability which we are after.
So, for P (Z > 1.20), we require the upper-tail probability shaded in red in Figure
4.3. Since Table 4 gives us lower-tail probabilities, if we look up the value z = 1.20 in
the table we get P (Z ≤ 1.20) = 0.8849. The total area under a normal curve is 1, so 4
the required probability is simply 1 − Φ(1.20) = 1 − 0.8849 = 0.1151.

Standard Normal Density Function


0.4
0.3
f Z (z )

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

Figure 4.3: Standard normal distribution with shaded area depicting P (Z > 1.20).

Example 4.12 If Z ∼ N (0, 1), what is P (−1.24 < Z < 1.86)?


Again, begin by producing a sketch.
The probability we require is the sum of the blue and red areas in Figure 4.4. Using
Table 4, which note only covers z ≥ 0, we proceed as follows.
The red area is given by:

P (0 ≤ Z ≤ 1.86) = P (Z ≤ 1.86) − P (Z ≤ 0)
= Φ(1.86) − Φ(0.00)
= 0.9686 − 0.50
= 0.4686.

81
4. Random variables, the normal and sampling distributions

The blue area is given by:

P (−1.24 ≤ Z ≤ 0) = P (Z ≤ 0) − P (Z ≤ −1.24)
= Φ(0.00) − Φ(−1.24)
= Φ(0.00) − (1 − Φ(1.24))
= 0.50 − (1 − 0.8925)
= 0.3925.

4 Note P (Z ≤ −1.24) = P (Z ≥ 1.24) = 1 − Φ(1.24), by symmetry of Z about µ = 0.


So, although Table 4 does not give probabilities for negative z values, we can exploit
the symmetry of the (standard) normal distribution about zero.
Hence P (−1.24 < Z < 1.86) = 0.4686 + 0.3925 = 0.8611.
Alternatively:

P (−1.24 < Z < 1.86) = Φ(1.86) − Φ(−1.24)


= Φ(1.86) − (1 − Φ(1.24))
= 0.9686 − (1 − 0.8925)
= 0.8611.

Standard Normal Density Function


0.4
0.3
f Z (z )

0.2
0.1
0.0

−3 −2 −1 0 1 2 3

Figure 4.4: Standard normal distribution depicting P (−1.24 < Z < 1.86) as the shaded
areas.

82
4.11. Sampling distributions

4.10.2 The general normal distribution


There exists an infinite number of normal distributions due to the infinite pairs of
parameter values of µ and σ 2 , since −∞ < µ < ∞ and σ 2 > 0. The good news is that
Table 4 of the New Cambridge Statistical Tables can be used to determine probabilities
for any normal random variable X, such that X ∼ N (µ, σ 2 ).
To do so, we need a little bit of magic, known as standardisation. This is a special
transformation which converts X ∼ N (µ, σ 2 ) into Z ∼ N (0, 1).

The transformation formula for standardisation

If X ∼ N (µ, σ 2 ), then the transformation:


4
X −µ
Z=
σ
creates a standard normal random variable, i.e. Z ∼ N (0, 1). So to standardise X
we subtract its mean and divide by its standard deviation.

Example 4.13 Suppose X ∼ N (5, 4). What is P (5.8 < X < 7.0)?
We have:
 
5.8 − µ X −µ 7.0 − µ
P (5.8 < X < 7.0) = P < <
σ σ σ
 
5.8 − 5 X −5 7.0 − 5
=P √ < √ < √
4 4 4
= P (0.40 < Z < 1.00)
= P (Z ≤ 1.00) − P (Z ≤ 0.40)
= 0.8413 − 0.6554 (from Table 4)
= 0.1859.

4.11 Sampling distributions


Probability distributions often have associated parameters, such as the mean, µ, and the
variance, σ 2 , for the normal distribution. By convention, Greek letters are used to
denote population parameters, whose values in practice are typically unknown.
The next few chapters will be the subject of statistical inference, whereby we infer
the values of unknown population parameters based on sample data. As an example,
suppose we wanted to investigate the average height of an adult population. It is
reasonable to assume that height is (approximately) a normally-distributed random
variable with a mean µ and a variance σ 2 . What are the exact values of these

83
4. Random variables, the normal and sampling distributions

parameters? To know these values precisely would require data on the heights of the
entire adult population, which for a large population is just not feasible!
Population sizes, denoted by N , are typically very large and clearly no-one has the
time, money or patience to undertake such a marathon data collection exercise. Instead
we opt to collect a sample (i.e. a subset of the population) of size n.2 Having collected
our sample, we then estimate the unknown population parameters based on the known
(i.e. observed) sample data. Specifically, we estimate population quantities based on
their respective sample counterparts.
A statistic (singular noun) is simply a known function of data. A sample statistic is
calculated from sample data. At this point, be aware of the following distinction
4 between an estimator and an estimate.

Estimator versus estimate

An estimator is a statistic (which is a random variable) describing how to obtain


an estimate (which is a real number) of a population parameter.

n
P
Example 4.14 The sample mean X̄ = Xi /n is an estimator of the population
i=1
mean, µ. If we had drawn from the population the sample data:

x1 = 4, x2 = 8, x3 = 2 and x4 = 6

then the (point) estimate of µ would be x̄ = (4 + 8 + 2 + 6)/4 = 20/4 = 5.


Notice the notation – the estimator is written as ‘capital’ X̄ and is a random
variable, while the estimate is written as ‘lower case’ x̄ as it is computed for a
specific observed random sample, hence the value x̄ is fixed (i.e. constant) for that
particular sample. Had the observed sample instead been:

x1 = 2, x2 = 9, x3 = 1 and x4 = 4

(drawn from the same population), then the estimator of µ would still be
n
P
X̄ = Xi /n, but the estimate would now be x̄ = (2 + 9 + 1 + 4)/4 = 16/4 = 4.
i=1

Example 4.14 demonstrates that the value of sample statistics varies from sample to
sample due to the random nature of the sampling process. Hence estimators are random
variables with corresponding probability distributions, known as sampling distributions.
We proceed to study these sampling distributions.

Sampling distribution

A sampling distribution is the probability distribution of an estimator.

2
If n = N , and we sample without replacement, then we have obtained a complete enumeration of the
population, i.e. a census. Most of the time n  N , where ‘’ means ‘much less than’.

84
4.12. Sampling distribution of X̄

Before we proceed, let us take a moment to review some population quantities and their
respective sample counterparts, as shown in Table 4.1.

Population quantity Sample counterpart


Probability distribution Histogram
(Population) mean, µ (Sample) mean, x̄
(Population) variance, σ 2 (Sample) variance, s2
(Population) standard deviation, σ (Sample) standard deviation, s
(Population) proportion, π (Sample) proportion, p

Table 4.1: Population quantities and their sample counterparts.


4

When the values of parameters are unknown, we must estimate them using sample data.
If the sample is (approximately) representative of the population from which it is
drawn, then the sample characteristics should be (approximately) equal to their
corresponding population characteristics. Hence a density histogram of the sample data
should look very similar to the corresponding population distribution. For example, if
X ∼ N (µ, σ 2 ), then a histogram of sample data drawn from this distribution should be
(approximately) bell-shaped. Similarly, common descriptive statistics should be
(approximately) equal to the corresponding population quantities. For example,

x̄ ≈ E(X) and s2 ≈ Var(X)

i.e. the sample mean should be ‘close to’ the expected value of X, and the sample
variance should be ‘close to’ the variance of X.
The precision (or quality) of point estimates such as x̄ and s2 will depend on the sample
size n, and in principle on the population size N , if finite. In practice if N is very large
relative to n (i.e. n  N ), then we can use approximations which are more than
adequate for practical purposes, but would only be completely accurate if the
population truly was infinite. In what follows, we assume N is large enough to be
treated as infinite.

4.12 Sampling distribution of X̄


Let us consider a large population (large enough to be treated as if it was infinite)
which is normally distributed, i.e. N (µ, σ 2 ). Suppose we take an initial random sample
of size n and obtain the sample mean, x̄1 . Next we take a second, independent sample
(from the same initial population) and compute x̄2 . Continue taking new, independent
samples, resulting in a series of sample means: x̄1 , x̄2 , x̄3 , . . ..
These values will, of course, vary from sample to sample, as we saw in Example 4.14
with x̄1 = 5 and x̄2 = 4. Hence if we constructed a density histogram of these values we
would see the empirical sampling distribution of X̄. Fortunately, however, we are often
able to determine the exact, theoretical form of sampling distributions without having
to resort to such empirical simulations.

85
4. Random variables, the normal and sampling distributions

When stating the sampling distribution of the sample mean, X̄, we distinguish between
sampling from populations which have a normal distribution and populations which
have a non-normal distribution.

Sampling distribution of X̄ when sampling from normal populations

When sampling from N (µ, σ 2 ), the sampling distribution of the sample mean is:

σ2
 
X̄ ∼ N µ, .
n
4

The central limit theorem applies to determine the sampling distribution of X̄ when
sampling from non-normal distributions.

Sampling distribution of X̄ when sampling from a non-normal population

The central limit theorem (CLT) says that for a random sample from (nearly)
any non-normal distribution with mean µ and variance σ 2 , then:

σ2
 
X̄ ∼ N µ,
n

approximately, when n is sufficiently large. We can then say that X̄ is asymptotically


normally distributed with mean µ and variance σ 2 /n.
‘Nearly’ because the CLT requires that the variance of the population distribution
is finite. If it is not, the CLT does not hold, but such distributions are not common.

So the difference between sampling from normal and non-normal populations is that X̄
is exactly normally distributed in the former case, but only approximately in the latter
case. The approximation is reasonable when n is at least 30, as a rule-of-thumb.
Although because this is an asymptotic approximation (i.e. as n → ∞), the larger n is,
the better the normal approximation.
Notice that Var(X̄) = σ 2 /n depends on the sample size n. It is easy to see that as n
increases (i.e. when we take a larger sample), the variance of X̄ gets smaller. That is,
sample means are less variable than single values (individual observations) from the
population. Indeed, as n → ∞ (as n tends to infinity), Var(X̄) → 0, and so the sample
mean tends to the true value µ (the value we are trying to estimate). Hence the larger
the sample size, the greater the accuracy in estimation (a good thing), but the greater
the total data collection cost (a bad thing). As happens so often in life we face a
trade-off, here between accuracy and cost.
Up to now, we have referred to the square root of the variance of a random variable as
the standard deviation. In sampling theory, the square root of the variance of an
estimator is called the standard error.

86
4.12. Sampling distribution of X̄

Standard error

The standard deviation of an estimator is called its standard error.


For example, Var(X̄) = σ 2 /n, hence the standard error of X̄, denoted S.E.(X̄), is:
s
σ2 σ
q
S.E.(X̄) = Var(X̄) = = √ .
n n

Note the standard error decreases as n increases.

To summarise, if X̄ is the sample mean of a random sample of n values from N (µ, σ 2 ), 4


then X̄:

is normally distributed

has mean µ

has variance σ 2 /n and standard error σ/ n.

If X̄ is the sample mean of a random sample of n values from a non-normal


distribution, then X̄:

is asymptotically normally distributed (but often n ≥ 30 is sufficient in practice)

has mean µ

has variance σ 2 /n and standard error σ/ n.

We can use standardisation


√ to compute probabilities involving X̄, but we2 must
remember
√ to divide by σ/ n rather than σ, since the variance of X̄ is σ /n. Note that
σ/ n is known as the standard error of X̄.

Example 4.15 A random sample of 16 values is taken from a normal distribution


with µ = 25 and σ = 3. What is P (X̄ > 26)?
Since the population is normally distributed, the sampling distribution of X̄ is exact,
i.e. we have:
σ2
   
9
X̄ ∼ N µ, = N 25, .
n 16
If Z is the standard normal random variable, then:
 
26 − 25
P (X̄ > 26) = P Z > √ = P (Z > 1.33) = 1 − Φ(1.33) = 1 − 0.9082 = 0.0918
3/ 16

using Table 4 of the New Cambridge Statistical Tables.

87
4. Random variables, the normal and sampling distributions

4.13 Overview of chapter


This chapter has introduced the concept of random variables, and distinguished
between discrete random variables (used for variables which can be counted) and
continuous random variables (used for variables which can be measured). Expectations
and variances of discrete random variables were presented. The normal distribution has
also been introduced, as well as the concept of sampling distributions of statistics. A
sampling distribution is the probability distribution of an estimator of an unknown
parameter. Often, we are interested in estimating the population mean, µ, and the
sampling distribution of X̄ was stated when sampling from a normally-distributed
4 population, and the central limit theorem was discussed for cases when sampling from a
non-normal population.

4.14 Key terms and concepts


Central limit theorem Random variable
Continuous Sample counterpart
Discrete Sample space
Estimate Sampling distribution
Estimator Standard deviation
Expected value (mean) Standard error
Normal distribution Standard normal distribution
Population quantity Standardisation
Probability density function Statistic
Probability distribution Statistical inference
Probability mass function Variance

4.15 Sample examination questions


1. The random variable X takes the values −1, 1 and 2 according to the following
probability distribution:

x −1 1 2
pX (x) 0.20 k 4k

(a) Determine the constant k and, hence, write down the probability distribution
of X.
(b) Find E(X) (the expected value of X).
(c) Find Var(X) (the variance of X).

2. The scores on a verbal reasoning test are normally distributed with a population
mean of µ = 100 and a population standard deviation of σ = 10.
(a) What is the probability that a randomly chosen person scores at least 105?

88
4.16. Solutions to Sample examination questions

(b) A simple random sample of size n = 20 is selected. What is the probability


that the sample mean will be between 97 and 104? (You may use the nearest
values provided in the statistical tables.)

3. The weights of a large population of animals have a mean of 7.3 kg and a standard
deviation of 1.9 kg.
(a) Assuming that the weights are normally distributed, what is the probability
that a random selection of 40 animals from this population will have a mean
weight between 7.0 kg and 7.4 kg?
(b) A researcher stated that the probability you calculated is approximately
correct even if the distribution of the weights is not normal. Do you agree? 4
Justify your answer.

4.16 Solutions to Sample examination questions


P
1. (a) We must have i p(xi ) = 0.20 + 5k = 1, hence k = 0.16. Therefore, the
probability distribution is:
x −1 1 2
pX (x) 0.20 0.16 0.64
(b) We have:
X
E(X) = xi p(xi ) = (−1) × 0.20 + 1 × 0.16 + 2 × 0.64 = 1.24.
i

(c) We have:
X
E(X 2 ) = x2i p(xi ) = (−1)2 × 0.20 + 12 × 0.16 + 22 × 0.64 = 2.92
i

hence:
Var(X) = E(X 2 ) − (E(X))2 = 2.92 − (1.24)2 = 1.3824.
An
P alternative method to find the variance is through the formula
2
i (xi − µ) p(xi ), where µ = E(X) was found in part (b).

2. The first part just requires knowledge of the fact that X is a normal random
variable with mean µ = 100 and variance σ 2 = 100. However, for the second part of
the question it is important to note that X̄, the sample mean, is also a normal
random variable with mean µ and variance σ 2 /n. Direct application of this fact
then yields that:
σ2
 
X̄ ∼ N µ, = N (100, 5).
n
For both parts, the basic property of normal random variables for this question is
that if X ∼ N (µ, σ 2 ), then:
X −µ
Z= ∼ N (0, 1).
σ

89
4. Random variables, the normal and sampling distributions

Note also that:


• P (Z < a) = P (Z ≤ a) = Φ(a)
• P (Z > a) = P (Z ≥ a) = 1 − P (Z ≤ a) = 1 − P (Z < a) = 1 − Φ(a)
• P (a < Z < b) = P (a ≤ Z < b) = P (a < Z ≤ b) = P (a ≤ Z ≤ b) = Φ(b) − Φ(a).
The above is all you need to find the requested probabilities.
(a) We have X ∼ N (100, 100), hence:
 
105 − 100
P (X ≥ 105) = P Z ≥ = P (Z ≥ 0.50) = 1−Φ(0.50) = 1−0.6915 = 0.3085
10
4
using Table 4 of the New Cambridge Statistical Tables.
(b) We have X̄ ∼ N (100, 5), hence:
 
97 − 100 104 − 100
P (97 ≤ X̄ ≤ 104) = P √ ≤Z≤ √ = P (−1.34 ≤ Z ≤ 1.79)
5 5
= Φ(1.79) − (1 − Φ(1.34))
= 0.9633 − (1 − 0.9099)
= 0.8732.

3. (a) Since X̄ ∼ N (7.3, (1.9)2 /40), we have:


 
7.0 − 7.3 7.4 − 7.3
P (7.0 ≤ X̄ ≤ 7.4) = P √ ≤Z≤ √
1.9/ 40 1.9/ 40
= P (−1.00 ≤ Z ≤ 0.33)
= Φ(0.33) − (1 − Φ(1.00))
= 0.6293 − (1 − 0.8413)
= 0.4706

using Table 4 of the New Cambridge Statistical Tables.


(b) The researcher’s statement is justified since we can apply the central limit
theorem given the large sample size n.

90
Chapter 5
Interval estimation

5.1 Synopsis of chapter


Parameters are attributes of a population, such as its mean and its variance. Often we
do not know the values of parameters, so we have to estimate them using a random
sample. A ‘random’ sample ensures that there is no bias in the selection, however there
is no certainty that it is a representative sample. This means that any estimate of a
parameter is uncertain and so we take a ‘point’ estimate (a number) and convert it into
5
an ‘interval’ estimate (an interval), known as a confidence interval. This chapter looks
at interval estimation based on one and two samples, focusing on estimating population
means and proportions.

5.2 Learning outcomes


After completing this chapter, and having completed the essential reading and
activities, you should be able to:

explain the concept of a confidence interval

construct and interpret confidence intervals for a population mean and proportion

perform sample size determination

state the sampling distributions of the difference between two sample means and
two sample proportions

construct and interpret confidence intervals for the difference between two
population means and two population proportions.

5.3 Recommended reading


Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE
Publications, 2023) 1st edition [ISBN 9781529774092] Chapter 9.

5.4 Introduction
Researchers often find themselves facing decisions related to populations, which can
consist of various members, such as a group of consumers in a marketing study or a set

91
5. Interval estimation

of manufactured goods on a production line. The types of information needed for these
decisions can take the form of statistical parameters like a population mean (denoted
µ) – for example, ‘What is the average number of consumers per week?’ – or a
population proportion (denoted π) – for example, ‘What proportion of manufactured
goods is defective?’. The decisions stemming from collected data may range from
adjusting prices, to modifying the production process for optimal quality.
In most cases it is impossible to gather information about the whole population due to
time and resource constraints. Consequently, researchers must rely on collecting data
from a representative sample and then make inferences about the broader
population based on the sample. We consider different types of sampling techniques in
Chapter 9, but for now we note that a random sample is free of selection bias with
the expectation (but no guarantee) of a representative sample. We would expect the
representativeness to improve as the sample size increases.
Now, let’s delve into the world of statistical inference and explore concepts like
5 confidence intervals and sample size determination. How do these elements influence the
accuracy of estimating population means and proportions? And how can researchers
effectively utilise this information in their research?
Note that inferring information about a parent (or theoretical) population
using observations from a (random) sample is the primary concern of
statistical inference.

5.4.1 Principle of confidence intervals


A point estimate is our ‘best guess’ of an unknown population parameter based on an
observed random sample of size n. Due to the random nature of sample data, we do not
expect to estimate the parameter exactly (unless we are very lucky!), i.e. an estimation
error (the difference between an estimate and the true parameter value) is almost
certain to occur. Hence there is some uncertainty in our estimation process, so it would
be advisable to communicate the level of uncertainty (or imprecision) in conjunction
with the point estimate.
Standard errors (the square root of the variance, i.e. the standard deviation, of a
descriptive statistic) act as a measure of estimation (im)precision, and these are used in
the construction of the confidence intervals covered in this chapter. Informally, you can
think of (most) confidence intervals as representing our ‘best guess plus/minus a bit’,
where the magnitude of this ‘bit’ depends on the:

confidence level

sample size

amount of variation in the population (and hence in the sample).

This method of expressing the accuracy of an estimate is easily understood and requires
no statistical sophistication for interpretation.
More formally, an x% confidence interval covers the unknown parameter with x%
probability over repeated samples. A visual illustration is provided in Figure 5.1. The
red and blue lines each represent a confidence interval (each confidence interval is

92
5.4. Introduction

obtained from a different sample). In total there are 10 lines (8 blue, 2 red) reflecting 10
independent random samples drawn from the same population. In this example, 80% of
the time (8 out of the 10 confidence intervals) happen to cover µ (whose true value is
indicated by the green arrow). If this 80% figure was the long-run percentage (i.e. over
many repeated samples), then such confidence intervals would have an 80% coverage
probability. Hence 80% of the time we would obtain a confidence interval for µ which
covers (or spans, or includes) µ. In practice, though, we may only have one sample,
hence one confidence interval. With respect to Figure 5.1 there is a 20% risk that it is a
‘red’ interval. If it was the left red confidence interval, this would lead us to think µ is
smaller than it actually is; if it was the right red confidence interval, this would lead us
to think µ is larger than it actually is.

Figure 5.1: Coverage (in blue) and non-coverage (in red) of µ for several confidence
intervals (one confidence interval per sample).

It is important to distinguish point estimation (using sample data to obtain a numerical


estimate of an unknown population parameter) from a confidence interval (an interval
estimate of the parameter whose width indicates how reliable the point estimate is).
At this point, make sure you are clear about the distinction between statistics (such as
descriptive statistics like the sample mean, x̄, as covered in Chapter 2) and parameters
(population characteristics, like the population mean, µ, introduced in Chapter 4).
Clearly, a very wide confidence interval would show that our estimate was not very
reliable, as we could not be sure that its value was close to the true parameter value,
whereas a narrow confidence interval would correspond to a more reliable estimate. The
degree of confidence which we have in our confidence interval can be expressed
numerically. An ideal situation is a narrow interval with a high coverage probability
(typically greater than the 80% achieved in Figure 5.1). With these points in mind we
now show how such intervals can be computed from sample data in some basic
situations.

93
5. Interval estimation

5.5 Interval estimation for a population mean

5.5.1 Variance known (σ 2 known)


Suppose a random sample of size n is drawn from N (µ, σ 2 ), i.e. we assume the variable
of interest has a normal distribution in the population, with σ 2 known. We use the
sample mean, X̄, to estimate µ such that its sampling distribution is:
σ2
 
X̄ ∼ N µ, .
n
We require a confidence interval defined by a pair of values (the endpoints of the
confidence interval) such that the probability the interval covers µ is high (the concept
of a coverage probability was shown in Figure 5.1).
5 Since X̄ is normally distributed it follows that, upon standardising X̄, we have:
X̄ − µ
Z= √ ∼ N (0, 1). (5.1)
σ/ n
Therefore, assuming a 95% coverage probability:
 
X̄ − µ
P −1.96 < √ < 1.96 = 0.95.
σ/ n
This means that the central 95% of the N (0, 1) curve is between −1.96 and 1.96. This
can be verified in Excel using =NORM.S.DIST(1.96,1)-NORM.S.DIST(-1.96,1), or by
using Table 4 of the New Cambridge Statistical Tables where Φ(1.96) = 0.9750.

Excel function: NORM.S.DIST

=NORM.S.DIST(z, cumulative) returns the standard normal cumulative


distribution (has a mean of zero and a standard deviation of one), where:

z is the value for which you want the distribution

cumulative is a logical value: for the cumulative distribution function, use 1 (or
TRUE); for the probability density function, use 0 (or FALSE).


Since σ/ n > 0 (a standard error is always strictly positive), then:
 
X̄ − µ
0.95 = P −1.96 < √ < 1.96 (from above)
σ/ n

 
σ σ
= P −1.96 × √ < X̄ − µ < 1.96 × √ (multiply through by σ/ n)
n n
 
σ σ
= P −1.96 × √ < µ − X̄ < 1.96 × √ (multiply through by −1)
n n
 
σ σ
= P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √ . (et voilà!)
n n

94
5.5. Interval estimation for a population mean

Note that when we multiply by −1 to go from the second to the third line, the
inequality sign is reversed.

Endpoints for a 95% confidence interval for µ (variance known)

When sampling from a √ normal distribution, a 95% confidence interval for µ has
endpoints X̄ ± 1.96 × σ/ n. Hence the reported confidence interval would be:
 
σ σ
x̄ − 1.96 × √ , x̄ + 1.96 × √ .
n n

This is a simple, but very important, result. As we shall see, it can be applied to give
confidence intervals in many different situations such as for the estimation of a mean, a
proportion, as well as a difference between means and a difference between proportions
(covered later in this chapter). 5
The above derivation was for a 95% confidence interval, i.e. with a 95% coverage
probability, which is a generally accepted confidence requirement. Of course, it is
possible to have different levels of confidence, say 90% or 99% (the 80% demonstrated
in Figure 5.1 is much less common). Fortunately, we can use the same argument as
above. However, a different multiplier coefficient drawn from the standard normal
distribution is required (i.e. not 1.96). For convenience, key values are given below,
where zα denotes the z-value which cuts off 100α% probability in the upper tail of the
standard normal distribution.

For 90% confidence, use the multiplier z0.05 = 1.645.

For 95% confidence, use the multiplier z0.025 = 1.96.

For 99% confidence, use the multiplier z0.005 = 2.576.

Unfortunately, the method used so far in this section is limited by the assumption that
σ 2 is known. This means, in effect, that we need to know the true population variance,
but we do not know the population mean – since this is what we are trying to estimate.
This seems implausible. Why would you know σ 2 , but not know µ?
In such cases it will be necessary to estimate the standard error from the data. This
requires a modification both of the approach and, in particular, of the general formula
for the endpoints of a confidence interval.

5.5.2 Variance unknown (σ 2 unknown)


Here we consider the case where the standard error√of X̄ is estimated, such that by
using the sample
√ standard deviation, S, we use S/ n to estimate the true standard √
error of √
σ/ n. It may be tempting to think, given (5.1), that simply substituting σ/ n
with S/ n also results in:
X̄ − µ
√ ∼ N (0, 1)
S/ n

95
5. Interval estimation

i.e. that the distribution remains the standard normal distribution, but because of the
additional sampling variability of the estimated standard error, this new, transformed
function of the data will have a more dispersed distribution than the standard normal –
the Student’s t distribution. Indeed we have that:
X̄ − µ
√ ∼ tn−1
S/ n

indicating a t distribution with n − 1 ‘degrees of freedom’.

5.5.3 Student’s t distribution


‘Student’ was the pen name of William S. Gosset (1876–1937) who is credited with
developing this distribution. Gosset used a pen name as he was forced to publish his
discovery anonymously because his employer did not want the discovery to be public.
5
You should be familiar with its generic shape, which resembles the standard normal (i.e.
bell-shaped and symmetric about 0) but with ‘fatter’ tails. We get different versions of
the t distribution for different degrees of freedom, ν, the parameter of the
distribution. Graphical examples of the t distribution for various degrees of freedom are
given in Figure 5.2, but note that as ν → ∞ (in words, ‘as the degrees of freedom tend
to infinity’), we approach the standard normal distribution; that is, tν → N (0, 1) as
ν → ∞ (in words, ‘the Student’s t distribution tends to the standard normal
distribution as the degrees of freedom tend to infinity’).

Figure 5.2: Student’s t distribution for various degrees of freedom as indicated.

For our purposes, we will use this distribution whenever we are performing statistical
inference for population means when population variances are unknown, and hence are

96
5.5. Interval estimation for a population mean

estimated from the data. The correct degrees of freedom will depend on the degrees of
freedom used to estimate the variance.
Assuming a 95% coverage probability, for a given ν we can find t0.025, n−1 such that:
 
X̄ − µ
P −t0.025, n−1 < √ < t0.025, n−1 = 0.95
S/ n
where t0.025, n−1 cuts off 2.5% probability in the upper tail of the t distribution with
n − 1 degrees of freedom. On rearranging the inequality within the brackets we get:
 
S S
P X̄ − t0.025, n−1 × √ < µ < X̄ + t0.025, n−1 × √ = 0.95.
n n

Endpoints for a 95% confidence interval for µ (variance unknown)



A 95% confidence interval for µ has endpoints X̄ ± t0.025, n−1 × S/ n, leading to a
reported confidence interval of the form:
5
 
S S
x̄ − t0.025, n−1 × √ , x̄ + t0.025, n−1 × √
n n

where t0.025, n−1 is the t-value which cuts off 2.5% probability in the upper tail of the
t distribution with n − 1 degrees of freedom.

The t-values can be found manually in Table 10 of the New Cambridge Statistical
Tables, and they can also be obtained in Excel using the T.INV function.

Excel function: T.INV

=T.INV(probability, deg freedom) returns the inverse of the left-tailed


probability of the Student’s t distribution, where:

probability is the probability associated with the Student’s t distribution

deg freedom is the number of degrees of freedom with which to characterise the
distribution.

Example 5.1 Suppose T ∼ t16 distribution, i.e. T follows a t distribution with 16


degrees of freedom. Using Table 10, we have:
P (T ≥ 1.337) = 0.10 and P (T ≥ 2.583) = 0.01.
Using Excel, =T.INV(0.9,16) and =T.INV(0.99,16) return 1.337 and 2.583,
respectively.
For lower-tail probabilities, we use the fact that the t distribution is symmetric
about zero. We have:
P (T ≤ −1.746) = 0.05 and P (T ≤ −2.120) = 0.025.
Using Excel, =T.INV(0.05,16) and =T.INV(0.025,16) return −1.746 and −2.120,
respectively.

97
5. Interval estimation

Important note

In the following applications, whenever the t distribution is used and we have large
sample size(s) (hence large degrees of freedom), it is acceptable to use standard
normal values as approximations due to the tendency of the t distribution to the
standard normal distribution as the degrees of freedom approach infinity, i.e. we
have that:
tν → N (0, 1) as ν → ∞.
What constitutes a ‘large’ sample size is rather subjective. However, as a rule of
thumb, treat anything over 30 as ‘large’.

5.5.4 Confidence interval for a single mean (σ 2 known)


5 n
P
Given observed sample values x1 , x2 , . . . , xn , the point estimate of µ is x̄ =
xi /n.
√ i=1
Assuming the (population) variance, σ 2 , is known, the standard error of x̄ is σ/ n (and
hence also known).

Confidence interval endpoints for a single mean (σ 2 known)

In such instances, we use the standard normal distribution when constructing a


100(1 − α)% confidence interval with endpoints:
 
σ σ σ
x̄ ± zα/2 × √ ⇒ x̄ − zα/2 × √ , x̄ + zα/2 × √
n n n

where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution to ensure a 100(1 − α)% confidence interval.
For example, for α = 0.05, we have a 100(1 − 0.05)% = 95% confidence interval, and
we require the z-value which cuts off α/2 = 0.025, i.e. 2.5% probability in the upper
tail of the standard normal distribution, which is 1.96, and can be obtained using
=NORM.S.INV(0.975) or from the bottom row of Table 10.

Example 5.2 In product development, we often find ourselves engaged in the


process of making precise measurements.
Measurements of the diameter of a random sample of 200 ball bearings produced by
a machine gave a sample mean of x̄ = 0.824. The population standard deviation, σ,
is 0.042.
We compute 95% and 99% confidence intervals for the true mean value of the
diameter of the ball bearings.
Since we are told that σ = 0.042, a 95% confidence interval, where α = 0.05, has
endpoints:
σ 0.042
x̄ ± 1.96 × √ = 0.824 ± 1.96 × √ = 0.824 ± 0.006
n 200

98
5.5. Interval estimation for a population mean

where z0.025 = 1.96 is the z-value which cuts off 100α/2% = 2.5% probability in the
upper tail of the standard normal distribution obtained using =NORM.S.INV(0.975),
or Table 10 (using the bottom row, since tν → N (0, 1) as ν → ∞). In other words,
the interval is (0.818, 0.830) which covers the true mean with a probability of 95%.
To compute a 99% confidence interval (where α = 0.01), since σ is known we require
zα/2 = z0.005 , that is the z-value which cuts off 0.5% probability in the upper tail of
the standard normal distribution.
We have that z0.005 = 2.576, obtained using =NORM.S.INV(0.995) (or Table 10), so a
99% confidence interval has endpoints:
σ 0.042
x̄ ± 2.576 × √ = 0.824 ± 2.576 × √ = 0.824 ± 0.008.
n 200
In other words, the interval is a (0.816, 0.832). Note the higher level of confidence
has resulted in a wider confidence interval. This is as expected since, other things 5
equal, the ‘price’ of the benefit of a higher confidence level is the cost of a wider
confidence interval.

5.5.5 Confidence interval for a single mean (σ 2 unknown)


In practice it is unusual for σ 2 to be known. (Why would we know σ 2 , but not µ?)
However, we can estimate σ 2 with the sample variance, i.e. use the descriptive statistic:
n
2 1 X
S = (Xi − X̄)2 .
n − 1 i=1

Confidence interval endpoints for a single mean (σ 2 unknown)

In such instances, we use the t distribution when constructing a 100(1 − α)%


confidence interval with endpoints:
 
s s s
x̄ ± tα/2, n−1 × √ ⇒ x̄ − tα/2, n−1 × √ , x̄ + tα/2, n−1 × √
n n n

where tα/2, n−1 is the t-value which cuts off 100α/2% probability in the upper tail of
the t distribution with n − 1 degrees of freedom, obtained from T.INV, alternatively
using Table 10.

Example 5.3 A researcher carries out a sampling exercise in order to estimate the
average height of a species of tree. A random sample of 12 trees gives the following
descriptive statistics (in feet):
x̄ = 41.625 and s = 7.840.
We seek a 95% confidence interval for µ, the mean height of all trees of this species.
The estimated standard error of the sample mean is:
s 7.840
√ = √ = 2.2632
n 12

99
5. Interval estimation

on n − 1 = 11 degrees of freedom. Hence a 95% confidence interval for µ has


endpoints:
41.625 ± 2.201 × 2.2632
i.e. the confidence interval (in feet) is:

(36.64, 46.61).

Two important points are the following.

Make sure you see where 2.201 comes from. It is t0.025, 11 , i.e. the t-value above
which lies 2.5% probability for a Student’s t distribution with 11 degrees of
freedom. In Excel, this is obtained using =T.INV(0.975,11). This can also be
found in Table 10.

5 Make sure you report confidence intervals in the form (36.64, 46.61). That is,
you must compute the actual endpoints and report these as an interval, as that
is what a confidence interval is! Note the lower endpoint should be given first.

5.6 Confidence interval for a single proportion


We often want to estimate a population proportion, π, such as the proportion of
satisfied customers. To proceed, we use the sample proportion, P as our estimator of π.
We now note the sampling distribution of the sample proportion.

Sampling distribution of the sample proportion

The sample proportion, P , is the estimator of the population proportion, π, such


that:
π(1 − π)
 
P ∼ N π,
n
approximately, as n → ∞, by the central limit theorem.

This result is a consequence of the central limit theorem applied to the proportion of
successes for a ‘binomial’ distribution, or equivalently the parameter of a ‘Bernoulli’
distribution (both introduced in ST104B Statistics 2).
The variance of the sample proportion is:

π(1 − π)
Var(P ) =
n
and so the standard error of the sample proportion (the square root) is:
r
π(1 − π)
S.E.(P ) = .
n
Unfortunately, this depends on π, precisely what we are trying to estimate, hence the
true standard error is unknown, so must itself be estimated. As π is unknown, the best

100
5.7. Sample size determination

we can do is replace it with our point estimate of it (the sample proportion p = r/n,
where there are r ‘successes’ in the sample of size n), hence the estimated standard error
is: r r
p(1 − p) r/n(1 − r/n)
E.S.E.(p) = = .
n n

Confidence interval endpoints for a single proportion

An approximate 100(1−α)% confidence interval for a single proportion has endpoints:


s  s s 
p(1 − p) p(1 − p) p(1 − p)
p ± zα/2 × ⇒ p − zα/2 × , p + zα/2 × 
n n n

where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV. Such z-values can also be
5
obtained from the bottom row of Table 10.

Note that although we are estimating a variance, for proportions we do not use the t
distribution for the following two reasons.

The standard error has not been estimated by S 2 .

The sample size n has to be large for the central limit theorem normal
approximation to hold, and so the standard normal distribution is appropriate in
this case.

Example 5.4 A survey is conducted by a bank to estimate the proportion of its


customers who would be interested in using a proposed new mobile telephone
banking service. If we denote the population proportion of customers who are
interested by π, and it is found that 68 out of 150 sampled customers were
interested, then we would estimate π by p = 68/150 = 0.453. Hence a 95%
confidence interval for π has endpoints:
r
0.453 × (1 − 0.453)
0.453 ± 1.96 × ⇒ (0.37, 0.53).
150
Note, for a 95% confidence interval, α = 0.05 and so we use z0.025 = 1.96 in the
computation of this 95% confidence interval.

5.7 Sample size determination


The question ‘How large a sample do I need?’ is a common one when sampling. The
answer to sample size determination is: ‘Well, James, it depends!’ Specifically, it
depends on the quality of inference which the researcher requires from the data. In the
estimation context, this can be expressed in terms of the accuracy of estimation. If the

101
5. Interval estimation

researcher requires that there should be a 95% chance that the estimation error should
be no larger than e units (we refer to e as the tolerance on the estimation error), then
this is equivalent to having a 95% confidence interval of width 2e. Note here e
represents the half-width of the confidence interval since the point estimate is, by
construction, at the centre of the confidence interval.

Sample size determination for a single mean

To estimate µ to within e units with 100(1 − α)% confidence, we require a sample of


size:
(zα/2 )2 σ 2
n≥ (5.2)
e2
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV. Such z-values can also be
obtained from the bottom row of Table 10.
5
Example 5.5 A random sample of 50 households is taken from a large population
of households in an area of a city. The sample mean and standard deviation of
weekly expenditure on beverages are £18 and £4, respectively. How many more
observations are required to estimate µ to within 1 unit with 99% confidence?
Here, e = 1, and we can assume σ = 4 since the initial sample with n = 50 is ‘large’
and hence we would expect s ≈ σ (i.e. we assume s = σ as a simplifying
assumption). For 99% confidence, we use z0.005 = 2.576. Hence, using (5.2), we have:

(2.576)2 × 42
n≥ = 106.17.
12
Remembering that n must be an integer, the smallest n satisfying this is 107. (Note
that we round up, otherwise had we rounded down it would lead to less precision.)
So, 57 more observations are required.

Example 5.6 The reaction time of a patient to a certain stimulus is known to have
a standard deviation of 0.05 seconds. How large a sample of measurements must a
psychologist take in order to be 95% confident and 99% confident, respectively, that
the error in the estimate of the mean reaction time will not exceed 0.01 seconds?
For 95% confidence, we use z0.025 = 1.96. So, using (5.2), n is to be chosen such that:
(1.96)2 × (0.05)2
n≥ .
(0.01)2
Hence we find that n ≥ 96.04. Since n must be an integer, 97 observations are
required to achieve an error of 0.01 or less with 95% confidence.
For 99% confidence, we use z0.005 = 2.576. So, using (5.2), n is to be chosen such that:
(2.576)2 × (0.05)2
n≥ .
(0.01)2
Hence we find that n ≥ 165.89. Since n must be an integer, 166 observations are
required to achieve an error of 0.01 or less with 99% confidence.

102
5.8. Estimation of differences between parameters of two populations

Note that a higher level of confidence requires a larger sample size as more
information (sample data) is required to achieve a higher level of confidence for a
given tolerance, e.

Sample size determination for a single proportion

To estimate π to within e units with 100(1 − α)% confidence, we require a sample of


size:
(zα/2 )2 p(1 − p)
n≥ (5.3)
e2
where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV. Such z-values can also be
obtained from the bottom row of Table 10.

5
In (5.3), p should be an approximate value of π, perhaps obtained from a pilot study, or
alternatively we make an assumption of this value based on judgement and/or
experience. If a pilot study is not feasible and a value cannot be assumed, then set
p = 0.50 in (5.3) as a ‘conservative’ choice, as this value gives the maximum possible
standard error (this can be shown using calculus, but the proof is beyond the scope of
this course).

Example 5.7 A pilot study estimates a proportion to be 0.50. If we wish to be


95% confident of estimating the true population proportion with an error no greater
than 0.03, how large a sample is required?
Here e = 0.03, and we have an initial estimate of p = 0.50. For 95% confidence, we
use z0.025 = 1.96. Hence, using (5.3), we have:

(1.96)2 × 0.50 × (1 − 0.50)


n≥ = 1,067.11
(0.03)2

So, rounding up, we require a sample size of 1,068.

5.8 Estimation of differences between parameters of


two populations
Estimating the difference between two population means and two population
proportions is important for scientific enquiry, serving as a fundamental tool for
researchers. This is driven by our desire to comprehend and quantify variations not just
within populations, but also between populations.
The scientific community relies on the estimation of differences between means and
proportions to assess the significance of observed effects or relationships (the concept of
statistical significance will be introduced in Chapter 6). Whether examining the
effectiveness of a marketing campaign, comparing customer satisfaction levels among
distinct market segments, or investigating environmental changes, researchers need a
systematic way to gauge the magnitude and uncertainty of these differences. Confidence

103
5. Interval estimation

intervals provide this essential framework, offering a range of plausible values for the
true difference and thereby aiding researchers in drawing robust conclusions.
Furthermore, confidence intervals offer a bridge between sample data and population
parameters, allowing researchers to communicate the precision of their findings to a
wider audience. This transparency enhances the credibility of research results and
fosters a deeper understanding of the underlying uncertainty in scientific measurements.
We proceed to look at how to construct confidence intervals in two contexts:

the difference between two population means, µ1 − µ2


the difference between two population proportions, π1 − π2 .

5.9 Difference between two population means


5
In this section we consider the difference between two population means of two normal
distributions, i.e. µ1 − µ2 , where the subscripts ‘1’ and ‘2’ distinguish the two
populations, i.e. groups 1 and 2, respectively. There are four cases to study:

when the variances of the two populations are known


when the variances of the two populations are unknown and assumed to be unequal
when the variances of the two populations are unknown and assumed to be equal
the case of paired datasets.

5.9.1 Unpaired samples: variances known


Suppose that we have random samples of size n1 and n2 from two normal populations,
such that:
X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ).
Drawing on material in Chapter 4, the sampling distributions of the respective sample
means X̄1 and X̄2 are hence:
σ12 σ22
   
X̄1 ∼ N µ1 , and X̄2 ∼ N µ2 , .
n1 n2
Of interest is the difference in the population means, given by:
µ1 − µ2 .
Since both µ1 and µ2 are unknown, the difference µ1 − µ2 is also unknown, so needs to
be estimated. Intuitively, we estimate this using the difference in the sample means, i.e.
X̄1 − X̄2 .
For independent random samples drawn from two separate populations, X̄1 and X̄2 are
hence independent random variables. Therefore, the sampling distribution of their
difference is:
σ12 σ22
 
X̄1 − X̄2 ∼ N µ1 − µ2 , + (5.4)
n1 n2

104
5.9. Difference between two population means

which arises due to X̄1 − X̄2 being a linear combination of two independent normal
random variables, such that:
E(X̄1 − X̄2 ) = E(X̄1 ) − E(X̄2 ) = µ1 − µ2
and, due to independence of X̄1 and X̄2 , we have that (note we add the variances):
σ12 σ22
Var(X̄1 − X̄2 ) = Var(X̄1 ) + Var(X̄2 ) = + .
n1 n2
The result in (5.4) follows since a linear combination of independent normal random
variables also has a normal distribution, and we have just derived its mean and variance.
Recall that the standard error is the (positive) square root of the variance of a statistic,
hence the standard error of X̄1 − X̄2 for the case of known variances σ12 and σ22 is:
s
σ12 σ22
q
S.E.(X̄1 − X̄2 ) = Var(X̄1 − X̄2 ) = + .
n1 n2
5
Noting the ‘template’ for confidence intervals seen so far in this chapter of:
point estimate ± confidence coefficient × standard error
we can now state the confidence interval endpoints for the difference between two
population means with known variances.

Confidence interval endpoints for the difference between two means

If the population variances σ12 and σ22 are known, a 100(1 − α)% confidence interval
for µ1 − µ2 has endpoints:
s
σ12 σ2
x̄1 − x̄2 ± zα/2 × + 2 (5.5)
n1 n2

where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV, alternatively using the
bottom row of Table 10.

Example 5.8 Two companies supplying a similar service are compared for their
reaction times (in days) to complaints. Company 1 does not offer an online reporting
portal for complaints. In a sample of n1 = 12 complaints, x̄1 = 8.5 days and it is
known that the population standard deviation is σ1 = 3.6 days (hence a known
variance of σ12 = (3.6)2 = 12.96 days2 ).
Company 2 does offer an online reporting portal for complaints. In a sample of
n2 = 10 complaints, x̄2 = 4.8 days and it is known that the population standard
deviation is σ2 = 2.1 days (hence a known variance of σ22 = (2.1)2 = 4.41 days2 ).
Using (5.5), we compute a 95% confidence interval for µ1 − µ2 , given by:
r
(3.6)2 (2.1)2
8.5 − 4.8 ± 1.96 × + ⇒ (1.28, 6.12).
12 10

105
5. Interval estimation

Hence we are 95% confident that µ1 − µ2 lies between 1.28 days and 6.12 days.
Since this interval excludes zero (both endpoints are positive), this suggests that
µ1 > µ2 , i.e. this indicates how Company 1 has a slower reaction time to complaints
on average, relative to Company 2 (slower since x̄1 > x̄2 , which suggests that
µ1 > µ2 ). The presence of the online reporting portal seems to speed up the average
reaction time to complaints.

5.9.2 Unpaired samples: variances unknown and unequal


We have the same set-up as above, with the same sampling distribution for X̄1 − X̄2 in
(5.4), but now the population variances σ12 and σ22 are unknown. Assuming large sample
sizes, say greater than 30, we can replace these unknown parameters with the respective
sample variances s21 and s22 and continue to use standard normal values when
5 determining confidence interval endpoints. The justification is that since the sample
sizes are large, we would expect reasonably accurate estimates of the population
variances, such that s21 ≈ σ12 and s22 ≈ σ22 .

Confidence interval endpoints for the difference between two means

If the population variances σ12 and σ22 are unknown, provided sample sizes n1 and n2
are large (greater than 30), an approximate 100(1 − α)% confidence interval for
µ1 − µ2 has endpoints:
s
s21 s2
x̄1 − x̄2 ± zα/2 × + 2 (5.6)
n1 n2

where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV, or alternatively using the
bottom row of Table 10.

Example 5.9 Continuing Example 5.8, we now assume that the population
variances are unknown. Suppose random samples of complaint reaction times for
these two companies produced the following (in days):

Sample size Sample mean Sample std. dev.


Company 1 45 8.5 3.6
Company 2 35 4.8 2.1

Since we have large sample sizes, n1 = 45 > 30 and n2 = 35 > 30, we can use (5.6) to
derive an approximate 95% confidence interval for µ1 − µ2 . We have:
r
(3.6)2 (2.1)2
8.5 − 4.8 ± 1.96 × + ⇒ (2.44, 4.96).
45 35
Hence we are 95% confident that µ1 − µ2 lies between 2.44 days and 4.96 days.
Again, this excludes zero indicating that µ1 > µ2 .

106
5.9. Difference between two population means

Note that the only difference in values with respect to Example 5.8 are the larger
sample sizes (the means and standard deviations are the same values). If we compare
the widths of these two intervals we have:

6.12 − 1.28 = 4.84 days and 4.96 − 2.44 = 2.52 days.

Unsurprisingly, having larger sample sizes produces more accurate estimates of µ1


and µ2 , resulting in a shorter confidence interval, since 2.52 < 4.84.
In particular, note that the different sample sizes is the only reason for the difference
in confidence interval widths, since we have controlled for the confidence level (95%
in both cases), the sample means difference (8.5 − 4.8), and the standard deviation
values used (3.6 and 2.1, respectively).

5.9.3 Unpaired samples: variances unknown and equal


5
In some circumstances we may be able to justify the assumption that the two
populations being sampled are of equal variability. In which case, suppose we have
random samples of size n1 and n2 from two normal populations, N (µ1 , σ 2 ) and
N (µ2 , σ 2 ), i.e. the populations have a common variance, σ 2 , which is unknown. (If
the population variances were known, then we would know their true values. Hence no
assumptions would be necessary and we could use (5.5).) Therefore, the sampling
distributions of X̄1 and X̄2 are now:

σ2 σ2
   
X̄1 ∼ N µ1 , and X̄2 ∼ N µ2 , .
n1 n2

Again, we use X̄1 − X̄2 to estimate µ1 − µ2 . Since we have independent (unpaired)


samples, X̄1 and X̄2 are independent, hence the sampling distribution of their difference
is:   
2 1 1
X̄1 − X̄2 ∼ N µ1 − µ2 , σ + .
n1 n2

The problem is that this common variance σ 2 is unknown, so needs to be estimated.


Should we use the first sample variance, S12 , or the second sample variance, S22 , to
estimate σ 2 ? Answer: use both, by pooling the two sample variances, since both contain
useful information about σ 2 .

Pooled variance estimator

The pooled variance estimator, where S12 and S22 are sample variances from
samples of size n1 and n2 , respectively, is:

(n1 − 1)S12 + (n2 − 1)S22


Sp2 = (5.7)
n1 + n2 − 2
on n1 + n2 − 2 degrees of freedom, where the subscript ‘p’ denotes ‘pooled’.

Hence Sp2 is the weighted average of the sample variances S12 and S22 , where the

107
5. Interval estimation

weights are:
n1 − 1 n2 − 1
and
n1 + n2 − 2 n1 + n2 − 2
respectively. So if n1 = n2 , then we give the sample variances equal weight. Intuitively,
this should make sense. As the sample size increases, a sample variance provides a more
accurate estimate of σ 2 . Hence if n1 6= n2 , the sample variance calculated from the
larger sample is more reliable, and so it is given greater weight in the pooled variance
estimator. Of course, if n1 = n2 , then the variances are equally reliable, hence they are
given equal weight.

Confidence interval endpoints for the difference between two means

If the population variances σ12 and σ22 are unknown but assumed equal, a 100(1−α)%
confidence interval for µ1 − µ2 has endpoints:
5 s  
1 1
x̄1 − x̄2 ± tα/2, n1 +n2 −2 × s2p + (5.8)
n1 n2

where s2p is the estimate from the pooled variance estimator (12.4), and where
tα/2, n1 +n2 −2 is the t-value which cuts off 100α/2% probability in the upper tail of
the Student’s t distribution with n1 + n2 − 2 degrees of freedom, obtained using
T.INV, alternatively using Table 10.

An obvious problem is how to decide whether to assume the unknown variances are
equal or unequal. Consider the following points.

If σ12 = σ22 , then we would expect approximately equal sample variances, i.e.
s21 ≈ s22 , since both sample variances would be estimating the same (common)
variance. If the sample variances are very different, then this would suggest σ12 6= σ22 .

If we are sampling from two ‘similar’ populations (for example, similar species of
animals) then an assumption of equal variability in these ‘similar’ populations
would be reasonable.

Example 5.10 Extending Examples 5.8 and 5.9, suppose random samples of
complaint reaction times for these two companies produced the following (in days):

Sample size Sample mean Sample std. dev.


Company 1 12 8.5 3.6
Company 2 10 4.8 2.1

Since we do not have large sample sizes, n1 = 12 < 30 and n2 = 10 < 30, we cannot
use (5.6), so instead we assume that σ12 = σ22 (a reasonable assumption for these two
populations) and so use (5.7) to estimate the common variance σ 2 , and then (5.8) to
calculate the confidence interval.

108
5.9. Difference between two population means

Using (5.7), we have an estimate of the common variance of:

(12 − 1) × (3.6)2 + (10 − 1) × (2.1)2


s2p = = 9.1125
12 + 10 − 2
on n1 + n2 − 2 = 12 + 10 − 2 = 20 degrees of freedom. Hence a 95% confidence
interval for µ1 − µ2 , using (5.8), is:
s  
1 1
8.5 − 4.8 ± 2.086 × 9.1125 × + ⇒ (1.01, 6.39)
12 10

where t0.025, 20 = 2.086 (we have estimated the common variance, so we use the t
distribution, here with 20 degrees of freedom, obtained in Excel using
=T.INV(0.975,20)), or using Table 10.
Hence we are 95% confident that µ1 − µ2 lies between 1.01 days and 6.39 days. 5
Again, this excludes zero indicating that µ1 > µ2 .
Note that the confidence interval width here is 6.39 − 1.01 = 5.38 days, which is
wider than the confidence intervals computed in Examples 5.8 and 5.9. This is due
to two factors:

the use of a t-value of 2.086, which is larger than the z-value of 1.96, since the t
distribution has fatter tails than the standard normal distribution, and hence a
t-value confidence coefficient is always larger than any z-value confidence
coefficient (for any column in Table 10, the t-values are greater than the
z-values in the bottom row, recalling that t∞ = N (0, 1))

with respect to Example 5.9 the sample sizes are smaller (but they are the same
as in Example 5.8).

5.9.4 Paired (dependent) samples


Paired-sample methods are used in special cases when the two samples are not
statistically independent. For our purposes, such paired data are likely to involve
observations on the same individuals in two different states – such as ‘before’ and ‘after’
some intervening event.
A paired-sample experimental design is advantageous since it allows researchers to
determine whether or not significant changes have occurred as a result of the
intervening event free of bias from other factors, since these have been controlled for by
observing the same individuals.
A necessary, but not sufficient, condition for the presence of paired sample data is that
n1 = n2 , in order to have ‘pairs’ of data values. Common sense needs to be exercised to
determine whether or not we have paired data. An example of such a dataset would be
observations of the same individuals at two different points in time, typically ‘before’
and ‘after’ some event, such as comparing the effectiveness of a drug by measuring the
same group of patients before and after treatment.

109
5. Interval estimation

This scenario is easy to analyse as the paired data can simply be reduced to a ‘one
sample’ analysis by working with differenced data. That is, suppose two samples
generated sample values x1 , x2 , . . . , xn and y1 , y2 , . . . , yn , respectively (note the same
number of observations, n, in each sample). We compute the differences, di for
i = 1, 2, . . . , n, using:
d1 = x1 − y1 , d2 = x2 − y2 , ..., dn = xn − yn .
Of interest is the population mean difference, µd , where:
µd = µX − µY
which is estimated using the sample mean of the differenced data, x̄d , equivalently the
difference in the sample means, such that:
x̄d = x̄ − ȳ.
5 Using the differenced data we then compute a confidence interval for µd using the
technique in Section 5.5.5.

Example 5.11 The table below shows the before and after weights (in pounds) of 8
adults after trying an experimental diet. We determine a 95% confidence interval for
the mean weight loss due to the experimental diet. Based on this, we can then judge
whether we are convinced that the experimental diet reduces weight, on average.

Before After
127 122
130 120
114 116
139 132
150 144
147 138
167 155
153 152

The differences (calculated as ‘Before − After’) are:

5, 10, −2, 7, 6, 9, 12 and 1.

For example:

d1 = x1 − y1 = 127 − 122 = 5, d2 = x2 − y2 = 130 − 120 = 10 etc.

We have n = 8 pairs of observations, and the sample mean and sample standard
deviation of the differenced data are, respectively:

x̄d = 6 and sd = 4.66

on n − 1 = 7 degrees of freedom. Using the t distribution on 7 degrees of freedom,


for 95% confidence we use t0.025, 7 = 2.365, obtained in Excel using =T.INV(0.975,7)
(or Table 10).

110
5.10. Difference between two population proportions

So a 95% confidence interval for the mean √ difference in weight before and after the
experimental diet is x̄d ± t0.025, n−1 × s/ n, that is:

4.66
6 ± 2.365 × √ ⇒ (2.1, 9.9).
8
Hence we are 95% confident that the average weight loss due to the experimental
diet lies between 2.1 pounds and 9.9 pounds.
Since zero is not included in this confidence interval, we conclude that the
experimental diet does appear to reduce weight, i.e. the average weight loss appears
to be positive.

We conclude the chapter by considering confidence intervals for the difference between
two population proportions.
5
5.10 Difference between two population proportions
As with the comparison of two population means, estimating differences between
population proportions in research allows for meaningful comparisons. For example, we
may wish to estimate the difference between population proportions to assess the
effectiveness of a marketing campaign aimed at increasing brand awareness.
The correct approach to the comparison of two population proportions, π1 and π2 , is via
the difference between the population proportions, i.e. π1 − π2 . As seen in Section 5.6,
the sample proportions P1 and P2 are, by the central limit theorem for large sample
sizes n1 and n2 , respectively, (approximately) normally distributed as:
   
π1 (1 − π1 ) π2 (1 − π2 )
P1 ∼ N π1 , and P2 ∼ N π2 , .
n1 n2

When independent random samples are drawn from two separate populations, then
these distributions are statistically independent. Therefore, the difference between P1
and P2 is also (approximately) normally distributed such that:
 
π1 (1 − π1 ) π2 (1 − π2 )
P1 − P 2 ∼ N π1 − π2 , +
n1 n2

which arises due to P1 − P2 being a linear combination of two independent normal


random variables, such that:

E(P1 − P2 ) = E(P1 ) − E(P2 ) = π1 − π2

and, due to independence of P1 and P2 , we have that (note we add the variances):

π1 (1 − π1 ) π2 (1 − π2 )
Var(P1 − P2 ) = Var(P1 ) + Var(P2 ) = + .
n1 n2
Recall that a linear combination of independent normal random variables also has a
normal distribution, and we have just derived its mean and variance.

111
5. Interval estimation

The standard error is the (positive) square root of the variance of a statistic, hence the
standard error of P1 − P2 is:
s
p π1 (1 − π1 ) π2 (1 − π2 )
S.E.(P1 − P2 ) = Var(P1 − P2 ) = + .
n1 n2
We see that both Var(P1 − P2 ) and S.E.(P1 − P2 ), depend on the unknown parameters
π1 and π2 . So we must resort to the estimated standard error:
s
P1 (1 − P1 ) P2 (1 − P2 )
E.S.E.(P1 − P2 ) = + .
n1 n2
We can now state the confidence interval endpoints for the difference between two
population proportions.

Confidence interval endpoints for the difference between two proportions


5
With point estimates of π1 and π2 given by p1 = r1 /n1 and p2 = r2 /n2 (based on
r1 and r2 favourable cases), respectively, an approximate 100(1 − α)% confidence
interval for the difference between two population proportions has endpoints:
s
p1 (1 − p1 ) p2 (1 − p2 )
p1 − p2 ± zα/2 × + (5.9)
n1 n2

where zα/2 is the z-value which cuts off 100α/2% probability in the upper tail of the
standard normal distribution, obtained using NORM.S.INV, alternatively using the
bottom row of Table 10.

Example 5.12 We use (5.9) to calculate 95% and 90% confidence intervals for the
difference between the population proportions of the general public who are aware of
a particular commercial product before and after an advertising campaign. Two
surveys were conducted and the results of the two random samples were:

Sample size Number aware


After campaign 120 65
Before campaign 150 68

Let population 1 be ‘after campaign’, and population 2 be ‘before campaign’. We


have:
r1 65 r2 68
p1 = = = 0.5417 and p2 = = = 0.4533.
n1 120 n2 150
Hence our point estimate of the difference is:
‘after’ − ‘before’ = p1 − p2 = 0.5417 − 0.4533 = 0.0884.
So a 95% confidence interval for the difference between population proportions,
π1 − π2 , has endpoints:
r
0.4533 × (1 − 0.4533) 0.5417 × (1 − 0.5417)
0.0884 ± 1.96 × +
150 120

112
5.11. Overview of chapter

which can be expressed as (−0.031, 0.208).


A 90% confidence interval has endpoints:
r
0.4533 × (1 − 0.4533) 0.5417 × (1 − 0.5417)
0.0884 ± 1.645 × +
150 120
which can be expressed as (−0.012, 0.189).
Note that both confidence intervals include zero (since they both have a negative
lower endpoint and a positive upper endpoint). This suggests there is no significant
difference between the proportions of the general public aware of the commerical
product before and after the advertising campaign. Hence we suspect that the
impact of the advertising campaign is not (statistically) significant. This idea has
close parallels with hypothesis testing, introduced in Chapter 6.

5
5.11 Overview of chapter

This chapter has introduced the concept of parameter estimation, focusing on means
and proportions for one and two populations. As the values of parameters are often
unknown, we draw a random sample from the population and estimate the parameter
using an appropriate statistic (the sample mean, x̄, for µ; the sample proportion, p, for
π). While we hope a random sample is representative of the population, there is no
guarantee. This is why we convert a point estimate into an interval estimate, known as
a confidence interval. The width of a confidence interval is affected by the confidence
level (i.e. the coverage probability, often set at 95%), the sample size (larger samples
produce more accurate estimates), and the amount of variation in the
population/sample (the more heterogeneous, i.e. diverse, the population, the more
difficult it is to capture this variation in a random sample). Matters of sample size
determination were considered, and we saw how to construct and interpret confidence
intervals for means and proportions.

5.12 Key terms and concepts

Common variance Population mean


Confidence interval Population proportion
Coverage probability Random sample
Degrees of freedom Representative sample
Difference Sample proportion
Differenced data Sample size determination
Endpoints Student’s t distribution
Estimation error Tolerance
Point estimate Weighted average
Pooled variance estimator

113
5. Interval estimation

5.13 Sample examination questions


1. You are told that a 99% confidence interval for a single population proportion is
(0.3676, 0.5324).
(a) What was the sample proportion that led to this confidence interval?
(b) What was the size of the sample used?

2. In a random sample of 120 large retailers, 85 used linear regression as a method of


forecasting sales. In an independent random sample of 163 small retailers, 78 used
linear regression as a method of forecasting sales.
(a) Find a 98% confidence interval for the difference between the two population
proportions.
5 (b) What do you conclude from your confidence interval?
(c) Repeat (a) but for a 94% confidence interval.

3. A random sample of 21 students is chosen from students at higher education


establishments in a particular area of a country, and it is found that their mean
height is 165 centimetres with a sample variance of 81.
(a) Assuming that the distribution of the heights of the students may be regarded
as normally distributed, calculate a 98% confidence interval for the mean
height of students.
(b) You are asked to obtain a 98% confidence interval for the mean height of width
3 centimetres. What sample size would be needed in order to achieve that
degree of accuracy?
(c) Suppose that a sample of 15 had been obtained from a single student hostel, in
which there were a large number of students (still with a distribution of
heights which was well-approximated by the normal distribution). The mean
height is found to be 160 centimetres. Calculate a 99% confidence interval for
the mean height of students in the hostel. How do their heights compare with
the group in part (a)?

5.14 Solutions to Sample examination questions


1. (a) The sample proportion, p, must be in the centre of the interval
(0.3676, 0.5324). Adding the two endpoints and dividing by 2 gives:

0.3676 + 0.5324
p= = 0.45.
2

(b) The (estimated) standard error when estimating a single proportion is:
r √
p(1 − p) 0.45 × 0.55 0.4975
= √ = √ .
n n n

114
5.14. Solutions to Sample examination questions

Since this is a 100(1 − α)% = 99% confidence interval, then α = 0.01, so the
confidence coefficient is zα/2 = z0.005 = 2.576. Therefore, to determine n we
need to solve:
0.4975
2.576 × √ = 0.5324 − 0.45 = 0.0824 ⇒ n = 241.89.
n
The correct sample size is n = 242.
Note that in questions regarding sample size determination remember to round
up when the solution is not an integer.
2. (a) Let p1 , n1 refer to the proportion of large retailers using regression and to the
total number of large retailers, respectively. Similarly, denote by p2 and n2 the
corresponding quantities for small retailers. We have p1 = 85/120 = 0.7083,
n1 = 120, p2 = 78/163 = 0.4785 and n2 = 163.
The estimated standard error is: 5
s
p1 (1 − p1 ) p2 (1 − p2 )
E.S.E.(p1 − p2 ) = + = 0.0570.
n1 n2

The correct z-value is z0.01 = 2.326, leading to the lower and upper bounds of
0.0971 and 0.3625, respectively. Presented as an interval this is (0.0971, 0.3625).
(b) We are 98% confident that the difference between the two population
proportions is between 0.0971 and 0.3625. The interval excludes zero,
suggesting there is a difference between the true proportions at the 2%
significance level.
(c) For a 94% confidence interval, the correct confidence coefficient is z0.03 = 1.88.
The sample proportions and standard error are unaffected, hence the new
interval is (0.1226, 0.3370).
3. (a) The confidence interval formula is:
s
x̄ ± tα/2, n−1 × √ .
n
The degrees of freedom are 21 − 1 = 20, and the correct t value from tables is
t0.01, 20 = 2.528. The computed confidence interval is:
9
165 ± 2.528 × √ ⇒ (160.04, 169.96).
21

(b) We seek n such that 2.528 × 9/ n ≤ 1.5 (for a confidence interval of width 3).
Solve for n = 231 (remembering to round up to the nearest integer).
(c) The confidence interval formula remains as:
s
x̄ ± tα/2, n−1 × √ .
n
The correct t value is t0.005,14 = 2.977. The computed interval is:
9
160 ± 2.977 × √ ⇒ (153.08, 166.92).
15

115
5. Interval estimation

Note the confidence intervals are not directly comparable since the one in part
(a) is for 98% confidence, while the one in part (c) is for 99% confidence.
Other things equal, a 99% confidence interval is wider than a 98% one. Also,
the sample sizes are different and, other things equal, a smaller sample size
leads to a wider confidence interval. Although there is some overlap of the
computed confidence intervals (suggesting that the mean heights are
potentially the same), a formal hypothesis test should be performed.

116
Chapter 6
Hypothesis testing principles

6.1 Synopsis of chapter


In this chapter we introduce a branch of statistical inference known as hypothesis
testing. Typically, we choose between two statements about some feature of a
population, such as the value of one or more parameters. An appropriate statistical test
is used to determine to what extent the data evidence (i.e. the sample) is consistent
with a ‘null hypothesis’. When the data are deemed sufficiently inconsistent with the
null hypothesis we call the test statistically significant. We focus here on the main
principles of hypothesis testing, before covering common statistical tests of means and
proportions in Chapter 7. 6

6.2 Learning outcomes


After completing this chapter, and having completed the essential reading and
activities, you should be able to:

distinguish between a null hypothesis and an alternative hypotheses

define and explain the types of errors which can be made in hypothesis testing

explain a significance level and describe the different types of statistical significance

define a p-value and explain how it is used to decide whether or not to reject the
null hypothesis

explain effect size influence and sample size influence.

6.3 Recommended reading


Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE
Publications, 2023) 1st edition [ISBN 9781529774092] Chapter 10.

6.4 Introduction
In hypothesis testing, our objective is to choose between two opposite statements about
the population, where these statements are known as hypotheses. By convention these
are denoted by H0 and H1 , such that:

117
6. Hypothesis testing principles

H0 is called the null hypothesis

H1 is called the alternative hypothesis.

Our binary decision is whether to:

‘reject H0 ’ or ‘not reject H0 ’

where the decision is data-driven.

Example 6.1 Qualitative examples of hypotheses are:

Null hypothesis, H0 Alternative hypothesis, H1


A modified process does not produce a A modified process does produce a
different yield than the standard process. different yield than the standard process.
A person is not gifted with A person is gifted with
Extra Sensory Perception. Extra Sensory Perception.
6 The average level of lead in the blood The average level of lead in the blood
of people in a particular environment of people in a particular environment
is no more than 10 µg/dL. is more than 10 µg/dL.

From Example 6.1 we see that we use the null hypothesis, H0 , to represent ‘no
difference’, ‘no effect’, ‘no increase’ etc., while the alternative hypothesis, H1 , represents
‘a difference’, ‘an effect’, ‘an increase’ etc.
Many statistical procedures can be represented as statements about the values of
population parameters such as the mean, µ, or variance, σ 2 . The first step in any
hypothesis testing problem is to ‘translate’ the real problem into its technical
counterpart. For example, hypotheses can be ‘translated’ into technical forms similar to:

‘We observe a random sample of n observations, denoted x1 , x2 , . . . , xn , from


N (µ, σ 2 ) and we wish to test:

H0 : µ = µ0 vs. H1 : µ > µ0

where µ0 is some specified constant value.’


Performing the ‘translation’ correctly can itself be complicated, and so this requires
careful thought. Specifically, it is the form of H1 which needs consideration. The null
hypothesis, H0 , will always denote the parameter value with equality, i.e. ‘=’,1 such as:

H 0 : µ = µ0 .

In contrast the alternative hypothesis, H1 , will take one of three forms, i.e. using ‘6=’, ‘<’
or ‘>’, that is:
H1 : µ 6= µ0 or H1 : µ < µ0 or H1 : µ > µ0 .
1
Such a null hypothesis is called a simple null hypothesis. It is possible to have a composite null
hypothesis, such as H0 : µ ≥ µ0 , which allows for more than one parameter value, although we will only
focus on simple forms of H0 in this course.

118
6.5. Types of error

Note that only one of these forms will be used per test. To determine which form to use
will require careful consideration of the wording of the problem.
The form H1 : µ 6= µ0 is an example of a two-tailed test (or two-sided test) and we use
this form with problems worded such as ‘test the hypothesis that µ is zero’. Here, there
is no indication of the value of µ if it is not zero, that is, do we assume µ > 0 or µ < 0
in such cases? We cannot be sure, so we take the safe option of a two-sided test, i.e. we
specify H1 : µ 6= 0.
In contrast, had the problem been phrased as ‘test whether µ is greater than zero’, then
unambiguously we would opt for H1 : µ > 0. This is an example of an upper-tailed
test (or one-sided test). Similarly, ‘test whether µ is less than zero’ leads to H1 : µ < 0
which is a lower-tailed test (also a one-sided test).
Later, in Chapter 7, when testing for differences between two population means or two
population proportions, you need to look out for comparative phrases indicating if one
population parameter value should exceed the other (for example, testing whether
group A is on average faster/older/taller than group B). Practising problems will make
you proficient in correctly specifying your hypotheses.

6
6.5 Types of error
In any hypothesis test there are two types of inferential decision error which could be
committed. Clearly, we would like to reduce the probabilities of these errors as much as
possible. These two types of error are called Type I error and Type II error.

Type I and Type II errors

Type I error: rejecting H0 when it is true, which is a false positive. We denote


the probability of this type of error by α.

Type II error: failing to reject H0 when it is false, which is a false negative.


We denote the probability of this type of error by β.

Both errors are undesirable and, depending on the context of the hypothesis test, it
could be argued that either one is worse than the other. (For example, which is worse, a
medical test incorrectly concluding a healthy person has an illness, or incorrectly
concluding that an ill person is healthy?) However, on balance, a Type I error is usually
considered to be more problematic. The possible decision space in hypothesis testing
can be presented as shown in Table 6.1.
Decision made
H0 not rejected H0 rejected
True state H0 true Correct decision Type I error
of nature H1 true Type II error Correct decision

Table 6.1: Decision space in hypothesis testing.

For example, if H0 was being ‘innocent’ and H1 was being ‘guilty’, a Type I error would

119
6. Hypothesis testing principles

be finding an innocent person guilty (bad for them), while a Type II error would be
finding a guilty person innocent (bad for the victim/society/justice, but admittedly
good for them!).
The complement of a Type II error probability, that is 1 − β, is called the power of the
test – the probability that the test will reject a false null hypothesis. Hence power
measures the ability of the test to reject a false H0 , and so we seek the most powerful
test for any testing situation. Hence by seeking the ‘best’ test, we mean the best in the
‘most powerful’ sense.

Example 6.2 During the Covid-19 pandemic, there was a choice of tests (in
particular the rapid antigen tests and PCR tests) such that which test would you
use? The ‘best’ test, of course. What is ‘best’ though? It could be in terms of speed
and/or convenience (such as the rapid antigen tests), or in terms of Covid detection
accuracy (such as the PCR tests). In terms of power alone, we would opt for PCR
tests over rapid antigen tests.
In statistical testing, we would choose the best test in terms of power alone, but we
should be mindful that on occasions we may accept use of a less powerful test for the
6 sake of expediency (analogous to the use of rapid antigen tests).

We control α (i.e. we choose the value of α by setting a significance level, discussed


shortly), but we do not control test power (as we do not control β). However, we can
increase test power by increasing the sample size, n. A larger sample size will inevitably
improve the accuracy of our hypothesis test decision-making. However, there is a
trade-off in that a larger n means greater data collection costs.

Example 6.3 In the rush to develop Covid-19 vaccines (who wants to be locked
down forever?), health authorities in each country would need to licence and approve
each vaccine for use. In matters of life and death, to obtain approval would require
very ‘powerful’ evidence to justify rapid rollouts. As such, each candidate vaccine
would have to undergo a series of clinical trials, with the number of test patients
increasing at each stage. For example:

Phase 1: n1 ≈ 20–100, checking for safety of the vaccine on a small scale, while
monitoring for any side-effects.

Phase 2: n2 ≈ 500–1,000, checking for effectiveness of the vaccine, and


appropriate size/number of doses (conditional on Phase 1 being successful).

Phase 3: n3 ≈ 10,000–50,000, vaccine tested on heterogeneous subjects (gender,


age, ethnicities etc.) to ‘prove’ it should be approved for use in the general
population (conditional on Phase 2 being successful).

While different countries may prescribe different numerical values for each of n1 , n2
and n3 (perhaps due to different clinical opinions on these requirements), note that:

n1 < n2  n3

where recall that  means ‘much less than’.

120
6.6. Significance level

These concepts can be summarised as conditional probabilities, as shown in Table 6.2.

Decision made
H0 not rejected H0 rejected
True state H0 true 1−α P (Type I error) = α
of nature H1 true P (Type II error) = β Power = 1 − β

Table 6.2: Conditional probabilities in hypothesis testing.

We have:

P (H0 not rejected | H0 is true) = 1 − α and P (H0 rejected | H0 is true) = α

such that, by design:

P (H0 not rejected | H0 is true) + P (H0 rejected | H0 is true) = 1

and:

P (H0 not rejected | H1 is true) = β and P (H0 rejected | H1 is true) = 1 − β 6


such that:

P (H0 not rejected | H1 is true) + P (H0 rejected | H1 is true) = 1

Other things being equal, if you decrease α you increase β and vice-versa. Hence there
is a trade-off. However, treating Type I errors as being more serious, this is why we
control the value of α through the significance level and then we seek the most powerful
test to minimise β, equivalently to maximise 1 − β.

6.6 Significance level


Since we control for the probability of a Type I error, i.e. α, what value should this be?
Well, in general we test at the 100α% significance level, for an α in the interval [0, 1].
The default choice is α = 0.05, i.e. we test ‘at the 5% significance level’. This means
that, on average, 1 in every 20 true null hypotheses are incorrectly rejected. Of course,
this value of α is subjective, and a different significance level may be chosen. The
severity of a Type I error in the context of a specific hypothesis test might, for example,
justify a more conservative or liberal choice for α. Could you tolerate a 1-in-20 error
rate when H0 is true? Would 1-in-10 (a 10% significance level) be acceptable? Perhaps
1-in-100 (a 1% significance level)? Ultimately, it is a judgement call.
In fact, noting our look at confidence intervals in the context of estimation in Chapter
5, we could view the significance level in testing as the complement of the confidence
level in estimation (strictly speaking, this would apply to two-tailed hypothesis tests).
For example:

a 90% confidence level in estimation has parallels with a 10% significance level in
testing (α = 0.10)

121
6. Hypothesis testing principles

a 95% confidence level in estimation has parallels with a 5% significance level in


testing (α = 0.05)

a 99% confidence level in estimation has parallels with a 1% significance level in


testing (α = 0.01).

The most common significance levels are 10%, 5% and 1%, such that:

rejecting H0 at the 10% significance level reflects a weakly significant result, with
weak evidence

rejecting H0 at the 5% significance level reflects a moderately significant result, with


moderate evidence

rejecting H0 at the 1% significance level reflects a highly significant result, with


strong evidence.

A sensible strategy to follow is to initially test at the 5% significance level and then test
either at the 1% or 10% significance level, depending on whether or not you have
6 rejected at the 5% significance level. A decision tree depiction of the procedure to
follow is presented in Figure 6.1.

Significance level decision tree

Reject H0 Test result is 'highly significant'

Choose the 1% level


Reject H0

Not reject H0 Test result is 'moderately significant'


Start at the
5% level

Reject H0 Test result is 'weakly significant'

Not reject H0
Choose the 10% level

Not reject H0 Test result is 'not significant'

Figure 6.1: Significance level decision tree.

As indicated in Figure 6.1, it is possible to state whether a test result is highly


significant, moderately significant, weakly significant or not significant, and hence we can
convey a measure of the ‘strength’ of any statistical significance.
The more serious a Type I error is, the smaller we may set α to minimise the risk of a
Type I error occurring, i.e. reduce the risk of incorrectly rejecting a true null hypothesis.

122
6.7. Critical values

Once this goalpost has been set, how do we actually use α to decide whether or not to
reject H0 ? For that, we can use either critical values or p-values.

6.7 Critical values


Each statistical test employs a test statistic. A selection of common test statistics will
be introduced in Chapter 7, however each test statistic is a random variable with a
probability distribution. For all tests in Chapter 7 this distribution will either be
standard normal, i.e. N (0, 1) for ‘z tests’, or a Student’s t distribution for ‘t tests’.
Critical values are specific points on the scale of a test statistic’s distribution that
define the boundary between the rejection region where the null hypothesis is
rejected, and the region where it is not rejected. These critical values are derived from
the chosen significance level, α, which represents the probability of making a Type I
error (incorrectly rejecting a true null hypothesis). Commonly used levels of significance
include 0.05, 0.01, and 0.10. The test statistic value from the sample data is compared to
the critical value(s).

Decision rule using critical values


6
When testing at the 100α% significance level, for α in the interval [0, 1], then if the
test statistic value:
(
falls in the rejection region (beyond the critical value), then reject H0
does not fall in the rejection region, then do not reject H0 .

The critical values are drawn from the distribution of the test statistic used for testing
H0 . The use of critical values ensures a systematic and objective approach to hypothesis
testing, providing a clear decision rule based on statistical evidence.

6.7.1 Rejection region for two-tailed tests


Suppose we test:
H0 : µ = µ0 vs. H1 : µ 6= µ0
i.e. a two-tailed test. Under certain circumstances (to be seen in Chapter 7), this will be
conducted using a z-test, hence critical values will be drawn from N (0, 1). For a
two-tailed test, both tails of the test statistic’s distribution will form the rejection
region, which has a total area of α. It is logical to split this α equally between the two
tails of the distribution, hence the critical values define the boundaries which cut off
α/2 probability in each tail. Using the bottom row of Table 10 of the New Cambridge
Statistical Tables and noting symmetry of the standard normal distribution about zero:

α = 0.10 ⇒ zα/2 = z0.05 = 1.645, giving critical values of ± 1.645


α = 0.05 ⇒ zα/2 = z0.025 = 1.96, giving critical values of ± 1.96
α = 0.01 ⇒ zα/2 = z0.005 = 2.576, giving critical values of ± 2.576

123
6. Hypothesis testing principles

and these rejection regions are shown in Figure 6.2.

Figure 6.2: Rejection regions shown in red for a two-tailed z test at the 10% (left), 5%
6 (centre) and 1% (right) significance levels.

6.7.2 Rejection region for upper-tailed tests


Suppose we test:
H0 : µ = µ0 vs. H1 : µ > µ0

i.e. an upper-tailed test. Suppose a z-test is conducted, hence critical values will be
drawn from N (0, 1). For an upper-tailed test, only the right tail of the test statistic’s
distribution will form the rejection region, which has a total area of α. Using the bottom
row of Table 10 of the New Cambridge Statistical Tables, the critical values are for:

α = 0.10 ⇒ zα = z0.10 = 1.282, giving a critical value of 1.282


α = 0.05 ⇒ zα = z0.05 = 1.645, giving a critical value of 1.645
α = 0.01 ⇒ zα = z0.01 = 2.326, giving a critical value of 2.326

and these rejection regions are shown in Figure 6.3.

6.7.3 Rejection region for lower-tailed tests


Suppose we test:
H0 : µ = µ0 vs. H1 : µ < µ0

i.e. a lower-tailed test. Suppose a z-test is conducted, hence critical values will be drawn
from N (0, 1). For a lower-tailed test, only the left tail of the test statistic’s distribution
will form the rejection region, which has a total area of α. Using the bottom row of
Table 10 of the New Cambridge Statistical Tables, and noting symmetry of the standard

124
6.7. Critical values

Figure 6.3: Rejection regions shown in red for an upper-tailed z test at the 10% (left),
5% (centre) and 1% (right) significance levels.

normal distribution about zero the critical values are for:


6
α = 0.10 ⇒ zα = z0.90 = −1.282, giving a critical value of − 1.282
α = 0.05 ⇒ zα = z0.95 = −1.645, giving a critical value of − 1.645
α = 0.01 ⇒ zα = z0.99 = −2.326, giving a critical value of − 2.326

and these rejection regions are shown in Figure 6.4.

Figure 6.4: Rejection regions shown in red for a lower-tailed z test at the 10% (left), 5%
(centre) and 1% (right) significance levels.

In each of Figures 6.2 to 6.4 we see that as α decreases, the size of the rejection region
also decreases (since α is the size of the rejection region!). For t tests, the critical values
would be obtained from a Student’s t distribution with the appropriate degrees of

125
6. Hypothesis testing principles

freedom (rather than from N (0, 1) for z tests). In all cases, if the test statistic value falls
in the rejection region, then we reject H0 as per the decision rule using critical values.
If we consider again the significance level decision tree in Figure 6.1, we can appreciate
why after rejecting H0 at the 5% significance level, say, we would not then test at the
10% significance level, since the rejection region when α = 0.05 is a subset of the
rejection region when α = 0.10, hence given the decision rule using critical values, the
test statistic value must also fall in the larger rejection region. As such, there is no
added value from moving to a 10% significance level having rejected H0 at the 5%
significance level. However, it is worthwhile to move to the 1% significance level as the
rejection region when α = 0.01 is a subset of that for α = 0.05, so there is a possibility
the test statistic value does not fall in this smaller rejection region.
We now consider an alternative, but equivalent, approach to hypothesis testing, which
uses p-values instead of critical values.

6.8 P -values

6 We introduce p-values, which provide an alternative way for deciding whether or not to
reject H0 .

Definition

A p-value is the probability of the event that the ‘test statistic’ (a known function
of our data) takes the observed value or more extreme (i.e. more unlikely) values
under H0 . It is a measure of the discrepancy between the null hypothesis, H0 , and
the data evidence.

• A ‘small’ p-value indicates that the data are inconsistent with H0 .

• A ‘large’ p-value indicates that the data are consistent with H0 .

So, p-values may be seen as a measure of how compatible our data are with the null
hypothesis, such that as the p-value gets closer to 1 then the data evidence becomes
more compatible with H0 (i.e. H0 seems more credible), while as the p-value gets closer
to 0 then the data evidence becomes less compatible with H0 (i.e. H0 seems more
incredible).

Example 6.4 Suppose that in a scientific experiment researchers are investigating


the effectiveness of a new drug in treating a specific medical condition. The null and
alternative hypotheses for this experiment can be defined as follows:

H0 : The new drug has no significant effect on the medical condition; it is not
effective in treating it.

H1 : The new drug is effective in treating the medical condition; it has a


significant positive effect.

Now, let us consider two scenarios in this scientific context: (a) strong evidence for

126
6.8. P -values

the effectiveness of the drug, and (b) weak or inconclusive evidence for the
effectiveness of the drug.

(a) In this scenario, the researchers conduct a well-designed clinical trial. The
results clearly show that patients who received the new drug experienced
significant improvements in their condition. The data are so compelling that
they provide strong evidence against the null hypothesis, H0 . The researchers
present robust evidence, including data and expert opinions from medical
professionals, supporting the effectiveness of the drug. This is similar to having
a very small p-value in hypothesis testing, indicating that the observed data are
highly inconsistent with H0 . Consequently, the researchers conclude that the
new drug is effective beyond a reasonable doubt and should be considered for
approval.

(b) In this scenario, the experimental results are inconclusive. While there might be
some indications that the new drug could be effective, the evidence is not strong
enough to justify claiming its effectiveness with confidence. The researchers’
statistical analysis shows that the p-value is relatively large, suggesting that the
data do not provide strong support for the alternative hypothesis, H1 . 6
Additionally, there may be counterarguments presented by experts in the field,
casting doubt on the effectiveness of the drug. As a result, the researchers are
unable to confidently conclude that the new drug is effective beyond a
reasonable doubt, similar to a situation where the p-value is not sufficiently
small. Further research and evidence may be needed to make a definitive
decision about the drug’s effectiveness.

Example 6.5 Suppose we are interested in researching the mean body mass index
(BMI) of a certain population of infants. Suppose that BMI in the population is
modelled as N (µ, 25), i.e. we assume a normal distribution with mean µ and a
known variance of σ 2 = 25. A random sample of n = 25 individuals is taken, yielding
a sample mean of 17, i.e. x̄ = 17.
Independently of the data, three experts give their own opinions as follows.

Dr A claims the population mean BMI is µ = 16.

Ms B claims the population mean BMI is µ = 15.

Mr C claims the population mean BMI is µ = 14.

How can we assess these experts’ contradictory statements (at most one expert is
correct, they could all be incorrect)?
Here, the sampling distribution of the sample mean is:

σ2
   
25
X̄ ∼ N µ, = N µ, = N (µ, 1)
n 25

since σ 2 = 25 and n = 25. We assess the statements of √


the three
√ experts based on
this sampling distribution, with a standard error of σ/ n = 1 = 1.

127
6. Hypothesis testing principles

If Dr A’s claim is correct, then X̄ ∼ N (16, 1). The observed value


√ x̄ = 17 is one
standard error away from µ (since x̄ − µ = 17 − 16 = 1, and σ/ n = 1), and may be
regarded as a typical observation from the distribution (that is, nothing out of the
ordinary). Hence there is little inconsistency between the claim and the data
evidence. This one standard error difference between the claim that µ = 16 and the
estimate of µ given by x̄ = 17 can reasonably be attributed to sampling error. This
is shown in Figure 6.5.
If Ms B’s claim is correct, then X̄ ∼ N (15, 1). The observed value x̄ = 17 begins to
look a bit ‘extreme’, as it is two standard errors away from µ (since 17 − 15 = 2).
Hence there is some inconsistency between the claim and the data evidence (we are
somewhat surprised to obtain a point estimate of µ as far away as two standard
errors from the claimed value). This two standard error difference between the claim
that µ = 15 and the estimate of µ given by x̄ = 17 is less likely to be attributable to
sampling error, casting some doubt on Ms B’s claim. This is shown in Figure 6.6.
If Mr C’s claim is correct, X̄ ∼ N (14, 1). The observed value x̄ = 17 is very extreme,
as it is three standard errors away from µ (since 17 − 14 = 3). Hence there is strong
inconsistency between the claim and the data evidence. This three standard error
6 difference between the claim that µ = 14 and the estimate of µ given by x̄ = 17 is
very unlikely to be attributable to sampling error, casting significant doubt on Mr
C’s claim. This is shown in Figure 6.7.
We can actually calculate the conditional probability of observing a sample mean at
least one, two and three standard errors, respectively, beyond the claims of each
expert. This simply involves standardisation of X̄ conditioning
√ on the claimed value
of µ in each case (that is, subtracting µ and dividing by σ/ n = 1).

Under H0 : µ = 16 (Dr A’s claim), the probability of being at least one standard
error above and below 16 is:
P (|X̄ − 16| ≥ 1) = P (X̄ ≤ 15) + P (X̄ ≥ 17)
   
15 − 16 17 − 16
=P Z≤ +P Z ≥
1 1
= P (Z ≤ −1) + P (Z ≥ 1)
= 0.3173
using =NORM.S.DIST(-1,1)+(1-NORM.S.DIST(1,1)), or Table 4 of the New
Cambridge Statistical Tables.
Under H0 : µ = 15 (Ms B’s claim), the probability of being at least two
standard errors above and below 15 is:
P (|X̄ − 15| ≥ 2) = P (X̄ ≤ 13) + P (X̄ ≥ 17)
   
13 − 15 17 − 15
=P Z≤ +P Z ≥
1 1
= P (Z ≤ −2) + P (Z ≥ 2)
= 0.0455

128
6.8. P -values

using =NORM.S.DIST(-2,1)+(1-NORM.S.DIST(2,1)), or Table 4.

Under H0 : µ = 14 (Mr C’s claim), the probability of being at least three


standard errors above and below 14 is:

P (|X̄ − 14| ≥ 3) = P (X̄ ≤ 11) + P (X̄ ≥ 17)


   
11 − 14 17 − 14
=P Z≤ +P Z ≥
1 1
= P (Z ≤ −3) + P (Z ≥ 3)
= 0.0027

using =NORM.S.DIST(-3,1)+(1-NORM.S.DIST(3,1)), or Table 4.

These conditional probabilities are the p-values for the respective sets of hypotheses:

H 0 : µ = µ0 vs. H1 : µ 6= µ0

where µ0 is the expert’s claimed value of µ. Note that: 6


0.0027 < 0.0455 < 0.3173

such that the greater the difference (the greater the incompatibility) between the
data evidence and the claim (in the null hypothesis), the smaller the p-value.
In summary, of the three claims the one we would be most willing to reject would be
the claim that µ = 14, because if the hypothesis µ = 14 is true, the probability of
observing x̄ = 17, or more extreme values (i.e. x̄ ≤ 11 or x̄ ≥ 17), would be as small
as 0.0027. We are comfortable with this decision, as such a small probability event
would be very unlikely to occur in a single experiment.
On the other hand, we would be far less comfortable rejecting the claim that µ = 16,
because if the hypothesis µ = 16 is true, the probability of observing x̄ = 17, or more
extreme values (i.e. x̄ ≤ 15 or x̄ ≥ 17) is much larger at 0.3173. However, this does
not imply that Dr A’s claim is necessarily true.
It is important to remember that:

not reject H0 6= accept H0 .

A statistical test is incapable of ‘accepting’ a hypothesis due to the possibility of an


inferential decision error!

6.8.1 Interpretation of p-values


In practice the statistical analysis of data is performed by computers using statistical
software packages. Some simple common hypothesis tests may even be performed in
Excel. Regardless of the specific hypothesis being tested, the execution of a hypothesis
test by a computer returns a p-value. Fortunately, regardless of the test being
conducted, there is a universal decision rule for p-values.

129
6. Hypothesis testing principles

Figure 6.5: How x̄ = 17 if Dr A’s claim is correct that µ = 16 in Example 6.5.

Figure 6.6: How x̄ = 17 if Ms B’s claim is correct that µ = 15 in Example 6.5.

We have explained that we control for the probability of a Type I error through our
choice of significance level, α, where α is a value in the interval [0, 1]. Since p-values are
also probabilities, that’s what the p stands for, we simply compare p-values with our
chosen benchmark significance level, α.
We now present the p-value decision rule.

Decision rule using p-values

When testing at the 100α% significance level, for α in the interval [0, 1], then:
(
≤ α then reject H0
if the p-value
> α then do not reject H0 .

130
6.9. Overview of chapter

Figure 6.7: How x̄ = 17 if Mr C’s claim is correct that µ = 14 in Example 6.5.

The p-value decision rule is shown in Figure 6.8 for α = 0.05.


6

Figure 6.8: Using the p-value decision rule with a 5% significance level.

6.8.2 Statistical significance versus practical significance


Statistically significant results are those which produce sufficiently small p-values. In
other words, statistically significant results are those which provide strong evidence
against H0 in favour of H1 . Such results are not necessarily significant in terms of being
of practical importance. They might be significant only in the statistical sense. This is
the case when your response to the results of the study is ‘who cares?’ or ‘so what?’.
There is always a possibility of statistical significance, but not practical significance with
large sample sizes. On the contrary, with smaller samples, results may not be
statistically significant even if they represent the truth about the population(s).

6.9 Overview of chapter


This chapter has introduced hypothesis testing, which involves a binary decision of
whether or not to reject a null hypothesis. Type I and Type II errors are always a
possibility and we have considered the conditional probabilities of these occurring. We
control for the probability of a Type I error by using a significance level. The decision of

131
6. Hypothesis testing principles

whether to reject H0 can be performed using either critical values or p-values. By


construction, the same reject/not reject H0 decision is reached regardless of approach.

6.10 Key terms and concepts


Alternative hypothesis Power
Binary decision Rejection region
Critical value Significance level
Decision space Test statistic
Decision tree Two-tailed test
Lower-tailed test Type I error
Null hypothesis Type II error
p-value Upper-tailed test

6.11 Sample examination questions


6 1. The power of a statistical test is the probability that the null hypothesis is true. Is
this statement true or false? Justify your answer.

2. Define what a p-value is, and explain how it is used in hypothesis testing.

3. Compare and contrast the two general approaches to hypothesis testing: (a) the
‘critical value approach’, and (b) the ‘p-value approach’.

6.12 Solutions to Sample examination questions


1. The statement is false. The power of a statistical test is the probability that the
test will correctly reject a false null hypothesis. In other words, it is the probability
of correctly detecting a true effect or difference when it exists.
Mathematically, the power of a test is influenced by factors such as the sample size
and the significance level (α). A higher power indicates a better ability of the test
to detect a true alternative hypothesis.
The correct interpretation is that the power of a test is the probability of rejecting
a false null hypothesis, not the probability that the null hypothesis itself is true.

2. A p-value is the probability of obtaining the test statistic value, or a more extreme
value, conditional on the null hypothesis being true.
A ‘small’ p-value indicates that the data are inconsistent with H0 , while a ‘large’
p-value indicates that the data are consistent with H0 .
When testing at the 100α% significance level, for α in the interval [0, 1], then:
(
≤ α then reject H0
if the p-value
> α then do not reject H0 .

132
6.12. Solutions to Sample examination questions

3. (a) The critical value approach does not require a p-value to be calculated, but
does require critical value(s) to be obtained for a given significance level, α.
(b) The p-value approach does not require critical value(s) to be obtained, but
does require it to be determined for direct comparison with a given significance
level, α.

133
6. Hypothesis testing principles

134
Chapter 7
Hypothesis testing of means and
proportions

7.1 Synopsis of chapter


Having introduced the principles of hypothesis testing in Chapter 6 we are now in a
position to proceed with performing some common statistical tests of parameters. In
this chapter we cover hypothesis tests for a single population mean, µ, a single
population proportion, π, and tests of the equality of two population means and
proportions. Each hypothesis being tested requires an appropriate test statistic which
will follow a probability distribution (in this chapter either the standard normal
distribution or a Student’s t distribution). Critical values and/or p-values are obtained
using the test statistic distribution. Use of the relevant decision rule then allows us to
draw a conclusion about the extent of statistical significance of the test results.
7
7.2 Learning outcomes
After completing this chapter, and having completed the essential reading and
activities, you should be able to:

list the stages of performing a hypothesis test

perform hypothesis tests of a single population and a single population proportion

perform hypothesis tests for the difference between two population means and two
population proportions

derive critical values from statistical tables

compute p-values using Excel formulae

summarise the findings of statistical tests.

7.3 Recommended reading


Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE
Publications, 2023) 1st edition [ISBN 9781529774092] Chapter 10.

135
7. Hypothesis testing of means and proportions

7.4 Introduction
We begin by listing the necessary steps to perform a statistical test. Once introduced,
we simply apply this ‘recipe’ to different testing scenarios.

1. Define the hypotheses. This requires specification of the null hypothesis, H0 ,


and the alternative hypothesis, H1 . When testing a parameter, such as a population
mean or proportion, H1 will be constructed for one of the following tests:
two-tailed test
upper-tailed test
lower-tailed test.
2. State the test statistic and its distribution. Each hypothesis being tested
requires a test statistic, which is a known function of the data. It is a formula
used to calculate the test statistic value in Step 3. Each test statistic follows a
probability distribution, from which critical values are obtained or the p-value is
calculated. The statistical tests covered in this chapter will all follow either a
standard normal distribution, N (0, 1), or a Student’s t distribution.
3. Compute the test statistic value. Using the sample data, we calculate the test
statistic value using the test statistic formula.

7 4. Determine the critical value(s) or the p-value of the test. As defined in


Chapter 6, critical values are specific points on the scale of a test statistic’s
distribution that define the boundary between the rejection region where the null
hypothesis is rejected, and the region where it is not rejected. The p-value is the
probability of the event that the test statistic takes the observed value or more
extreme values under H0 . It is a measure of the discrepancy between the null
hypothesis, H0 , and the data evidence.

Calculating p-values

Let the test statistic be X (a random variable), and the test statistic value be x.
For a test statistic distribution which is symmetric about zero such as N (0, 1) and
a Student’s t distribution, then the p-value is an area under the curve of the test
statistic distribution. The calculation of the p-value then depends on the form of the
alternative hypothesis, H1 , as follows.

Form of H1 p-value calculation

Two-tailed test 2 × P (X ≥ |x|)

Lower-tailed test P (X ≤ x)

Upper-tailed test P (X ≥ x)

5. Decide whether or not to reject H0 . Apply the critical value or p-value


decision rule using your chosen significance level.

136
7.5. Testing a population mean claim

When testing at the 100α% significance level, for α in the interval [0, 1], then if the
test statistic value:
(
falls in the rejection region (beyond the critical value), then reject H0
does not fall in the rejection region, then do not reject H0 .

When testing at the 100α% significance level, for α in the interval [0, 1], then:
(
≤ α then reject H0
if the p-value
> α then do not reject H0 .

The most common significance levels are 10%, 5% and 1%, such that:
rejecting H0 at the 10% significance level reflects a weakly significant result,
with weak evidence
rejecting H0 at the 5% significance level reflects a moderately significant result,
with moderate evidence
rejecting H0 at the 1% significance level reflects a highly significant result, with
strong evidence.
6. Draw conclusions. It is always important to draw conclusions in the context of
the original hypotheses. This is an important final step which guides us to make a
better decision about the original research hypothesis, and final conclusions should
be drawn in terms of the problem. 7

7.5 Testing a population mean claim


Suppose we are studying the average monthly expenses incurred by project teams in a
company over time. We are interested in determining whether the mean monthly
expenses are significantly lower than the assumed value of 500 (in £000s). This
lower-tailed hypothesis test is relevant in business management when managers want to
assess whether cost-cutting measures or improved efficiency have led to a decrease in the
monthly expenses of the project team, which can positively impact the department’s
financial performance and contribute to overall cost savings.
Let X denote the monthly expenses (in £000s). We assume that the monthly expenses
follow a normal distribution as an approximate model such that:

X ∼ N (µ, σ 2 )

and we wish to test:

H0 : µ = 500 (in £000s) vs. H1 : µ < 500 (in £000s).

To assess whether the mean monthly expenses have decreased from the assumed value
of 500, sample data will be required. Suppose a random sample of n = 100 is taken, and
let us assume that σ = 10 (in £000s). From Chapter 4, we know that:
σ2 (10)2
   
X̄ ∼ N µ, = N µ, = N (µ, 1).
n 100

137
7. Hypothesis testing of means and proportions

Further, suppose that the sample mean in our random sample of n = 100 is x̄ = 497 (in
£000s). Clearly, we see that:
x̄ = 497 6= 500 = µ
where 500 is the claimed value of µ being tested in H0 .
The question is whether we judge the difference between x̄ = 497 and the claim µ = 500
to be:

(a) small, and hence attributable to sampling error (so we think H0 is true)

(b) large, and hence classified as statistically significant (so we think H1 is true).

Adopting the p-value approach to testing, the p-value will allow us to choose between
explanations (a) and (b). We proceed by standardising X̄ such that:

X̄ − µ
Z= √ ∼ N (0, 1)
σ/ n
acts as our test statistic.
Using our sample data, we now obtain the test statistic value:
x̄ − µ 497 − 500
z= √ = √ = −3.
σ/ n 10/ 100
7 The p-value is the probability of our test statistic value or a more extreme value
conditional on H0 . Noting that H1 : µ < 500, ‘more extreme’ here means a z-value
≤ −3. This can be expressed as:

p-value = P (Z ≤ −3) = 0.00135.

Note this value can easily be obtained using Excel (or, in the examination, Table 4 of
the New Cambridge Statistical Tables) as:

=NORM.S.DIST(-3,1)

where the function NORM.S.DIST(z,1) returns the cumulative probability P (Z ≤ z) for


Z ∼ N (0, 1). It is shown as the red-shaded area in Figure 7.1.
We now apply the p-value decision rule. Since 0.00135 < 0.05 we reject H0 and conclude
that the result is ‘statistically significant’ at the 5% significance level (and also, of
course, at the 1% significance level, since also 0.00135 < 0.01). Hence this is a highly
significant result, with strong evidence that µ < 500. The mean monthly expenses
incurred by project teams is significantly lower than the assumed value of 500 (in
£000s). Finally, recall the hypothesis testing decision space in Table 6.1. As we have
rejected H0 this means one of two things:

either we have correctly rejected H0

or we have committed a Type I error.

Although the p-value is very small, indicating it is highly unlikely that this is a Type I
error, unfortunately we cannot be certain which outcome has actually occurred!

138
7.6. Hypothesis test for a single mean (σ 2 known)

Figure 7.1: The standard normal distribution, indicating the p-value of 0.00135 in red.
Note the right-hand plot is a zoomed in version of the left-hand tail of the left-hand plot.

We now show that the same conclusion is reached had we used the critical value
approach instead. At the 5% significance level, the critical value for this lower-tailed z
test is −1.645 (shown in Figure 6.4). Applying the critical value decision rule, since
−3 < −1.645 the test statistic value falls in the rejection region hence we reject H0 . 7
Moving to the 1% significance level (as per the significance level decision tree in Figure
6.1), the new critical value is −2.326 (also shown in Figure 6.4), hence we again reject
H0 and have a highly significant result with the same conclusion as the p-value approach.
Note that if the p-value is less than 0.01 (here it is 0.00135) then this must mean that
the test statistic value falls in the rejection region at the 1% significance level (since the
rejection region area is α). Hence using critical values or p-values will always result in
the same conclusion!

7.6 Hypothesis test for a single mean (σ 2 known)


We consider the test of a single population mean when the population variance, σ 2 , is
known. To test H0 : µ = µ0 when sampling from N (µ, σ 2 ) we use the following test
statistic.

z test for a single mean (σ 2 known)

In this case, the test statistic is:

X̄ − µ0
Z= √ ∼ N (0, 1). (7.1)
σ/ n

Hence critical values and p-values are obtained from the standard normal
distribution, i.e. using the bottom row of Table 10 and Table 4, respectively.

139
7. Hypothesis testing of means and proportions

Example 7.1 The mean lifetime of 100 components in a sample is 1,570 hours and
their standard deviation is known to be 120 hours. Let µ be the mean lifetime of all
the components produced. Is it likely the sample comes from a population whose
mean is 1,600 hours?
We perform a two-tailed test since we are testing whether or not µ is 1,600 hours.
(Common sense might lead us to perform a lower-tailed test since 1,570 < 1,600,
suggesting that if µ is not 1,600 hours, then it is likely to be less than 1,600 hours.
However, since this is framed as a two-tailed test, a justification for performing a
lower-tailed test would be required, should you decide to opt for a lower-tailed test.
Indeed, in principle the alternative hypothesis should be determined before data are
collected, to avoid the data biasing our choice of alternative hypothesis!)
Hence we test:
H0 : µ = 1,600 vs. H1 : µ 6= 1,600.
Since σ (and hence σ 2 ) is known, we use (7.1) to calculate the test statistic value,
which is:
x̄ − µ0 1,570 − 1,600
z= √ = √ = −2.50.
σ/ n 120/ 100
For this two-tailed test the p-value is:

p-value = 2 × P (Z > | − 2.50|) = 2 × P (Z ≥ 2.50)


7 = 2 × 0.0062
= 0.0124

computed using =2*NORM.S.DIST(-2.5,1) (or using Table 4). It is shown as the


red-shaded area in Figure 7.2.
Since 0.01 < 0.0124 < 0.05, using the p-value decision rule we conclude that the test
is significant at the 5% significance level, but not at the 1% significance level (we
reject H0 at the 5% significance level, but do not reject H0 at the 1% significance
level). Hence the test is moderately significant with moderate evidence to suggest
that the mean decay time, µ, is not equal to 1,600 hours.
Using critical values, at the 5% significance level, the critical values for this
two-tailed z test are ±1.96 (shown in Figure 6.2). Since −2.50 < −1.96 the test
statistic value falls in the rejection region hence we reject H0 . Moving to the 1%
significance level (as per Figure 6.1), the new critical values are ±2.576 (also shown
in Figure 6.2), hence we do not reject H0 since −2.50 > −2.576 and so the test
statistic value does not fall in the rejection region. We have a moderately significant
result with the same conclusion as the p-value approach.

7.7 Hypothesis test for a single mean (σ 2 unknown)


Here the only difference with the previous case is that σ 2 is unknown. In which case this
is estimated with the sample variance, S 2 . To test H0 : µ = µ0 when sampling from
N (µ, σ 2 ) we use the following test statistic.

140
7.7. Hypothesis test for a single mean (σ 2 unknown)

Figure 7.2: The standard normal distribution, indicating the p-value of 0.0124 in red.

t test for a single mean (σ 2 unknown)

In this case, the test statistic is: 7


X̄ − µ0
T = √ ∼ tn−1 . (7.2)
S/ n

Hence critical values and p-values are obtained from the Student’s t distribution with
n−1 degrees of freedom, i.e. using Table 10 or T.DIST function in Excel, respectively.

Excel function: T.DIST

=T.DIST(x, deg freedom, cumulative) returns the Student’s left-tailed t


distribution, where:

x is the numeric value at which to evaluate the distribution

deg freedom is an integer indicating the number of degrees of freedom

cumulative is a logical value: for the cumulative distribution function, use 1 or


TRUE; for the probability density function, use 0 or FALSE.

The determination of the p-value will depend on whether the test is a lower-tailed,
upper-tailed or two-tailed test. In the list below, suppose the test statistic value is x with
df degrees of freedom.

If H1 : µ < µ0 , the p-value would be =T.DIST(x,df,1).

If H1 : µ > µ0 , the p-value would be =1-T.DIST(x,df,1).

141
7. Hypothesis testing of means and proportions

If H1 : µ 6= µ0 , the p-value would be =2-2*T.DIST(ABS(x),df,1), where the ABS


function returns the absolute value of x.

Example 7.2 A study on the impact of comprehensive planning on financial


performance reported that the average annual return on investment for banks was
10.2%, and suggested that banks who exercised comprehensive planning would do
better than this. A random sample of 26 such banks gave the following returns on
investment. Do these data support the claim?

10.00, 11.90, 9.90, 10.09, 10.31, 9.96, 10.34,

10.30, 10.50, 10.23, 10.72, 11.54, 10.81, 10.15,

9.04, 11.55, 10.81, 8.69, 10.74, 10.31, 10.76,

10.92, 11.26, 11.21, 10.20, 10.76.

We test:
H0 : µ = 10.2 vs. H1 : µ > 10.2.
Note that this is an upper-tailed test as this is the region of interest (we hypothesise
that banks exercising comprehensive planning perform better than a 10.2% annual
7 return).
The summary statistics are n = 26, x̄ = 10.5 and s = 0.714. Hence, using (7.2), the
test statistic value is:
x̄ − µ0 10.5 − 10.2
t= √ = √ = 2.14.
s/ n 0.714/ 26
For this upper-tailed test the p-value is:

p-value = P (T > 2.14) = 0.0212

computed using =1-T.DIST(2.14,25,1). It is shown as the red-shaded area in


Figure 7.3.
We see that 0.01 < 0.0212 < 0.05 hence the test is significant at the 5% significance
level, but not at the 1% significance level. Note that we can view the p-value as the
smallest value of α such that we would reject the null hypothesis. We conclude that
the test is moderately significant with moderate evidence against the null hypothesis
and that banks exercising comprehensive planning perform better than a 10.2%
annual return, on average.
Using critical values, at the 5% significance level, the critical value for this
upper-tailed t test is 1.708 (using Table 10 for a t25 distribution). Since 2.14 > 1.708
the test statistic value falls in the rejection region hence we reject H0 . Moving to the
1% significance level (as per Figure 6.1), the new critical value is 2.485 (from Table
10), hence we do not reject H0 since 2.14 < 2.485 and so the test statistic value does
not fall in the rejection region. We have a moderately significant result with the
same conclusion as the p-value approach.

142
7.8. Hypothesis test for a single proportion

Figure 7.3: The Student’s t distribution with 25 degrees of freedom, indicating the p-value
of 0.0212 in red.

7.8 Hypothesis test for a single proportion


7
We now consider the hypothesis test for a single proportion. Recall from Chapter 5 that
by the central limit theorem, the (approximate) sampling distribution of the sample
proportion, P , is:  
π(1 − π)
P ∼ N π,
n
as n → ∞. When testing:
H 0 : π = π0
where π0 is the claimed value of the population proportion under H0 , the standard error
is: r
π0 (1 − π0 )
n
that is, we set π = π0 when conditioning on the null hypothesis, leading to the following
test statistic, achieved by standardising the sample proportion, P .

z test for a single proportion

In this case, the test statistic is:


P − π0
Z= p ∼ N (0, 1) (approximately, for large n). (7.3)
π0 (1 − π0 )/n

Validity requires nπ0 > 5 and n(1 − π0 ) > 5. Hence critical values and p-values are
obtained from the standard normal distribution, i.e. using the bottom row of Table
10 and Table 4, respectively.

143
7. Hypothesis testing of means and proportions

Using Excel, suppose the test statistic value is x.

If H1 : π < π0 , the p-value would be =NORM.S.DIST(x,1).

If H1 : π > π0 , the p-value would be =1-NORM.S.DIST(x,1).

If H1 : π 6= π0 , the p-value would be =2-2*NORM.S.DIST(ABS(x),1).

Example 7.3 To illustrate this, let us extend Example 5.4 where we consider a
survey conducted to estimate the proportion of bank customers who would be
interested in using a proposed new mobile telephone banking service. If we denote
the population proportion of customers who are interested by π, and it is found that
68 out of a random sample of 150 sampled customers were interested, then we would
estimate π with p = 68/150 = 0.453.
Suppose that other surveys have shown that 40% of the population of customers are
interested and it is proposed to test whether or not the above survey agrees with
this figure, i.e. we conduct a two-tailed test. Hence:

H0 : π = 0.40 vs. H1 : π 6= 0.40.

The test statistic value is, using (7.3):

7 z=p
p − π0
=p
0.453 − 0.40
= 1.325.
π0 (1 − π0 )/n 0.40 × 0.60/150

For this two-tailed test the p-value is:

p-value = P (Z > |1.325|) = 0.1852

computed using =2-2*NORM.S.DIST(ABS(1.325),1). It is shown as the total


red-shaded area in Figure 7.4.
Since 0.1852 > 0.10, the p-value exceeds the 10% significance level, hence using the
p-value decision rule we do not reject H0 – the test result is not statistically
significant. There is insufficient evidence to justify rejecting H0 .
We conclude that the data are consistent with the null hypothesis and that the
proportion of customers interested in the population may well be 40%.
Note that if we had an a priori reason (meaning before we collected any data) to
believe that the proportion of customers interested would be greater than 40%
(without being influenced by the sample proportion being greater than 0.40), we
would conduct an upper-tailed test of:

H0 : π = 0.40 vs. H1 : π > 0.40.

The test statistic value would remain at 1.325, but now the p-value would be halved,
since =1-NORM.S.DIST(1.325,1) returns 0.0926, and since 0.0926 < 0.10 we would
now reject H0 at the 10% significance level. The test is now weakly significant, with
weak evidence that more than 40% of the population of customers is interested. Note
the p-value of 0.0926 is the red-shaded area in the right tail only in Figure 7.4.

144
7.9. Hypothesis testing of differences between parameters of two populations

While this is a different decision to the two-tailed test, it is not inconsistent since:

not reject H0 6= accept H0 .

The power of the test has increased with the adoption of an upper-tailed H1 because
the effect of halving the two-tailed p-value makes it easier to reject H0 by increasing
the chance that the p-value falls below the chosen significance level.
Using critical values, for a two-tailed test at the 5% significance level, the critical
values are ±1.96 (shown in Figure 6.2). Since 1.325 < 1.96 the test statistic value
does not fall in the rejection region hence we do not reject H0 . Moving to the 10%
significance level (as per Figure 6.1), the new critical values are ±1.645 (also shown
in Figure 6.2), hence again we do not reject H0 since 1.325 < 1.645 and so the test
statistic value does not fall in the rejection region. The test is not statistically
significant, as per above.

Figure 7.4: The standard normal distribution, indicating the p-value of 0.1852 in red.

7.9 Hypothesis testing of differences between


parameters of two populations
Testing for differences between two population means or proportions is of great
interest to researchers. In experiments, comparing means allows researchers to assess the
impact of different conditions or treatments. For instance, in medicine, it is important
to compare the mean effectiveness of two drugs in a clinical trial to determine which one
is more effective in treating a specific condition. Similarly, in environmental science,
comparing means can help ascertain whether there is a significant difference in
pollutant levels before and after implementing a pollution control measure.
Comparing proportions is important for studying categorical data, such as the level of
brand awareness after a marketing campaign or the success rate of a new manufacturing

145
7. Hypothesis testing of means and proportions

process. In marketing, researchers can compare proportions to understand the


effectiveness of a campaign to increase brand awareness, while in quality control,
proportions help assess the reliability of a production method.
Researchers formulate hypotheses about the effects of variables and use statistical tests
to determine whether the observed differences are statistically significant or merely due
to chance. This rigour ensures that scientific conclusions are grounded in evidence and
can be replicated, fostering the credibility and reliability of research.

7.10 Difference between two population means


In this section we consider the difference between two population means of two normal
distributions, i.e. µ1 − µ2 , where the subscripts ‘1’ and ‘2’ distinguish the two
populations, i.e. groups 1 and 2, respectively. As in Chapter 5, there are four cases to
study:

when the variances of the two populations are known


when the variances of the two populations are unknown and assumed to be unequal
when the variances of the two populations are unknown and assumed to be equal
the case of paired datasets.
7
7.10.1 Unpaired samples: variances known
Suppose we have two random samples of size n1 and n2 , respectively, drawn from
N (µ1 , σ12 ) and N (µ2 , σ22 ), where σ12 and σ22 are known. Testing for the equality of means
gives the null hypothesis:
H 0 : µ1 = µ2 or, in terms of their difference, H0 : µ1 − µ2 = 0.
The sampling distribution of X̄1 − X̄2 is:
σ12 σ22
 
X̄1 − X̄2 ∼ N µ1 − µ2 , +
n1 n2
which, when standardised, gives the test statistic.

z test for the difference between two means (variances known)

In this case, the test statistic is:

X̄1 − X̄2 − (µ1 − µ2 )


Z= p ∼ N (0, 1). (7.4)
σ12 /n1 + σ22 /n2

Hence critical values and p-values are obtained from the standard normal
distribution, i.e. using the bottom row of Table 10 and Table 4, respectively.
Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (7.4),
we set the term (µ1 − µ2 ) equal to 0.

146
7.10. Difference between two population means

Example 7.4 Suppose we are interested in researching the average response time
of two different customer support teams, 1 and 2, in a company. Managers want to
determine if there is a statistically significant difference in the mean response time of
these two support teams.
We test:
H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
We assume that the data for each team follow a normal distribution, and that the
population variances are known to be:

σ12 = 400 minutes2 and σ22 = 425 minutes2

respectively.
Managers collected a random sample of 30 customer support interactions from each
team (hence n1 = n2 = 30) and measured their response times, with sample means
of x̄1 = 150 minutes and x̄2 = 165 minutes.
Using (7.4), the test statistic value is:
x̄1 − x̄2 150 − 165
z=p 2 2
=p = −2.86
σ1 /n1 + σ2 /n2 400/30 + 425/30

For this two-tailed test the p-value is:


7
p-value = 2 × P (Z > | − 2.86|) = 0.0042

computed using =2*NORM.S.DIST(-2.86,1) (or Table 4). It is shown as the


red-shaded area in Figure 7.5.
Hence we reject H0 at both the 5% and 1% significance levels and conclude that the
test is highly significant, with strong evidence of a difference in the mean response
times of the two customer support teams.
Using critical values, for a two-tailed test at the 5% significance level, the critical
values are ±1.96 (shown in Figure 6.2). Since −2.86 < −1.96 the test statistic value
falls in the rejection region hence we reject H0 .
Moving to the 1% significance level (as per Figure 6.1), the new critical values are
±2.576 (also shown in Figure 6.2), hence again we reject H0 since −2.86 < −2.576
and so the test statistic value again falls in the rejection region. The test is highly
significant, as per above.
Since x̄1 < x̄2 , it is likely that µ1 < µ2 (rather than merely concluding that they are
not equal).

7.10.2 Unpaired samples: variances unknown and unequal


We have the same set-up as above, with the same sampling distribution of X̄1 − X̄2 , but
now the population variances σ12 and σ22 are unknown. Assuming large sample sizes, we
can replace these unknown parameters with the sample variance estimators S12 and S22 ,
respectively, to obtain the test statistic.

147
7. Hypothesis testing of means and proportions

Figure 7.5: The standard normal distribution, indicating the p-value of 0.0042 in red.

z test for the difference between two means (variances unknown)

If the population variances σ12 and σ22 are unknown, provided sample sizes n1 and n2
are large (greater than 30):

7 Z=
X̄1 − X̄2 − (µ1 − µ2 )
p ∼ N (0, 1) (approximately, for large n1 and n2 ).
S12 /n1 + S22 /n2
(7.5)
Hence critical values and p-values are obtained from the standard normal
distribution, i.e. using the bottom row of Table 10 and Table 4, respectively.
Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (7.5),
we set the term (µ1 − µ2 ) equal to 0.

Example 7.5 Extending Example 7.4, suppose the population variances are now
unknown. Using the samples, suppose the sample variances are s21 = 550 minutes2
and s22 = 575 minutes2 .
We still test:
H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
Using (7.5), the test statistic value is:
x̄1 − x̄2 150 − 165
z=p 2 2
=p = −2.45
s1 /n1 + s2 /n2 550/30 + 575/30

For this two-tailed test the p-value is:

p-value = 2 × P (Z > | − 2.45|) = 0.0143

computed using =2*NORM.S.DIST(-2.45,1) (or Table 4). It is shown as the


red-shaded area in Figure 7.6.

148
7.10. Difference between two population means

Hence we reject H0 only at the 5% significance level (not at the 1% significance level,
since 0.0143 > 0.01) and conclude that the test is moderately significant, with
moderate evidence of a difference in the mean response times of the two customer
support teams. Again, since x̄1 < x̄2 , it is likely that µ1 < µ2 (rather than merely
concluding that they are not equal).
Using critical values, for a two-tailed test at the 5% significance level, the critical
values are ±1.96 (shown in Figure 6.2). Since −2.45 < −1.96 the test statistic value
falls in the rejection region hence we reject H0 . Moving to the 1% significance level
(as per Figure 6.1), the new critical values are ±2.576 (also shown in Figure 6.2),
hence we do not reject H0 since −2.45 > −2.576 and so the test statistic value does
not fall in the rejection region. The test is moderately significant, as per above.
Note that the (absolute) test statistic value is smaller here relative to that in
Example 7.4 due to the sample variances being larger than the (known) population
variances. Given the inverse relationship between the test statistic value and the
p-value, the p-value in this example is inevitably larger (0.0143 vs. 0.0042), and now
only significant at the 5% significance level.

Figure 7.6: The standard normal distribution, indicating the p-value of 0.0143 in red.

7.10.3 Unpaired samples: variances unknown and equal

Although still unknown, if we assume the population variances are equal to some
common variance, i.e. that σ12 = σ22 = σ 2 , then we only have one (common) unknown
variance to estimate. As with confidence intervals in Section 5.9.3, we utilise the
pooled variance estimator, given by:

(n1 − 1)S12 + (n2 − 1)S22


Sp2 =
n1 + n2 − 2

149
7. Hypothesis testing of means and proportions

t test for the difference between two means (variances unknown)

If the population variances σ12 and σ22 are unknown but assumed equal:

X̄1 − X̄2 − (µ1 − µ2 )


T = q ∼ tn1 +n2 −2 (7.6)
Sp2 (1/n1 + 1/n2 )

where Sp2 is the pooled variance estimator.

Hence critical values and p-values are obtained from the Student’s t distribution
with n1 + n2 − 2 degrees of freedom, i.e. using Table 10 or T.DIST function in Excel,
respectively.

Note if testing for the equality of means, then µ1 − µ2 = 0 under H0 . Hence, in (7.6),
we set the term (µ1 − µ2 ) equal to 0.

Example 7.6 To illustrate this, let us extend Example 5.10 and consider the
complaint reaction times of two similar companies. Random samples gave the
following statistics (in days):

Sample size Sample mean Sample std. dev.


7 Company 1 12 8.5 3.6
Company 2 10 4.8 2.1

We want to test for a difference between mean reaction times, i.e. we test:

H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .

Because the two companies are ‘similar’, it is reasonable to assume (although it is


only an assumption!) that the two population variances are equal. Under this
assumption, the estimate of the common variance is:
(12 − 1) × (3.6)2 + (10 − 1) × (2.1)2
s2p = = 9.1125.
12 + 10 − 2
Therefore, the test statistic value, using (7.6), is:
x̄1 − x̄2 8.5 − 4.8
t= q =p = 2.87.
s2p (1/n1 + 1/n2 ) 9.1125 × (1/12 + 1/10)

There are 12 + 10 − 2 = 20 degrees of freedom, hence we obtain the p-value from the
t20 distribution.
The p-value is returned by 2-2*T.DIST(ABS(2.87),20,1) which gives 0.0095. It is
shown as the red-shaded area in Figure 7.7. Since 0.0095 < 0.01 (just!), we reject H0
at the 1% significance level. On this basis, we have a highly significant result, and we
conclude that there is strong evidence that the mean reaction times of the two
companies are different. Indeed, it appears that Company 1 is slower, on average,
than Company 2, since x̄1 > x̄2 .

150
7.10. Difference between two population means

Using critical values, at the 5% significance level, the critical values for this
two-tailed t test are ±2.086 (using Table 10 for a t20 distribution). Since
2.87 > 2.086 the test statistic value falls in the rejection region hence we reject H0 .
Moving to the 1% significance level (as per Figure 6.1), the new critical values are
±2.845 (from Table 10), hence we reject H0 since 2.87 > 2.845 (just!) and so the test
statistic value again falls in the rejection region. We have a highly significant result
with the same conclusion as the p-value approach.

7
Figure 7.7: The Student’s t distribution with 20 degrees of freedom, indicating the p-value
of 0.0095 in red.

7.10.4 Paired (dependent) samples


Recall from Section 5.9.4 that for paired (dependent) samples we work with
differenced data to reduce matters to a one-sample analysis. As before, we compute
the differenced data as:
d1 = x1 − y1 , d2 = x2 − y2 , ..., dn = xn − yn
reducing the two-sample problem to a one-sample problem. The test is then analogous
to the hypothesis test of a single mean with σ 2 unknown, covered in Section 7.7,
performed on the differenced data.

t test for the difference in means in paired samples

Using the sample mean and sample standard deviation of differenced data:

X̄d − µd
T = √ ∼ tn−1 . (7.7)
Sd / n

Hence critical values and p-values are obtained from the Student’s t distribution with
n−1 degrees of freedom, i.e. using Table 10 or T.DIST function in Excel, respectively.

151
7. Hypothesis testing of means and proportions

Example 7.7 Extending Example 5.11, the table below shows the before and after
weights (in pounds) of 8 adults after trying an experimental diet. We test whether
there is evidence that the diet is effective.

Before After
127 122
130 120
114 116
139 132
150 144
147 138
167 155
153 152

We want to test:
H0 : µ1 = µ2 vs. H1 : µ1 > µ2
which is equivalent to testing:

H0 : µd = 0 vs. H1 : µd > 0

where we choose a one-tailed test because we are looking for a reduction (if there is
7 any change, we would expect weight loss from a diet!) and we define µd = µ1 − µ2
since we anticipate that this way round the values will (more likely) be positive.
The differences (calculated as ‘Before − After’) are:

5, 10, −2, 7, 6, 9, 12 and 1.

For example:

d1 = x1 − y1 = 127 − 122 = 5, d2 = x2 − y2 = 130 − 120 = 10 etc.

Hence n = 8, x̄d = 6 and sd = 4.66 on n − 1 = 7 degrees of freedom. Using (7.7), the


test statistic value is:
x̄d − µd 6−0
t= √ = √ = 3.64.
sd / n 4.66/ 8
The p-value is returned by 1-T.DIST(3.64,7,1) which gives 0.0041. It is shown in
Figure 7.8. Since 0.0041 < 0.01, we reject H0 at the 1% significance level. On this
basis, we have a highly significant result, and we conclude that there is strong
evidence that the experimental diet reduces weight, on average.
Using critical values, at the 5% significance level, the critical value for this
upper-tailed t test is 1.895 (using Table 10 for a t7 distribution). Since 3.64 > 1.895
the test statistic value falls in the rejection region hence we reject H0 . Moving to the
1% significance level (as per Figure 6.1), the new critical value is 2.998 (from Table
10), hence again we reject H0 since 3.64 > 2.998 and so the test statistic value again
falls in the rejection region. We have a highly significant result with the same
conclusion as the p-value approach.

152
7.11. Difference between two population proportions

Figure 7.8: The Student’s t distribution with 7 degrees of freedom, indicating the p-value
of 0.0041 in red.

7.11 Difference between two population proportions


As with confidence intervals in Section 5.10, the correct approach for the comparison of
two population proportions, π1 and π2 , is to consider the difference between them, i.e.
π1 − π2 . When testing for equal proportions (i.e. a zero difference), the null hypothesis is: 7
H0 : π1 − π2 = 0 or equivalently H0 : π1 = π2 .

We derive the test statistic by standardising the (approximate, by the central limit
theorem) sampling distribution of the difference between two independent sample
proportions, P1 − P2 , given by:
 
π1 (1 − π1 ) π2 (1 − π2 )
P1 − P 2 ∼ N π1 − π2 , +
n1 n2

leading to the proposed test statistic (via standardisation) of:

P1 − P2 − (π1 − π2 )
p ∼ N (0, 1)
π1 (1 − π1 )/n1 + π2 (1 − π2 )/n2

approximately, for large n1 and n2 . However, when evaluating this test statistic, which
values do we use for π1 and π2 ? In the test of a single proportion, we had H0 : π = π0 ,
where π0 is the tested (known) value.
When comparing two proportions, under H0 no value is given for π1 and π2 , only that
they are equal, that is:
π1 = π2 = π

where π is the common proportion whose value, of course, is still unknown! Hence we
need to estimate π from the sample data using the pooled proportion estimator.

153
7. Hypothesis testing of means and proportions

Pooled proportion estimator

If R1 and R2 represent the number of ‘favourable’ responses from two independent


random samples with sample sizes of n1 and n2 , respectively, then the pooled
proportion estimator is:
R1 + R2
P = . (7.8)
n1 + n2

This leads to the following revised test statistic.

z test for the difference between two proportions

In this case, the test statistic is:

P1 − P2 − (π1 − π2 )
Z= p ∼ N (0, 1) (7.9)
P (1 − P )(1/n1 + 1/n2 )
(approximately, for large n1 and n2 ).

Hence critical values and p-values are obtained from the standard normal
distribution, i.e. using the bottom row of Table 10 and Table 4, respectively.

7 Example 7.8 To illustrate this, let us extend Example 5.12 by testing for a
difference between the population proportions of the general public who are aware of
a particular commercial product before and after an advertising campaign. Two
surveys were conducted and the results of the two random samples were:

Sample size Number aware


After campaign 120 65
Before campaign 150 68

If π1 and π2 are the true population proportions for ‘after’ and ‘before’ the
campaign, respectively, then we wish to test:

H0 : π1 = π2 vs. H1 : π1 > π2 .

Note that we use an upper-tailed test on the assumption that the advertising
campaign would increase the proportion aware – an example of the importance of
using common sense in determining the alternative hypothesis!
For the sample proportions, we have:
r1 65 r2 68
p1 = = = 0.5417 and p2 = = = 0.4533.
n1 120 n2 150
On the assumption that H0 is true, we estimate the common proportion, π, using
(7.8), to be:
65 + 68
= 0.4926.
120 + 150

154
7.12. Overview of chapter

So our test statistic value, using (7.9), is:


0.5417 − 0.4533
z=p = 1.44.
0.4926 × (1 − 0.4926) × (1/120 + 1/150)

The p-value for this upper-tailed test is:

P (Z > 1.44) = 0.0749

using =1-NORM.S.DIST(1.44,1) (or Table 4). It is shown as the red-shaded area in


Figure 7.9. Since 0.0749 > 0.05, we are unable to reject H0 for any significance level
α < 0.0749, such as at the 5% significance level. However, the test is significant at
the 10% significance level (because 0.0749 < 0.10). Hence the test is weakly
significant, as there is weak evidence that the advertising campaign of a protected
area have been effective.
Using critical values, for an upper-tailed test at the 5% significance level, the critical
value is 1.645 (shown in Figure 6.3). Since 1.44 < 1.645 the test statistic value does
not fall in the rejection region hence we do not reject H0 . Moving to the 10%
significance level (as per Figure 6.1), the new critical value is 1.282 (also shown in
Figure 6.3), hence we now reject H0 since 1.44 > 1.282 and so the test statistic value
does fall in the rejection region. The test is weakly significant, as per above.

Figure 7.9: The standard normal distribution, indicating the p-value of 0.0749 in red.

7.12 Overview of chapter


This chapter has outlined the main steps in conducting a statistical test. Tests for a
single population mean (for the cases of known and unknown variances) and a single
population proportion have been presented, along with the statistical tests for testing
the difference (or equality) of two population means and two population proportions,
respectively. In each example, the critical value and p-value approaches to testing have
been used, and we have seen that they always lead to the same conclusions.

155
7. Hypothesis testing of means and proportions

7.13 Key terms and concepts


Common variance Pooled variance estimator
Critical value Significance level
Decision rule Rejection region
Difference t test
Differenced data Test statistic
p-value Test statistic value
Pooled proportion estimator z test

7.14 Sample examination questions


1. An experiment is conducted to determine whether intensive tutoring (covering a
great deal of material in a fixed amount of time) is more effective than standard
tutoring (covering less material in the same amount of time) for a particular
course. Two randomly chosen groups of students were tutored separately and their
examination mark on the course was recorded. The data are summarised in the
table below:

Sample size Mean Sample standard


examination mark deviation
7 Intensive tutoring 22 65.33 6.61
Standard tutoring 25 61.58 5.37

(a) Use an appropriate hypothesis test to determine whether there is a difference


between the mean examination marks between the two tutoring groups. State
clearly the hypotheses, the test statistic and its distribution under the null
hypothesis, and carry out the test at two appropriate significance levels.
Comment on your findings.
(b) State clearly any assumptions you made in part (a).

2. (a) A pharmaceutical company is conducting an experiment to test whether a new


type of pain reliever is effective. In this context, a treatment is considered
effective if it is successful with a probability of more than 0.50. The pain
reliever was given to 30 patients and it reduced the pain for 20 of them. You
are asked to use an appropriate hypothesis test to determine whether the pain
reliever is effective. State the test hypotheses, and specify your test statistic
and its distribution under the null hypothesis. Comment on your findings.
(b) A second experiment followed where a placebo pill was given to another group
of 40 patients. A placebo pill contains no medication and is prescribed so that
the patient will expect to get well. In some situations, this expectation is
enough for the patient to recover. This effect, also known as the placebo effect,
occurred to some extent in the second experiment where the pain was reduced
for 21 of the patients. You are asked to consider an appropriate hypothesis test
to incorporate this new evidence with the previous data and reassess the
effectiveness of the pain reliever.

156
7.15. Solutions to Sample examination questions

3. A paired-difference experiment involved n = 121 adults. For each adult a


characteristic was measured under two distinct conditions and the difference in
characteristic values was recorded. The sample mean of the differences was 1.195,
whereas their sample standard deviation was 10.2. The researchers reported a t
statistic value of 1.289 when testing whether the means of the two conditions are
the same.
(a) Show how the researchers obtained the t statistic value of 1.289.
(b) Calculate the p-value of the test and use the p-value to test the hypothesis of
equal means. Use a 5% significance level.

7.15 Solutions to Sample examination questions

1. (a) Let µ1 denote the mean examination mark for the intensive tutoring group and
µ2 denote the mean examination mark for the standard tutoring group. The
wording ‘whether there is a difference between the mean examination marks’
implies a two-tailed test, hence the hypotheses can be written as:

H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .

The test statistic formulae, depending on whether or not a pooled variance is 7


used, are:
x̄1 − x̄2 x̄1 − x̄2
q or p 2 .
s2p (1/n1 + 1/n2 ) s1 /n1 + s22 /n2

If equal variances are assumed, the test statistic value is 2.1449. If equal
variances are not assumed, the test statistic value is 2.1164.
The variances are unknown but the sample size is large enough, so the standard
normal distribution can be used. The t40 distribution (the nearest degrees of
freedom available in Table 10) is also correct and will be used in what follows.
The critical values at the 5% significance level are ±2.021, hence we reject the
null hypothesis. Moving to the 1% significance level, the critical values are
±2.704, so we do not reject H0 . We conclude that the test is moderately
significant such that there is moderate evidence of a difference between the two
tutoring groups.
(b) The assumptions for part (a) relate to the following.
• Assumption about whether variances are equal, i.e. whether σ12 = σ22 or
σ12 6= σ22 .
• Assumption about whether n1 + n2 is ‘large’ so that the normality
assumption is satisfied.
• Assumption about the samples being independent.
• Assumption about whether a normal or t distribution is used.

157
7. Hypothesis testing of means and proportions

2. (a) Let πT denote the true probability for the new treatment to work. We test:
H0 : πT = 0.50 vs. H1 : πT > 0.50.
Under H0 , the test statistic is:
p − 0.50
p ∼ N (0, 1)
0.50 × (1 − 0.50)/n
approximately due to the central limit theorem, since here n = 30 is (just
about) large enough. Hence the test statistic value is:
20/30 − 0.50
= 1.83.
0.0913
For α = 0.05, the critical value is 1.645. Since 1.83 > 1.645, we reject H0 at the
5% significance level. Moving to the 1% significance level, the critical value is
2.326, so we do not reject H0 . The test is moderately significant, with moderate
evidence that the treatment is effective.
(b) Let πP denote the true probability for the patient to recover with the placebo.
We test:
H0 : πT = πP vs. H1 : πT > πP .
For reference, the test statistic is:
πT − πP
p ∼ N (0, 1)
P (1 − P )(1/n1 + 1/n2 )
7 approximately, due to the central limit theorem.
The standard error is:
s  
41 29 1 1
S.E.(πT − πP ) = × × + = 0.119
70 70 40 30
and so the test statistic value is:
20/30 − 21/40
= 1.191.
0.119
For α = 0.05, the critical value is 1.645. Since 1.191 < 1.645, we do not reject
H0 at the 5% significance level. Moving to the 10% significance level, the
critical value is 1.282, so again we do not reject H0 . The test is not statistically
significant. There is insufficient evidence of higher effectiveness.

3. (a) The test statistic value is:


x̄d 1.195
√ = √ = 1.289.
sd / n 10.2/ 121
(b) In this case one can use the tn−1 = t120 distribution, since the standard
deviation is unknown and estimated by sd . Nevertheless the assumption of a
standard normal distribution is also justified given the large sample size n.
Combining the above, noting that this is a two-tailed test we get that the
p-value is:
2 × P (T ≥ 1.289) = 2 × 0.10 = 0.20
where T ∼ t120 , using Table 10 of the New Cambridge Statistical Tables. Since
the p-value is 0.20 > 0.05, the test is not significant at the 5% significance
level, hence there is no evidence of a difference.

158
Chapter 8
Contingency tables and the
chi-squared test

8.1 Synopsis of chapter


Every variable has a type, known as a level of measurement. When investigating
possible relationships between variables, the number and type of variables being
considered affects which statistical technique can be used to detect any evidence of a
relationship. In this chapter, our focus will be on researching whether there is a
statistical association between two categorical variables, which employs contingency
tables. We also, briefly, consider goodness-of-fit tests for assessing whether sample data
appear to be a good fit to a specified probability distribution.

8.2 Learning outcomes


After completing this chapter, and having completed the essential reading and
activities, you should be able to:
8
set up the null and alternative hypotheses appropriate for a contingency table

compute the degrees of freedom, expected frequencies and appropriate critical


values of a chi-squared test for a contingency table

summarise the limitations of a chi-squared test

be able to work with a one-row or one-column contingency table as above.

8.3 Recommended reading


Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE
Publications, 2023) 1st edition [ISBN 9781529774092] Chapter 11.

8.4 Introduction
In Chapter 7 we focused on testing the value of a population parameter of interest, such
as a mean, µ, or a proportion, π (and differences between means and proportions). Being
able to perform such statistical tests is of particular use when undertaking research.

159
8. Contingency tables and the chi-squared test

Here we shall look at two additional testing procedures, one which deals with testing for
association between two categorical variables (introduced in Chapter 2), and a second
which considers the shape of the distribution from which the sample data were drawn.

8.5 Association versus correlation


We frequently encounter questions such as ‘Does the implementation of a specific
software feature tend to be associated with a higher success rate of a marketing
campaign?’ or ‘Are employees with a particular certification more inclined to exhibit
strong leadership skills? If either of these questions yields an affirmative response, it
implies that there exists a statistical association between the two categories
(alternatively referred to as ‘attributes’ or ‘factors’). To investigate any such
associations, we will need frequency data but, in this course, we will not be concerned
with the strength of any such association, rather we restrict ourselves to testing for its
existence and commenting on the nature of any association which we discover.
If we are asking whether two measurable variables (see Chapter 2) are related to each
other, the technical way of describing the case when they are is to say that the two
variables are correlated. It is possible to measure the strength of a correlation based on
sample data, and also to study to what extent changes in a variable ‘explain’ changes in
another – a topic known as linear regression. These topics will be discussed in Chapter
10.
Note that if one variable is measurable and the other variable is categorical, our
objective remains to detect the presence of an association rather than computing a
correlation. In such scenarios, we would transform the measurable variable into a
8 categorical one by creating distinct levels or groups. For instance, in a customer study,
one might categorise age by creating age groups. It is worth noting that an ‘age group’
in this context would be considered an ordinal variable, as it allows us to rank the age
groups in a specific order, typically from the youngest to the oldest.

8.6 Tests for association


This statistical test is used to test the null hypothesis that two factors (or attributes)
are not associated, against the alternative hypothesis that they are associated. Each
data unit we sample has one level (or ‘type’ or ‘variety’) of each factor.

Example 8.1 Suppose that we are sampling people, and that one factor of interest
is hair colour (black, blonde, brown etc.) while another factor of interest is eye
colour (blue, brown, green etc.). In this example, each sampled person has one level
of each factor. We wish to test whether or not these factors are associated.
Therefore, we have the following.

H0 : There is no association between hair colour and eye colour.1

H1 : There is an association between hair colour and eye colour.

160
8.6. Tests for association

So, under H0 , the distribution of eye colour is the same regardless of hair colour,
whereas if H1 is true it may be attributable to blonde-haired people having a
(significantly) higher proportion of blue eyes, say.
The association might also depend on the sex of the person, and that would be a
third factor which was associated with (i.e. interacted with) both of the other
factors, however we will not consider interactions in this course.
The main way of analysing these questions is by using a contingency table, discussed
in the next section.

8.6.1 Contingency tables

In a contingency table, also known as a cross-tabulation, the data are in the form
of frequencies (counts), where the observations are organised in cross-tabulated
categories. We sample a certain number of units (people, perhaps) and classify them
according to the two factors of interest.

Example 8.2 In three areas of a city, a record has been kept of the numbers of
burglaries, robberies and car thefts which take place in a year. The total number of
offences was 150, and they were divided into the various categories as shown in the
following contingency table:

Area Burglary Robbery Car theft Total


A 30 19 6 55
B 12 23 14 49 8
C 8 18 20 46
Total 50 60 40 150

The cell frequencies are known as observed frequencies and show how the data
are spread across the different combinations of factor levels. The first step in any
analysis is to complete the row and column totals (as already done in this table).

8.6.2 Expected frequencies

We proceed by computing a corresponding set of expected frequencies, conditional


on the null hypothesis of no association between the factors, i.e. that the factors are
independent.
Now suppose that you are only given the row and column totals for the frequencies. If
the factors were assumed to be independent, consider how you would calculate the
expected frequencies. Recall from Chapter 3 that if A and B are two independent
events, then P (A ∩ B) = P (A) P (B). We now apply this idea.

1
When conducting tests for association, the null hypothesis can be expressed either as ‘There is no
association between categorical variables X and Y ’, or as ‘Categorical variables X and Y are independent’.
The corresponding alternative hypothesis would then replace ‘no association’ or ‘independent’ with ‘an
association’ or ‘not independent (dependent)’, respectively.

161
8. Contingency tables and the chi-squared test

Example 8.3 For the data in Example 8.2, if a record was selected at random from
the 150 records:

P (a crime being a burglary) = 50/150

P (a crime being in area A) = 55/150.

Hence, under H0 , we have:


50 55
P (a crime being a burglary in area A) = ×
150 150
and so the expected number of burglaries in area A is:
50 55
150 × × .
150 150

So the expected frequency is obtained by multiplying the product of the ‘marginal’


probabilities by n, the total number of observations. This can be generalised as follows.

Expected frequencies in contingency tables

The expected frequency, Eij , for the cell in row i and column j of a contingency
table with r rows and c columns, is:
row i total × column j total
Eij =
total number of observations
8
where i = 1, 2, . . . , r and j = 1, 2, . . . , c.

Example 8.4 The completed expected frequency table for the data in Example 8.2
is (rounding to two decimal places, which is recommended):

Area Burglary Robbery Car theft Total


A 18.33 22.00 14.67 55
B 16.33 19.60 13.07 49
C 15.33 18.40 12.27 46
Total 50 60 40 150

Make sure you can replicate these expected frequencies using your own calculator.

8.6.3 Test statistic


To motivate our choice of test statistic, if H0 is true then we would expect to observe
small differences between the observed and expected frequencies, while large differences
would suggest that H1 is true. This is because the expected frequencies have been
calculated conditional on the null hypothesis of independence. Hence, if H0 is actually

162
8.6. Tests for association

true, then what we actually observe (the observed frequencies) should be


(approximately) equal to what we expect to observe (the expected frequencies).

χ2 test of association

Let the contingency table have r rows and c columns, then formally the test statistic
used for tests of association is:
r X
c
X (Oij − Eij )2
∼ χ2(r−1)(c−1) . (8.1)
i=1 j=1
Eij

Hence critical values are obtained from the χ2 distribution with (r − 1)(c − 1) degrees
of freedom in Table 8 of the New Cambridge Statistical Tables, and p-values are
obtained in Excel using:

=CHISQ.DIST.RT(test statistic value, degrees of freedom)

Notice the ‘double summation’ here just means summing over all rows and all columns.
This test statistic follows an (approximate) chi-squared distribution with (r − 1)(c − 1)
degrees of freedom, where r and c denote the number of rows and columns, respectively,
in the contingency table. The approximation is reasonable provided all the expected
frequencies are at least 5.

8.6.4 The chi-squared, χ2 , distribution


A chi-squared distribution is only defined over positive values. The precise shape of
the distribution is dependent on the degrees of freedom, which is a parameter of the
8
distribution. Figure 8.1 illustrates the chi-squared distribution for a selection of degrees
of freedom.
Although the shape of the distribution does change quite significantly for different
degrees of freedom, note the distribution is always positively skewed.

8.6.5 Degrees of freedom


The general expression for the number of degrees of freedom in tests for association
is:
(number of cells) − (number of times data are used to calculate the Eij s).

For an r × c contingency table, we begin with rc cells. We lose one degree of freedom for
needing to use the total number of observations to compute the expected frequencies.
However, we also use the row and column totals in these calculations, but we only need
r − 1 row totals and c − 1 column totals, as the final one in each case can be deduced
using the total number of observations. Hence we only lose r − 1 degrees of freedom for
the row totals. Similarly, we only lose c − 1 degrees of freedom for the column totals.
Hence the overall degrees of freedom are:

k = rc − (r − 1) − (c − 1) − 1 = (r − 1)(c − 1).

163
8. Contingency tables and the chi-squared test

0.10
0.6
k=1 k=10
k=2 k=20
0.5

0.08
k=4 k=30
k=6 k=40
0.4

0.06
0.3

0.04
0.2

0.02
0.1
0.0

0 2 4 6 8 0.0 0 10 20 30 40 50

Figure 8.1: The chi-squared distribution for various degrees of freedom, k.

8.6.6 Performing the test


As usual, we choose a significance level, α, at which to conduct the test. However, are
we performing a one-tailed test or a two-tailed test? To determine this, we need to
consider what sort of test statistic value would be considered extreme under H0 .
8
As seen in Figure 8.1, the chi-squared distribution only takes positive values. The
squared term in the numerator of the test statistic in (8.1) ensures the test statistic
value will be positive (the Eij s in the denominator are clearly positive too). If H0 is
true, then the observed and expected frequencies should be quite similar, since the
expected frequencies are computed conditional on the null hypothesis of independence
(equivalently, no association).
This means that the |Oij − Eij |s should be quite small for all cells. In contrast if H0 is
not true, then we would expect comparatively large values for the |Oij − Eij |s due to
large differences between the two sets of frequencies. Therefore, upon squaring the
|Oij − Eij |s, sufficiently large test statistic values suggest we should reject H0 . Hence
tests of association are always upper-tailed tests.

Example 8.5 Using the data in Example 8.2, we proceed with the hypothesis test.
We test:
H0 : No association between area and type of crime
versus:
H1 : Association between area and type of crime.
Note it is advisable to present your calculations as an extended contingency table as
shown below, where the three rows in each cell correspond to the observed
frequencies, the expected frequencies and the test statistic contributors, respectively.

164
8.6. Tests for association

Area Burglary Robbery Car theft Total


O1· 30 19 6 55
A E1· 18.33 22.00 14.67 55
(O1· − E1· )2 /E1· 7.48 0.41 5.15
O2· 12 23 14 49
B E2· 16.33 19.60 13.07 49
(O2· − E2· )2 /E2· 1.13 0.59 0.06
O3· 8 18 20 46
C E3· 15.33 18.40 12.27 46
(O3· − E3· )2 /E3· 3.48 0.01 4.82
Total 50 60 40 150

Using (8.1), we obtain a test statistic value of:


3 X 3
X (Oij − Eij )2
= 7.48 + 0.41 + · · · + 4.82 = 23.13.
i=1 j=1
Eij

Since r = c = 3, we have (r − 1)(c − 1) = (3 − 1)(3 − 1) = 4 degrees of freedom.


Using Excel, the p-value is obtained using:

=CHISQ.DIST.RT(23.13,4)

which returns a p-value of 0.000119. Using the p-value decision rule, since:

0.000119 < 0.01 < 0.05

we can reject H0 at the 1% significance level (and, of course, at the 5% significance 8


level). Therefore, the test is highly significant and we conclude that there is strong
evidence of an association between area and type of crime.
Using critical values, at the 5% significance level, the critical value for this
upper-tailed chi-squared test is 9.488 (using Table 8 with ν = 4). Since 23.13 > 9.488
the test statistic value falls in the rejection region hence we reject H0 . Moving to the
1% significance level (as per Figure 6.1), the new critical value is 13.28 (again using
Table 8), hence we also reject H0 since 23.13 > 13.28 and so the test statistic value
again falls in the rejection region. We have a highly significant result with the same
conclusion as the p-value approach.
Looking again at the contingency table, comparing the observed and expected
frequencies, the interpretation of this association becomes clear – burglary is the
main problem in area A (an observed frequency of 30 versus an expected frequency
of only 18.33) whereas car theft is a problem in area C (an observed frequency of 20
versus an expected frequency of only 12.27).
Note how we can make such inferences by looking at the cells with large test statistic
contributors, which are a consequence of large differences between the observed and
expected frequencies.

The conclusions in Example 8.5 are fairly obvious, given the small dimensions of the
contingency table. However, for data involving more factors, and more factor levels, this

165
8. Contingency tables and the chi-squared test

type of analysis can be very insightful.


Cells which make a large contribution to the test statistic value (i.e. which have large
values of (Oij − Eij )2 /Eij ) should be studied carefully when determining the nature of
an association. This is because, in cases where H0 has been rejected, rather than simply
concluding that there is an association between two categorical variables, it is helpful to
describe the nature of the association.

8.7 Goodness-of-fit tests


In addition to tests of association, the chi-squared distribution is often used more
generally in so-called ‘goodness-of-fit’ tests. We may, for example, wish to answer
hypotheses such as:

‘Is it reasonable to assume the data follow a particular distribution?’

This justifies the name of a goodness-of-fit test, since we are testing whether or not a
particular probability distribution provides an adequate fit to the observed data. The
null hypothesis will assert that a specific hypothetical population distribution is the
true one. The alternative hypothesis is that this specific distribution is not the true one.
Goodness-of-fit tests can be performed for a variety of probability distributions.
However, there is a special case which we shall consider here when you are only dealing
with one variable. This is when we wish to test whether the sample data are drawn
from a discrete uniform distribution, i.e. that each characteristic is equally likely.

Example 8.6 Suppose we wish to determine whether a given die is fair.


8
If the die is fair, then the values of the faces (1, 2, 3, 4, 5 and 6) are all equally
likely, so we have the following hypotheses.

H0 : Score is uniformly distributed vs. H1 : Score is not uniformly distributed.

8.7.1 Observed and expected frequencies


As with tests of association, the goodness-of-fit test involves both observed and
expected frequencies. In all goodness-of-fit tests, the sample data must be expressed in
the form of observed frequencies associated with certain classifications of the data.
Assume there are k classifications. Therefore, the observed frequencies can be denoted
by Oi , for i = 1, 2, . . . , k.

Example 8.7 Extending Example 8.6, for a die the obvious classifications would be
the six faces.
If the die is thrown n times, then our observed frequency data would be the number
of times each face appeared. Here k = 6.

Recall that in hypothesis testing we always assume that the null hypothesis, H0 , is true.
In order to conduct a goodness-of-fit test, the expected frequencies are computed

166
8.7. Goodness-of-fit tests

conditional on the probability distribution expressed in H0 . The test statistic will then
involve a comparison of the observed and expected frequencies. In broad terms, if H0 is
true, then we would expect small differences between these two sets of frequencies, while
large differences would be inconsistent with H0 .
We now consider how to compute the expected frequencies for discrete uniform
probability distributions.

Expected frequencies in goodness-of-fit tests

For discrete uniform probability distributions, the expected frequencies are computed
as:
1
Ei = n × for i = 1, 2, . . . , k
k
where n denotes the sample size and 1/k is the uniform (i.e. equal, same) probability
for each characteristic.

Expected frequencies should not be rounded, just as we do not round sample means,
say. Note that the final expected frequency (for the kth category) can easily be
computed using the formula:
k−1
X
Ek = n − Ei .
i=1

This is because we have a constraint that the sum of the observed and expected
frequencies must be equal,2 i.e. we have:

k
X
Oi =
k
X
Ei
8
i=1 i=1

which results in a loss of one degree of freedom (discussed below).

8.7.2 The goodness-of-fit test


Having computed the expected frequencies, we now need to formally test H0 and so we
need a test statistic.

Goodness-of-fit test statistic

For a discrete uniform distribution with k categories, observed frequencies Oi and


expected frequencies Ei , the test statistic is:
k
X (Oi − Ei )2
∼ χ2k−1 (8.2)
i=1
Ei

approximately, under H0 .

2
Recall the motivation for computing expected frequencies in the first place. Assuming H0 is true, we
want to know how a random sample of size n is expected to be distributed across the k categories.

167
8. Contingency tables and the chi-squared test

Note that this test statistic does not have a true χ2k−1 distribution under H0 , rather it is
only an approximating distribution. An important point to note is that this
approximation is only good enough provided all the expected frequencies are at least 5.
(In cases where one or more expected frequencies are less than 5, we merge categories
with neighbouring ones until the condition is satisfied.)
As seen in (8.2), there are k − 1 degrees of freedom when testing a discrete uniform
distribution. k is the number of categories (after merging), and we lose one degree of
freedom due to the constraint that:
Xk X k
Oi = Ei .
i=1 i=1

As with tests of association, goodness-of-fit tests are upper-tailed tests as, under H0 , we
would expect to see small differences between the observed and expected frequencies, as
the expected frequencies are computed conditional on H0 . Hence large test statistic
values are considered extreme under H0 , since these arise due to large differences
between the observed and expected frequencies.

Example 8.8 A confectionery company is trying out different wrappers for a


chocolate bar – its original, A, and two new ones, B and C. It puts the bars on
display in a supermarket and looks to see how many of each wrapper type have been
sold in the first hour, with the following results.

Wrapper type A B C Total


Observed frequencies 8 10 15 33

8 Is there a difference between wrapper types in the consumer choices made?


To answer this we need to test:

H0 : There is no difference in preference for the wrapper types

versus:
H1 : There is a difference in preference for the wrapper types.
Note that under H0 , this is the same as testing for the suitability of the discrete
uniform distribution, i.e. that each wrapper type is equally likely.
How do we work out the expected frequencies? Well, for equal preferences, across
three wrapper types (hence k = 3), the expected frequencies will be:
1
Ei = 33 × = 11 for i = 1, 2 and 3.
3
Applying (8.2), the test statistic value is:
3
X (Oi − Ei )2 (8 − 11)2 (10 − 11)2 (15 − 11)2
= + + = 2.364.
i=1
Ei 11 11 11

The degrees of freedom will be k − 1 = 3 − 1 = 2. Using Excel, the p-value is


obtained using:
=CHISQ.DIST.RT(2.364,2)

168
8.8. Overview of chapter

which returns a p-value of 0.3067. Using the p-value decision rule, since:

0.3067 > 0.10

we cannot reject H0 , even at the 10% significance level. Hence the test is not
significant. Therefore, there is insufficient evidence of a difference in preference for
the wrapper types.
Using critical values, at the 5% significance level, the critical value for this
upper-tailed chi-squared test is 5.991 (using Table 8 with ν = 2). Since 2.364 < 5.991
the test statistic value does not fall in the rejection region hence we do not reject H0 .
Moving to the 10% significance level (as per Figure 6.1), the new critical value is
4.605 (again using Table 8), hence we also do not reject H0 since 2.364 < 4.605 and
so the test statistic value again does not fall in the rejection region. We have an
insignificant result with the same conclusion as the p-value approach.

8.8 Overview of chapter


In this chapter we have seen how to test whether there is any evidence of a statistical
association between two categorical variables, and how to conduct goodness-of-fit tests
for the discrete uniform distribution. You should regard this chapter as an opportunity
to revise your work on hypothesis testing in Chapter 7.

8.9 Key terms and concepts


8
Association Discrete uniform distribution
Chi-squared distribution Expected frequency
Contingency table Goodness-of-fit test
Cross-tabulation Observed frequency
Degrees of freedom Test statistic

8.10 Sample examination questions


1. State whether the following statements are true or false. Give a brief explanation.
(a) A negative chi-squared value shows that there is little association between the
variables tested.
(b) Similar observed and expected frequencies indicate strong evidence of an
association in an r × c contingency table.

2. An experiment was conducted to examine whether age, in particular being over 30


or not, has any effect on preferences for a digital or an analogue watch. Specifically,
129 randomly-selected people were asked what watch they prefer and their
responses are summarised in the table below:

169
8. Contingency tables and the chi-squared test

Analogue watch Undecided Digital watch


30 year old or younger 10 17 37
Over 30 years old 31 22 12

(a) Based on the data in the table, and without conducting any significance test,
would you say there is an association between age group and watch preference?
Provide a brief justification for your answer.
(b) Is there any evidence of an association between age group and watch
preference?

3. Set out the null and alternative hypotheses, degrees of freedom, expected
frequencies, and 10%, 5% and 1% critical values for the following problem. The
following figures give the number of births by season in a town.

Season Number of births


Spring 100
Summer 200
Autumn 250
Winter 180

The number of days per season in this country are 93 (Spring), 80 (Summer), 100
(Autumn) and 92 (Winter). Is there any evidence that births vary over the year?
Hint: You would expect, if the births are uniformly distributed over the year, that
the number of births would be proportional to the number of days per season. So
work out your expected frequencies by taking the number of days per season
8 divided by the number of days in the year and multiplying by the total number of
births over the year.

8.11 Solutions to Sample examination questions


1. (a) False. Chi-squared values cannot be negative.
(b) False. The expected frequencies are determined under the null hypothesis of no
association, hence similar observed and expected frequencies provide little to
no evidence of an association.

2. (a) There are some differences between younger and older people regarding watch
preference. More specifically, 16% of younger people prefer an analogue watch
compared to 48% for people over 30. Hence there seems to be an association
between age and watch preference, although this needs to be investigated
further.
(b) Set out the null hypothesis that there is no association between age and watch
preference against the alternative, that there is an association. Be careful to
get these the correct way round!
H0 : No association between age group and watch preference.
H1 : Association between age group and watch preference.

170
8.11. Solutions to Sample examination questions

Work out the expected values to obtain the table below:

Analogue watch Undecided Digital watch


30 year old or younger 20.34 19.35 24.31
Over 30 years old 20.66 19.65 24.69

The test statistic formula is:


r X c
X (Oij − Eij )2
i=1 j=1
Eij

which gives a test statistic value of 24.146 (make sure you can replicate this).
This is a 2 × 3 contingency table so the degrees of freedom are
(2 − 1) × (3 − 1) = 2.
For α = 0.05, the critical value is 5.991, hence reject H0 since 24.146 > 5.991.
For α = 0.01, the critical value is 9.210, hence reject H0 since 24.146 > 9.210.
We conclude that the test is highly significant, with strong evidence of an
association between age group and watch preference.

3. We perform a goodness-of-fit test with:


H0 : There is a uniform distribution of births throughout the year by season.
H1 : There is a non-uniform distribution of births throughout the year by
season.
How do we calculate the expected values? You should not expect equal numbers of
births in each season unless they had exactly the same number of days. The clue 8
lies in the fact that you are given the number of days per season, so logically you
would expect the number of births to be (out of the 730 given):

Season Expected number of births

93
Spring 730 × = 186
365
80
Summer 730 × = 160
365
100
Autumn 730 × = 200
365
92
Winter 730 × = 184
365

The test statistic is:


4
2
X (Oi − Ei )2
χ =
i=1
Ei

171
8. Contingency tables and the chi-squared test

giving a test statistic value of:

(100 − 186)2 (200 − 160)2 (250 − 200)2 (180 − 184)2


χ2 = + + +
186 160 200 184
= 39.76 + 10.00 + 12.50 + 0.09
= 62.35.

The degrees of freedom are the number of categories minus 1 so, as there are four
seasons given, we have 4 − 1 = 3 degrees of freedom. Looking at Table 8 of the New
Cambridge Statistical Tables, we see that the critical value at the 5% significance
level is 7.815, hence we reject H0 since 62.35 > 7.815. Moving to the 1% significance
level, the critical value is now 11.34, so again we reject H0 as 62.35 > 11.34. We
conclude that the test is highly significant and conclude that there is (very!) strong
evidence that the number of births vary over the year by season.
If we look again at the observed and expected values we see:

Season Observed Expected


Spring 100 186
Summer 200 160
Autumn 250 200
Winter 180 184

Only the observed winter births are anywhere near the expected value – both
summer and autumn show higher births than expected, while spring is much lower.

172
Chapter 9
Sampling and experimental design

9.1 Synopsis of chapter


We take a pause on statistical calculations by turning our attention to data collection.
Specifically, we explore different sampling techniques frequently used in social science
and business surveys. We consider the choices when designing a survey, how reliable
results of surveys are, and the relative merits and limitations of different sampling
techniques. We also address the key ideas of causation and experimental design.

9.2 Learning outcomes


After completing this chapter, and having completed the Essential reading and
activities, you should be able to:

define non-random and random sampling techniques and propose real-world


applications of each

describe the factors which contribute to errors in surveys

discuss the various methods of contact which may be used in a survey and the
related implications

distinguish between an experiment in the natural sciences and the observations 9


possible in the social sciences

explain the merits and limitations of a longitudinal survey compared with a


cross-sectional survey.

9.3 Recommended reading


Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE
Publications, 2023) 1st edition [ISBN 9781529774092] Chapter 7.

9.4 Introduction
We proceed to describe the main stages of a survey and the sources of error at each
stage. This part of the course is the foundation for your future work in applied social
science, including business and management. There is not much point in learning the

173
9. Sampling and experimental design

various statistical techniques we have introduced you to in the rest of the subject guide
unless you understand the sources and limitations of the data you are using. This is
important to academics and policymakers alike!

9.5 Motivation for sampling


A target population represents the collection of units (people, firms, objects etc.) in
which we are interested. In the absence of time and budget constraints we would
conduct a census, that is a total enumeration of the population. Examples include the
population census in the UK and in other countries around the world. Its advantage is
that there is no sampling error because all population units are observed, and so there
is no estimation of population parameters – we can, in principle, determine their true
values exactly.
Of course, in practice, we do need to take into account time and budget constraints.
Due to the large size, N , of most populations, an obvious disadvantage with a census is
cost. Hence a census is not feasible in practice because there is only a limited amount of
information which is cost-effective to collect. Also, non-sampling errors may occur. For
example, if we have to resort to using cheaper (hence less reliable) interviewers, then
they may erroneously record data, misunderstand a respondent etc. It is usually
preferable to draw a sample, that is a certain number of population members are
selected and studied. The selected members are known as the elementary sampling
units.
Sample surveys (hereafter just ‘surveys’) are how new data are collected on a
population and tend to be based on samples rather than a census. Often, surveys are
conducted under circumstances which cannot be fully controlled. Since the sample size,
n, can be very large, the design and implementation of surveys requires considerable
planning and teamwork. (Recall sample size determination was covered in Chapter 5.)
9
Example 9.1 Examples of sample surveys include:

demography: studies of births, deaths, families etc.

government: studies of crime, education, employment, housing, health etc.

market research: studies of consumer preferences, attitudes etc.

political science: studies of voting intention.

Selected respondents may be contacted using a variety of methods such as face-to-face


interviews, telephone, postal or, increasingly, online questionnaires. Sampling errors will
occur (since not all population units are observed, so we resort to estimating population
parameter values, which means there will be some uncertainty attached to our point
estimates), but because of the smaller numbers involved (since typically n  N )
resources can be used to ensure high-quality interviews or to check completed
questionnaires. Therefore, non-sampling errors should be smaller and consequently
researchers can ask more questions.

174
9.6. Types of sampling techniques

A census is difficult to administer, time-consuming, expensive and does not guarantee


completely accurate results due to non-sampling errors. In contrast, a sample is
relatively easier, faster and cheaper, although it introduces sampling errors while
non-sampling errors can still occur. Unsurprisingly, we would like sampling errors and
non-sampling errors to be minimised!
Before proceeding, it is worth spending a moment thinking about the classification of
data.

Primary data and secondary data

Primary data are new data collected by researchers for a particular purpose.
Secondary data refer to existing data which have already been collected by others
or for another purpose.

Here we focus on the collection of primary data using various sampling techniques.

9.6 Types of sampling techniques


We seek a sample which is representative of the target population to ensure our
estimated parameter values are good approximations of the true parameter values. A
particular difficulty is how to select a sample which yields a ‘fair’ representation.
Fortunately, there are many different sampling techniques (reviewed below) from which
to choose. As expected, these all have relative advantages and disadvantages, hence
each will be suitable in particular situations. Knowledge of these techniques is
frequently examined in this course, so it is strongly advised that you become very
familiar with this material.

Classification of sampling techniques

We divide sampling techniques into two groups:


9
non-random sampling (also known as non-probability sampling)

random sampling (also known as probability sampling).

9.6.1 Non-random sampling


Non-probability samples are characterised by the following properties.

Some population units have no chance (i.e. a zero probability) of being selected.
Units which can be selected have an unknown (non-zero) probability of selection.
Sampling errors cannot be quantified.
They are used in the absence of a sampling frame, which is a list of all the
individual units in a population, serving as the basis for selecting a sample.

175
9. Sampling and experimental design

We proceed to consider three types of non-random sampling.

Convenience sampling

In convenience sampling researchers select participants based on their accessibility


and convenience (hence the name!). Such a sample is unlikely to represent the entire
population and is susceptible to potential selection bias which limits the generalisability
of results to the population, but is often chosen for practical reasons, such as when
there are limited time and resources.

Example 9.2 Consider a convenience sample of people in a shopping mall on a


school day. Such an approach is unlikely to yield a representative sample because
children and working professionals are unlikely to be in the shopping mall at the
time.

Judgemental sampling

In judgemental sampling researchers use their expertise or judgement (hence the


name!) to select participants or elements for inclusion in a study. Instead of randomly
selecting samples, the researcher deliberately chooses specific individuals, cases or data
points based on their knowledge of the population and the research objectives. This
approach is often employed when researchers believe that certain individuals or
elements possess unique characteristics or insights that are crucial to the study’s
objectives. While judgmental sampling allows for a targeted and purposeful selection of
samples, it may introduce selection bias if the researcher’s judgement is subjective or
influenced by personal opinions. Despite its limitations, judgemental sampling can be
valuable in situations where expertise is essential for identifying relevant participants or
when resources are limited.

9 Example 9.3 In market research, a company may use judgemental sampling to


select key industry experts for in-depth interviews on a new product. The
researchers handpick individuals believed to provide valuable insights, leveraging
their expertise to inform strategic decisions.

Quota sampling

In quota sampling researchers divide the population into subgroups based on certain
characteristics, such as age, gender, or socio-economic status, known as quota
controls. The researcher then sets quotas (hence the name!) for each subgroup and
selects participants non-randomly until each quota is filled. This method aims to ensure
representation of key characteristics within the sample and requires the distribution of
these characteristics in the population to be (approximately) known in order to
replicate it in the sample. While quota sampling provides a structured approach, it may
introduce selection bias if the quotas are not well-defined or if the selection within each
quota is not random. Quota sampling is useful in the absence of a sampling frame, since
non-probability sampling techniques do not require a sampling frame. Another reason
for conducting a quota sample, instead of a random sample, might be speed. We may be

176
9.6. Types of sampling techniques

in a hurry and not want to spend time organising interviewers for a random sample – it
is much quicker to set target numbers (quotas) to interview.

Example 9.4 In a market survey for a new tech product, a company employs
quota sampling to ensure diverse participant representation. The population is
categorised by age groups and income levels. Quotas are set for each category, and
researchers purposefully select participants to meet these quotas. This method helps
capture insights from various demographic segments, guiding the company’s
marketing strategy for a more comprehensive market understanding.

Quota sampling is cheap, but it may be systematically biased by the choice of


interviewee made by the interviewer, and their willingness to reply. For instance,
interviewers might avoid choosing anyone who looks threatening, or too busy, or too
strange! Quota sampling also does not allow us to measure the sampling error – a
consequence of it being a non-probability sampling technique.
Each of the three non-random sampling techniques (convenience, judgemental and
quota) has an appeal; however, in all cases we have no real guarantee that we have
achieved an adequately representative sample – we might do by chance, of course, but
this would be highly unlikely! For example, were the people we interviewed only those
working in local offices? Were the young adults all students? Etc.
Since we do not know the probability that an individual will be selected for the survey,
the basic rules of statistical inference which we have been learning (i.e. to construct
confidence intervals and perform hypothesis testing) do not apply. Specifically, standard
errors (a key ingredient in such inferential procedures) are not measurable. However, in
the absence of a sampling frame then we have to resort to non-probability sampling
techniques.

Example 9.5 You would likely use a quota sample in the following situations.

When speed is important. Clearly, reaching a target of a certain number (quota)


9
of people on a given day is likely to be quicker than if specific persons or
households had to be contacted (as determined by a random sample). Typical
quota controls for the interviewer to meet are:
• age group
• gender
• socio-economic group, or social class.
Note the more quota controls the interviewer is given, the longer it will take to
complete the required number of interviews and hence it will take longer to
complete the study. (Imagine the time you would take locating the last male for
your sample aged 35–44, married with teenage children and a full-time job!)

When you need to reduce cost. Clearly, time-saving is an important element in


cost-saving.

No available sampling frame covering the target population. If you think


obtaining a sampling frame is likely to be very complicated, then a sensible

177
9. Sampling and experimental design

targeting of the population by taking a quota sample might be helpful. You


might, for example, wish to contact drivers of buses over a set of bus routes.
There are a lot of bus companies involved, and some of them will not let you
have their list of employees for data protection reasons, say. In such
circumstances carrying out a quota sample at different times of the day would
be feasible.

When accuracy is not important. You may not need to have an answer to your
research question(s) to a high, and known, level of accuracy (which is only
possible using a random sample); rather you may merely require a rough idea
about a subject. Perhaps you only need to know if people, on the whole, like
your new flavour of ice cream in order to judge whether or not there is likely to
be sufficient demand to justify full-scale production and distribution. In this
case, asking a reasonably representative group of people (a quota) would be
perfectly adequate for your needs.

Although there may be several reasons to justify the use of a quota sample, you should
be aware of the problem caused by the omission of non-respondents. Because you only
count the individuals who reply (unlike random sampling where your estimate has to
allow for bias through non-response), the omission of non-respondents can lead to
serious errors as your results would be misleading. Listing non-response as it occurs in
quota samples is regarded as good practice.

9.6.2 Random sampling

Random sampling means that every population unit has a known (not necessarily
equal), non-zero probability of being selected in the sample. In all cases selection is
performed through some form of randomisation. For example, a pseudo-random number
generator can be used to generate a sequence of synthetic ‘random-like’ numbers.
9
Relative to non-random sampling methods, random sampling can be expensive and
time-consuming, and also requires a sampling frame. We aim to minimise both the
(random) sampling error and the systematic sampling bias. Since the probability of
selection is known, standard errors can be computed which allows confidence intervals
to be determined and hypothesis tests to be performed.
We consider five types of probability sampling.

Simple random sampling

Simple random sampling is a random sampling technique in which each member of a


population has a known, non-zero equal chance (i.e. equal probability) of being selected.
The process involves assigning a unique identifier to each element, and then using a
random method, such as a pseudo-random number generator to select samples. This
ensures that every possible sample of a given size has an equal chance of being chosen,
removing selection bias and allowing for generalisability to the entire population.

178
9.6. Types of sampling techniques

Example 9.6 In a customer satisfaction survey, a company employs simple random


sampling to obtain feedback. Each customer in the database is assigned a unique
identification number. Using a pseudo-random number generator, the company
selects a sample of customers to participate in the survey. This ensures that every
customer has an equal chance of being included, providing a representative snapshot
of overall customer sentiment for informed business decisions.

Systematic sampling

Systematic sampling is a random sampling technique where researchers select every


ith item from a list or sequence after randomly choosing a starting point, where i is
called the sampling interval such that i = N/n. This systematic (hence the name!)
approach ensures an equal probability of selection for each element, maintaining
simplicity and efficiency. However, if there is an underlying pattern in the sequence, it
may introduce bias. Systematic sampling is often used when a complete list of the
population is available and easy to organise.

Example 9.7 In a product quality control process, a manufacturing company


employs systematic sampling to ensure consistent standards. The production line
produces 100 units per hour, and every 10th unit is systematically selected for
thorough inspection. This systematic approach provides a representative sample
across the production cycle, allowing the company to identify and address any
quality issues efficiently, maintaining overall product quality.

Stratified random sampling

Stratified sampling is a random sampling method where the population is divided


into distinct (i.e. mutually exclusive and collectively exhaustive) subgroups known as 9
strata based on certain characteristics known as stratification factors. Simple random
samples are then independently drawn from each stratum. This ensures representation
from each subgroup, allowing researchers to analyse and compare specific groups within
the population. Stratified sampling is particularly useful when there are notable
variations across the population characteristics that may impact the study results.
Elements within a stratum tend to be homogeneous while the strata tend to be
heterogeneous.

Example 9.8 In a market research study, a tech company employs stratified


sampling to ensure a comprehensive understanding of customer preferences. The
population is stratified based on age groups (18–24, 25–34, 35–44, etc.). Simple
random samples are then taken from each age group, ensuring representation from
all segments. This approach allows the company to analyse the preferences and
behaviours of different age demographics, informing targeted marketing strategies
for their diverse customer base and tailoring product features to specific age-related
preferences.

179
9. Sampling and experimental design

Cluster sampling

Cluster sampling is a random sampling method where the population is divided into
clusters (hence the name!), and entire clusters are randomly selected for inclusion in the
study. Unlike other sampling techniques, cluster sampling involves selecting groups
(clusters) rather than individual elements. Once the clusters are chosen:

in a one-stage cluster sample all members within the selected clusters are included
in the sample
in a two-stage cluster sample a simple random sample is drawn from each selected
cluster (suitable when cluster sizes are large).

This method is often more practical and cost-effective when studying large and
geographically dispersed populations, as it reduces the need for extensive travel or data
collection across the entire population.
Ideally each cluster is as variable as the overall population (i.e. heterogeneity within
clusters is permitted (and likely), while there should be homogeneity between clusters).

Example 9.9 In a market analysis, a multinational company uses cluster sampling


to study consumer preferences across different regions. The population is divided into
geographical clusters (such as cities or regions) and a random selection of clusters is
chosen. Researchers then survey all consumers within the selected clusters. This
method allows the company to obtain a representative sample without the need to
survey every individual consumer, making the study more efficient and cost-effective
while still capturing diverse perspectives from various geographic locations.

Note that, from a technical perspective, stratified sampling can be thought of as an


extreme form of two-stage cluster sampling where at the first stage all clusters in the
population are selected. In addition, one-stage cluster sampling is at the opposite end of
9 this spectrum. We can summarise as follows.

Similarities and differences between stratified and cluster sampling

Stratified sampling: all strata chosen, some units randomly selected from each
stratum (stratum = singular form of strata).

One-stage cluster sampling: some clusters chosen, all units selected in each
sampled cluster.

Two-stage cluster sampling: some clusters chosen, some units selected in each
sampled cluster.

Multistage sampling

Multistage sampling is a complex random sampling technique which involves multiple


stages of sampling (hence the name!). It typically combines elements of different

180
9.7. Sources of error

sampling methods such as cluster, stratified, and simple random sampling. In the first
stage, clusters are randomly selected, then in subsequent stages, additional levels of
sampling occur, which may involve further random selection of subgroups or individual
elements. This method is often employed when it is impractical or too costly to survey
an entire population directly, providing a compromise between accuracy and efficiency.
During the first stage, large compound units are sampled, known as primary units.
During the second stage, smaller units, known as secondary units, are sampled from the
primary units. From here, additional sampling stages of this type may be performed, as
required, until we finally sample the basic units.

Example 9.10 In a nationwide employee satisfaction survey, a large corporation


uses multistage sampling. Initially, they randomly select specific branches (clusters)
across the country. In the second stage, within each chosen branch, they conduct
stratified sampling by selecting different departments. Finally, in the third stage,
they employ simple random sampling to select individual employees for the survey.
This multistage approach allows the company to gather comprehensive feedback from
diverse branches and departments while maintaining efficiency and cost-effectiveness.

9.7 Sources of error


We distinguish between two types of error in sampling design, not to be confused with
Type I and Type II errors in hypothesis testing!

Sampling error

Sampling error refers to the difference between the characteristics of a sample and the
entire population from which the sample was drawn. It is an inherent part of the
sampling process and occurs because a sample is only a subset of the larger population.
The goal of sampling is to minimise this error, but it cannot be entirely eliminated.
9
Sampling error can lead to discrepancies between sample statistics and population
parameters, affecting the generalisability of study findings to the broader population.

Example 9.11 In a market research study, a company aims to estimate the


average income of its customer base. Due to budget constraints, the company
surveys a random sample of customers instead of the entire population. The
calculated average income from the sample may differ slightly from the true average
income of all customers due to sampling error. The company acknowledges this
discrepancy and interprets the results with an understanding of the inherent
variability introduced by the sampling process, reporting a confidence interval as
part of the estimation process.

Non-sampling error

Non-sampling error refers to errors in research which are not related to the act of
sampling itself but can still affect the accuracy of study results.

181
9. Sampling and experimental design

Selection bias occurs when the sample chosen is not representative of the entire
population, leading to a systematic difference between the characteristics of the sample
and the population. This bias arises when certain factors influence the selection process,
causing a non-random and non-representative sample. Selection bias can impact the
generalisability of study findings and may result in inaccurate conclusions if the selected
sample does not adequately reflect the characteristics of the broader population.

Example 9.12 In a job satisfaction survey conducted online, individuals without


internet access are excluded, creating a selection bias as the sample is not
representative of the entire workforce.

Response bias occurs when there is a systematic pattern of inaccurate responses from
participants, leading to distorted or unreliable data.

Example 9.13 In a customer feedback survey, respondents might provide more


positive feedback if they know the company is conducting the survey, creating a
response bias as their answers may not accurately reflect their true opinions.

Response error refers to any deviation between the true value of the variable being
measured and the response obtained from a participant. This can result from factors
such as sampling error, measurement error, or processing errors during data collection.
The main sources of response error include:

role of the interviewer due to the characteristics and/or opinions of the interviewer,
asking leading questions and the incorrect recording of responses
role of the respondent who may lack knowledge, forget information or be reluctant
to give the correct answer due to the sensitivity of the subject matter.

Example 9.14 In a customer satisfaction survey conducted by a retail company, a


9 response error may occur if the survey questions are unclear or ambiguous. For
instance, if the survey asks customers to rate the ‘timeliness of service’, some
respondents might interpret this as the speed of checkout, while others may think it
refers to the delivery of online orders. If the survey lacks specificity or if there is
confusion about the meaning of terms, it could lead to response errors, with
respondents providing varied and potentially inaccurate ratings.

Addressing and minimising non-sampling errors is crucial for improving the reliability
and validity of research findings. Such kinds of error can be controlled or allowed for
more effectively by a pilot survey, also known as a pilot study or pretest. This is a
small-scale research initiative conducted before the main study to test and refine the
research design, survey instruments, and procedures. It involves collecting data from a
small sample that is representative of the target population to identify potential issues,
assess the feasibility of the study, and make necessary adjustments. The pilot survey is
a crucial step in the research process, serving several purposes.

Pilot surveys help evaluate the effectiveness of questionnaires, interviews, or other


data collection tools. This ensures that the instruments are clear, relevant, and
yield reliable responses.

182
9.8. Non-response bias

By conducting a pilot survey, researchers can identify and address any logistical,
methodological, or practical challenges that may arise during the full-scale study.

Pilot surveys provide insights into the appropriateness of the chosen sampling
strategy. Researchers can assess whether the selected sampling method is feasible
and whether the sample is representative.

Researchers can test data analysis procedures on pilot data to ensure they are
suitable for the main study. This includes refining coding schemes, statistical tests,
and other analytical approaches.

Pilot surveys help gauge the feasibility of the entire research process, including
data collection, analysis, and interpretation. This allows researchers to make
adjustments to enhance the overall feasibility of the study.

9.8 Non-response bias


Non-response bias occurs in survey research when the individuals who choose not to
participate in a study differ systematically from those who do participate. This type of
bias can lead to inaccurate or unrepresentative findings, as the characteristics of the
non-respondents may significantly differ from the characteristics of the overall target
population.

Classification of non-response

We can classify non-response as follows.

Item non-response occurs when a sampled member fails to respond to a


question in the questionnaire.

Unit non-response occurs when no information is collected from a sample 9


member at all.

Several factors contribute to non-response bias, such as the following.

Non-respondents may have different demographic characteristics (such as age,


gender, income, education) compared to respondents, leading to an
unrepresentative sample.

Individuals with specific attitudes or opinions may be more or less likely to respond
to a survey, creating a bias in the collected data.

People with busy schedules or time constraints may be less likely to participate,
potentially biasing the sample towards individuals with more time available.

If the survey addresses a sensitive or controversial topic, individuals with strong


opinions may be more inclined to participate, while those who are more neutral or
reserved might opt out, causing bias.

183
9. Sampling and experimental design

To mitigate non-response bias, researchers can employ strategies such as using follow-up
reminders, offering incentives, and analysing available demographic information of
non-respondents to understand potential biases. However, complete elimination of
non-response bias is challenging, and researchers should be transparent about the
limitations and potential biases in their findings.

9.9 Method of contact


A further point you should think about when assessing how to carry out a survey is the
method of contact. The most common methods of contact are face-to-face interviews,
telephone interviews and online/postal/mail (so-called ‘self-completion’) interviews. In
most countries you can assume the following.

An interviewer-administered face-to-face questionnaire will be the most expensive


to carry out, but can yield high-quality results.

Telephone surveys depend very much on whether your target population is on the
telephone (and how good the telephone system is).

Self-completion questionnaires can have a low response rate.

We now explore some of the advantages and disadvantages of various contact methods.

Face-to-face interviews

Advantages:

Allow for in-depth and detailed responses, enabling interviewers to gather rich
9 qualitative data that might be challenging to obtain through other survey methods.

Interviewers can clarify questions in real-time, ensuring respondents fully


understand the survey items. This helps reduce misunderstandings and enhances
the accuracy of responses.

Allow for the observation of non-verbal cues, such as body language and facial
expressions, providing additional insights into respondents’ feelings and attitudes.

Can result in higher response rates compared to other methods, as the personal
interaction can build trust and rapport with respondents.

Well-suited for complex surveys that require a skilled interviewer to guide


respondents through intricate questions or scenarios.

Disadvantages:

Can be expensive and time-consuming, involving travel, training, and the need for
skilled interviewers. This may limit the feasibility for large-scale surveys.

184
9.9. Method of contact

The presence of an interviewer may introduce bias, as respondents might modify


their answers based on social desirability or their perception of the interviewer’s
expectations.

May be impractical for surveys covering a large geographic area or a dispersed


population, as it is challenging to reach all respondents in person.

Some respondents may feel uncomfortable sharing personal or sensitive information


face-to-face, potentially leading to social desirability bias or underreporting of
certain behaviors.

The format may limit the anonymity of responses, potentially affecting the
willingness of respondents to share candid feedback, particularly in situations
where privacy is a concern.

Telephone interviews

Advantages:

These are generally more cost-effective than face-to-face interviews, as they


eliminate travel expenses for interviewers and allow for a broader geographic reach.

They enable rapid data collection, making them suitable for surveys that require
timely responses or when efficiency is a priority.

They offer a structured and standardised approach to data collection, ensuring that
each respondent receives the same set of questions in a consistent way.

Respondents may feel a greater sense of anonymity during a telephone interview,


potentially leading to more candid responses, especially on sensitive topics.

They allow for random digit dialling, providing a more random and representative 9
sample compared to convenience sampling in some other survey methods.

Disadvantages:

Such surveys often face challenges related to low response rates, as many
individuals may screen calls or refuse to participate, leading to potential
non-response bias.

Complex or lengthy survey questions may be less suitable for telephone interviews,
as respondents might find it difficult to engage in detailed discussions over the
phone.

Telephone interviews lack the visual cues available in face-to-face interactions,


making it challenging for interviewers to interpret non-verbal communication or
gestures.

Interviewers have no control over the respondent’s environment during a telephone


interview, which may introduce distractions or result in less focused responses.

185
9. Sampling and experimental design

Some populations, such as those without access to a phone or with specific phone
preferences (for example, mobile-only users), may be underrepresented in telephone
surveys, impacting the sample’s representativeness.

Self-completion interviews

Advantages:

Such surveys, particularly those conducted online, are often cost-effective as they
eliminate the need for interviewers and associated expenses.

Respondents can complete self-administered surveys at their own convenience,


which may result in higher participation rates, especially among individuals with
busy schedules.

Self-completion surveys provide respondents with a greater sense of anonymity,


encouraging more honest and candid responses, particularly on sensitive topics.

Online surveys, in particular, can reach a large and diverse audience, making them
suitable for studies involving widespread or international populations.

The absence of an interviewer minimises the potential for interviewer bias, ensuring
that respondents interpret and answer questions independently.

Disadvantages:

Respondents may misinterpret or have questions about certain survey items, and
self-completion surveys offer limited opportunities for clarification compared to
interviewer-administered methods.
9 Online surveys may exclude individuals with limited internet access, contributing
to a digital divide and potentially introducing bias against certain demographic
groups.

Self-completion surveys, especially those conducted through mail, may experience


low response rates, affecting the representativeness of the sample and potentially
introducing non-response bias.

Researchers have limited control over the environment in which respondents


complete the survey, potentially leading to distractions or incomplete responses.

Without an interviewer present, there is less opportunity to motivate respondents


or ensure they engage thoughtfully with the survey, potentially affecting the quality
of responses.

We see that the choice of contact method involves a series of trade-offs such that often
there is no single ‘right’ answer to deciding which is the best approach. Ultimately, a
value judgement is often required.

186
9.10. Experimental design

9.10 Experimental design


Before we look at the ideas of correlation and linear regression more formally in
Chapter 10, it is important to take stock of the limitations of assuming causation
when working in social science research.
To establish genuine causation we ideally need to conduct an experiment.
Experimental design involves planning and conducting experiments to test hypotheses
and draw valid conclusions about the relationships between variables. We now consider
the basic principles of experimental design.
First, we clearly identify and define the independent variable (the variable being
manipulated) and the dependent variable (the variable being measured as an outcome).
We then randomly assign participants or subjects to different experimental conditions to
control for potential confounding variables and ensure that the groups are comparable
at the start of the experiment.
We include a control group which does not receive the experimental treatment. The
control group helps establish a baseline against which the effects of the experimental
treatment can be compared by way of the treatment group.
Next, there is a systematic manipulation of the independent variable to observe its
effect on the dependent variable. This manipulation is what distinguishes experimental
designs from observational studies.

Example 9.15 Consider determining the effectiveness of an advertising campaign


for your new chocolate bars with improved wrappers. Your company measures the
sales of the old chocolate bars, product A, in two areas, X and Y, before introducing
the new chocolate bars, product B, and running a four-month advertising campaign.
Imagine your marketing manager’s joy when sales of product B are much higher
than for product A in area X. Clearly, the changes have worked! However, oh dear,
product B is achieving a lower level of sales than product A in area Y. How can this
be? Well, on closer investigation, we find that our main rival has withdrawn their 9
product from area X and concentrated their sales focus on area Y (where they have
realised that a larger number of their target population lives).
So, your success with product B in area X is not necessarily related to your
advertising campaign, rather it may be entirely due to your rival’s actions. Clearly,
whatever measures you use, there is a problem with the measurement of your
results. ‘Other things’ are no longer equal, since the behaviour of the rival changed
while you were conducting your experiment.

As seen in Example 9.15, it is important to control for confounding factors, that is,
factors which are correlated with the observed variables (such as the rival’s actions).
Failure to properly control for such factors may lead us to treat a false positive as a
genuine causal relationship! Let’s consider a further example.

Example 9.16 Suppose a company is testing the effectiveness of a new training


programme aimed at improving employee productivity. The company implements
the training programme and observes an increase in productivity among the

187
9. Sampling and experimental design

participants. However, there might be confounding factors that could affect the
results.
The company notices an increase in productivity among employees who participated
in the training programme. However, it turns out that these employees were also
given a pay rise around the same time as the training. The increase in productivity
could be attributed to the salary increase rather than the training programme. In
this case, the pay rise is a confounding factor because it is associated with both the
independent variable (the training programme) and the dependent variable (the
productivity). It creates a potential alternative explanation for the observed increase
in productivity, making it challenging to attribute the changes solely to the training
programme.
Without considering the confounding factor of the pay rise, the company may
incorrectly conclude that the training programme was the primary driver of
increased productivity. The true effect of the training programme is confounded by
the simultaneous influence of the pay rise, leading to a potential misinterpretation of
the results.
To address this confounding factor, the company should carefully analyse the data,
control for variables like salary increases, or use statistical techniques like regression
analysis to separate the effects of the training programme from other potential
influences. By doing so, the company can obtain a more accurate understanding of
the true impact of the training programme on employee productivity.

9.11 Observational studies and designed experiments


We are often limited in the social sciences because we are unable to perform
experiments either for ethical or practical reasons. Imagine, for example, that you need
to assess the likely effect on tooth decay of adding fluoride to the water supply in a
9 town. There is no possibility of being allowed to experiment, as you are afraid that
fluoride might have harmful side effects. However, you know that fluoride occurs
naturally in some communities. What can you do, as a statistician?

9.11.1 Observational study

In an observational study data are collected on units (not necessarily people) without
any intervention. Researchers do their best not to influence the observations in any way.
A sample survey is a good example of such a study, where data are collected in the form
of questionnaire responses. As discussed previously, every effort is made to ensure
response bias is minimised (if not completely eliminated).

Example 9.17 To assess the likely effect on tooth decay of adding fluoride to the
water supply, you can look at the data for your non-fluoridated water population
and compare it with one of the communities with naturally-occurring fluoride in
their water and measure tooth decay in both populations, but be careful! A lot of
other things may be different. Are the following the same for both communities?

188
9.12. Longitudinal surveys

Number of dentists per person.

Number of sweet shops per person.

Eating habits.

Age distribution.

Think of other relevant attributes which may differ between the two communities. If
you can match in this way (i.e. find two communities which share the same
characteristics and only differ in the fluoride concentration of their water supply),
your results may have some significance. However, finding such matching
communities may be easier said than done in practice!

So, to credibly establish a causal link in an observational study, all other relevant
factors need to be adequately controlled for, such that any change between observation
periods can be explained by only one variable.

9.12 Longitudinal surveys


Policymakers use longitudinal surveys over a long period of time to look at the
development of childhood diseases, educational development and unemployment. There
are many long-term studies in these areas. Some longitudinal medical studies of rare
diseases have been carried out at an international level over long periods.
One such, very well-known, study which is readily available is the UK National Child
Development Survey. This began with a sample of about 5,000 children born in 1958. It
is still going on! It was initially set up to look at the connections between childhood
health and development and nutrition by social groups. The figures produced in the
first few years were so useful that it was extended to study educational development
and work experience.
9
You should note the advantages and disadvantages of using longitudinal surveys.
Advantages:

Longitudinal surveys allow researchers to observe changes over time, providing


insights into trends, patterns, and developments that cannot be captured in
cross-sectional studies.

Such surveys enable the study of individual-level changes, helping researchers


understand how individuals evolve over time in response to various factors.

They are well-suited for studying dynamic processes, such as learning trajectories,
career progression, or the development of health conditions, as researchers can
capture changes at multiple points.

Longitudinal data are suitable for event history analysis, allowing researchers to
study the occurrence and timing of specific events or transitions in the lives of
participants.

189
9. Sampling and experimental design

Longitudinal studies help control for cohort effects, where individuals from different
birth cohorts may exhibit different characteristics or behaviours due to shared
experiences.

Disadvantages:

Longitudinal surveys are often resource-intensive in terms of time, cost and effort.
Tracking participants over an extended period requires sustained funding and
logistical support.

Over time, participants may drop out or become unavailable, leading to attrition.
Loss of participants can compromise the representativeness of the sample and affect
the generalisability of findings.

The extended duration of longitudinal surveys may introduce time-dependent bias,


where changes in measurement methods, societal norms, or participant
characteristics impact the consistency of data collected over time.

Longitudinal studies may encounter ethical concerns related to maintaining


participant confidentiality and privacy, especially as personal circumstances change
over the study period.

Changes in external factors (intervening variables) may occur between


measurement points, making it challenging to attribute observed changes solely to
the variables of interest.

Despite the disadvantages, such studies are widely regarded as being the best way of
studying change over time.

9.13 Overview of chapter


9
This chapter has described the main sampling techniques (non-random and random) of
a survey and the possible sources of error, along with mitigation strategies, where
appropriate. Different methods of contact have also been discussed, with their relative
advantages and disadvantages. A discussion of experimental design followed, as well as
the challenges of establishing causation in the social sciences, for which observational
studies are often the norm.

9.14 Key terms and concepts


Causation Quota controls
Census Quota sampling
Cluster sampling Random sampling
Control group Response bias
Convenience sampling Response error
Elementary sampling units Sample survey
Experiment Sampling error

190
9.15. Sample examination questions

Item non-response Sampling frame


Judgemental sampling Sampling interval
Longitudinal survey Secondary data
Method of contact Selection bias
Multistage sampling Simple random sampling
Non-random sampling Stratified sampling
Non-response bias Systematic sampling
Non-sampling error Target population
Observational study Treatment group
Pilot survey Unit non-response
Primary data

9.15 Sample examination questions

1. (a) State one advantage and one disadvantage of quota sampling.


(b) State one advantage and one disadvantage of telephone interviews.

2. You work for a market research company and your manager has asked you to carry
out a random sample survey for a laptop company to identify whether a new laptop
model is attractive to consumers. The main concern is to produce results of high
accuracy. You are being asked to prepare a brief summary containing the items
below.
(a) Choose an appropriate probability sampling scheme. Provide a brief
justification for your answer.
(b) Describe the sampling frame and the method of contact you will use. Briefly
explain the reasons for your choices.
(c) Provide an example in which non-response bias may occur. State an action
9
which you would take to address this issue.
(d) State the main research question of the survey. Identify the variables
associated with this question.

3. You have been asked to design a stratified random sample survey from the
employees of a certain large company to examine whether job satisfaction of
employees varies between different job types.
(a) Discuss how you will choose your sampling frame. Also discuss any
limitation(s) of your choice.
(b) Propose two relevant stratification factors. Justify your choice.
(c) Provide an action to reduce response bias and explain why you think this
would be successful.
(d) Briefly discuss the statistical methodology you would use to analyse the
collected data.

191
9. Sampling and experimental design

9.16 Solutions to Sample examination questions


1. Note that for these types of questions there is no single ‘right’ answer. Below are
some suggested ‘good’ answers for each part.
(a) Possible advantages: useful in the absence of a sampling frame, speed, cost.
Possible disadvantages: systematically biased due to interviewer; no guarantee
of representativeness.
(b) Possible advantages: easy to achieve a large number of interviews; easy to
check on quality of interviewers.
Possible disadvantages: not everyone has a telephone so the sample can be
biased; cannot usually show samples; although telephone directories exist for
landline numbers, what about mobile telephone numbers? Also, young people
are more likely to use mobile telephones rather than landline telephones, so are
more likely to be excluded.

2. One of the main things to avoid in this part is to write ‘essays’ without any
structure. This question asks for specific things and each one of them requires only
one or two lines of response. If you are unsure of what these things are, do not
write lengthy answers. This is a waste of your valuable examination time. If you
can identify what is being asked, keep in mind that the answer should not be long.
Note also that in most cases there is no single ‘right’ answer to the question. Some
suggested answers are given below.
(a) We are asked for accuracy and random (probability) sampling, so a reasonable
option is the use of stratified random sampling which is known to produce
results of high accuracy. An example of a sampling scheme could be a
stratified sample of those customers who bought this laptop recently.
(b) The question requires:
9 ◦ a description of a sampling frame
◦ a justification of its choice
◦ mentioning a (sensible) contact method.
Use a list provided by retailers to identify people who bought this laptop
model recently. The list could include the postal address, telephone or email
address of purchasers. Stratification could be made by gender of buyer. Finally,
an explanation should be provided as to which contact method you would
prefer – for example, email is fast but there may be a lot of non-response.
(c) The question requires an example of non-response bias and an action
suggested to address this issue.
For example, selected respondents may simply ignore the survey. Offering
(possibly financial) incentives could help mitigate non-response.
(d) A suggested answer for the question is ‘How satisfied are you with your new
laptop model?’. In terms of variables one could mention ‘satisfaction’ and
possible demographic attributes of respondents such as age or gender.

192
9.16. Solutions to Sample examination questions

3. (a) An indicative answer here would be to use an email list. A limitation with this
choice is that this list may not contain all current employees (new starters may
not have their email account activated, and recent leavers may not yet have
their email account deactivated).
(b) Examples of stratification factors are income level, gender, age group etc., as
we suspect job satisfaction may differ across these attributes. In order for
stratified sampling to be effective, within strata the members should be
homogeneous.
(c) Employees may be reluctant to express negative opinions (in case this
negatively impacts their career). By ensuring anonymity of responses, honest
answers are more likely to be expressed.
(d) Examples here are appropriate graphs (boxplots, density histograms etc.),
confidence intervals and hypothesis tests of job satisfaction across different job
types.

193
9. Sampling and experimental design

194
Chapter 10
Correlation and linear regression

10.1 Synopsis of chapter


This final chapter of the subject guide takes us back to the ideas of the formula for a
straight line, which you worked on back in Chapter 1. Here, we are going to use basic
mathematics to understand the relationship between two measurable variables. If you
are taking, or likely to take, ST104B Statistics 2, and later EC2020 Elements of
econometrics, you should ensure that you cover all the examples here very carefully as
they prepare you for work on those courses. If you do not intend to take your statistics
any further, you need, at the very least as a social scientist, to be able to assess whether
relationships shown by data are what they appear! Be careful not to confuse the
techniques used here on measurable variables with those you worked with in Chapter 8
on categorical variables!

10.2 Learning outcomes


After completing this chapter, and having completed the Essential reading and
activities, you should be able to:

draw and label a scatter diagram

calculate r

explain the meaning of a particular value and the general limitations of r 10


calculate β0 and β1 for the line of best fit on a scatter diagram

explain the relationship between β1 and r

summarise the problems caused by extrapolation.

10.3 Recommended reading

Abdey, J. Business Analytics: Applied Modelling and Prediction. (London: SAGE


Publications, 2023) 1st edition [ISBN 9781529774092] Chapter 13.

195
10. Correlation and linear regression

10.4 Introduction
In Chapter 8, you were introduced to the idea of testing for evidence of an association
between different attributes of two categorical variables using the chi-squared
distribution. We did this by looking at the number of individuals falling into a category,
or experiencing a particular contingency.
Correlation and linear regression enable us to see the connection between the actual
dimensions of two or more measurable variables. The work we will do in this chapter
will only involve looking at two variables at a time, but you should be aware that
statisticians use these theories and similar formulae to look at the relationship between
many variables, so-called ‘multivariate’ analysis.
When we use these terms we are concerned with using models for prediction and
decision making. So, how do we model the relationship between two variables? We are
going to look at:

correlation – which measures the strength of a linear relationship

regression – which is a way of representing that linear relationship.

It is important you understand what these two terms have in common, but also the
differences between them.

10.5 Scatter diagrams


Suppose we have some data in paired form: (xi , yi ), for i = 1, 2, . . . , n.

Example 10.1 An example of paired data is the following which represents the
number of people unemployed and the corresponding monthly reported crime figures
for twelve areas of a city.

Unemployed, x 2,614 1,160 1,055 1,199 2,157 2,305


10 Number of offences, y 6,200 4,610 5,336 5,411 5,808 6,004

Unemployed, x 1,687 1,287 1,869 2,283 1,162 1,201


Number of offences, y 5,420 5,588 5,719 6,336 5,103 5,268

When dealing with paired data, the first action is to construct a scatter diagram (also
called a scatterplot) of the data, and visually inspect it for any apparent relationship
between the two measurable variables.
Figure 10.1 shows such a scatter diagram for these data, which gives an impression of a
positive, linear relationship, i.e. it can be seen that x (the number unemployed) and y
(the number of offences) increase together, roughly in a straight line, but subject to a
certain amount of scatter. So, the relationship between x and y is not exactly linear –
the points do not lie exactly on a straight line.

196
10.5. Scatter diagrams

Figure 10.1: Scatter diagram of ‘Number of offences’ against ‘Unemployment’.

Data showing a general ‘upward shape’ like this are said to be positively correlated, and
we shall see how to quantify this correlation. Other possible scatter patterns are shown
in Figure 10.2.
The left-hand plot shows data which have a negative correlation, i.e. y decreases as x
increases, and vice versa. The right-hand plot shows uncorrelated data, i.e. no clear
relationship between x and y.
Note that correlation assesses the strength of the linear relationship between two
measurable variables. Hence uncorrelated data, in general, just means an absence of
linearity. It is perfectly possible that uncorrelated data are related, just not linearly –
for example, x and y may exhibit a quadratic relationship. 10
Example 10.2 Below is a list of variables, along with their expected correlation.

Variables Expected correlation


Height and weight Positive

Rainfall and sunshine hours Negative

Ice cream sales and sun cream sales Positive

Hours of study and examination mark Positive

Car’s petrol consumption and goals scored Zero

197
10. Correlation and linear regression

Figure 10.2: Scatter diagrams – negatively correlated variables (left) and uncorrelated
variables (right).

10.6 Causal and non-causal relationships


When two variables are correlated, an interesting question which arises is whether the
correlation indicates a causal relationship. In Example 10.2, it is natural to assume that
hours of study and examination marks are positively correlated, due to more study
resulting in better examination performance, i.e. a higher mark. In this case, the
relationship is plausibly causal.
For the ice cream and sun cream sales, however, the relationship is not causal. It is not
the selling of ice cream which causes sun cream sales to rise, nor vice versa. Rather both
sales respond to warm weather and so these sales are seasonal, with both rising and
falling together in response to other variables such as temperature and sunshine hours.
It should be clear from this that care needs to be taken in interpreting correlated
relationships. Remember correlation does not imply causality!

10 Example 10.3 Let us consider a few more examples. In each case we observe
strong correlations.

(a) ‘Ice cream sales’ and ‘drowning incidents’.


(b) ‘Divorce rate’ and ‘margarine consumption’.
(c) ‘Size of student population’ and ‘number of juvenile offenders by local area in a
country’.
Would you seriously think there was a causal connection in these cases? Let us look
at them in a little more detail.

(a) The correlation is likely due to the fact that both variables are influenced by a
third variable – temperature. Warmer weather leads to increased ice cream sales
and more people swimming, thereby increasing the risk of drowning.

198
10.7. Correlation coefficient

(b) This correlation is likely coincidental. Changes in divorce rates and margarine
consumption are influenced by various social and economic factors, but there is
no direct causal relationship between the two.

(c) The more young people there are, the more juvenile offenders, scholarship
winners, and students there are likely to be. Connecting these two figures is
pretty meaningless.

10.7 Correlation coefficient


The strength of a linear relationship between two random variables is given by the
correlation coefficient. For random variables X and Y , the population correlation
coefficient, denoted ρ, is defined as:

E((X − E(X))(Y − E(Y ))) E(XY ) − E(X) E(Y )


ρ= p = p
Var(X)Var(Y ) Var(X) Var(Y )

and sometimes this is referred to as the ‘product-moment correlation coefficient’.


Technically, ρ can only be determined if we have perfect knowledge of the ‘bivariate
density function’ of X and Y (this concept is beyond the scope of this course, but is
discussed in ST104B Statistics 2). In practice, we will wish to estimate ρ, using the
sample correlation coefficient, from a set of random sample observations of X and Y ,
i.e. using sample paired data (xi , yi ), for i = 1, 2, . . . , n.

Sample correlation coefficient

The sample correlation coefficient, denoted r, is:


n
P
xi yi − nx̄ȳ
Sxy i=1
r= p = s (10.1)
Sxx Syy n
 n

x2i − nx̄2 yi2 − nȳ 2
P P
i=1 i=1

such that −1 ≤ r ≤ 1 and where the: 10


n
X
‘corrected sum of squares of x-data is: Sxx = x2i − nx̄2
i=1
Xn
‘corrected sum of squares of y-data is: Syy = yi2 − nȳ 2
i=1
n
X
‘corrected sum of cross-products is: Sxy = xi yi − nx̄ȳ.
i=1

It is quiteP
commonP in examination
P questions
P to be given certain summary statistics (for
example, i xi , i x2i , i yi , i yi2 and i xi yi ) to save you time from computing such
P
quantities directly using raw data. Hence it may be easier for you to remember the

199
10. Correlation and linear regression

expressions for Sxx , Syy and Sxy (the ‘corrected sum of squares’ for x and y, and
corresponding cross-products, respectively), and how they combine to calculate r.
The sample correlation coefficient measures how closely the points in a scatter diagram
lie around a straight line, and the sign of r tells us the direction of this line, i.e.
upward-sloping or downward-sloping, for positive and negative r, respectively. It does
not tells us the gradient of the line – this is what we will determine in linear regression.

Example 10.4 For the dataset in Example 10.1, we have n = 12, x̄ = 1,665 and
ȳ = 5,567, and so:

Sxx = 3,431,759, Syy = 2,584,497 and Sxy = 2,563,066.

Using (10.1), we have:


2,563,066
r=√ = 0.8606.
3,431,759 × 2,584,497

This is a strong, positive correlation. We also note that the value of r agrees with
the scatter diagram shown in Figure 10.1, i.e. positive.

Properties of the sample correlation coefficient

The sample correlation coefficient, r, has the following properties. It:

is independent of the scale of measurement

is independent of the origin of measurement

is symmetric; that is, the correlation of x and y is the same as the correlation of
y and x

can only take values between ±1, i.e. −1 ≤ r ≤ 1, or |r| ≤ 1, i.e. sample
correlation coefficients always have an absolute value less than or equal to 1.

Having defined the correlation coefficient, it is important to remember the following


10 when interpreting r.

r ≈ 1 indicates a strong, positive linear relationship between x and y.

r ≈ 0.6 indicates a moderate, positive linear relationship between x and y.

r ≈ 0.2 indicates a weak, positive linear relationship between x and y.

r ≈ 0 indicates that x and y are not linearly related, i.e. the variables are
uncorrelated

r ≈ −0.2 indicates a weak, negative linear relationship between x and y.

r ≈ −0.6 indicates a moderate, negative linear relationship between x and y.

r ≈ −1 indicates a strong, negative linear relationship between x and y.

200
10.7. Correlation coefficient

Example 10.5 To reiterate, correlation assesses the strength of linear relationships


between variables only. r ≈= 0 does not imply that x and y are independent, since
the variables could have a non-linear relationship. For example, if:

y = x(1 − x) for 0 ≤ x ≤ 1

then the correlation is zero (as they are not linearly related), but, clearly, there is a
well-defined relationship between the two variables, so they are certainly not
independent. Figure 10.3 demonstrates this point for simulated sample data, where
we see a clear relationship between x and y, but it is clearly not a linear
relationship.1 Data of this kind would have a sample correlation near zero (here,
r = 0.15).

Figure 10.3: Scatter diagram showing a non-linear relationship.


10

10.7.1 Spearman rank correlation

We saw in Chapter 2 that we can use the median and interquartile range as measures of
location and dispersion, respectively, instead of the mean and standard deviation (or
variance), when dealing with datasets that may have skewed distributions or outliers.
Similarly, we may calculate the Spearman rank correlation, rs , instead of r.
To compute rs we rank the xi and yi values in ascending order. Of course, it may be
that we only have the ranks, in which case we would have to use this method.

1
In fact, these data are scattered around a parabola with (approximate) equation y = 2(x−15)(85−x).

201
10. Correlation and linear regression

Spearman rank correlation

If there are no tied rankings of xi and yi , the Spearman rank correlation is:
n
d2i
P
6
i=1
rs = 1 − (10.2)
n(n2 − 1)

where the di s are the differences in the ranks between each xi and yi .

As with other order statistics, such as the median and quartiles, it is helpful to use the
Spearman rank correlation if you are worried about the effect of extreme observations
(outliers) in your sample. The limits for rs are the same as for r, that is −1 ≤ r ≤ 1, i.e.
|rs | ≤ 1.

Example 10.6 An aptitude test has been designed to examine a prospective


salesperson’s ability to sell.
Ten current staff sit the test. Instead of putting achieved scores in the computer, a
research assistant ranks the individuals in ascending order in terms of the test as
well as productivity. The ranked data are:

Staff member A B C D E F G H I J
Rank order in test 2 3 5 1 4 9 10 6 7 8
Rank order in productivity 1 2 3 4 5 6 7 8 9 10

Does it look as if the test is a good predictor of sales ability?


We first compute the differences in the ranks:

Staff member A B C D E F G H I J
di 1 1 2 −3 −1 3 3 −2 −2 −2
d2i 1 1 4 9 1 9 9 4 4 4

10 and so:
10
X
d2i = 46.
i=1

Hence, using (10.2), we have:


6 × 46
rs = 1 − = 0.7212
10 × ((10)2 − 1)

which is quite strong, indicating that the test is a reasonably good predictor of sales
ability.
Note we are implying a causal connection here, i.e. that greater performance in the
aptitude test results in greater sales ability.

202
10.8. Linear regression

10.8 Linear regression


We can only hope to review the fundamentals of what is a very large (and important)
topic in statistical analysis (some courses, such as EC2020 Elements of
econometrics focus exclusively on regression analysis).
Here, we concentrate on the simple linear regression model (also known as the
bivariate regression model). This will allow calculations to be performed on a calculator,
unlike multiple linear regression (with multiple independent variables) which typically
requires the help of statistical computer packages due to the complexity and sheer
number of calculations involved.
In the simple linear regression model, we have two variables y and x.

y is the dependent variable (or response variable), i.e. that which we are trying
to explain.
x is the independent variable (or explanatory variable), i.e. the variable we think
influences y.

Multiple linear regression is just a natural extension of this set-up, but with more than
one independent variable (covered in EC2020 Elements of econometrics).
There can be a number of reasons for wanting to establish a mathematical relationship
between a dependent variable and an independent variable, for example:

to find and interpret unknown parameters in a known relationship


to understand the reason for such a relationship – is it causal?
to predict or forecast y for specific values of the independent variable.

10.8.1 The simple linear regression model


We assume that there is a true (population) linear relationship between a dependent
variable, y, and an independent variable, x, of the approximate form:

y = β0 + β1 x 10
where β0 and β1 are fixed, but unknown, population parameters. Our objective is to
estimate β0 and β1 using (paired) sample data (xi , yi ), for i = 1, 2, . . . , n.
Note the use of the word approximate. Particularly in the social sciences, we would not
expect a perfect linear relationship between the two variables. Therefore, we modify this
basic model to:
y = β0 + β1 x + ε
where ε is some random disturbance from the initial ‘approximate’ line. In other words,
each y observation almost lies on the line, but ‘jumps’ off the line according to the
random variable ε. This disturbance is often referred to as the error term.
For each pair of observations (xi , yi ), for i = 1, 2, . . . , n, we can write this as:

y i = β 0 + β 1 xi + εi .

203
10. Correlation and linear regression

The random error terms ε1 , ε2 , . . . , εn corresponding to the n data points are assumed
to be independent and identically normally distributed, with zero mean and constant
(but unknown) variance, σ 2 . That is:

εi ∼ N (0, σ 2 ) for i = 1, 2, . . . , n.

This completes the model specification.

Specification of the simple linear regression model

To summarise, we list the assumptions of the simple linear regression model.

A linear relationship between the variables of the form y = β0 + β1 x + ε.

The existence of three model parameters: the linear equation parameters β0 and
β1 , and the error term variance, σ 2 .

Var(εi ) = σ 2 for all i = 1, 2, . . . , n, i.e. the error term variance is constant and
does not depend on the independent variable.

The εi s are independent and N (0, σ 2 ) distributed for all i = 1, 2, . . . , n.

You may feel that some of these assumptions are particularly strong and restrictive. For
example, why should the error term variance be constant across all observations?
Indeed, your scepticism serves you well! In a more comprehensive discussion of linear
regression, such as in EC2020 Elements of econometrics, model assumptions would
be properly tested to assess their validity. Given the limited scope of linear regression in
this course, sadly we are too time-constrained to consider such tests in detail. However,
do be aware that with any form of modelling, a thorough critique of model assumptions
is essential. Analysis based on false assumptions leads to invalid results, a bad thing!

10.8.2 Parameter estimation

10 As mentioned above, our principal objective is to estimate β0 and β1 , that is the


y-intercept and slope of the true line. To fit a line to some data, we need a criterion for
establishing which straight line is in some sense ‘best’. The criterion used is to minimise
the sum of the squared distances between the observed values of yi and the values
predicted by the model. This ‘least squares’ estimation technique is outlined in ST104B
Statistics 2.
The estimated least squares regression line is written as:

yb = βb0 + βb1 x

where βb0 and βb1 denote our estimates of β0 and β1 , respectively. The derivation of the
formulae for βb0 and βb1 is not required for this course, although you do need to know
how to calculate point estimates of β0 and β1 .

204
10.8. Linear regression

Simple linear regression line estimates

We estimate β0 and β1 with βb0 and βb1 , where:


n
P n
P
(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
i=1 i=1 Sxy
βb1 = n = n = (10.3)
P
(xi − x̄)2
P
x2i − nx̄2 Sxx
i=1 i=1

and:
βb0 = ȳ − βb1 x̄. (10.4)
Note in practice that you need to compute the value βb1 first, since this is needed to
calculate βb0 .

10.8.3 Prediction
Having estimated the regression line, an important application of it is for prediction.
That is, for a given value of the independent variable, we can use it in the estimated
regression line to obtain a prediction of y.

Predicting y for a given value of x

For a given value of the independent variable, x0 , the predicted value of the dependent
variable, yb, is:
yb = βb0 + βb1 x0 .

Remember to attach the appropriate units to the prediction (i.e. the units of
measurement of the original y data). Also, ensure the value you use for x0 is correct –
for example, if the original x data is in 000s, then a prediction of y when the
independent variable is 10,000, say, would mean x0 = 10, and not 10,000!

Example 10.7 A study was made by a retailer to determine the relationship


between weekly advertising expenditure and sales (both in thousands of pounds). 10
We calculate the equation of a regression line to predict weekly sales from
advertising, and then use this to predict weekly sales when advertising costs are
£35,000. The data are:
Advertising costs (in £000s) 40 20 25 20 30 50
Sales (in £000s) 385 400 395 365 475 440

Advertising costs (in £000s) 40 20 50 40 25 50


Sales (in £000s) 490 420 560 525 480 510
Summary statistics, representing sales as yi and advertising costs as xi , for
i = 1, 2, . . . , 12, are:
X X X X X
xi = 410, x2i = 15,650, yi = 5,445, yi2 = 2,512,925, xi yi = 191,325.
i i i i i

205
10. Correlation and linear regression

The parameter estimates are:


n
P
xi yi − nx̄ȳ
i=1 191,325 − (12 × (410/12) × (5,445/12))
β1 = P
b
n = = 3.221
2 2 15,650 − (12 × (410/12)2 )
xi − nx̄
i=1

and:
5,445 410
βb0 = ȳ − βb1 x̄ = − 3.221 × = 343.7.
12 12
Hence the estimated regression line is:

yb = 343.7 + 3.221x.

The predicted sales for £35,000 worth of advertising is:

yb = 343.7 + 3.221 × 35 = 456.4

which is £456,400. Note that since the advertising costs were given in £000s, we used
x0 = 35, and then converted the predicted sales into pounds.

10.8.4 Points to watch about linear regression

Non-linear relationships

We have only seen here how to use a straight line to model the best fit. So, we could be
missing some quite important non-linear relationships, particularly if we were working
in the natural sciences (recall Figure 10.3).

Which is the dependent variable?

Note it is essential to correctly establish which is the dependent variable, y. In Example


10.7, we would have a different line if we had taken advertising costs as y and sales as x!
10 So remember to exercise your common sense – we would expect sales to react to
advertising campaigns rather than vice versa.

Extrapolation

In Example 10.7, we used our line of best fit to predict the value of y for a given value
of x, i.e. advertising expenditure of £35,000. Such predictions are only reliable if we are
dealing with x0 values which lie within the range of available x data, known as
interpolation. If we use the estimated regression line for prediction using x0 values
which lie outside the range of the available x data, then this is known as
extrapolation, for which any predictions should be viewed with caution.
For Example 10.7, it may not be immediately obvious that the relationship between
advertising expenditure and sales could change, but a moment’s thought should
convince you that, were you to quadruple advertising expenditure, you would be

206
10.9. Overview of chapter

unlikely to get a nearly 4 × 3.221 ≈ 13-fold increase in sales! Basic economics would
suggest some diminishing marginal returns to advertising expenditure!
Sometimes it is very easy to see that the relationship must change. For instance,
consider Example 10.8, which shows an anthropologist’s data on years of education of a
mother and the number of children she has, based on a Pacific island.

Example 10.8 Figures from an anthropologist show a negative linear relationship


between the number of years of education, x, of a mother and the number of live
births she has, y. The estimated regression line is:

yb = 8 − 0.6x

based on data of women with between 5 and 8 years of education who had 0 to 8 live
births. This looks sensible. We predict yb = 8 − 0.6 × 5 = 5 live births for those with
5 years of education, and yb = 8 − 0.6 × 10 = 2 live births for those with 10 years of
education.
This is all very convincing, but say a woman on the island went to university and
completed a doctorate, and so had 15 years of education. She clearly cannot have
yb = 8 − 0.6 × 15 = −1 children! Also, if someone missed school entirely, is she really
likely to have yb = 8 − 0.6 × 0 = 8 children? We have no way of knowing. The
relationship shown by the existing data will probably not hold beyond the x data
range of 5 to 8 years of education. So, exercise caution when extrapolating!

An important reminder about examination questions on correlation and


regression

As already mentioned, examiners frequently give you the following summary statistics:
n
X n
X n
X n
X n
X
xi , x2i , yi , yi2 and xi y i
i=1 i=1 i=1 i=1 i=1

for i = 1, 2, . . . , n, in order to save you time. If you do not know how to take advantage
of these, you will waste valuable time which you really need for the rest of the question. 10
Note that if you use your calculator, show no working and get the answer wrong, you
are unlikely to get any credit.

10.9 Overview of chapter


This chapter has introduced the concept of correlation between two measurable
variables, noting that correlation does not imply causality! However, if we could
plausibly imply that a variable y would depend on another variable, x, then we could
construct a simple linear regression model. We have set out the specification of such a
regression model, including its assumptions. Estimates of the regression line parameters
and prediction of y for a given value of x were presented, along with a discussion of
interpolation versus extrapolation.

207
10. Correlation and linear regression

10.10 Key terms and concepts


Dependent variable Prediction
Error term Sample correlation coefficient
Extrapolation Scatter diagram (plot)
Independent variable Simple linear regression
Interpolation Spearman rank correlation

10.11 Sample examination questions


1. Seven students in a class received the following examination and project marks in a
subject:

Examination mark 50 80 70 40 30 75 95
Project mark 75 60 55 40 50 80 65

You want to know if students who had relatively high project marks in the subject
also excel in examinations.
(a) Calculate the Spearman rank correlation.
(b) Based on your answer to part (a), do you think students who have the highest
project marks are also likely to score well in examinations? Briefly justify your
view.

2. An area manager in a department store wants to study the relationship between


the number of workers on duty, x, and the value of merchandise lost to shoplifters,
y, in $. To do so, the manager assigned a different number of workers for each of 10
weeks. The results were as follows:

Week #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
x 9 11 12 13 15 18 16 14 12 10
y 420 350 360 300 225 200 230 280 315 410

10 The summary statistics for these data are:

Sum of x data: 130 Sum of the squares of x data: 1,760


Sum of y data: 3,090 Sum of the squares of y data: 1,007,750
Sum of the products of x and y data: 38,305

(a) Draw a scatter diagram of these data. Carefully label the diagram.
(b) Calculate the sample correlation coefficient. Interpret its value.
(c) Calculate and report the least squares line of y on x. Draw the line on the
scatter diagram.
(d) Based on the regression model above, what will be the predicted loss from
shoplifting when there are 17 workers on duty? Would you trust this value?
Justify your answer.

208
10.12. Solutions to Sample examination questions

10.12 Solutions to Sample examination questions


1. (a) We have:
Examination mark 50 80 70 40 30 75 95
Project mark 75 60 55 40 50 80 65
Rank of examination mark 5 2 4 6 7 3 1
Rank of project mark 2 4 5 7 6 1 3
Difference in ranks, di 3 −2 −1 −1 1 2 −2
d2i 9 4 1 1 1 4 4
The Spearman rank correlation is:
n
d2i
P
6
i=1 6 × 24
rs = 1 − =1− = 0.5714.
n(n2 − 1) 7 × 48

(b) There is some correlation, but it is not strong. It looks as if there is a positive
connection between project marks and examination marks.
2. (a) A scatter diagram is:
Stolen merchandise vs number of workers
400
value of merchandise in $'s lost to shoplifters

350
300
250

10
200

10 12 14 16 18

number of workers on duty

(b) The summary statistics can be substituted into the formula for the sample
correlation coefficient to obtain the value r = −0.9688. An interpretation of
this value is the following – the data suggest that the higher the number of
workers, the lower the loss from shoplifters. The fact that the value is very
close to −1 suggests that this is a strong, negative linear relationship.
(c) The regression line can be written by either:
yb = βb0 + βb1 x or y = βb0 + βb1 x + ε.
The formula for βb1 is: P
xi yi − nx̄ȳ
βb1 = P 2
xi − nx̄2

209
10. Correlation and linear regression

and by substituting the summary statistics we get βb1 = −26.64.


The formula for βb0 is βb0 = ȳ − βb1 x̄, and we get βb0 = 655.36.
Hence the regression line can be written as:

yb = 655.36 − 26.64x or y = 655.36 − 26.64x + ε

which is drawn on the scatter diagram above.


(d) In this case one can note in the scatter diagram that the points seem to be
‘scattered’ around a straight line. Hence a linear regression model does seem to
be a good model here. Since 17 is within the x data range (since 9 ≤ xi ≤ 18)
this is interpolation. According to the model, the expected loss from
shoplifting for 17 workers on duty is:

655.36 − 26.64 × 17 ≈ $202.48.

10

210
A

Appendix A
Mathematics primer and the role of
statistics in the research process

A.1 Worked examples

1. A dataset contains the observations 1, 1, 1, 2, 4, 8, 9 (so here, n = 7). Find:


n
X
(a) 2xi
i=1

n
X
(b) x2i
i=1

n
X
(c) (xi − 2)
i=1

n
X
(d) (xi − 2)2
i=1

n
!2
X
(e) xi
i=1

n
X
(f) 2.
i=1

Solution:

We have that x1 = 1, x2 = 1, x3 = 1, x4 = 2, x5 = 4, x6 = 8 and x7 = 9.

(a) We have:

n
X
2xi = 2x1 + 2x2 + 2x3 + 2x4 + 2x5 + 2x6 + 2x7
i=1
= 2 × (x1 + x2 + x3 + x4 + x5 + x6 + x7 )
= 2 × (1 + 1 + 1 + 2 + 4 + 8 + 9)
= 52.

211
A. Mathematics primer and the role of statistics in the research process
A
(b) We have:
n
X
x2i = x21 + x22 + x23 + x24 + x25 + x26 + x27
i=1
= 12 + 12 + 12 + 22 + 42 + 82 + 92
= 168.

(c) We have:
n
X
(xi − 2) = (x1 − 2) + (x2 − 2) + (x3 − 2) + (x4 − 2)
i=1
+ (x5 − 2) + (x6 − 2) + (x7 − 2)
= (1 − 2) + (1 − 2) + (1 − 2) + (2 − 2)
+ (4 − 2) + (8 − 2) + (9 − 2)
= 12.

(d) We have, extending (c):


n
X
(xi − 2)2 = (1 − 2)2 + (1 − 2)2 + (1 − 2)2 + (2 − 2)2
i=1
+ (4 − 2)2 + (8 − 2)2 + (9 − 2)2
= 92.

(e) We have:

n
!2
X
xi = (x1 + x2 + x3 + x4 + x5 + x6 + x7 )2
i=1

= (1 + 1 + 1 + 2 + 4 + 8 + 9)2
= (26)2
= 676.

Note that:
n
!2 n
X X
xi 6= x2i .
i=1 i=1

(f) We have:
n
X
2 = 2 + 2 + 2 + 2 + 2 + 2 + 2 = 14.
i=1

212
A.1. Worked examples
A
2. Suppose that x1 = 7, x2 = 3, x3 = 1, x4 = 0, x5 = −6, and y1 = −3, y2 = 5,
y3 = −8, y4 = 9, y5 = 1. Calculate the following quantities:
4
X
(a) 2yi
i=2
3
X
(b) 4(xi − 1)
i=1
5
X
(c) y12 + (x2i + 2yi2 ).
i=3

Solution:
(a) We have:
4
X
2yi = 2 × (y2 + y3 + y4 ) = 2 × (5 − 8 + 9) = 12.
i=2

(b) We have:
3
X 3
X
4(xi − 1) = 4 × (xi − 1)
i=1 i=1
= 4 × ((x1 − 1) + (x2 − 1) + (x3 − 1))
= 4 × ((7 − 1) + (3 − 1) + (1 − 1))
= 4 × (6 + 2 − 0)
= 32.

(c) We have:
5
X 5
X
y12 + 2 2 2
(xi + 2yi ) = y1 + ((x23 + 2y32 ) + (x24 + 2y42 ) + (x25 + 2y52 ))
i=3 i=3
= (−3) + (12 + 2 × (−8)2 ) + (02 + 2 × 92 )
2

+ ((−6)2 + 2 × 12 )
= 9 + 129 + 162 + 38
= 338.

3. Suppose that x1 = −0.5, x2 = 2.5, x3 = −2.8, x4 = 0.4, x5 = 6.1, and y1 = −0.5,


y2 = 4.0, y3 = 4.6, y4 = −2.0, y5 = 0. Calculate the following quantities:
5
X
(a) x2i
i=3
2
X 1
(b)
xy
i=1 i i

213
A. Mathematics primer and the role of statistics in the research process
A
5
X y2 i
(c) y43 + .
i=4
xi
Solution:
(a) We have:
5
X
x2i = x23 + x24 + x25 = (−2.8)2 + (0.4)2 + (6.1)2 = 7.84 + 0.16 + 37.21 = 45.21.
i=3

(b) We have:
2
X 1 1 1 1 1
= + = + = 4 + 0.1 = 4.1.
xy
i=1 i i
x1 y 1 x2 y 2 (−0.5) × (−0.5) 2.5 × 4.0

(c) We have:
5
y2 y42 y52 (−2.0)2 02
X    
i
y43 + = y43 + + 3
= (−2.0) + + = −8 + 10 = 2.
i=4
xi x4 x5 0.4 6.1

4. Explain why:
n n
!2 n n n
X X X X X
x2i 6= xi and xi yi 6= xi yi .
i=1 i=1 i=1 i=1 i=1

Solution:
Writing out the full summations, we obtain:
n n
!2
X X
x2i = x21 + x22 + · · · + x2n 6= (x1 + x2 + · · · + xn )2 = xi .
i=1 i=1

Therefore, the ‘sum of squares’ is, in general, not equal to the ‘square of the sum’
because the expansion of the quadratic gives:

(x1 + x2 + · · · + xn )2 = x21 + x22 + · · · + x2n + 2x1 x2 + 2x1 x3 + · · ·

so, unless all the cross-product terms sum to zero, the two expressions are not the
same. Hence, in general, the two expressions are different. Similarly:
n
X n
X n
X
xi yi = x1 y1 +x2 y2 +· · ·+xn yn 6= (x1 +x2 +· · ·+xn )(y1 +y2 +· · ·+yn ) = xi yi .
i=1 i=1 i=1

Therefore, the ‘sum of the products’ is, in general, not equal to the ‘product of the
sums’ because the expansion of the products gives:

(x1 + x2 + · · · + xn )(y1 + y2 + · · · + yn ) = x1 y1 + x2 y2 + · · · + xn yn + x1 y2 + x1 y3 + · · · .

Note the first case is a special case of the second if we set yi = xi .

214
A.2. Practice problems
A
A.2 Practice problems

1. Work out the following:


(a) (2 + 4) × (3 + 7)
(b) 1/3 of 12 − 4 ÷ 2
(c) (1 + 4)/5 × (100 − 98).

2. Work out the following (use a calculator where necessary):



(a) 16
(b) (0.07)2

(c) 0.49.

3. (a) What is 98% of 200?


(b) Express 17/25 as a percentage.
(c) What is 25% of 98/144?

4. Provide the absolute values for:


(a) |−8|
(b) |15 − 9|.

5. (a) For which of 2, 3, 7 and 9 is x > 3?


(b) For which of 2, 3, 7 and 9 is x < 3?
(c) For which of 2, 3, 7 and 9 is x ≤ 3?
(d) For which of 2, 3, 7 and 9 is x2 ≥ 49?

2
X
6. Suppose x1 = 4, x2 = 1 and x3 = 2. For these figures, compute x3i .
i=1

7. If n = 4, x1 = 2, x2 = 3, x3 = 5 and x4 = 7, compute:
3
X
(a) xi
i=1

4
1X 2
(b) x.
n i=1 i

215
A. Mathematics primer and the role of statistics in the research process
A
8. Given x1 = 3, x2 = 1, x3 = 4, x4 = 6 and x5 = 8, find:
P
(a) 5xi
i=1
4
x2i .
P
(b)
i=3

Given also that p1 = 1/4, p2 = 1/8, p3 = 1/8, p4 = 1/3 and p5 = 1/6, find:
5
P
(c) p i xi
i=1
5
pi x2i .
P
(d)
i=3

9. Sketch the following linear functions:


(a) y = 3 + x
(b) y = −2 + 3x.
You are going to need equations like these for all the material on regression in
Chapter 10.

A.3 Solutions to Practice problems


1. (a) Rules: Brackets take precedence so this is:

(6) × (10) = 60.

(b) Here the ‘of’ and ÷ take precedence and this is:

1
of 12 (= 4) minus 4 ÷ 2 (= 2)
3
i.e. 4 minus 2, i.e. +2.

(c) Brackets take precedence so we have:

5
×2
5
which is 1 × 2, which is 2.

2. (a) 16 is 4 × 4 and −4 × −4, so the square root of 16 is ±4.

(b) (0.07)2 = 0.07 × 0.07 = 0.0049. Be careful with the decimal points!

(c) 0.49 is 0.7 × 0.7 and −0.7 × −0.7, so the square root of 0.49 is ±0.7.
Be sure that you understand the rules for placing decimal points. When you work
on interval estimation or hypothesis tests of proportions you will need this.

216
A.3. Solutions to Practice problems
A
3. (a) The answer is (98/100) × 200 which is 98 × 2, or 196.
(b) To get a percentage, multiply the given fraction by 100, i.e. we obtain:

17 17
× 100 = × 4 = 17 × 4 = 68%.
25 1
So 17/25 can be written as 68%.
(c) 25% of 98/144 is another way of writing (25/100) × (98/144) which is
(1/4) × (49/72) or 49/288 which is 0.1701 to four decimal places.

4. (a) This is, simply put, 8.


(b) This is the same as |6|, which is 6.

5. (a) 7 and 9 are greater than 3.


(b) 2 is less than 3.
(c) 2 and 3 are less than or equal to 3.
(d) If x2 ≥ 49, then x ≥ 7. This is the case when x is 7 or 9.

6. Substituting in the values, we obtain:


2
X
x3i = x31 + x32 = 43 + 13 = 64 + 1 = 65.
i=1

7. (a) Substituting in the values, we obtain:


3
X
xi = x1 + x2 + x3 = 2 + 3 + 5 = 10.
i=1

(b) Substituting in the values, we obtain:


4
1X 2 1 2 1 1
xi = (x1 + x22 + x23 + x24 ) = (22 + 32 + 52 + 72 ) = (4 + 9 + 25 + 49)
n i=1 4 4 4

= 21.75.

8. (a) We have:
5
X
xi = x1 + x2 + x3 + x4 + x5 = 3 + 1 + 4 + 6 + 8 = 22.
i=1

(b) We have:
4
X
x2i = x23 + x24 = 42 + 62 = 16 + 36 = 52.
i=3

217
A. Mathematics primer and the role of statistics in the research process
A
(c) We have:

5
X
p i xi = p 1 x1 + p 2 x2 + p 3 x 3 + p 4 x4 + p 5 x5
i=1

1 1 1 1 1
= ×3+ ×1+ ×4+ ×6+ ×8
4 8 8 3 6
3 1 1 1
= + + +2+1
4 8 2 3
17
=4
24

or 4.7083, if you prefer to work in decimals.

(d) We have:

5
X 1 1 1
pi x2i = p3 x23 + p4 x24 + p5 x25 = × 42 + × 62 + × 82
i=3
8 3 6
1 1 1
= × 16 + × 36 + × 64
8 3 6
2
= 2 + 12 + 10
3
2
= 24
3

or 24.6667, if you prefer to work in decimals.

Check this carefully if you could not do it yourself first time. Note that i = 3 and
i = 5 mark the beginning and end, respectively, of the summation, as did i = 3 and
i = 4 in (b).

9. (a) Here the straight line graph has a slope of 1 and the line cuts the y-axis at
x = 3 (when x is zero, y is 3).

y
3

-3 x

218
A.3. Solutions to Practice problems
A
(b) Here the straight line still has a positive slope of 3 (y increases by 3 when x
increases by 1) but the line crosses the y-axis at a minus value (−2).

2/3 x
-2

219
A. Mathematics primer and the role of statistics in the research process
A

220
Appendix B B
Data visualisation and descriptive
statistics

B.1 Worked examples

1. A survey has been conducted on the wage rates paid to clerical assistants in twenty
companies in your locality in order to make proposals for a review of wages in your
own firm. The results of this survey are as follows:

Hourly rate in £

12.50 11.80 12.10 11.90 12.40 12.10 11.90 12.00 11.80 11.90
12.20 12.00 11.90 12.40 12.00 11.90 12.00 12.10 12.20 12.30

Construct a dot plot of these data.

Solution:
A good idea is to first list the data in increasing order:

11.80 11.80 11.90 11.90 11.90 11.90 11.90 12.00 12.00 12.00
12.00 12.10 12.10 12.10 12.20 12.20 12.30 12.40 12.40 12.50

The dot plot of wage rates is:


• •
• • •
• • • • • •
• • • • • • • •
£12.00 £12.50

2. The following two sets of data represent the lengths (in minutes) of students’
attention spans during a one-hour class. Construct density histograms for each of
these datasets and use these to comment on comparisons between the two
distributions.

221
B. Data visualisation and descriptive statistics

Statistics class

01 43 16 28 27 25 26 25 22 26
B 47 40 14 36 23 32 15 31 19 25
21 07 28 49 31 22 24 26 14 45
38 48 36 22 29 12 32 11 34 42
55 27 06 23 42 21 58 23 35 13

Economics class

60 39 30 41 37 27 38 04 25 43
58 60 21 53 26 47 08 51 19 31
29 21 31 60 48 30 28 37 07 60
50 60 51 24 41 03 37 14 46 60
60 48 25 32 59 11 60 28 54 18
60 42 04 26 60 41 60 11 43 28

Solution:
The exact shape of the density histograms will depend on the class intervals you
have used. A sensible approach is to choose round numbers. Below the following
have been used:

Statistics (n = 50)

Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[0, 10) 10 3 0.06 0.006
[10, 20) 10 8 0.16 0.016
[20, 30) 10 20 0.40 0.040
[30, 40) 10 9 0.18 0.018
[40, 50) 10 8 0.16 0.016
[50, 60) 10 2 0.04 0.004
[60, 70) 10 0 0.00 0.000

Economics (n = 60)

Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[0, 10) 10 5 0.083 0.0083
[10, 20) 10 5 0.083 0.0083
[20, 30) 10 12 0.200 0.0200
[30, 40) 10 10 0.167 0.0167
[40, 50) 10 10 0.167 0.0167
[50, 60) 10 7 0.117 0.0117
[60, 70) 10 11 0.183 0.0183

222
B.1. Worked examples

For the statistics class, the density histogram of students’ attention spans is
approximately symmetric. Attention spans for the economics class students are
more variable and higher on average than those of statistics students, and they are
not symmetric due to a group of students who maintain interest throughout the
class. Note we have more data on economics students, which is perfectly
acceptable. Sample sizes of different groups may very well be different in practice.

3. Sales of footwear in a store were recorded for 52 weeks and these are shown below.

Weekly sales of footwear

30 60 67 63 69 54 68 60 62 83 66 70 68
61 74 94 87 66 69 66 62 78 90 98 93 73
70 68 47 40 51 56 56 58 57 47 71 76 80
79 77 77 73 64 67 59 46 54 53 49 58 62

Construct a stem-and-leaf diagram of these data and comment on its shape.


Solution:

Stem-and-leaf diagram of weekly shoe sales

Stems in 10s Leaves in 1s


3 0
4 06779
5 1344667889
6 001222346667788899
7 00133467789
8 037
9 0348

We can see from the stem-and-leaf diagram that the data look approximately
symmetric, centred at about 66.

223
B. Data visualisation and descriptive statistics

4. Calculate the mean, median and mode of the following sample values:

5, 8, 10, 7, 7, 4 and 12.


B
Solution:
P
We have xi = 53 and n = 7, hence x̄ = 53/7 = 7.57.
For the median, arrange the observations into ascending order, giving:

4, 5, 7, 7, 8, 10 and 12.

Given there is an odd number of observations, since n = 7, then the median is:

x((n+1)/2) = x(4) = 7.

The mode is 7 since this occurs twice, with the other values only occurring once
each.

5. Consider again the wage rate data in Question 1. Calculate the mean, median and
mode for these data. Explain which measure should be used in deciding on a rate
to be used in your company.

Solution:
The mean is:
8
P
f k xk
k=1 2 × 11.80 + 5 × 11.90 + · · · + 1 × 12.50 241.4
x̄ = 8
= = = £12.07.
P 20 20
fk
k=1

The median is given by:

x(10) + x(11) 12.00 + 12.00


m= = = £12.00.
2 2

The mode is £11.90 (5 repetitions).


The frequency distribution is positively skewed (i.e. has a long tail on the right)
and so we have (as is often the case with positively-skewed data) that:

mode < median < mean.

The mode is probably a poor measure of location in this case (the mode is often a
poor measure when we are dealing with measurable variables), so the median or
mean is preferred. Depending on the company’s negotiating strategy, it might
choose a figure slightly higher than the mean, say £12.20, since it is higher than the
average and beaten by fewer than one-third of rivals.

224
B.1. Worked examples

6. Hard!
In a sample of n = 6 objects the mean of the data is 15 and the median is 11.
Another observation is then added to the sample mean and this takes the value B
x7 = 12.
(a) Calculate the mean of the seven observations.
(b) What can you conclude about the median of the seven observations?

Solution:
The ‘new’ mean is:
6
P
xi + x7
i=1 nx̄original + x7 6 × 15 + 12 102
x̄new = = = = = 14.6.
6+1 6+1 7 7
Note the easy way to calculate the new sum of data, using the old mean.
There is not enough information in the question to calculate the new median
exactly, but it must be somewhere between 11 and 12 (inclusive), in part because
for the original data:
x(1) , x(2) , x(3) ≤ 11
and:
x(4) , x(5) , x(6) ≥ 11.
x7 = 12 fits somewhere in the second group, increasing the median.
• If x(3) < x7 < x(4) then the new median will be x7 = 12.
• If x(3) < x(4) < x7 then the new median will be x(4) , where 11 ≤ x(4) ≤ 12.
Other cases can also be worked out if we had enough information, or estimated by
a bit more algebra.

7. An airline has a baggage weight allowance of 20 kg per passenger on its flights. It


collected the following data on the weights of its passengers’ baggage. The data
were taken from a sample of 100 passengers on a Madrid–London flight. There were
about 300 passengers on the flight and the 100 passengers were chosen at random.

Weight of baggage (kg) Number of passengers


0≤x<5 21
5 ≤ x < 10 2
10 ≤ x < 15 10
15 ≤ x < 20 27
20 ≤ x < 25 26
25 ≤ x < 30 11
30 ≤ x < 35 3
35 ≤ x 0

(a) Display the data using an appropriate diagram.


(b) Calculate the mean and median of the data.

225
B. Data visualisation and descriptive statistics

(c) The company uses the data to claim that ‘40% of airline passengers travel with
baggage over the weight allowance’. Explain whether or not you think this
claim is valid. (Think about how the data were collected!)
B
Solution:

(a) A density histogram is probably as good as anything here (though in some


cases like this there is room for debate about what is best). We have:

Interval Relative Cumulative


width, Frequency, frequency, Density, frequency,
P
Class interval wk fk rk = fk /n dk = rk /wk k fk
[0, 5) 5 21 0.21 0.042 21
[5, 10) 5 2 0.02 0.004 23
[10, 15) 5 10 0.10 0.020 33
[15, 20) 5 27 0.27 0.054 60
[20, 25) 5 26 0.26 0.052 86
[25, 30) 5 11 0.11 0.022 97
[30, 35) 5 3 0.03 0.006 100

(b) The sample mean, using midpoints of the class intervals, is:

7
P
f k xk
k=1 (21 × 2.5) + (2 × 7.5) + · · · + (3 × 32.5) 1,650
x̄ = = = = 16.5 kg.
P7 21 + 2 + · · · + 3 100
fk
k=1

For the median, since n = 100 we seek the 50.5th ordered value, which must be
located within the [15, 20) class interval (using the cumulative frequency
column in the table above). Since we do not have the raw data, we use the

226
B.1. Worked examples

interpolation approach. Hence:

bin width × number of remaining observations


m = endpoint of previous bin +
bin frequency
B
5 × (50.5 − 33)
= 15 + = 18.24 kg.
27

(c) The claim is not valid. The data were taken from one route and only one flight
on that route. We cannot extrapolate to make conclusions about all flights.
There needs to be more variety in the sampling. For example, include flights to
many destinations, both domestic and international flights etc.

8. Consider again the sales of footwear data in Question 3.

Weekly sales of footwear

30 60 67 63 69 54 68 60 62 83 66 70 68
61 74 94 87 66 69 66 62 78 90 98 93 73
70 68 47 40 51 56 56 58 57 47 71 76 80
79 77 77 73 64 67 59 46 54 53 49 58 62

(a) Use the stem-and-leaf diagram to find the median and the quartiles of the data.

(b) Calculate the interquartile range.

Solution:

(a) There are 52 values so the median is halfway between the 26th and 27th
ordered values. It is easy to read off from the graph that both the 26th and
27th ordered values equal 66, so the median equals 66.

(b) Again, we can read off from the graph that the lower and upper quartiles are
about 57 and 74, respectively. On this basis, the interquartile range is
74 − 57 = 17. However, there exist slightly different definitions of the quartiles
so you might have obtained a slightly different answer (but not too different!)
depending on the definition you have used.

A sensible method is:

x(13) + x(14) 57 + 58
Q1 = = = 57.5
2 2

and:
x(39) + x(40) 73 + 74
Q3 = = = 73.5.
2 2
Using this method, the interquartile range is 73.5 − 57.5 = 16.

227
B. Data visualisation and descriptive statistics

9. Calculate the range, variance and standard deviation of the following sample values:

5, 8, 10, 7, 7, 4 and 12.


B
Solution:
The range is:
x(n) − x(1) = 12 − 5 = 7

where x(1) is the minimum value in the dataset, and x(n) is the maximum value.
We have i xi = 53 and i x2i = 447, hence x̄ = 53/7 = 7.57 and the (sample)
P P
variance is:
!
1 X 1
s2 = x2i − nx̄2 = × (447 − 7 × (7.57)2 ) = 7.64.
n−1 i
6

Hence the (sample) standard deviation is:


√ √
s= s2 = 7.64 = 2.76.

10. Compare the means and standard deviations of the sample datasets shown in
Question 2 and comment. You may wish to use the following summary statistics.

Statistics class Economics class


Sum 1,395 2,225
Sum of squares 46,713 100,387

Solution:
For the statistics students:
s
50  
1 X 1,395 1 (1,395)2
x̄ = xi = = 27.9 and s = × 46,713 − = 12.6.
50 i=1 50 49 50

So the statistics students’ mean and standard deviation of attention spans are 27.9
minutes and 12.6 minutes, respectively. For the economics students:
s
60  
1 X 2,225 1 (2,225)2
x̄ = xi = = 37.1 and s = × 100,387 − = 17.4.
60 i=1 60 59 60

So the economics students’ mean and standard deviation of attention spans are
37.1 minutes and 17.4 minutes, respectively.
These statistics represent the main features of the distributions shown in the
density histograms of Question 2. The economics students have the higher mean
and variability due to the group which maintains interest throughout the class.

228
B.1. Worked examples

11. Calculate the mean and standard deviation of the following groups of numbers
(treat these as samples).
(a) 5, 3, 4, 7, 10, 7 and 1. B
(b) 12, 10, 11, 14, 17, 14 and 8.
(c) 12, 8, 10, 16, 22, 16 and 4.
(d) 25, 9, 16, 49, 100, 49 and 1.
Comment on any relationships between the statistics for the groups.
Solution:
The sample means and standard deviations are:

(a) (b) (c) (d)


Mean 5.29 12.29 12.57 35.60
Standard deviation (s.d.) 2.98 2.98 5.97 33.90

Comments are the following.


• Dataset (b) is (a) + 7, hence mean(b) = mean(a) + 7 and s.d.(a) = s.d.(b).
• Dataset (c) is 2(a) + 2, hence mean(c) = 2 × mean(a) + 2 and
s.d.(c) = 2 × s.d.(a).
• Dataset (d) is the square of group (a).
These results are examples of transformations of variables.

12. Consider again the wage rate data in Question 1. Calculate the range, interquartile
range and standard deviation of these data.
Solution:
The range is £12.50 − £11.80 = £0.70. For the quartiles we have:
x(5) + x(6) 11.90 + 11.90
Q1 = = = £11.90
2 2
and:
x(15) + x(16) 12.20 + 12.20
Q3 = = = £12.20.
2 2
Therefore, the interquartile range is £12.20 − £11.90 = £0.30. To compute the
(sample) variance, compute the following:
20
X
x2i = 2,914.50
i=1

20 20
!2
X 1 X 1
Sxx = x2i − xi = 2,914.50 − × (241.4)2 = 0.802
i=1
n i=1
20

hence:
Sxx 0.802
s2 = = = 0.042211.
n−1 19

229
B. Data visualisation and descriptive statistics


Hence the (sample) standard deviation is 0.042211 = £0.21. Note that two
decimal places are sufficient here.
B Tip: Working out the quartiles is easier if the sample size is a multiple of 4. If we
have 4k items, listed in increasing order as x(1) , x(2) , . . . , x(4k) , then:
x(k) + x(k+1) x(2k) + x(2k+1) x(3k) + x(3k+1)
Q1 = , Q2 = and Q3 = .
2 2 2
13. The table below summarises the distribution of salaries for a sample of 50
employees.

Salary (£000s) [5, 10) [10, 15) [15, 20) [20, 25)
Number of employees 8 14 21 7
(a) Draw a density histogram for these data.
(b) Calculate the mean, standard deviation and median.
(c) What is the modal class for this sample?
Solution:
(a) We have, using midpoints for calculations:

Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk f k xk fk x2k
[5, 10) 5 8 0.16 0.032 60.0 450.00
[10, 15) 5 14 0.28 0.056 175.0 2,187.50
[15, 20) 5 21 0.42 0.081 367.5 6,431.25
[20, 25) 5 7 0.14 0.028 157.5 3,543.75
Total 50 760.0 12,612.50

230
B.1. Worked examples

(b) The mean is x̄ = 760/50 = 15.2, i.e. £15,200.


The sample standard deviation is:
s   B
1 (760)2
s= × 12,612.50 − = 4.65, i.e. £4,650.
49 50

For the median, since n = 50 we seek the 25.5th ordered value, which must be
located within the [15, 20) class interval. Since we do not have the raw data,
we use the interpolation approach. Hence:
bin width × number of remaining observations
m = endpoint of previous bin +
bin frequency
5 × (25.5 − 22)
= 15 + = 15.83, i.e. £15,830.
21

(c) The modal class is [15, 20) (where the modal class is the class interval with the
greatest frequency).

14. In a sample of 64 observations, the numbers of households containing 1, 2, 3, 4 and


5 persons were 8, 30, 11, 7 and 8, respectively. Calculate the mean and standard
deviation of household size.
Solution:
It might help to display the data in a table:

Household size, xk 1 2 3 4 5
Frequency, fk 8 30 11 7 8

The summary statistics are:


X
fk = 8 + 30 + 11 + 7 + 8 = 64
X
fk xk = 1 × 8 + 2 × 30 + · · · + 5 × 8 = 169

and: X
fk x2k = 8 × 12 + 30 × 22 + · · · + 8 × 52 = 539.
So:
K
P
f k xk
k=1 169
x̄ = K
= = 2.64
P 64
fk
k=1

and:  2
K K
fk x2k
P P
f k xk   2
2 k=1

 k=1 539 169
s = − K  = − = 1.45.

PK  P  64 64
fk fk
i=1 k=1

Therefore, the standard deviation is s = 1.45 = 1.20.

231
B. Data visualisation and descriptive statistics

15. James set a class test for his students. The test was marked out of 10, and the
students’ marks are summarised below. (Note that no student obtained full marks,
and every student scored at least 3.)
B
Mark (out of 10) 3 4 5 6 7 8 9
Number of students 1 1 6 5 11 3 1

(a) Display the data using an appropriate diagram.


(b) Calculate the mean and median marks.
(c) Calculate the standard deviation, the quartiles and the interquartile range.

Solution:

(a) A histogram would be acceptable for this dataset. We have:

Interval Relative
width, Frequency, frequency, Density,
Class interval wk fk rk = fk /n dk = rk /wk
[3, 4) 1 1 0.0357 0.0357
[4, 5) 1 1 0.0357 0.0357
[5, 6) 1 6 0.2143 0.2143
[6, 7) 1 5 0.1786 0.1786
[7, 8) 1 11 0.3929 0.3929
[8, 9) 1 3 0.1071 0.1071
[9, 10) 1 1 0.0357 0.0357

(b) x̄ = 177/28 = 6.32 marks. There are 28 students. The 14th lowest score was 7
and the 15th lowest was also 7. Therefore, the median mark also equals 7
marks.

232
B.2. Practice problems

(c) We compute the sample standard deviation using:


s

 
(177)2
s=
1 2 2 2
× (1 × 3 + 1 × 4 + · · · + 1 × 9 ) − = 1.78 = 1.33 marks. B
27 28

The lower quartile equals 5 and the upper quartile equals 7, so the
interquartile range = 7 − 5 = 2. Note that, perhaps surprisingly, the upper
quartile equals the middle quartile, which is, of course, the median. This sort
of thing can happen with discrete data!

B.2 Practice problems


1. Identify and describe the variables in each of the following examples.
(a) Voting intentions in a poll of 1,000 people.
(b) The number of hours students watch television per week.
(c) The number of children per household receiving full-time education.

2. Think about why and when you would use each of the following.
(a) A density histogram.
(b) A stem-and-leaf diagram.
When would you not do so?

3. Find the mean of the number of hours of television watched per week by 10
students, with the following observations:

2, 2, 2, 5, 5, 10, 10, 10, 10 and 10.

4. Say whether the following statement is true or false and briefly give your reason(s).
‘The mean of a dataset is always greater than the median.’

5. If n = 4, x1 = 1, x2 = 4, x3 = 5 and x4 = 6, find:
4
1X
xi .
3 i=2

Why might you use this figure to estimate the mean?

6. For x1 = 4, x2 = 1 and x3 = 2, calculate:


(a) the median
(b) the mean.

7. If x1 = 4, x2 = 2, x3 = 2, x4 = 5 and x5 = 6, calculate:
(a) the mode
(b) the mean.

233
B. Data visualisation and descriptive statistics

8. Calculate the mean, median and mode of the prices (in £) of spreadsheet packages.
Also, calculate the range, variance, standard deviation and interquartile range of
the spreadsheet prices shown below. Check against the answers given after the data.
B
Name Price Price − Mean (Price − Mean)2
Brand A 52 −82.54 6,812.60
Brand B 64 −70.54 4,975.67
Brand C 65 −69.54 4,835.60
Brand D 82 −52.54 2,760.29
Brand E 105 −29.54 875.52
Brand F 107 −27.54 758.37
Brand G 110 −24.54 602.14
Brand H 115 −19.54 381.75
Brand I 155 20.46 418.67
Brand J 195 60.46 3,655.60
Brand K 195 60.46 3,655.60
Brand L 239 104.46 10,912.21
Brand M 265 130.46 17,020.21

Total 1,749 0.00 57,661.23

Variance = 4,805.10 Std. Dev. = 69.32 IQR = 113

You should be able to show that the arithmetic mean is 134.54, the median is 110,
and the mode is 195.

9. Work out s2 for a sample of nine observations of the number of minutes students
took to complete a statistics problem.
2, 4, 5, 6, 6, 7, 8, 11 and 20.

10. State whether the following statement is true or false and briefly give your
reason(s). ‘Three quarters of the observations in a dataset are less than the lower
quartile.’

11. The data below show the number of daily telephone calls received by an office
supplies company over a period of 25 working days.

219 541 58 7 13
476 418 177 175 455
258 312 164 314 336
121 77 183 133 78
291 138 244 36 48
(a) Construct a stem-and-leaf diagram for these data and use this to find the
median of the data.
(b) Find the first and third quartiles of the data.
(c) Would you expect the mean to be similar to the median? Explain.
(d) Comment on your figures.

234
B.3. Solutions to Practice problems

B.3 Solutions to Practice problems


1. (a) Imagine there are only three possible parties people can vote for: A, B and C. B
Hence there will be three possible variables with nA , nB and nC being the
numbers voting for each (where nA + nB + nC = 1,000), or pA , pB and pC being
the proportions voting for each (where pA + pB + pC = 1). Whichever way you
look at it, these are discrete variables.
(b) The number of hours students watch television per week could be anything
from 0 to 24 × 7 = 168! If you only measure whole numbers, you would regard
these as integers, though with such large numbers the distribution will
approximate to a continuous one. It is more likely that you will be given more
detailed figures, such as 1.33 hours, for example, or 23.75 hours. In this case,
hours are clearly continuous variables.
(c) Only whole numbers are possible which are unlikely to be large, so the variable
is discrete.

2. (a) We use a density histogram when we want to compare the numbers of


observations or frequencies in categories which form part of a continuum, such
as numbers in the income groups:
◦ less than £200 per week
◦ £200 and less than £300 per week
◦ £300 and less than £500 per week
and so on. There must be no overlaps between the class intervals, i.e. they
must be mutually exclusive, they must also be collectively exhaustive.
(b) A stem-and-leaf diagram is very useful to work out before we draw a density
histogram. It is a way of organising data and understanding their structure.
You could not use:
• a density histogram if you are looking at named groups with no particular
relationship with each other
• a stem-and-leaf diagram if you are looking, for example, at proportions in
categories.

3. Adding the ten observations and dividing by 10 gives us:


n
1X 2 + 2 + 2 + 5 + 5 + 10 + 10 + 10 + 10 + 10 66
x̄ = xi = = = 6.6.
n i=1 10 10

Using the frequency form formula gives us, again:


K
P
f k xk
k=1 (3 × 2) + (2 × 5) + (5 × 10) 6 + 10 + 50
x̄ = K
= = = 6.6.
P 10 10
fk
k=1

235
B. Data visualisation and descriptive statistics

4. The statement is false. The mean is only greater than the median when the dataset
is positively skewed. If the distribution is symmetric, then the mean and median
are equal. If the distribution is negatively skewed, then the mean is less than the
B median.

5. We have:
4
1X 1 4+5+6 15
xi = (x2 + x3 + x4 ) = = = 5.
3 i=2 3 3 3
The first observation, x1 , could be an outlier due to its value, 1, being ‘far’ from
the other observed values. Given the small sample size of n = 4, including this
potentially extreme observation might give a misleading estimate of the true
population mean.

6. (a) Since n = 3, the median is the midpoint of the ordered dataset, which is:

x(1) = 1, x(2) = 2 and x(3) = 4

hence the median is x(2) = 2.


(b) The mean is:
n
1X 4+1+2 7
xi = = = 2.3̇.
n i=1 3 3

7. (a) The mode is the value which occurs most frequently. Here the mode is 2.
(b) The mean is:
n
1X x1 + x2 + x3 + x4 + x5 4+2+2+5+6 19
xi = = = = 3.8.
n i=1 5 5 5

8. There are 13 observations so the mean is the sum of them (1,749 as given) divided
by 13. This comes to £134.54.
The numbers have been arranged in order of size, so the median is the
((13 + 1)/2)th observation, that is the 7th (ordered) observation. This is £110.
The mode of the data, as given, is £195 which occurs twice (the other figures only
occur once). However, if we round to the nearest 10 (counting the 5s downwards)
the prices become (in £s) 50, 60, 60, 80, 100, 110, 110, 110, 150, 190, 190, 240, 260
and the mode is then £110 which occurs three times (£60 and £190 occur twice –
less often).
The range of the data is 265 − 52 = 213, so a price difference of £213 between the
most expensive and cheapest brands. The variance, standard deviation and
interquartile range are provided in the table, so check through your working and
make sure you can determine these.

9. We first need to know the sample mean, which is:


n
1X 2 + 4 + 5 + 6 + 6 + 7 + 8 + 11 + 20 69
x̄ = xi = = = 7.67.
n i=1 9 9

236
B.3. Solutions to Practice problems

Therefore:
n
2 1 X
s = (xi − 7.67)2
n − 1 i=1 B
1
= ((2 − 7.67)2 + (4 − 7.67)2 + (5 − 7.67)2 + (6 − 7.67)2 + (6 − 7.67)2
8
+ (7 − 7.67)2 + (8 − 7.67)2 + (11 − 7.67)2 + (20 − 7.67)2 )
1
= (32.15 + 13.47 + 7.13 + 2.79 + 2.79 + 0.45 + 0.11 + 11.09 + 152.03)
8
222.01
=
8
= 27.75.

You can see this is very hard work, and it is quite remarkable that having rounded
figures to the nearest two decimal places before squaring the (xi − x̄) terms we still
get the same answer as that using the short-cut formula.
The short-cut formula gives us:
n
!
1 X 1
s2 = x2i − nx̄2 = (751 − 9 × (7.67)2 ) = 27.75.
n−1 i=1
8

10. The statement is false. By definition, the lower quartile separates the bottom 25%
of observations in a dataset from the top 75% of observations.

11. (a) We have:


Stem-and-leaf diagram of daily telephone calls

Stems in 100s Leaves in 10s and 1s


0 07 13 36 48 58 77 78
1 21 33 38 64 75 77 83
2 19 44 58 91
3 12 14 36
4 18 55 76
5 41
Since n = 25, the median is:

x((n+1)/2) = x(13) = 177.

(b) For the quartiles, noting that different quartile calculation methods are
acceptable, possible (interpolated) solutions are:
x(7) − x(6) 78 − 77
Q1 = x(n/4) = x(6.25) = x(6) + = 77 + = 77.25
4 4
and:
3(x(19) − x(18) ) 3(312 − 291)
Q3 = x(3n/4) = x(18.75) = x(18) + = 291 + = 306.75.
4 4
237
B. Data visualisation and descriptive statistics

(c) The stem-and-leaf diagram clearly shows a positively-skewed distribution,


hence we would expect the mean to be greater than the median. (In fact the
mean of the data is 210.88.)
B
(d) As discussed in part (c), the data are clearly skewed to the right. The third
quartile indicates that about 75% of the time no more than 300 telephone calls
are received. The company’s management could use this information when
considering staffing levels.

238
Appendix C
Probability theory
C
C.1 Worked examples

1. Calculate the probability that, when two fair dice are rolled, the sum of the
upturned faces will be:
(a) an odd number
(b) less than 9
(c) exactly 12
(d) exactly 4.

Solution:
The following table shows all the possibilities [as: (Score on first die, Score on
second die)]:

(1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)


(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
(6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)

The total number of possible outcomes, N , is 36. Note all outcomes are equally
likely.
(a) The number of favourable points = 18, so P (odd total) = 18/36 = 1/2 = 0.50.
(b) The number of favourable points = 26, so P (less than 9) = 26/36 = 13/18
= 0.7222.
(c) The number of favourable points = 1, so P (exactly 12) = 1/36 = 0.0278.
(d) The number of favourable points = 3, so P (exactly 4) = 3/36 = 1/12 = 0.0833.

2. A simple gambling game is played as follows. The gambler throws three fair coins
together with one fair die and wins £1 if the number of heads obtained in the
throw of the coins is greater than or equal to the score on the die. Calculate the
probability that the gambler wins.

239
C. Probability theory

Solution:
The total number of equally likely outcomes is:

N = 2 × 2 × 2 × 6 = 48.

Here it is convenient to organise the calculations by numbers in descending order –


in this case, the value on the die. Clearly, we are only interested in cases for which
C the highest die value is 3, 2 or 1.

Die shows 3: the only winning outcome is HHH ⇒ 1 case.

Die shows 2: the winning outcomes are HHH, HHT, HT H, T HH ⇒ 4 cases.

Die shows 1: the winning outcomes are


HHH, HHT, HT H, T HH, HT T, T HT, T T H ⇒ 7 cases.

We can simply add the numbers of cases, so there are n = 12 favourable outcomes.
Hence the required probability is 12/48 = 1/4 = 0.25.

3. Three fair dice are thrown and you are told that at least one of the upturned faces
shows a 4. Using this fact, determine the (conditional) probabilities of:
(a) exactly one 4
(b) exactly two 4s.
Solution:
We first identify N , the total number of equally likely outcomes when at least one
die shows a 4, i.e. we restrict our attention to the ‘smaller’ sample space, rather
than the ‘full’ sample space of 63 = 216 outcomes. If the first die shows a 4, then
the other two can be anything, so the ‘pattern’ is 4XY . There are six choices for
each of X and Y , and these choices are independent of each other. Hence there are
6 × 6 = 36 outcomes in which the first die shows a 4.
If the first die does not show a 4, then there are 11 possible combinations of values
for the other two dice. To see why, one method is to list the possible pairs of values:

left-hand number equal to 4 : 46, 45, 44, 43, 42, 41


left-hand number not equal to 4 : 64, 54, 34, 24, 14

where, of course, the outcome ‘44’ can only be shown in one of the two lists. This
‘listing’ approach works because the number of possibilities for the two dice is small
enough such that we can list all of those in which we are interested (unlike with
three dice).
Another, slightly different, method is to calculate the numbers in each row
(without listing them all). For the pattern 4X there are six choices of X, hence six
cases, and for the pattern [not 4]4 there are five choices for [not 4] and hence five
cases, giving eleven possibilities altogether. As mentioned above, anything except
short lists should be avoided, so it is useful to know lots of alternative ways for
working out the number of relevant outcomes.

240
C.1. Worked examples

Combining these remarks, we have:

N = 36 + (5 × 11) = 91.

We now identify the number of favourable outcomes, n, for two or more 4s.
If the first die shows a 4, then there has to be at least one 4 on the other two, so
there are 11 ways for the other two dice. If the first die does not show a 4, then
there are 5 choices for what it might show, while both the other two dice must C
show 4s. Clearly, the two 4s can only happen in 1 way. So:

n = 11 + (5 × 1) = 16

and exactly one of these cases, (4, 4, 4), represents more than two 4s, so that fifteen
represent exactly two 4s.
Hence the required probabilities are as follows.
(a) We have:
91 − 16 75
P (exactly one 4) = = = 0.8242.
91 91
(b) We have:
15
P (exactly two 4s) = = 0.1648.
91

4. Use a Venn diagram, or otherwise, to ‘prove’ the following.


(a) P (Ac ) = 1 − P (A).
(b) P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
(c) If A implies B, then P (A) ≤ P (B).
(d) P (A ∪ B ∪ C) =
P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C).
Solution:
These results can be seen very easily from a Venn diagram (although a
mathematician would not accept such a graphical approach as a ‘proof’, but we will
not worry about that here). The following is the Venn diagram for events A, B and
C. The lower case letters, that is a, b, . . . , g, stand for the areas (which represent the
probabilities) of the events for which the region surrounding each such letter stands.

A B

a d b

e f

(a) With the same notation from the diagram, the area for Ac is everything
outside the area for A, so P (Ac ) = 1 − P (A).

241
C. Probability theory

(b) With the notation from the diagram:

P (A ∪ B) = a + e + f + b + (d + g)
= (a + d + g + e) + (d + b + f + g) − (d + g)
= P (A) + P (B) − P (A ∩ B).
C
(c) If A implies B, then this means that the region for A is completely inside the
region for B, i.e. that the intersection of A and B is the whole of A. Hence the
area of the shape for A is no larger than the area of the shape for B, i.e.
P (A) ≤ P (B).
(d) It is clear from the diagram that:
area(A ∪ B ∪ C) = a + b + c + d + e + f + g
P (A) = a + d + e + g P (A ∩ B) = d + g
P (B) = b + d + f + g P (A ∩ C) = e + g
P (C) = c + e + f + g P (B ∩ C) = f + g
P (A ∩ B ∩ C) = g.
So P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C)
equals:

(a + d + e + g) + (b + d + f + g) + (c + e + f + g)
− ((f + g) + (e + g) + (d + g)) + g
=a+b+c+d+e+f +g
= area(A ∪ B ∪ C)
= P (A ∪ B ∪ C).

5. In a class of 35 students, 20 have taken Economics, 15 have taken History, and 10


have taken Politics. Of these, 9 have taken both Economics and History, 5 have
taken both Economics and Politics, and 6 have taken both History and Politics. 3
students have taken all three subjects. What is the probability that a student
chosen at random from the class has taken at least one of these subjects?
Solution:
This is an application of the previous question with:

A = Economics, B = History and C = Politics.

So:
20 + 15 + 10 − 9 − 5 − 6 + 3 28
P (at least one subject) = = = 0.80.
35 35

6. Two fair dice are rolled.

242
C.1. Worked examples

(a) What is the probability of two sixes?


(b) What is the probability of at least one six?
Solution:
(a) By independence:
1 1 1
P (two sixes) = × = .
6 6 36 C
(b) We have:
P (at least one 6) = P (6 on first die and/or 6 on second die)
= P (6 on first die) + P (6 on second die) − P (6 on both dice)
1 1 1
= + −
6 6 36
11
=
36
= 0.3056.

7. A student has an important job interview in the morning. To ensure he wakes up in


time, he sets two alarm clocks which ring with probabilities 0.97 and 0.99,
respectively. What is the probability that at least one of the alarm clocks will wake
him up?
Solution:
We are told that the alarm clocks are independent. Therefore:
P (at least one rings) = 1 − P (neither rings) = 1 − (0.03 × 0.01) = 0.9997.

8. A fair coin is tossed 10 times.


(a) What is the probability that the 10th toss results in a head?
(b) If the first 9 tosses give 4 heads and 5 tails, what is the probability that the
10th toss results in a head?
(c) If the first 9 tosses give 9 heads and no tails, what is the probability that the
10th toss results in a head?
Solution:
The answer is 0.5 in each case! The coin has no ‘memory’. This is because the
successive throws of a fair coin are independent.

9. The successful operation of three separate switches is needed to control a machine.


If the probability of failure of each switch is 0.10 and the failure of any switch is
independent of any other switch, what is the probability that the machine will
break down?
Solution:
The probability that the machine will not break down equals the probability that
all three switches work. This equals 0.90 × 0.90 × 0.90 = 0.729, by independence.
Hence the probability it will break down is 1 − 0.729 = 0.271.

243
C. Probability theory

10. Of all the candles produced by a company, 0.01% do not have wicks (the core piece
of string). A retailer buys 10,000 candles from the company.
(a) What is the probability that all the candles have wicks?
(b) What is the probability that at least one candle will not have a wick?
Solution:
C Let X be the number of candles without a wick.
(a) We have:
P (X = 0) = (0.9999)10,000 = 0.3679.

(b) We have:
P (X ≥ 1) = 1 − P (X = 0) = 1 − 0.3679 = 0.6321.

11. A family has four children.


(a) What is the probability of at least one boy among the children?
(b) What is the probability that all the children are of the same sex?
Solution:
For our sample space we can take (in order of age):

GGGG
GGGB GGBG GBGG BGGG
GGBB GBGB BGGB GBBG BGBG BBGG
BBBG BBGB BGBB GBBB
BBBB

So we have N = 24 = 16 possible outcomes.


(a) For ‘at least one boy’ the number of favourable points is 15 (all excluding
GGGG), so:
15
P (at least one boy) = = 0.9375.
16
An alternative (and easier) method is to note that:
1 1 1 1 1
P (all girls) = × × × =
2 2 2 2 16
so:
1 15
P (at least one boy) = 1 − P (all girls) = 1 − = = 0.9375.
16 16
(b) For ‘all of the same sex’ the number of favourable points is 2, so:
2 1
P (all of the same sex) = = = 0.125.
16 8

12. Calculate the probabilities of the following events.


(a) Five independent tosses of a fair coin yield the sequence HHT T H.
(b) Five independent tosses of a fair coin yield Hs on the first and last tosses.

244
C.1. Worked examples

(c) A pair of fair dice is thrown and the faces are equal.
(d) Five cards are drawn at random from a deck of cards and all are the same suit.
(e) Four independent tosses of a fair coin result in at least two heads. What is the
probability that all four tosses are heads?
Solution:
(a) By independence, we have (0.50)5 = 0.03215. C
(b) By independence (noting we only care about the first and last outcomes), we
have (0.50)2 = 0.25.
(c) Of the 36 equally likely outcomes, 6 have equal faces, hence the probability is
6/36 = 1/6 = 0.1667.
(d) Note this is sampling without replacement, and we can have any card as the
first card drawn, the probability is:
12 11 10 9
1× × × × = 0.00198.
51 50 49 48
(e) The sample space has 11 points, all equally likely, and one is the event of
interest, so the probability is 1/11 = 0.0909.

13. A company is concerned about interruptions to email. It was noticed that problems
occurred on 15% of workdays. To see how bad the situation is, calculate the
probabilities of an interruption during a five-day working week:
(a) on Monday and again on Tuesday
(b) for the first time on Thursday
(c) every day
(d) at least once during the week.
Solution:
(a) This is the probability of the event occurring on Monday and on Tuesday.
Each has probability 0.15 and the outcome is independent of day of the week,
so calculate:
(0.15)2 = 0.025.
(b) We require the probability of the event not occurring on the first three days of
the week, but occurring on Thursday. So the result will be:
(1 − 0.15)3 × 0.15 = 0.092.

(c) Similar to (a), but for each weekday. We need to find:


(0.15)5 = 0.000076.

(d) The probability of no occurrence in the week will be that of no occurrence on


any of the five days, and this is:
(0.85)5 = 0.444.
Therefore the probability of at least one occurrence will be:
1 − 0.444 = 0.556.

245
C. Probability theory

14. Two fair coins, a 10p and a 50p, are tossed.


(a) Find the probability that:
i. both coins show heads
ii. different faces show up
iii. at least one head shows.
C (b) You are told that the 10p shows heads. What is the probability that both
coins show heads?
(c) You are told that at least one of the two coins shows heads. What is the
probability that both coins show heads?
Solution:
Let the sample space be S = {HH, HT, T H, T T } where, for example, ‘HT ’ means
‘heads on the 10p coin and tails on the 50p coin’. Each element is equally likely, so
each has a probability equal to 0.25.
(a) i. We have P (HH) = 0.25.
ii. We have P (different faces) = P (HT or T H) = P (HT ) + P (T H) = 0.50.
iii. We have P (at least 1 head) = P (HT or T H or HH) =
P (HT ) + P (T H) + P (HH) = 0.75.
(b) To find P (both show heads given that the 10p shows heads) we define:

A = both show heads = {HH} and B = 10p shows heads = {HT, HH}

and use conditional probability, i.e. the result that P (A | B) = P (A ∩ B)/P (B).
To do this we note that in this case the event A ∩ B is the same as the event A
(which is important to spot, when it happens). Hence:
P (A ∩ B) P (A) P (HH) 0.25
P (HH | 10p shows heads) = = = = = 0.50.
P (B) P (B) P (HT or HH) 0.5

(c) Here we want to find P (A | C) = P (A ∩ C)/P (C) where:

C = at least one head = {HH, HT, T H}.

Hence P (HH | at least one head) equals:


P (A ∩ C) P (A) P (HH) 0.25
= = = = 0.33.
P (C) P (C) P (HH or HT or T H) 0.75

15. A fair coin is thrown six times in succession.


(a) Find the probability that exactly three throws come up heads.
(b) Assuming that there are exactly three heads, what is the (conditional)
probability that they are on consecutive throws?
Solution:
There are two ‘fair’ choices for the outcome of each throw, hence there are two
equally likely outcomes (H and T , using obvious notation) of each. Moreover, we

246
C.1. Worked examples

can assume that one throw’s outcome does not affect the others. That is, the
outcomes of the different throws are independent, which (always!) means we
multiply the probabilities. Combining these two thoughts, there are:

N = 2 × 2 × 2 × 2 × 2 × 2 = 26 = 64

C
items (‘outcomes’) in the full, six-throw sample space, each with probability:

 6
1 1 1 1 1 1 1 1
× × × × × = = .
2 2 2 2 2 2 2 64

(a) It is (just about) possible to list accurately the 20 cases in which there are
exactly three heads (and hence exactly three tails), identifying the four cases
(top row) in which the three Hs are adjacent.

HHHT T T T HHHT T T T HHHT T T T HHH 3 Hs together


HHT T T H HHT T HT HHT HT T T HHT T H
T HHT T H T T HHT H HT T T HH HT T HHT 2 consecutive Hs
HT HHT T T HT T HH T HT T HH T T HT HH
HT HT HT HT HT T H HT T HT H T HT HT H All Hs separate

To make sure that you count everything, but just once, it is essential to
organise the listed items according to some pattern.

Hence:
n 20 5
P (exactly 3 Hs) = = = = 0.3125.
N 64 16

(b) If there are exactly three heads on consecutive throws, then it is possible to
specify which of these different outcomes we are dealing with by identifying
how many (if any) T s there are before the sequence of Hs. (The remaining T s
must be after the sequence of Hs.) The possible values (out of 3) are 0, 1, 2
and 3, so there are four cases.

Hence:

n 4 1
P (3 consecutive Hs | exactly 3 Hs) = = = = 0.20.
N 20 5

247
C. Probability theory

16. Let K be the event of drawing a ‘king’ from a well-shuffled deck of playing cards.
Let D be the event of drawing a ‘diamond’ from the pack. Determine:
(a) P (K) (f) P (K | D)
(b) P (D) (g) P (D | K)
(c) P (K c ) (h) P (D ∪ K c )
C (d) P (K ∩ D) (i) P (Dc ∩ K)
(e) P (K ∪ D) (j) P ((Dc ∩ K) | (D ∪ K)).
Are the events D and K independent, mutually exclusive, neither or both?
Solution:
(a) We have:
4 1
P (K) = = .
52 13
(b) We have:
13 1
P (D) = = .
52 4
(c) We have:
1 12
P (K c ) = 1 − = .
13 13
(d) We have:
1
P (K ∩ D) = .
52
(e) We have:
16 4
P (K ∪ D) = = .
52 13
(f) We have:
1
P (K | D) = .
13
(g) We have:
1
P (D | K) = .
4
(h) We have:
49
P (D ∪ K c ) = .
52
(i) We have:
3
P (Dc ∩ K) = .
52
(j) We have:
3
P (Dc ∩ K | D ∪ K) = .
16
The events are independent since P (K ∩ D) = P (K) P (D), but they are not
mutually exclusive (consider the King of Diamonds).

248
C.1. Worked examples

17. At a local school, 90% of the students took test A, and 15% of the students took
both test A and test B. Based on the information provided, which of the following
calculations are not possible, and why? What can you say based on the data?
(a) P (B | A).
(b) P (A | B).
(c) P (A ∪ B).
C
If you knew that everyone who took test B also took test A, how would that
change your answers?
Solution:
(a) Possible.
P (A ∩ B) 0.15
P (B | A) = = = 0.167.
P (A) 0.90
(b) Not possible because P (B) is unknown and we would need to calculate
P (A ∩ B) 0.15
P (A | B) = = .
P (B) P (B)

(c) Not possible because P (A ∪ B) = P (A) + P (B) − P (A ∩ B) and, as discussed


in (b), we do not know P (B).
If we knew that everyone who took test B also took test A, then P (A | B) = 1 and
hence A ∪ B = A. Therefore P (A ∪ B) = P (A) = 0.90.

18. Given two events, A and B, state why each of the following is not possible. Use
formulae or equations to illustrate your answer.
(a) P (A) = −0.46.
(b) P (A) = 0.26 and P (Ac ) = 0.62.
(c) P (A ∩ B) = 0.92 and P (A ∪ B) = 0.42.
(d) P (A ∩ B) = P (A) P (B) and P (B) > P (B | A).
Solution:
(a) Not possible because a probability cannot be negative since 0 ≤ P (A) ≤ 1.
(b) Not possible because P (A) + P (Ac ) = 1. Here 0.26 + 0.62 = 0.88 6= 1.
(c) Not possible because A ∩ B is a subset of A ∪ B, hence we have
P (A ∩ B) ≤ P (A ∪ B). Here 0.92 > 0.42.
(d) Not possible because two events cannot both be independent (as implied by
the condition P (A ∩ B) = P (A) P (B)) and dependent (as implied by the
condition P (B) > P (B | A)) at the same time.

19. Events A, B and C have the following probabilities:

P (A) = 0.40 P (B) = 0.50 P (C) = 0.10

P (A | B) = 0.20 P (B | C) = 0 P (C | A) = 0.25.

249
C. Probability theory

(a) Are the following statements ‘True’ or ‘False’ ?

i. C implies A.
ii. B and C are mutually exclusive.
iii. A ∪ B ∪ C = S.
C (b) What is P (B | A)?
(c) What is P (B | Ac )?
Solution:
(a) i. The necessary condition for C implies A is that P (A | C) = 1. Using
Bayes’ formula, we see that:

P (C | A) P (A) 0.25 × 0.40


P (A | C) = = =1
P (C) 0.10

so the statement is true.


ii. P (B | C) = 0, and using Bayes’ formula it follows that P (B ∩ C) = 0, so
we see that (ignoring continuous distributions) the two events are
mutually exclusive, so the statement is true.
iii. Since C implies A, it follows that:

P (A ∪ B ∪ C) = P (A ∪ B)
= P (A) + P (B) − P (A ∩ B)
= 0.40 + 0.50 − P (A | B) P (B)
= 0.40 + 0.50 − 0.20 × 0.50
= 0.80.

Since P (S) = 1, it follows that A ∪ B ∪ C 6= S, so the statement is false.


(b) Applying Bayes’ formula, we have:

P (B ∩ A) P (A | B) P (B) 0.20 × 0.50


P (B | A) = = = = 0.25.
P (A) 0.40 0.40

(c) Applying Bayes’ formula, we have:

c P (Ac | B) P (B) 0.80 × 0.50 2


P (B | A ) = c
= = = 0.6667.
P (A ) 0.60 3

20. In an audit Bill analyses 60% of the audit items and George analyses 40%. Bill’s
error rate is 5% and George’s error rate is 3%. Suppose an item is sampled at
random.
(a) What is the probability that it is in error (i.e. audited incorrectly)?
(b) If the chosen item is incorrect what is the probability that Bill is to blame?

250
C.1. Worked examples

Solution:
Let B = Bill audits item, G = George audits item, and E = incorrect audit. Hence
P (B) = 0.60, P (G) = 0.40, P (E | B) = 0.05 and P (E | G) = 0.03.
(a) Using the total probability formula:
P (E) = P (E | B) P (B) + P (E | G) P (G) = (0.05 × 0.60) + (0.03 × 0.40) = 0.042.

(b) Using the conditional probability formula:


C
P (E | B) P (B) 0.05 × 0.60
P (B | E) = = = 0.71.
P (E) 0.042

21. Two fair coins are tossed. You are told that ‘at least one is a head’. What is the
probability that both are heads?
Solution:
At first sight it may appear that this should be equal to 1/2, but this is wrong! The
correct solution may be achieved in several ways; here we shall use a conditional
probability approach.
Let HH denote the event ‘both coins show heads’, and let A denote the event ‘at
least one coin shows a head’. Then:
P (A ∩ HH) P (HH) 1/4 1
P (HH | A) = = = = .
P (A) P (A) 3/4 3
Note that A ∩ HH = HH, i.e. ‘at least one head’ and ‘both show heads’ only
corresponds to ‘both show heads’, HH.

22. The probability of a horse winning a race is 0.30 if it is dry and 0.50 if it is wet.
The weather forecast gives the chance of rain as 40%.
(a) Find the probability that the horse wins.
(b) If you are told that the horse lost the race, what is the probability that the
weather was dry on the day of the race? (Assuming you cannot remember!)
Solution:
Let A = ‘horse wins’, Ac = ‘horse loses’, D = ‘dry’ and Dc = ‘wet’. We have:
P (A | D) = 0.30, P (A | Dc ) = 0.50, P (D) = 0.60 and P (Dc ) = 0.40.

(a) We have:
P (A) = P (A | D) P (D) + P (A | Dc ) P (Dc ) = 0.30 × 0.60 + 0.50 × 0.40 = 0.38.
Hence the horse has a probability of 0.38 of winning the race.
(b) From (a), we can determine that there is a 62% chance that the horse loses the
race, i.e. P (Ac ) = 0.62. Hence:
c P (Ac | D) P (D) 0.70 × 0.60
P (D | A ) = c
= = 0.6774.
P (A ) 0.62
That is, the probability it is dry, given that the horse loses, is 0.6774, or about
68%.

251
C. Probability theory

23. A restaurant manager classifies customers as well-dressed, casually-dressed or


poorly-dressed and finds that 50%, 40% and 10%, respectively, fall into these
categories. The manager found that wine was ordered by 70% of the well-dressed,
by 50% of the casually-dressed and by 30% of the poorly-dressed.
(a) What is the probability that a randomly chosen customer orders wine?

C (b) If wine is ordered, what is the probability that the person ordering is
well-dressed?
(c) If wine is not ordered, what is the probability that the person ordering is
poorly-dressed?

Solution:
The following notation is used:
• W = well-dressed
• C = casually-dressed
• P = poorly dressed
• O = wine ordered.

(a) We have:

P (O) = P (O | W ) P (W ) + P (O | M ) P (M ) + P (O | P ) P (P )
= (0.70 × 0.50) + (0.50 × 0.40) + (0.30 × 0.10)
= 0.58.

(b) Using Bayes’ theorem:

P (O | W ) P (W ) 0.70 × 0.50
P (W | O) = = ≈ 0.60.
P (O) 0.58

(c) We require:

P (Oc | P ) P (P ) 0.70 × 0.10


P (P | Oc ) = c
= ≈ 0.17.
P (O ) 0.42

24. In a large lecture, 60% of students self-identify as female and 40% self-identify as
male. Records show that 15% of female students and 20% of male students are
registered as part-time students.
(a) If a student is chosen at random from the lecture, what is the probability that
the student studies part-time?
(b) If a randomly chosen student studies part-time, what is the probability that
the student is male?

252
C.1. Worked examples

Solution:
Let P T = ‘part time’, F = ‘female’ and M = ‘male’.
(a) We have:

P (P T ) = P (P T | F ) P (F )+P (P T | M ) P (M ) = (0.15×0.60)+(0.20×0.40) = 0.17.

(b) We have:
C
P (P T | M ) P (M ) 0.20 × 0.40
P (M | P T ) = = = 0.4706.
P (P T ) 0.17

25. 20% of men show early signs of losing their hair. 2% of men carry a gene that is
related to hair loss. 80% of men who carry the gene experience early hair loss.
(a) What is the probability that a man carries the gene and experiences early hair
loss?
(b) What is the probability that a man carries the gene, given that he experiences
early hair loss?
Solution:
Using obvious notation, P (H) = 0.20, P (G) = 0.02 and P (H | G) = 0.80.
(a) We require:

P (G ∩ H) = P (G) P (H | G) = 0.02 × 0.80 = 0.016.

(b) We require:
P (G ∩ H) 0.016
P (G | H) = = = 0.08.
P (H) 0.20

26. James is a salesman for a company and sells two products, A and B. He visits three
different customers each day. For each customer, the probability that James sells
product A is 1/3 and the probability is 1/4 that he sells product B. The sale of
product A is independent of the sale of product B during any visit, and the results
of the three visits are mutually independent. Calculate the probability that James
will:
(a) sell both products, A and B, on the first visit
(b) sell only one product during the first visit
(c) make no sales of product A during the day
(d) make at least one sale of product B during the day.
Solution:
Let A = ‘product A sold’ and B = ‘product B sold’.
(a) We have:
1 1 1
P (A ∩ B) = P (A) P (B) = × = .
3 4 12
253
C. Probability theory

(b) We have:
   
c c 1 3 2 1 5
P (A ∩ B ) + P (A ∩ B) = × + × = .
3 4 3 4 12
(c) Since P (Ac ) = 2/3, then:
 3
2 8
P (no A sales all day) = = .
C 3 27
(d) We have:
 3
3 37
P (at least 1 B sale) = 1 − P (no B sales) = 1 − = .
4 64

27. Tower Construction Company (‘Tower’) is determining whether it should submit a


bid for a new shopping centre.
In the past, Tower’s main competitor, Skyrise Construction Company (‘Skyrise’),
has submitted bids 80% of the time. If Skyrise does not bid on a job, the
probability that Tower will get the job is 0.60.
If Skyrise does submit a bid, the probability that Tower gets the job is 0.35.
(a) What is the probability that Tower will get the job?
(b) If Tower gets the job, what is the probability that Skyrise made a bid?
(c) If Tower did not get the job, what is the probability that Skyrise did not make
a bid?
Solution:
(a) We have:
P (Tower gets it) = (0.20 × 0.60) + (0.80 × 0.35) = 0.40.
(b)
P (Tower gets job | Skyrise bid) P (Skyrise bid)
P (Skyrise bid | Tower gets job) =
P (Tower gets job)
0.35 × 0.80
=
0.40
= 0.70.
(c) We have:
P (Skyrise did not bid | Tower did not get job)
P (Tower did not get job | Skyrise did not bid) P (Skyrise did not bid)
=
P (Tower did not get job)
0.40 × 0.20
=
0.60
2
=
15
≈ 0.13.

254
C.1. Worked examples

28. Hard!
There are 3 identical urns.
• The first urn contains 7 red balls and 3 white balls.
• The second urn contains 5 red balls and 5 white balls.
• The third urn contains 4 red balls and 8 white balls. C
One of the urns is selected at random (i.e. each urn has a 1-in-3 chance of being
selected). Balls are then drawn from the selected urn without replacement.
• The first ball is red.
• The second ball is white.
• The third ball is red.
• The fourth ball is red.

At each stage of sampling, calculate the probabilities of the selected urn being Urn
I, Urn II or Urn III.

Solution:
The following table shows the relevant calculations:

Urn Start Round 1 Round 2 Round 3 Round 4


Mix I 7R, 3W 6R, 3W 6R, 2W 5R, 2W 4R, 2W
of II 5R, 5W 4R, 5W 4R, 4W 3R, 4W 2R, 4W
balls III 4R, 8W 3R, 8W 3R, 7W 2R, 7W 1R, 7W
I — 7/10 × 1/3 3/9 × 0.4565 6/8 × 0.3097 5/7 × 0.4526
P (Ball)
= 0.2333 = 0.1522 = 0.2323 = 0.3232
II — 5/10 × 1/3 5/9 × 0.3262 4/8 × 0.3687 3/7 × 0.3593
×
= 0.1667 = 0.1812 = 0.1844 = 0.1540
III — 4/12 × 1/3 8/11 × 0.2174 3/10 × 0.3217 2/9 × 0.1880
P (Urn)
= 0.1111 = 0.1581 = 0.0965 = 0.0418
Sum 0.5111 0.4915 0.5132 0.5190

Urn Start Round 1 Round 2 Round 3 Round 4


I 0.2333/0.5111 0.1522/0.4915 0.2323/0.5132 0.3232/0.5190
0.3333
= 0.4565 = 0.3097 = 0.4526 = 0.6227
II 0.1667/0.5111 0.1812/0.4915 0.1844/0.5132 0.1540/0.5190
P (Urn) 0.3333
= 0.3262 = 0.3687 = 0.3593 = 0.2967
III 0.1111/0.5111 0.1581/0.4915 0.0965/0.5132 0.0418/0.5190
0.3333
= 0.2174 = 0.3217 = 0.1880 = 0.0805

Note that the ‘Sum’ row is calculated using the total probability formula, and the
urn probabilities are computed using Bayes’ formula.

255
C. Probability theory

C.2 Practice problems


1. When throwing a die, suppose:

S = {1, 2, 3, 4, 5, 6}, E = {3, 4} and F = {4, 5, 6}.

Determine:
C (a) F c
(b) E c ∩ F c
(c) (E ∪ F )c
(d) E c ∩ F .

2. Draw the appropriate Venn diagram to show each of the following in connection
with Question 1:
(a) E ∪ F = {3, 4, 5, 6}
(b) E ∩ F = {4}
(c) E c = {1, 2, 5, 6}.

3. Consider the following information.

Supplier Delivery time


Early On time Late Total
Jones 20 20 10 50
Smith 10 90 50 150
Robinson 0 10 90 100
Total 30 120 150 300

What are the probabilities associated with a delivery chosen at random for each of
the following?
(a) A delivery is early.
(b) A delivery is from Smith.
(c) A delivery is from Jones and late.

4. There are three sites a company may move to: A, B and C. We are told that P (A)
(the probability of a move to A) is 1/2, and P (B) = 1/3. What is P (C)?

5. Two events A and B are independent with P (A) = 1/3 and P (B) = 1/4. What is
P (A ∩ B)?

6. Say whether the following statement is true or false and briefly give your reason(s).
‘If two events are independent, then they must be mutually exclusive.’

7. If X can take values of 1, 2 and 4 with P (X = 1) = 0.30, P (X = 2) = 0.50 and


P (X = 4) = 0.20, calculate:
(a) P (X 2 < 4)
(b) P (X > 2 | X is an even number).

256
C.2. Practice problems

8. Write down and illustrate the use in probability of:


(a) the additive law
(b) the multiplicative rule.

9. A coffee machine may be defective because it dispenses the wrong amount of coffee,
event C, and/or it dispenses the wrong amount of sugar, event S. The probabilities
of these defects are: C
P (C) = 0.05, P (S) = 0.04 and P (C ∩ S) = 0.01.

Determine the proportion of cups of coffee with:


(a) at least one defect
(b) no defects.

10. A company gets 60% of its supplies from manufacturer A, and the remainder from
manufacturer Z. The quality of the parts delivered is given below:

Manufacturer % Good parts % Bad parts


A 97 3
Z 93 7

(a) The probabilities of receiving good or bad parts can be represented by a


probability tree. Show, for example, that the probability that a randomly
chosen part comes from A and is bad is 0.018.
(b) Show that the sum of the probabilities of all outcomes is 1.
(c) The way the probability tree is used depends on the information required. For
example, show that the probability tree can be used to show that the
probability of receiving a bad part is 0.028 + 0.018 = 0.046.

11. A company has a security system comprising of four electronic devices (A, B, C
and D) which operate independently. Each device has a probability of 0.10 of
failure. The four electronic devices are arranged such that the whole system
operates if at least one of A or B functions and at least one of C or D functions.
Show that the probability that the whole system functions properly is 0.9801.
(Use set theory and the laws of probability, or a probability tree.)

12. A student can enter a course either as a beginner (73% of all students) or as a
transferring student (27% of all students). It is found that 62% of beginners
eventually graduate, and that 78% of transferring students eventually graduate.
(a) Find the probability that a randomly chosen student:
i. is a beginner who will eventually graduate
ii. will eventually graduate
iii. is either a beginner or will eventually graduate, or both.
(b) Are the events ‘eventually graduates’ and ‘enters as a transferring student’
statistically independent?

257
C. Probability theory

(c) If a student eventually graduates, what is the probability that the student
entered as a transferring student?
(d) If two entering students are chosen at random, what is the probability that not
only do they enter in the same way but that they also both graduate or both
fail?

C
C.3 Solutions to Practice problems
1. We have:
S = {1, 2, 3, 4, 5, 6}, E = {3, 4} and F = {4, 5, 6}.

(a) F c means the elements not in F , which are 1, 2 and 3.


(b) E c ∩ F c means the elements not in E and not in F . The elements not in E are
1, 2, 5 and 6. So only 1 and 2 are not in E and F .
(c) (E ∪ F )c means elements which are neither in E nor F , so the elements in E
or F or both are 3, 4, 5 and 6. So, once again, the answer is 1 and 2.
(d) E c ∩ F are the elements not in E and in F . E c elements are 1, 2, 5 and 6 and
F elements are 4, 5, and 6, so the answer is 5 and 6.

2. The Venn diagram is:

(a) E ∪ F is shown by the total area enclosed by both circles.


(b) E ∩ F is the intersection of the two circles.
(c) E c is the area completely outside the circle which encloses 3 and 4.

3. (a) Of the total number of equally likely outcomes, 300, there are 30 which are
early. Hence required probability is 30/300 = 0.10.
(b) Again, of the total number of equally likely outcomes, 300, there are 150 from
Smith. Hence required probability is 150/300 = 0.50.
(c) Now, of the 300, there are only 10 which are late and from Jones. Hence the
probability is 10/300 = 1/30.

258
C.3. Solutions to Practice problems

4. We have that P (A) + P (B) + P (C) = 1. Therefore:


 
1 1 1
P (C) = 1 − + = .
2 3 6

5. Since A and B are independent:


1 1 1
P (A ∩ B) = P (A) P (B) = × = .
3 4 12 C
6. False. If two events are independent, then the occurrence (or non-occurrence) of
one event has no impact on whether or not the other event occurs. Mutually
exclusive events cannot both occur at the same time. A simple counterexample to
show the statement is false would be the events A = ‘a coin landing heads up’, and
B = ‘rolling a 6 on a die’. The outcome of the coin toss has no effect on the die
outcome, and vice versa. However, it is possible to get heads and a 6, so the events
are not mutually exclusive.

7. (a) P (X 2 < 4) = P (X = 1) = 0.3, since X 2 < 4 is only satisfied when X = 1.


(b) We require the following conditional probability:
P (X = 4) 0.2
P (X > 2 | X is an even number) = = = 0.2857.
P (X = 2) + P (X = 4) 0.5 + 0.2

8. (a) The additive law, for any two events A and B, is:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
For example, consider the following Venn diagram:

x z y

We have P (A) = area x + area z, P (B) = area y + area z, P (A ∩ B) = area z


and P (A ∪ B) = area x + area y + area z. Therefore:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= (area x + area z) + (area y + area z) − (area z)
= area x + area y + area z.
Therefore, to compute P (A ∪ B), we need to subtract P (A ∩ B) otherwise that
region would have been counted twice.

259
C. Probability theory

(b) The multiplicative law, for any two independent events A and B, is:

P (A ∩ B) = P (A) P (B).

There is no clear way of representing this on a Venn diagram. However, this is


a very important result whenever events are independent, such as when
randomly sampling objects with replacement.
C
9. (a) From the given probabilities we obtain:

P (C ∪ S) = P (C) + P (S) − P (C ∩ S) = 0.05 + 0.04 − 0.01 = 0.08.

(b) Having no defects is the complement of the event of having at least one defect,
hence:
P (no defects) = 1 − P (C ∪ S) = 1 − 0.08 = 0.92.

10. Note that the percentage of good and bad parts total to 100.
(a) The probability that a randomly chosen part comes from A is 0.6 (60%), the
probability that one of A’s parts is bad is 0.03 (3%), so the probability that a
randomly chosen part comes from A and is bad is 0.6 × 0.03 = 0.018.
(b) Add all the outcomes, which gives:

(0.6×0.97)+(0.6×0.03)+(0.4×0.93)+(0.4×0.07) = 0.582+0.018+0.372+0.028 = 1.

(c) The probability of receiving a bad part is the probability of either receiving a
bad part from A or from B, which is:

(0.6 × 0.03) + (0.4 × 0.07) = 0.018 + 0.028 = 0.046.

11. Here we have:

Probabilities
Device Not fail Fail
A 0.9 0.1
B 0.9 0.1
C 0.9 0.1
D 0.9 0.1

The system fails if both A and B fail (or more) or both C and D fail (or more).
To work out the probability that the system works properly, first work out the
probability it will fail, P (F ). The probability it will work is P (F c ).
The system fails if any of the following occur:
• A, B, C and D all fail = (0.1)4 = 0.0001
• ABC, ABD, ACD or BCD fail = (0.1)3 × 0.9 × 4 = 0.0036
• A and B fail and C & D are okay = 0.1 × 0.1 × 0.9 × 0.9 = 0.0081
• C and D fail and A & B are okay = 0.1 × 0.1 × 0.9 × 0.9 = 0.0081.

260
C.3. Solutions to Practice problems

So the total probability that the system will fail is


0.0001 + 0.0036 + 0.0162 = 0.0199, which makes the probability it will run
smoothly 1 − 0.0199 = 0.9801.

12. Let B = ‘a beginner student’, T = ‘a transferring student’ and G = ‘will graduate’.


Hence P (B) = 0.73, P (T ) = 0.27, P (G | B) = 0.62 and P (G | T ) = 0.78.
(a) We need P (B ∩ G) which can be derived from: C
P (B ∩ G)
P (G | B) = ⇒ P (B∩G) = P (G | B) P (B) = 0.62×0.73 = 0.4526.
P (B)

(b) Using the total probability formula:

P (G) = P (G | B) P (B) + P (G | T ) P (T ) = 0.62 × 0.73 + 0.78 × 0.27


= 0.4526 + 0.2106
= 0.6632.

(c) The required probability is:

P (B ∪ G) = P (B) + P (G) − P (B ∩ G) = 0.73 + 0.6632 − 0.4526 = 0.9406.

(d) We need to check whether P (G ∩ T ) = P (G) P (T ). Since:

P (G∩T ) = P (G | T ) P (T ) = 0.78×0.27 = 0.2106 6= P (G) P (T ) = 0.6632×0.27 = 0.1791

the events are not statistically independent.


(e) This is a conditional probability, given by:

P (T ∩ G) P (G | T ) P (T ) 0.2106
P (T | G) = = = = 0.3176.
P (G) P (G) 0.6632

(f) We already have P (G ∩ T ) = 0.2106 and P (G ∩ B) = 0.4526. We also require:

P (Gc ∩ T ) = P (Gc | T ) P (T ) = (1 − 0.78) × 0.27 = 0.0594

and:
P (Gc ∩ B) = P (Gc | B) P (B) = (1 − 0.62) × 0.73 = 0.2774.
Two students being chosen at random can be considered as independent
events, hence the required probability is:

(0.2106)2 + (0.4526)2 + (0.0594)2 + (0.2774)2 = 0.3297.

261
C. Probability theory

262
Appendix D
Random variables, the normal and
sampling distributions

D.1 Worked examples


D
1. A fair die is thrown once. Determine the probability distribution of the value of the
upturned face, X, and find its mean and variance.
Solution:
Each face has an equal chance of 1/6 of turning up, so we get the following table:

X=x 1 2 3 4 5 6 Total
P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6 1
x P (X = x) 1/6 2/6 3/6 4/6 5/6 6/6 21/6 = 3.50
x2 P (X = x) 1/6 4/6 9/6 16/6 25/6 36/6 91/6

Hence the mean is:


X 1 1 1
E(X) = x P (X = x) = 1 × + 2 × + · · · + 6 × = 3.50.
6 6 6
Since:
X 1 1 1 91
E(X 2 ) = x2 P (X = x) = 12 × + 22 × + · · · + 6 2 × = .
6 6 6 6
The variance is:
91
Var(X) = E(X 2 ) − µ2 = − (3.50)2 = 2.92.
6

2. Two fair dice are thrown.


(a) Determine the probability distribution of the sum of the two dice, X, and find
its mean and variance.
(b) Determine the probability distribution of the absolute difference of the two
dice, Y , and find its mean and variance.
Solution:
(a) The pattern is made clearer by using the same denominator (i.e. 36) below.
X=x 2 3 4 5 6 7
P (X = x) 1/36 2/36 3/36 4/36 5/36 6/36
x P (X = x) 2/36 6/36 12/36 20/36 30/36 42/36
x2 P (X = x) 4/36 18/36 48/36 100/36 180/36 294/36

263
D. Random variables, the normal and sampling distributions

X=x 8 9 10 11 12 Total
P (X = x) 5/36 4/36 3/36 2/36 1/36 1
x P (X = x) 40/36 36/36 30/36 22/36 12/36 252/36
x2 P (X = x) 320/36 324/36 300/36 242/36 144/36 1,974/36

This yields a mean of E(X) = 252/36 = 7, and a variance of


E(X 2 ) − µ2 = (1,974/36) − 72 = 5.83. Although not required, a plot of the
distribution is:

D Probability distribution − sum of two dice


0.16
0.14
0.12
Probability

0.10
0.08
0.06
0.04

2 4 6 8 10 12

Value of sum

(b) Again, for clarity, we use the same denominator.

Y =y 0 1 2 3 4 5 Total
P (Y = y) 6/36 10/36 8/36 6/36 4/36 2/36 1
y P (Y = y) 0 10/36 16/36 18/36 16/36 10/36 70/36
y 2 P (Y = y) 0 10/36 32/36 54/36 64/36 50/36 210/36

This yields a mean of E(Y ) = 70/36 = 1.94, while the variance is:

210
E(Y 2 ) − µ2 = − (1.94)2 = 2.05.
36

264
D.1. Worked examples

Again, although not required, a plot of the distribution is:

Probability distribution − difference in two dice

0.25
0.20

D
Probability

0.15
0.10
0.05

0 1 2 3 4 5

Absolute value of difference

3. The probability function P (X = x) = 0.02x is defined for x = 8, 9, 10, 11 and 12.


What are the mean and variance of this probability distribution?
Solution:
The probability distribution is:

X=x 8 9 10 11 12
P (X = x) 0.16 0.18 0.20 0.22 0.24

The mean is:


X
E(X) = x P (X = x) = (8 × 0.16) + (9 × 0.18) + · · · + (12 × 0.24) = 10.20.

To determine the variance, we first find E(X 2 ), given by:


X
E(X 2 ) = x2 P (X = x) = (82 × 0.16) + (92 × 0.18) + · · · + ((12)2 × 0.24) = 106.

Hence:
Var(X) = 106 − (10.20)2 = 1.96.

4. In a prize draw, the probabilities of winning various amounts of money are:

Prize (in £) 0 1 50 100 500


Probability of win 0.35 0.50 0.11 0.03 0.01

What is the expected value and standard deviation of the prize?

265
D. Random variables, the normal and sampling distributions

Solution:
The expected value is:
X
E(X) = x P (X = x) = (500 × 0.01) + (100 × 0.03) + · · · + (0 × 0.35) = £14.00

To calculate the standard deviation, we first need E(X 2 ), given by:


X
E(X 2 ) = x2 P (X = x) = ((500)2 ×0.01)+((100)2 ×0.03)+· · ·+(02 ×0.35) = 3,075.50.

So the variance is:


Var(X) = 3,075.50 − (14)2 = 2,879.50
D √
and hence the standard deviation is 2,879.50 = £53.66.

5. If Z ∼ N (0, 1), determine:


(a) P (0 < Z < 1.20) (e) P (Z < −0.60)
(b) P (−0.68 < Z < 0) (f) P (Z > −1.28)
(c) P (−0.46 < Z < 2.21) (g) P (Z > 2.05).
(d) P (0.81 < Z < 1.94)
Solution:
All of these probabilities can be obtained from Table 4 of the New Cambridge
Statistical Tables, which note provides P (Z ≤ z) for z ≥ 0.
(a) We have:

P (0 < Z < 1.20) = P (Z < 1.20) − P (Z < 0) = 0.8849 − 0.50 = 0.3849.

(b) We have:

P (−0.68 < Z < 0) = P (Z < 0) − (1 − P (Z < 0.68))


= 0.50 − (1 − 0.7517)
= 0.2517.

(c) We have:

P (−0.46 < Z < 2.21) = P (Z < 2.21) − (1 − P (Z < 0.46))


= 0.98645 − (1 − 0.6772)
= 0.66365.

(d) We have:

P (0.81 < Z < 1.94) = P (Z < 1.94) − P (Z < 0.81) = 0.9738 − 0.7910 = 0.1828.

(e) We have:

P (Z < −0.62) = 1 − P (Z < 0.62) = 1 − 0.7324 = 0.2743.

266
D.1. Worked examples

(f) We have:
P (Z > −1.28) = P (Z < 1.28) = 0.8997.

(g) We have:

P (Z > 2.05) = 1 − P (Z < 2.05) = 1 − 0.97982 = 0.02018.

6. Suppose X ∼ N (10, 4).


(a) Find: D
i. P (X > 13.4)

ii. P (8 < X < 9).


(b) Find the value a such that P (10 − a < X < 10 + a) = 0.95.
(c) Find the value b such that P (10 − b < X < 10 + b) = 0.99.
(d) How far above the mean of the standard normal distribution must we go such
that only 1% of the probability remains in the right-hand tail?
(e) How far below the mean of the standard normal distribution must we go such
that only 5% of the probability remains in the left-hand tail?

Solution:
Since X ∼ N (10,√4), we use the transformation Z = (X − µ)/σ with the values
µ = 10 and σ = 4 = 2.
(a) i. We have:
 
13.4 − 10
P (X > 13.4) = P Z > = P (Z > 1.70) = 1 − P (Z ≤ 1.70)
2
= 1 − 0.9554
= 0.0446.

ii. We have:
 
8 − 10 9 − 10
P (8 < X < 9) = P <Z<
2 2
= P (−1 < Z < −0.50)
= P (Z < −0.50) − P (Z < −1)
= (1 − P (Z ≤ 0.50)) − (1 − P (Z ≤ 1))
= (1 − 0.6915) − (1 − 0.8413)
= 0.1498.

267
D. Random variables, the normal and sampling distributions

(b) We want to find the value a such that P (10 − a < X < 10 + a) = 0.95, that is:
 
(10 − a) − 10 (10 + a) − 10
0.95 = P <Z<
2 2
 a a 
=P − <Z<
2 2
 a  a
=P Z< −P Z <−
2 2
 a 
=1−2×P Z <− .
2
D
This is the same as 2 × P (Z > a/2) = 0.05, i.e. P (Z > a/2) = 0.025, or
P (Z ≤ a/2) = 0.975. Therefore, from Table 4 of the New Cambridge Statistical
Tables, a/2 = 1.96, and so a = 3.92.
(c) We want to find the value b such that P (10 − b < X < 10 + b) = 0.99. Similar
reasoning shows that P (Z ≤ b/2) = 0.995. Therefore, from Table 4, b/2 = 2.58
(approximately), so that b = 5.16.
(d) We want k such that P (Z > k) = 0.01, or P (Z ≤ k) = 0.99. From Table 4,
k = 2.33 (approximately).
(e) We want x such that P (Z < x) = 0.05. This means that x < 0 and
P (Z > |x|) = 0.05, or P (Z ≤ |x|) = 0.95, so, from Table 4, |x| = 1.645 (by
interpolating between Φ(1.64) and Φ(1.65)) and hence x = −1.645.

7. Your company requires a special type of light bulb which is available from only two
suppliers. Supplier A’s light bulbs have a mean lifetime of 2,000 hours with a
standard deviation of 180 hours. Supplier B’s light bulbs have a mean lifetime of
1,850 hours with a standard deviation of 100 hours. The distribution of the
lifetimes of each type of light bulb is normal. Your company requires that the
lifetime of a light bulb be not less than 1,500 hours. All other things being equal,
which type of bulb should you buy, and why?
Solution:
Let A and B be the random variables representing the lifetimes (in hours) of light
bulbs from supplier A and supplier B, respectively. We are told that:

A ∼ N (2,000, (180)2 ) and B ∼ N (1,850, (100)2 ).

Since the relevant criterion is that light bulbs last at least 1,500 hours, the
company should choose the supplier whose light bulbs have a greater probability of
doing so. We find that:
 
1,500 − 2,000
P (A > 1,500) = P Z > = P (Z > −2.78) = P (Z < 2.78) = 0.99728
180

and:
 
1,500 − 1,850
P (B > 1,500) = P Z > = P (Z > −3.50) = P (Z < 3.50) = 0.9998.
100

268
D.1. Worked examples

Therefore, the company should buy light bulbs from supplier B, since they have a
greater probability of lasting the required time.
Note it is good practice to define notation and any units of measurement and to
state the distributions of the random variables. Note also that here it is not essential
to compute the probability values in order to determine what the company should
do, since −2.78 > −3.50 implies that P (Z > −2.78) < P (Z > −3.50).

8. A normal distribution has a mean of 40. If 10% of the distribution falls between the
values of 50 and 60, what is the standard deviation of the distribution?
Solution: D
2
Let X ∼ N (40, σ ). We seek σ, and know that:
   
50 − 40 60 − 40 10 20
P (50 ≤ X ≤ 60) = P ≤Z≤ =P ≤Z≤ = 0.10.
σ σ σ σ

Hence we know that one z-value (i.e. 20/σ) is twice the other (i.e. 10/σ), and their
corresponding probabilities differ by 0.10. We also know that the z-values are
positive since both represent numbers larger than the mean of 40. Now we need to
use Table 4 of the New Cambridge Statistical Tables to find two such z-values.
Looking at Table 4 we find the values to be, approximately, 1.25 and 2.50.
Therefore, σ = 8.

9. The life, in hours, of a light bulb is normally distributed with a mean of 200 hours.
If a consumer requires at least 90% of the light bulbs to have lives exceeding 150
hours, what is the largest value that the standard deviation can have?
Solution:
Let X be the random variable representing the lifetime of a light bulb (in hours),
so that for some value σ we have X ∼ N (200, σ 2 ). We want P (X > 150) = 0.90,
such that:
   
150 − 200 50
P (X > 150) = P Z > =P Z>− = 0.90.
σ σ

Note that this is the same as P (Z < 50/σ) = 0.90, so 50/σ = 1.28, giving
σ = 39.06.

10. The random variable X has a normal distribution with mean µ and variance σ 2 , i.e.
X ∼ N (µ, σ 2 ). It is known that:

P (X ≤ 66) = 0.0359 and P (X ≥ 81) = 0.1151.

(a) Produce a clearly-labelled sketch to represent these probabilities on a normal


curve.
(b) Show that the value of σ = 5.
(c) Calculate P (69 ≤ X ≤ 83).

269
D. Random variables, the normal and sampling distributions

Solution:
(a) The sketch below shows the probabilities with P (X ≤ 66) shaded blue and
P (X ≥ 81) shaded red.

(b) We have X ∼ N (µ, σ 2 ), where µ and σ 2 are unknown. Using Table 4 of the
New Cambridge Statistical Tables, we find that P (Z ≤ −1.80) = 0.0359 and
P (Z ≥ 1.20) = 0.1151. Therefore:
 
66 − µ
P (X ≤ 66) = P Z ≤ = P (Z ≤ −1.80) = 0.0359
σ

and:  
81 − µ
P (X ≥ 81) = P Z≥ = P (Z ≥ 1.20) = 0.1151.
σ
Hence:
66 − µ 81 − µ
= −1.80 and = 1.20.
σ σ
Rearranging, we obtain a pair of simultaneous equations which can be solved
to find µ and σ. Specifically:

66 − µ = −1.80σ and 81 − µ = 1.20σ.

Subtracting the first from the second, gives 15 = 3σ, and hence σ = 5. For
completeness, 81 − µ = 6, so µ = 75. Therefore, X ∼ N (75, 52 ).
(c) Given X ∼ N (75, 52 ), we have:
 
69 − 75 83 − 75
P (69 ≤ X ≤ 83) = P ≤Z≤ = P (−1.20 ≤ Z ≤ 1.60)
5 5
= Φ(1.60) − (1 − Φ(1.20))
= 0.9452 − (1 − 0.8849)
= 0.8301.

270
D.1. Worked examples

11. The number of newspapers sold daily at a kiosk is normally distributed with a
mean of 350 and a standard deviation of 30.
(a) Find the probability that fewer than 300 newspapers are sold on a particular
day.
(b) How many newspapers should the newsagent stock each day such that the
probability of running out on any particular day is 10%?
Solution:
Let X be the number of newspapers sold, hence X ∼ N (350, 900).
(a) We have: D
 
300 − 350
P (X < 300) = P Z < = P (Z < −1.67) = 1 − P (Z < 1.67)
30
= 1 − 0.9525
= 0.0475.

(b) Let s be the required stock, then we require P (X > s) = 0.10. Hence:
 
s − 350
P Z> = 0.10
30
s − 350
⇒ ≥ 1.28
30
⇒ s ≥ 350 + 1.28 × 30 = 388.4.

Rounding up, the required stock is 389.

12. Consider the following set of data. Does it appear to approximately follow a normal
distribution? Justify your answer.

45 31 37 55 54 56
48 54 52 55 52 51
49 46 62 38 45 48
47 46 40 61 50 58
46 35 36 59 50 48
39 48 51 52 43 45

Solution:
To see whether this set of data approximates a normal distribution, we need to
analyse it. Using a calculator we calculate the mean to be µ = 48.1 and the
standard deviation to be σ = 7.3 (assuming population data).
For ± one standard deviation, i.e. µ ± σ, the interval is (40.8, 55.4) which contains
24 observations, representing 24/36 = 67% of the data.
For ± two standard deviations, i.e. µ ± 2σ, the interval is (33.1, 62.7) which
contains 35 observations, representing 35/36 = 97% of the data.

271
D. Random variables, the normal and sampling distributions

These percentages match very closely to what we expect for a normal distribution.
We could also construct a histogram of the data, as shown below. This appears to
confirm that the data do indeed approximate a normal distribution.

13. Consider the population below with N = 4 elements:

A B C D
3 6 9 12

(a) Calculate the population mean and variance.


(b) Write down the sampling distribution of the sample mean for samples of size
n = 2 drawn without replacement where order does not matter.
(c) Using the result in (b), calculate the mean of the sampling distribution.
(d) Using the result in (b), calculate the variance of the sampling distribution.

Solution:

(a) The population mean is:

N
P
xi
i=1 3 + 6 + 9 + 12
µ= = = 7.50.
N 4

The population variance is:

N
x2i
P
i=1 9 + 36 + 81 + 144
σ2 = − µ2 = − (7.50)2 = 11.25.
N 4

272
D.1. Worked examples

(b) The sampling distribution of the sample mean for samples of size n = 2 is:
Sample Values X̄ = x̄ P (X̄ = x̄)
AB 3 6 4.5 1/6
AC 3 9 6 1/6
AD 3 12 7.5 1/6
BC 6 9 7.5 1/6
BD 6 12 9 1/6
CD 9 12 10.5 1/6
(c) The mean of the sampling distribution is:
4.5 + 6 + 7.5 + 7.5 + 9 + 10.5 45 D
E(X̄) = = = 7.50.
6 6

(d) The variance of the sampling distribution is:

(4.5)2 + (6)2 + (7.5)2 + (7.5)2 + (9)2 + (10.5)2


Var(X̄) = − (7.5)2 = 3.75.
6

14. A random variable, X, can take the values 1, 2 and 3, each with equal probability.
List all possible samples of size two which may be chosen when order matters,
without replacement, from this population, and hence construct the sampling
distribution of the sample mean, X̄.
Solution:
Each possible sample has an equal probability of occurrence of 1/6. The sampling
distribution of X̄ is:

Possible samples (1, 2) (1, 3) (2, 3)


(2, 1) (3, 1) (3, 2)

Sample mean, X̄ = x̄ 1.5 2 2.5

Relative frequency, P (X̄ = x̄) 1/3 1/3 1/3

15. Discuss the differences or similarities between a sampling distribution of size 5 and
a single (simple) random sample of size 5.
Solution:
If 5 members are selected from a population such that every possible set of 5
population members has the same probability of being selected, then the sample is
a simple random sample. In a sampling distribution of size 5, every possible sample
of size 5 from the population is averaged and the result is the sampling
distribution. The similarity is the inclusive nature of both the simple random
sample as well as the sampling distribution.

16. A perfectly-machined regular tetrahedral (pyramid-shaped) die has four faces


labelled 1 to 4. It is tossed twice onto a level surface and after each toss the number
on the face which is downward is recorded. If the recorded values are x1 and x2 ,
when order matters, then the observed sample mean is x̄ = (x1 + x2 )/2. Write out

273
D. Random variables, the normal and sampling distributions

the sampling distribution of the sample mean as a random quantity over repeated
double tosses.
Solution:
Each possible sample has an equal probability of occurrence of 1/16. The sampling
distribution of X̄ is:

Possible samples (1, 1) (1, 2) (1, 3) (1, 4) (2, 4) (3, 4) (4, 4)


(2, 1) (3, 1) (4, 1) (4, 2) (4, 3)
(2, 2) (2, 3) (3, 3)
D (3, 2)

Sample mean, X̄ = x̄ 1 1.5 2 2.5 3 3.5 4

Relative frequency, P (X̄ = x̄) 1/16 1/8 3/16 1/4 3/16 1/8 1/16

17. The weights of a large group of animals have mean 8.2 kg and standard deviation
2.2 kg. What is the probability that a random selection of 80 animals from the
group will have a mean weight between 8.3 kg and 8.4 kg? State any assumptions
you make.
Solution:
We are not told that the population is normal, but n is ‘large’ so we can apply the
central limit theorem. The sampling distribution of X̄ is, approximately:
σ2 (2.2)2
   
X̄ ∼ N µ, = N 8.2, .
n 80
Hence, using Table 4 of the New Cambridge Statistical Tables:
 
8.3 − 8.2 8.4 − 8.2
P (8.3 ≤ X̄ ≤ 8.4) = P √ ≤Z≤ √
2.2/ 80 2.2/ 80
= P (0.41 ≤ Z ≤ 0.81)
= P (Z ≤ 0.81) − P (Z ≤ 0.41)
= 0.7910 − 0.6591
= 0.1319.

18. A random sample of 25 audits is to be taken from a company’s total audits, and
the average value of these audits is to be calculated.
(a) Explain what is meant by the sampling distribution of this average and discuss
its relationship to the population mean.
(b) Is it reasonable to assume that this sampling distribution is normally
distributed?
(c) If the population of all audits has a mean of £54 and a standard deviation of
£10, find the probability that:

274
D.1. Worked examples

i. the sample mean will be greater than £60


ii. the sample mean will be within 5% of the population mean.
Solution:
(a) The sample average (mean) is formed from 25 observations which are subject
to sampling variability, hence the average is also subject to this variability. Its
sampling distribution describes its probability properties. If a large number of
such averages were independently sampled, then their histogram would be the
(approximate) sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the D
central limit theorem (CLT) – although the sample size is rather small.
(c) If n = 25, µ = 54 and σ = 10, then the CLT says that, approximately:

σ2
   
100
X̄ ∼ N µ, = N 54, .
n 25

i. We have:
!
60 − 54
P (X̄ > 60) = P Z>p = P (Z > 3) = 0.00135.
100/25

ii. We are asked for:


 
−0.05 × 54 0.05 × 54
P (0.95 × 54 < X̄ < 1.05 × 54) = P <Z<
2 2
= P (−1.35 < Z < 1.35)
= P (Z < 1.35) − (1 − P (Z < 1.35)
= 0.9115 − (1 − 0.9115)
= 0.8230.

19. The distribution of salaries of lecturers in a university is positively skewed, with


most lecturers earning near the minimum of the pay scale. What would a sampling
distribution of the sample mean of size 2 look like? How about size 5? How about
size 50?
Solution:
Here we are given a non-normal (skewed) distribution. In order that the sampling
distribution of the sample mean is (approximately) normal, a ‘large’ sample size is
required, according to the central limit theorem. Sampling distributions based on
‘small’ sample sizes tend to resemble the population from which they are drawn.
So, for samples of size 2 and 5, the sampling distributions would probably be
positively skewed. However, the sample of size 50 would be approximately normal.
Regardless, in all cases the sampling distribution would have a mean equal to the
population mean, µ, and a variance of σ 2 /n.

275
D. Random variables, the normal and sampling distributions

D.2 Practice problems


1. Using Table 4 of the New Cambridge Statistical Tables, calculate:
(a) P (Z ≤ 1)
(b) P (Z ≥ 1)
where Z ∼ N (0, 1).

2. Check that (approximately):


(a) 68% of normal random variables fall within 1 standard deviation of the mean.
D
(b) 95% of normal random variables fall within 2 standard deviations of the mean.
(c) 99% of normal random variables fall within 3 standard deviations of the mean.
Draw the areas concerned on a normal distribution.

3. Given a normal distribution with a mean of 20 and a variance of 4, what


proportion of the distribution would be:
(a) above 22
(b) between 14 and 16?

4. In an examination, the scores of students who attend schools of type A are


normally distributed with a mean of 50 and a standard deviation of 5. The scores
of students who attend schools of type B are normally distributed with a mean of
55 and a standard deviation of 6. Which type of school would have a higher
proportion of students with marks below 45?

5. The manufacturer of a new brand of lithium battery claims that the mean life of a
battery is 3,800 hours with a standard deviation of 250 hours. Assume the lifespans
of lithium batteries follow a normal distribution.
(a) What percentage of batteries will last for more than 3,500 hours?
(b) What percentage of batteries will last for more than 4,000 hours?
(c) If 700 batteries are supplied, how many should last between 3,500 and 4,000
hours?

6. The following six observations give the time taken, in seconds, to complete a
100-metre sprint by all six individuals competing in a race. Note this is population
data.

Individual Time
A 15
B 14
C 10
D 12
E 20
F 15

276
D.3. Solutions to Practice problems

(a) Find the population mean, µ, and the population standard deviation, σ, of the
sprint times.
(b) Calculate the sample mean for each possible sample of:
i. two individuals
ii. three individuals
iii. four individuals.
(c) Work out the mean for each set of sample means (it must come to µ) and
compare the standard deviations of the sample means about µ.
This may take some time, but, after you have done it, you should have a
clearer idea about sampling distributions!
D

D.3 Solutions to Practice problems


1. Z ∼ N (0, 1), i.e. µ = 0 and σ 2 = 1.
(a) Looking at Table 4 of the New Cambridge Statistical Tables, we see the area
under the curve for which Z ≤ 1 is 0.8413, i.e. P (Z ≤ 1) = 0.8413.
(b) Since the total area under the curve is 1, the area for which Z ≥ 1 is
P (Z ≥ 1) = 1 − 0.8413 = 0.1587.

2. We have the following:

0.683

µ −1.96σ µ−σ µ µ+σ µ +1.96σ

<−−−−−−−−−− 0.95 −−−−−−−−−−>

(a) If we are looking for 0.68 (approximately) as the area under the curve one
standard deviation either side of µ, we need the grey shaded area in the
diagram above. Half of this area must be 0.68/2 = 0.34 and hence the whole
area to the left of µ + σ must be 0.34 + 0.50 = 0.84.
Looking at Table 4, for Φ(z) = 0.84, we see that z is between 0.99 (where Φ(z)
is 0.8389) and 1.00 (where Φ(z) = 0.8413), i.e. approximately correct.

277
D. Random variables, the normal and sampling distributions

(b) Similarly, look up Φ(z) of 0.95/2 + 0.50 = 0.475 + 0.50 = 0.975, which gives us
a z of exactly 1.96.
(c) Similarly, look up Φ(z) of 0.99/2 + 0.50 = 0.495 + 0.50 = 0.995, which gives z
between 2.57 and 2.58. Therefore, the third example is more approximate!

3. Let X ∼ N (20, 4).


(a) Standardising, we have:
 
X − 20 22 − 20
P (X > 22) = P √ > √ = P (Z > 1)
4 4
D
= 1 − Φ(1)
= 1 − 0.8413
= 0.1587.

(b) Standardising, we have:


 
14 − 20 X − 20 16 − 20
P (14 < X < 16) = P √ < √ < √
4 4 4
= P (−3 < Z < −2)
= (1 − Φ(2)) − (1 − Φ(3))
= (1 − 0.97725) − (1 − 0.99865)
= 0.0214.

4. Let A ∼ N (50, 52 ) and B ∼ N (55, 62 ). We have:


 
A − 50 45 − 50
P (A < 45) = P < = P (Z < −1)
5 5

and:  
B − 55 45 − 55
P (B < 45) = P < = P (Z < −1.67).
6 6
Clearly, P (Z < −1.67) < P (Z < −1) (if you do not see this immediately, shade
these regions on a sketch of the standard normal distribution) and hence schools of
type A would have a higher proportion of students with marks below 45.

5. Let X ∼ N (3,800, (250)2 ).


(a) Standardising, we have (noting the symmetry of the standard normal
distribution about zero):
 
X − 3,800 3,500 − 3,800
P (X > 3,500) = P > = P (Z > −1.2) = Φ(1.2) = 0.8849.
250 250

Hence, as a percentage, this is 88.49%.

278
D.3. Solutions to Practice problems

(b) Standardising, we have (noting the symmetry of the standard normal


distribution about zero):
 
X − 3,800 4,000 − 3,800
P (X > 4,000) = P > = P (Z > 0.8)
250 250
= 1 − Φ(0.8)
= 1 − 0.7881
= 0.2119.
Hence, as a percentage, this is 21.19%.
(c) Using (a) and (b), we have:
D
P (3,500 < X < 4,000) = (1 − P (X > 4,000)) − (1 − P (X > 3,500))
= (1 − 0.2119) − (1 − 0.8849)
= 0.6730.
Therefore, we would expect 700 × 0.6730 ≈ 471 batteries to last between 3,500
and 4,000 hours.

6. (a) The population mean is:


N
1 X 15 + 14 + 10 + 12 + 20 + 15 86
µ= xi = = = 14.33.
N i=1 6 6
The population standard deviation is:
v
N
1 √
u r
u1 X 1
σ=t (xi − x̄)2 = × 57 = 9.56 = 3.09.
N i=1 6 3

(b) i. Case of two individuals


Taking samples of two elements, the 15 possible samples and their means
are:
Sample number Individuals Times x̄i
1 A, B 15, 14 14.5
2 A, C 15, 10 12.5
3 A, D 15, 12 13.5
4 A, E 15, 20 17.5
5 A, F 15, 15 15.0
6 B, C 14, 10 12.0
7 B, D 14, 12 13.0
8 B, E 14, 20 17.0
9 B, F 14, 15 14.5
10 C, D 10, 12 11.0
11 C, E 10, 20 15.0
12 C, F 10, 15 12.5
13 D, E 12, 20 16.0
14 D, F 12, 15 13.5
15 E, F 20, 15 17.5

279
D. Random variables, the normal and sampling distributions

ii. Case of three individuals


Taking samples of three elements, the 20 possible samples and their means
are:
Sample number Individuals Times x̄i
1 A, B, C 15, 14, 10 13.00
2 A, B, D 15, 14, 12 13.67
3 A, B, E 15, 14, 20 16.33
4 A, B, F 15, 14, 15 14.67
5 A, C, D 15, 10, 12 12.33
6 A, C, E 15, 10, 20 15.00
D 7 A, C, F 15, 10, 15 13.33
8 A, D, E 15, 12, 20 15.67
9 A, D, F 15, 12, 15 14.00
10 A, E, F 15, 20, 15 16.67
11 B, C, D 14, 10, 12 12.00
12 B, C, E 14, 10, 20 14.67
13 B, C, F 14, 10, 15 13.00
14 B, D, E 14, 12, 20 15.33
15 B, D, F 14, 12, 15 13.67
16 B, E, F 14, 20, 15 16.33
17 C, D, E 10, 12, 20 14.00
18 C, D, F 10, 12, 15 12.33
19 C, E, F 10, 20, 15 15.00
20 D, E, F 12, 20, 15 15.67
iii. Case of four individuals
Taking samples of four elements, the 15 possible samples and their means
are:
Sample number Individuals Times x̄i
1 C, D, E, F 10, 12, 20, 15 14.25
2 B, D, E, F 14, 12, 20, 15 15.25
3 B, C, E, F 14, 10, 20, 15 14.75
4 B, C, D, F 14, 10, 12, 15 12.75
5 B, C, D, E 14, 10, 12, 20 14.00
6 A, D, E, F 15, 12, 20, 15 15.50
7 A, C, E, F 15, 10, 20, 15 15.00
8 A, C, D, F 15, 10, 12, 15 13.00
9 A, C, D, E 15, 10, 12, 20 14.25
10 A, B, E, F 15, 14, 20, 15 16.00
11 A, B, D, F 15, 14, 12, 15 14.00
12 A, B, D, E 15, 14, 12, 20 15.25
13 A, B, C, F 15, 14, 10, 15 13.50
14 A, B, C, E 15, 14, 10, 20 14.75
15 A, B, C, D 15, 14, 10, 12 12.75
(c) For the case of two individuals, the mean of all the sample means is:
15
1 X 215
µ= x̄i = = 14.33.
15 i=1 15

280
D.3. Solutions to Practice problems

For the case of three individuals, the mean of all the sample means is:
20
1 X 286.67
µ= x̄i = = 14.33.
20 i=1 20

For the case of four individuals, the mean of all the sample means is:
15
1 X 215
µ= x̄i = = 14.33.
15 i=1 15

The standard deviations of X̄ are 1.9551, 1.3824 and 0.9776 for samples of size D
2, 3 and 4, respectively. This confirms that the accuracy of our population
mean estimator improves as we increase our sample size because we increase
the amount of information about the population in the sample.

281
D. Random variables, the normal and sampling distributions

282
Appendix E
Interval estimation

E.1 Worked examples


1. The reaction times, in seconds, for eight police officers were found to be:

0.28, 0.23, 0.21, 0.26, 0.29, 0.21, 0.25 and 0.22.


E
Determine a 90% confidence interval for the mean reaction time of all police officers.
Solution:
Here n = 8 is small and σ 2 is unknown, so the formula for a 90% confidence
interval for µ is:
s
x̄ ± t0.05, 7 × √ .
n
It is easy to calculate x̄ = 0.2437 and s = 0.0311, so a 90% confidence interval has
endpoints:
0.0311
0.2437 ± 1.895 × √ .
8
Hence the confidence interval is (0.2229, 0.2645).

2. A business requires an inexpensive check on the value of stock in its warehouse. In


order to do this, a random sample of 50 items is taken and valued. The average
value of these is computed to be £320.41 with a (sample) standard deviation of
£40.60. It is known that there are 9,875 items in the total stock.
(a) Estimate the total value of the stock to the nearest £10,000.
(b) Calculate a 95% confidence interval for the mean value of all items and hence
determine a 95% confidence interval for the total value of the stock.
(c) You are told this confidence interval is too wide for decision purposes and you
are asked to assess how many more items would need to be sampled to obtain
an interval with the same degree of confidence, but with half the width.
Solution:
(a) The total value of the stock is 9,875µ, where µ is the mean value of an item of
stock. We know that X̄ is the obvious estimator of µ, so 9,875X̄ is the obvious
estimator of 9,875µ. Therefore, an estimate of the total value of the stock is:

9,875 × 320.41 = £3,160,000

(to the nearest £10,000).

283
E. Interval estimation

(b) In this question σ 2 is unknown but n = 50 is large, so we can approximate tn−1


with the standard normal distribution. Hence for a 95% confidence interval for
µ we use t0.025, 49 ≈ z0.025 = 1.96 giving:

s 40.6
x̄±z0.025 × √ = 320.41±1.96× √ = 320.41±11.25 ⇒ (£309.16, £331.66).
n 50

Note that because n is large we have used the standard normal approximation.
It is more accurate to use a t distribution on 49 degrees of freedom. Using
Table 10 of the New Cambridge Statistical Tables, the nearest available value is
t0.025, 50 = 2.009. This gives a 95% confidence interval of:

s 40.6
x̄±t0.025, 49 × √ = 320.41±2.009× √ = 320.41±11.54 ⇒ (£308.87, £331.95)
n 50
E so not much of a difference.
To obtain a 95% confidence interval for the total value of the stock, 9,875µ,
multiply the interval endpoints by 9,875. This gives (to the nearest £10,000):

(£3,050,000, £3,280,000).

(c) Increasing the sample size


√ by a factor of k reduces the width of the confidence
interval by a factor of k. Hence increasing the sample size by a factor
√ of 4
will reduce the width of the confidence interval by a factor of 2 (= 4).
Therefore, we need to increase the sample size from 50 to 4 × 50 = 200, i.e.
collect another 150 observations.

3. A random sample of 100 voters contained 35 supporters of a particular party.


Compute a 95% confidence interval for the proportion of this party’s supporters in
the population.

Solution:
In this question we are estimating a proportion with n = 100. Let π be the
proportion of the party’s supporters in the population. With p = 35/100 = 0.35
and n = 100, a 95% confidence interval for π is calculated as:
r r
p(1 − p) 0.35 × 0.65
p ± z0.025 × ⇒ 0.35 ± 1.96 × ⇒ (0.257, 0.443).
n 100

4. A claimant of Extra Sensory Perception (ESP) agrees to be tested for this ability.
Blindfolded, he claims to be able to identify more randomly chosen cards than
would be expected by pure guesswork.
An experiment is conducted in which 200 playing cards are drawn at random, and
with replacement, from a deck of cards, and the claimant is asked to name their
suits (hearts, diamonds, spades or clubs).
Of the 200 cards he identifies 60 correctly. Compute a 95% confidence interval for
his true probability of identifying a suit correctly. Is this evidence of ESP?

284
E.1. Worked examples

Solution:
We have the sample proportion p = 60/200 = 0.30 and an estimated standard error
of: r
0.30 × 0.70
= 0.0324.
200
A 95% confidence interval for the true probability of the correct identification of a
suit is:
r
p(1 − p)
p ± z0.025 × ⇒ 0.30 ± 1.96 × 0.0324 ⇒ (0.236, 0.364).
n

As 0.25 (pure guesswork) is within this interval, then the claimant’s performance is
not convincing!

5. 400 college students, chosen at random, are interviewed and it is found that 240 use
E
the refectory.
(a) Use these data to compute a:
i. 95% confidence interval
ii. 99% confidence interval
for the proportion of students who use the refectory.
(b) The college has 12,000 students. The college catering officer claims that the
refectory is used by at least 9,000 students and that the survey has yielded a
low figure due to sampling variability. Is this claim reasonable?

Solution:

(a) i. We have p = 240/400 = 0.60 and the corresponding estimated standard


error is: r
0.60 × 0.40
= 0.0245.
400
Therefore, a 95% confidence interval for thepopulation proportion is:

0.60 ± 1.96 × 0.0245 ⇒ (0.5520, 0.6480).

ii. Similarly, a 99% confidence interval for the population proportion is:

0.60 ± 2.576 × 0.0245 ⇒ (0.5369, 0.6631).

(b) A 99% confidence interval for the total number of students using the refectory
is obtained by multiplying the interval endpoints by the total number of
students, i.e. 12,000. This gives (6,442, 7,959).
The catering officer’s claim of 9,000 is incompatible with these data as it falls
well above the 99% confidence interval upper endpoint. We conclude that the
actual number of users is well below 9,000.

285
E. Interval estimation

6. A simple random sample of 100 workers had weekly salaries with a mean of £315
and a standard deviation of £20.
(a) Calculate a 90% confidence interval for the mean weekly salary of all workers
in the factory.
(b) How many more workers should be sampled if it is required that the estimate
is to be within £3 of the true average (again, with 90% confidence)?
Note this means a tolerance of £3 – equivalent to a confidence interval width
of £6.
Solution:
(a) We have n = 100, x̄ = 315 and s = 20. The estimated standard error is:
s 20
√ =√ = 2.
n 100
E
Hence a 90% confidence interval for the true mean is:
s
x̄ ± t0.05, n−1 × √ ⇒ 315 ± 1.645 × 2 ⇒ (311.71, 318.29)
n
where we use the approximation t0.05, 99 ≈ z0.05 = 1.645 since n is large.
(b) For the tolerance to be 3, we require:
20
3 ≥ 1.645 × √ .
n
Solving this gives n ≥ 120.27, so we round up to get n = 121. Hence we need
to take a further sample of 21 workers.

7. In a market research study to compare two chocolate bar wrappings, 30 out of 50


children preferred a gold wrapping to a silver wrapping, whereas 25 out of 40 adults
preferred the silver wrapping.
(a) Compute a 95% confidence interval for the difference in the true proportions in
favour of gold wrapping between the two groups.
(b) It is decided to take further samples of children and adults so that we finally
sample the same number of adults and children such that the final estimator of
the difference between proportions is within 0.06 of the true difference (with
95% confidence). How many more adults and children need to be sampled?
Solution:
(a) For ‘Children − Adults’, the estimate of the difference in population
proportions is 30/50 − 15/40 = 0.225. The estimated standard error of this
estimate is:
s r
p1 (1 − p1 ) p2 (1 − p2 ) 0.60 × 0.40 0.375 × 0.625
+ = + = 0.1032.
n1 n2 50 40

A 95% confidence interval is:

0.225 ± 1.96 × 0.1032 ⇒ (0.023, 0.427).

286
E.1. Worked examples

(b) Let n be the required common sample size. We require:


r
0.60 × 0.40 + 0.375 × 0.625
1.96 × ≤ 0.06.
n

Solving gives n = 507. Therefore, we must sample 457 more children and 467
more adults.

8. Two market research companies each take random samples to assess the public’s
attitude toward a new government policy. If n represents the sample size and r the
number of people against, the results of these independent surveys are as follows:

n r
Company 1 400 160
Company 2 900 324
E

If π is the population proportion against the policy, compute a 95% confidence


interval to comment on the likely representativeness of the two companies’ results.

Solution:
Here we are estimating the difference between two proportions (if the two
companies’ results are compatible, then there should be no difference between the
proportions). The formula for a 95% confidence interval of the difference is:
s
p1 (1 − p1 ) p2 (1 − p2 )
p1 − p2 ± z0.025 × + .
n1 n2

The estimates of the proportions are p1 = 160/400 = 0.40 and p2 = 324/900 = 0.36,
so a 95% confidence interval is:
r
0.40 × 0.60 0.36 × 0.64
0.40 − 0.36 ± 1.96 × + ⇒ (−0.017, 0.097).
400 900

Since zero is in this interval it is plausible that the two proportions are the same, so
the companies’ surveys are compatible indicating likely representativeness.

9. A sample of 954 adults in early 1987 found that 23% of them held shares.
(a) Given a UK adult population of 41 million, and assuming a proper random
sample was taken, compute a 95% confidence interval estimate for the number
of shareholders in the UK in 1987 (following liberalisation of financial markets
in the UK).
(b) A ‘similar’ survey the previous year (prior to financial liberalisation) had
found a total of 7 million shareholders. Assuming ‘similar’ means the same
sample size, find a 95% confidence interval estimate of the increase in
shareholders between the two years.

287
E. Interval estimation

Solution:
(a) Let π be the proportion of shareholders in the population in 1987. Start by
estimating π. We are estimating a proportion and n is large, so a 95%
confidence interval for π is:
r r
p(1 − p) 0.23 × 0.77
p ± z0.025 × ⇒ 0.23 ± 1.96 × ⇒ (0.203, 0.257).
n 954
Therefore, a 95% confidence interval for the number (rather than the
proportion) of shareholders in the UK in 1987 is obtained by multiplying the
above interval endpoints by 41 million, resulting in:
9,430,000 ± 1,107,000 ⇒ (8,323,000, 10,537,000)
Therefore, we estimate there were about 9.43 million shareholders in the UK in
1987, with a margin of error of approximately 1.1 million.
E (b) Let us start by finding a 95% confidence interval for the difference in the two
proportions. We use the formula:
s
p1 (1 − p1 ) p2 (1 − p2 )
p1 − p2 ± z0.025 × + .
n1 n2
The estimates of the proportions π1 and π2 are 0.230 and 7/41 = 0.171,
respectively. We know n1 = 954, and although n2 is unknown we can assume it
is approximately equal to 954 (as the previous year’s survey was ‘similar’), so a
95% confidence interval is:
r
0.230 × 0.770 0.171 × 0.829
0.230 − 0.171 ± 1.96 × + = 0.059 ± 0.036
954 954
giving (0.023, 0.094). Multiply the interval endpoints by 41 million and we get
a confidence interval of:
2,419,000 ± 1,476,000 ⇒ (943,000, 3,895,000).
We estimate that the number of shareholders has increased by about 2.4
million during the period covered by the surveys.
There is quite a large margin of error of approximately 1.5 million, especially
when compared with a point estimate (i.e. interval midpoint) of 2.4 million.
However, it seems financial liberalisation increased the number of shareholders.

10. In order to assess the impact of an advertising campaign, a restaurateur monitors


her daily revenue before and after the campaign. The table below shows some
sample statistics of daily sales calculated over a period of 60 days prior to the
campaign, and 45 days after the campaign. Determine a 95% confidence interval for
the increase in average daily sales due to the campaign. Is there strong evidence
that the advertising campaign has increased sales?

Before campaign After campaign


Number of days 60 45
Mean daily sales £503 £559
Standard deviation £21 £29

288
E.1. Worked examples

Solution:
As both sample sizes are ‘large’ there is no need to use a pooled estimator of the
variance as we would expect s21 ≈ σ12 and s22 ≈ σ22 . A 95% confidence interval for the
difference in means is:
s r
s21 s22 (21)2 (29)2
x̄1 −x̄2 ±z0.025 × + ⇒ 559−503±1.96× + ⇒ (46.0, 66.0).
n1 n2 60 45

Zero is well below the lower endpoint, so there is evidence that the advertising
campaign has increased sales.

11. A survey is conducted on time spent recording (in hours per track) for 25 music
industry recording studios, classified as successful or unsuccessful according to their
recent chart performances. The relevant statistics resulting from this study are:
E
Sample size Sample mean Sample standard deviation
Successful studios 12 9.2 1.4
Unsuccessful studios 13 6.8 1.9

(a) Compute a 98% confidence interval for the difference between population mean
recording times between successful and unsuccessful studios.
(b) On the basis of this confidence interval, do you consider this to be sufficient
evidence of a true difference in mean recording times between the different
types of studios? Justify your answer.
Solution:
(a) Both samples are small (combined sample size < 30), so it will be necessary to
use a pooled variance by assuming a common value of σ 2 and estimating it
using the pooled variance estimator:

(n1 − 1)S12 + (n2 − 1)S22


Sp2 = .
n1 + n2 − 2
Also, since the sample sizes are small and the true variance is unknown, the
Student’s t distribution will be used with ν = (n1 − 1) + (n2 − 1) = n1 + n2 − 2
degrees of freedom. (Simply add the degrees of freedom from the two separate
samples.)
Let the subscripts ‘1 ’ and ‘2 ’ refer to the successful and unsuccessful studios,
respectively. Because the studios are in the same market (which is largely why
we are bothering to compare them in the first place!) then it is reasonable to
assume the same variance for the recording times at studios for the two
groups. Under this assumption:

(12 − 1)(1.4)2 + (13 − 1)(1.9)2 64.88


s2p = = = 2.8209.
12 + 13 − 2 23

This variance is our best estimate of σ 2 , and applies (by assumption) to both
populations. If the true population means µ1 and µ2 are estimated by x̄1 and

289
E. Interval estimation

x̄2 (the sample means), respectively, then the central limit theorem allows us
to use the following approximations:
Sp2 Sp2
Var(X̄1 ) ≈ and Var(X̄2 ) ≈
n1 n2
with point estimates of 2.8209/12 and 2.8209/13, provided that we use the t
distribution.
The point estimates are 9.2 for µ1 and 6.8 for µ2 . We estimate the difference
between the means by:
x̄1 − x̄2 = 9.2 − 6.8 = 2.4.
Since the random samples are independent, we add the variances, hence:
s2p s2p
 
2.8209 2.8209 1 1
+ = + = 2.8209 × +
n1 n2 12 13 12 13
E on 23 degrees of freedom.
Therefore, a 98% confidence interval for the difference between the means is,
noting here t0.01, 23 = 2.500 for 98% confidence:
s  
1 1
2.4 ± 2.500 × 2.8209 × + ⇒ 2.4 ± 2.500 × 0.6724
12 13
resulting in the confidence interval (0.7190, 4.0810).
(b) Since zero is not in this interval, it follows (with 98% confidence) that the true
difference is greater than zero, so that more time (on average) is spent
recording at the successful studios.

12. Two advertising companies each give quotations for nine different campaigns. Their
quotations (in £000s) are shown in the following table. Compute a 95% confidence
interval for the true difference between average quotations. Can you deduce from
this interval if one company is more expensive on average than the other?
Company 1 2 3 4 5 6 7 8 9
A 39 24 36 42 45 30 38 32 39
B 46 26 32 39 51 34 37 41 44

Solution:
The data are paired observations for n = 9 campaigns. First compute the
differences:
7, 2, −4, −3, 6, 4, −1, 9 and 5.
The sample mean of these differences is 2.78 and the sample standard deviation is
4.58. (Make sure you know how to do these calculations!) Therefore, a 95%
confidence interval for the true difference, µd , is:
s 4.58
x̄d ± t0.025, n−1 × √ ⇒ 2.78 ± 2.306 × √ ⇒ (−0.74, 6.30).
n 9
Since zero is in this confidence interval we cannot conclude that one company is
more expensive than the other on average. It is possible that the differences in
quotations are due to random variation.

290
E.2. Practice problems

E.2 Practice problems


1. Would you say the following statement is true or false? ‘When calculated from the
same dataset, a 90% confidence interval is wider than a 95% confidence interval.’
Explain your answer.

2. National figures for a blood test result have been collected and the population
standard deviation is 1.2. You select a random sample of 100 observations and find
a sample mean of 25 units. Construct a 95% confidence interval for the population
mean.

3. Measurements of the diameter of a random sample of 200 ball bearings produced


by a machine gave a sample mean x̄ = 0.824. The sample standard deviation, s, is
0.042. Construct a 95% confidence interval and a 99% confidence interval for the
population mean value of the diameter of the ball bearings. E
(Note that although you have been told that you only have a point estimate of the
population standard deviation, you can justify using the standard normal
distribution since the sample size is very large.)

4. Look at Table 10 of the New Cambridge Statistical Tables. Note that different
probability tails are given for ν = 1, 2 etc. Now consider a 95% confidence interval
for µ when n = 21 (i.e. for 20 degrees of freedom, as ν is the same as n − 1 in this
case). You can see the t-value is t0.025, 20 = 2.086. However, when ν → ∞, the
t-value converges to 1.96 – exactly the same as z0.025 for the standard normal
distribution. Although you can see that t-values are given for quite large degrees of
freedom, we generally assume that the standard normal distribution can be used
instead of Student’s t if the degrees of freedom are greater than 30 (some people
say 40, others say 50).

5. An independent assessment is made of the services provided by two holiday


companies. Of a random sample of 300 customers who booked through company A,
220 said they were satisfied with the service. Of a random sample of 250 of
company B’s customers, 200 said they were satisfied. For both companies the total
customer base was very large.
Construct a 95% confidence interval for the difference between the proportions of
satisfied customers between the two companies. Basing your conclusion on this
confidence interval, do you believe that one company generates greater customer
satisfaction than the other?

6. A random sample of 200 students is observed. 30 of them say they ‘really enjoy’
studying Statistics.
(a) Calculate the proportion of students in the sample who say they ‘really enjoy’
studying Statistics and then construct a 95% confidence interval for the
population proportion of students who ‘really enjoy’ studying Statistics.
You now take a further random sample, in another institution. This time there are
100 students, and 18 say they ‘really enjoy’ studying Statistics.

291
E. Interval estimation

(b) Construct a 95% confidence interval for the population proportion of students
who ‘really enjoy’ studying Statistics based on this second sample.
(c) Construct a 95% confidence interval for the difference between the two
proportions.

E.3 Solutions to Practice problems

1. False. When calculated from the same data set, the only difference is the change in
confidence level. There is a trade-off whereby higher confidence (which is good,
other things equal) leads to a larger multiplier coefficient (a z-value or t-value)
leading to a greater margin of error, and hence a wider confidence interval (which is
bad, other things equal). Therefore a 95% confidence interval would be wider, since
E 95% confidence is greater than 90% confidence.

2. We are given a population variance of σ 2 = (1.2)2 = 1.44, a sample mean of x̄ = 25,


and a sample size of n = 100.
To get a 95% confidence interval for the mean, we need:

σ 1.2
x̄ ± 1.96 × √ ⇒ 25 ± 1.96 × √ ⇒ (24.7648, 25.2352).
n 100

3. We were given x̄ = 0.824, n = 200 and σ = 0.042.


Here we are told that the standard deviation is derived from the sample, i.e.
x̄ = 0.824, n = 200 and s = 0.042.
We should use the t distribution, but, looking at Table 10 of the New Cambridge
Statistical Tables the largest n we see is 121 (i.e. the degrees of freedom are 120), for
which the appropriate 95% value is 1.98 (compared with 1.96 for an infinite-sized
sample) and the 99% value is 2.617 (compared with 2.576). As 200 is beyond the
values given in the tables, in practice we use the values given on the bottom line.

4. Do this exactly as stated. Proceed carefully. Look at Table 10 of the New


Cambridge Statistical Tables as suggested!

5. The sample proportions are pA = 220/300 = 0.733 and pB = 200/250 = 0.8. For a
95% confidence interval, we use the z-value of 1.96. A 95% confidence interval is:
r
0.733 × (1 − 0.733) 0.8 × (1 − 0.8)
0.8−0.733±1.96× + ⇒ (−0.0035, 0.1375).
300 250

The confidence interval (just) contains zero, suggesting no evidence of a difference


in the true population proportions. Therefore, there is no reason to believe that one
company gives greater satisfaction than the other. However, given the close
proximity of the lower endpoint to zero, this is a marginal case.

292
E.3. Solutions to Practice problems

6. (a) We are given n = 200 and p = 30/200 = 0.15. So the estimated standard error
is:
0.15 × 0.85 √
r
E.S.E.(p) = = 0.0006375 = 0.0252.
200
A 95% confidence interval is:

0.15 ± 1.96 × 0.0252 ⇒ (0.1006, 0.1994).

Taking this as percentages, we have a 95% confidence interval of between


10.06% and 19.94%, to two decimal places.
(b) In the second institution, n = 100 and p = 18/100 = 0.18. So the estimated
standard error is:

0.18 × 0.82 √
r
E.S.E.(p) = = 0.0015 = 0.0384
100 E
which is larger than the one in the first sample owing to the relatively smaller
sample size.
This gives us a 95% confidence interval of:

0.18 ± 1.96 × 0.0384 ⇒ (0.1047, 0.2553).

Taking this as percentages, we have a 95% confidence interval of between


10.47% and 25.53%, to two decimal places.
(c) The estimated standard error is:
r
0.15 × 0.85 0.18 × 0.82 √
E.S.E.(pA −pB ) = + = 0.0006375 + 0.001476 = 0.0460
200 100
which gives a 95% confidence interval of:

0.4 − 0.18 ± 1.96 × 0.0460 ⇒ (0.1298, 0.3102)

or between 12.98% and 31.02% – not a very useful confidence interval! Such a
wide interval arises because the sample proportions differ greatly, meaning the
proportions of students enjoying Statistics in these two different institutions
seem to be very different.

293
E. Interval estimation

294
Appendix F
Hypothesis testing principles

F.1 Worked examples


1. What are the options of the binary decision in hypothesis testing?

Solution:
In hypothesis testing, the binary decision can result in one of two possible
outcomes based on the analysis of sample data:
• Reject the null hypothesis: This decision indicates that there is enough
evidence in the sample data to conclude that the null hypothesis is not true. It F
suggests that there is an effect, a difference, or a change in the population
parameter, as suggested by the alternative hypothesis.
• Fail to reject the null hypothesis: This decision implies that there is insufficient
evidence in the sample data to reject the null hypothesis. It does not
necessarily mean that the null hypothesis is proven true; rather, it suggests
that the available evidence is not strong enough to support a rejection.
The goal of hypothesis testing is to make a decision about the null hypothesis
based on the observed sample data while controlling the risk of making a Type I
error (incorrectly rejecting a true null hypothesis).

2. When testing H0 : µ = µ0 , write down the other hypothesis when conducting:


(a) a two-tailed test
(b) an upper-tailed test
(c) a lower-tailed test.

Solution:
When testing H0 : µ = µ0 , the alternative hypothesis for:
(a) a two-tailed test is H1 : µ 6= µ0
(b) an upper-tailed test is H1 : µ > µ0
(c) a lower-tailed test is H1 : µ < µ0 .

3. (a) Briefly define Type I and Type II errors in the context of the statistical test of
a hypothesis.
(b) What is the general effect on the probabilities of each type of these errors
happening if the sample size is increased?

295
F. Hypothesis testing principles

Solution:
(a) A Type I error is rejecting H0 when it is true. A Type II error is failing to
reject a false H0 . We can express these as conditional probabilities as follows:

α = P (Type I error) = P (Reject H0 | H0 is true)

and:
β = P (Type II error) = P (Not reject H0 | H1 is true).

(b) Increasing the sample size decreases the probabilities of making both types of
error because there is greater precision in the estimation of parameters.

4. (a) Explain what is meant by the statement: ‘The test is significant at the 5%
significance level’.
(b) How should you interpret a test which is significant at the 10% significance
level, but not at the 5% significance level?
Solution:
F (a) ‘The test is significant at the 5% significance level’ means there is a less than
5% chance of getting data as extreme as actually observed if the null
hypothesis was true. This implies that the data are inconsistent with the null
hypothesis, which we reject, i.e. the test is ‘moderately significant’ with
‘moderate evidence’ to justify rejecting H0 .
(b) A test which is significant at the 10% significance level, but not at the 5%
significance level is often interpreted as meaning there is some doubt about the
null hypothesis, but not enough to reject it with confidence, i.e. the test is
‘weakly significant’ with ‘weak evidence’ to justify rejecting H0 .

5. You are interested in researching the average daily sunlight exposure in a certain
region. Sunlight exposure in the region is modelled as a normal distribution with a
mean of µ and a known variance of σ 2 = 36. A random sample of n = 30 days is
taken, yielding a sample mean of 8 hours of sunlight per day, i.e. x̄ = 8 hours.
Independently of the data, three experts provide their opinions about the average
daily sunlight exposure in the region as follows:
• Meteorologist A claims the population mean sunlight exposure is µ = 7.5
hours.
• Climate Scientist B claims the population mean sunlight exposure is µ = 7.2
hours.
• Environmental Scientist C claims the population mean sunlight exposure is
µ = 6.8 hours.
Based on the data evidence, which expert’s claim do you find the most convincing?
Solution:
The sampling distribution of the sample mean is:

σ2
   
36
X̄ ∼ N µ, = N µ, = N (µ, 1.2)
n 30

296
F.1. Worked examples

since σ 2 = 36 and n = 30. We assess the statements of √


the three
√ experts based on
this sampling distribution, with a standard error of σ/ n = 1.2.
• Under H0 : µ = 7.5 (Meteorologist A’s claim), the probability of being at least
8 − 7.5 = 0.5 hours above and below 7.5 is:
P (|X̄ − 7.5| ≥ 0.5) = P (X̄ ≤ 7) + P (X̄ ≥ 8)
   
7 − 7.5 8 − 7.5
=P Z≤ √ +P Z ≥ √
1.2 1.2
= P (Z ≤ −0.4564) + P (Z ≥ 0.4564)
= 0.6481
using =NORM.S.DIST(-0.4564,1)+(1-NORM.S.DIST(0.4564,1)) or Table 4.
• Under H0 : µ = 7.2 (Climate Scientist B’s claim), the probability of being at
least 8 − 7.2 = 0.8 hours above and below 7.2 is:
P (|X̄ − 7.2| ≥ 0.8) = P (X̄ ≤ 6.4) + P (X̄ ≥ 8)
   
6.4 − 7.2 8 − 7.2
=P Z≤ √
1.2
+P Z ≥ √
1.2
F
= P (Z ≤ −0.7303) + P (Z ≥ 0.7303)
= 0.4652
using =NORM.S.DIST(-0.7303,1)+(1-NORM.S.DIST(0.7303,1)) or Table 4.
• Under H0 : µ = 6.8 (Environmental Scientist C’s claim), the probability of
being at least 8 − 6.8 = 1.2 hours above and below 6.8 is:
P (|X̄ − 6.8| ≥ 1.2) = P (X̄ ≤ 5.6) + P (X̄ ≥ 8)
   
5.6 − 6.8 8 − 6.8
=P Z≤ √ +P Z ≥ √
1.2 1.2
= P (Z ≤ −1.0954) + P (Z ≥ 1.0954)
= 0.2733
using =NORM.S.DIST(-1.0954,1)+(1-NORM.S.DIST(1.0954,1)) or Table 4.
These conditional probabilities are the p-values for the respective sets of
hypotheses:
H0 : µ = µ0 vs. H1 : µ 6= µ0
where µ0 is the expert’s claimed value of µ. Note that:
0.2733 < 0.4652 < 0.6481
such that the greater the difference (the greater the incompatibility) between the
data evidence and the claim (in the null hypothesis), the smaller the p-value. In
summary, of the three claims the one we would find the most convincing would be
the claim that µ = 7.5, because if the hypothesis µ = 7.5 is true, the probability of
observing x̄ = 8, or more extreme values (i.e. x̄ ≤ 7 or x̄ ≥ 8), would be as high as
0.6481.

297
F. Hypothesis testing principles

F.2 Practice problems


1. Think about each of the following statements. For each, state the null and
alternative hypotheses and say whether they will need one-tailed or two-tailed
tests. If one-tailed, specify whether a lower-tailed or upper-tailed test.
(a) The mean level of family income in a national population is known to be
10,000 ‘ulam’ per year. You take a random sample in an urban area and find
the mean family income is 6,000 ulam per year in that area. Do the families in
the chosen area have a lower mean income than the population as a whole?
(b) You are looking at data from two schools on the weights of children by age.
Are the mean weights of children aged 10–11 different in the two schools?
(c) You are looking at reading scores for children before and after a new teaching
technique is delivered. Have their reading scores improved on average?

2. Complete the following decision space in hypothesis testing.

Decision made
F H0 not rejected H0 rejected
True state H0 true
of nature H1 true

3. Complete the following conditional probabilities in hypothesis testing.

Decision made
H0 not rejected H0 rejected
True state H0 true
of nature H1 true

4. Explain your attitude to a null hypothesis if a test of the hypothesis is significant:


(a) at the 1% significance level
(b) at the 10% significance level, but not at the 5% significance level.

5. Reproduce the significance level decision tree.

F.3 Solutions to Practice problems


1. Look at the wording in each of (a) to (c) carefully. You can get clues as to whether
you are dealing with a one-tailed test (where H1 will use a < or >), or a two-tailed
test (where H1 will use 6=) according to the use of words like ‘increase’, ‘higher’,
‘greater’ and ‘diminished’ which all imply a one-tailed test; or the use of words like
‘equal’, ‘changed’ and ‘different from’ which all imply a two-tailed test.
(a) Note the phrase ‘have a lower mean income’, so here we want to know whether
the mean income in a particular area (µU rban ) is less than the general
population mean (µ = 10,000). So we need a lower-tailed test, where:

298
F.3. Solutions to Practice problems

H0 : µU rban = 10,000 (i.e. µU rban is equal to µN ational )


H1 : µU rban < 10,000 (i.e. µU rban is less than µN ational ).
(b) Here we are comparing the mean weights in two schools A and B, let us denote
the population means µA and µB , respectively. We want to know if they are
different so we need a two-tailed test.
H0 : µA = µB (i.e. the means are the same)
H1 : µA 6= µB (i.e. the means are different).
(c) We are looking for evidence that the teaching technique has improved
children’s average reading scores. Let µBef ore and µAf ter denote the mean
reading scores before and after delivery of the teaching technique, respectively.
Hence we conduct a one-tailed test. Weht
H0 : µBef ore = µAf ter (i.e. the mean reading scores are the same)
H1 : µBef ore < µAf ter (i.e. the mean reading score after the teaching
technique is greater than before).
Note we could have expressed the alternative hypothesis as:

H1 : µAf ter > µBef ore F


so whether this is a lower-tailed or upper-tailed test depends on whether we
define the difference in means as ‘µBef ore − µAf ter ’ (which would be a
lower-tailed test) or as ‘µAf ter − µBef ore ’ (which would be an upper-tailed test).
Either approach is fine, there would be no impact on the final test conclusions.

2. The decision space in hypothesis testing is:

Decision made
H0 not rejected H0 rejected
True state H0 true Correct decision Type I error
of nature H1 true Type II error Correct decision

3. The table of conditional probabilities in hypothesis testing is:

Decision made
H0 not rejected H0 rejected
True state H0 true 1−α P (Type I error) = α
of nature H1 true P (Type II error) = β Power = 1 − β

4. (a) A test which is significant at the 1% significance level is highly significant. This
indicates that it is very unlikely to have obtained the sample we did if the null
hypothesis was actually true. Therefore, we would be very confident in
rejecting the null hypothesis with strong evidence to justify doing so.
(b) A test which is significant at the 10% significance level, but not at the 5%
significance level, is weakly significant. This indicates mild (at best) support
for rejecting the null hypothesis, with only weak evidence. The test outcome is
less ‘conclusive’ than in part (a).

299
F. Hypothesis testing principles

5. We have:
Significance level decision tree

Reject H0 Test result is 'highly significant'

Choose the 1% level


Reject H0

Not reject H0 Test result is 'moderately significant'


Start at the
5% level

Reject H0 Test result is 'weakly significant'

Not reject H0
Choose the 10% level

Not reject H0 Test result is 'not significant'

300
Appendix G
Hypothesis testing of means and
proportions

G.1 Worked examples


1. A sample of seven batteries is taken at random from a large batch of (nominally 12
volt) batteries. These are tested and their true voltages are shown below:

12.9, 11.6, 13.5, 13.9, 12.1, 11.9 and 13.0.

(a) Test whether the mean voltage of the whole batch is 12 volts using two
appropriate significance levels.
(b) Test whether the mean batch voltage is less than 12 volts using two
appropriate significance levels.
(c) Which test do you think is the more appropriate? Briefly explain why.
G
Solution:
(a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is
small and that σ 2 is unknown. We can use a t test, and this is valid provided
the data are normally distributed. The test statistic value is:
x̄ − 12 12.7 − 12
t= √ = √ = 2.16.
s/ 7 0.858/ 7

This is compared to a Student’s t distribution on n − 1 = 6 degrees of freedom.


The critical values corresponding to a 5% significance level two-tailed test are
±t0.025, 6 = ±2.447, using Table 10 of the New Cambridge Statistical Tables.
Hence we cannot reject the null hypothesis at the 5% significance level since
2.16 < 2.447. Moving to the 10% significance level the critical values are
±t0.05, 6 = ±1.943, so we can reject H0 and conclude that the test is weakly
significant as there is weak evidence that µ 6= 12.
(b) We are to test H0 : µ = 12 vs. H1 : µ < 12. There is no need to do a formal
statistical test. As the sample mean is 12.7, which is greater than 12, the test
statistic value will be positive while the lower-tailed test critical value will be
negative and hence there is no evidence whatsoever in favour of the alternative
hypothesis.
(c) In part (a) we are asked to conduct a two-tailed test; in part (b) it is a
lower-tailed test. Which is more appropriate will depend on the purpose of the
experiment, and your suspicions before you conduct it.

301
G. Hypothesis testing of means and proportions

• If you suspected before collecting the data that the mean voltage was less
than 12 volts, the test in part (b) would be appropriate.
• If you had no prior reason to believe that the mean was less than 12 volts,
you would use the test in part (a).
• General rule: decide on whether it is a one-tailed or two-tailed test before
collecting the data!

2. A salesperson claims that they can sell at least 100 items per day. Over a period of
11 days their resulting sales were as shown below. Do the data support their claim?

94, 105, 100, 90, 96, 101, 92, 97, 93, 100 and 91.

Solution:
Here we look for evidence to refute the claim. If µ is the mean number of sales per
day, then we wish to test:

H0 : µ = 100 vs. H1 : µ < 100.

The main statistics are:



n x̄ s s/ n

Sales 11 96.273 4.777 4.777/ 11 = 1.440
G
Hence the test statistic value is:
96.723 − 100
t= = −2.59.
1.440
Comparing this with t10 (since there are 11 − 1 = 10 degrees of freedom), we see
that it is significant at the 5% significance level (for this lower-tailed test, the
critical value is t0.95, 10 = −1.812 and −2.59 < −1.812).
However, it is not significant at the 1% significance level (for which the critical
value is t0.99 = −2.764 and −2.59 > −2.764). Hence we reject the salesperson’s
claim at the 5% significance level, hence the test is moderately significant with
moderate evidence against the claim.

3. In a particular city it is known, from past surveys, that 25% of households


regularly use a washing powder named ‘Snolite’. After an advertising campaign, a
survey of 300 randomly selected households showed that 100 had recently
purchased ‘Snolite’. Is there evidence that the campaign has been successful? Use a
1% significance level.
Solution:
Here we are testing a proportion: H0 : π = 0.25 vs. H1 : π > 0.25. Note that this is
an upper-tailed test as we have reason to believe that the advertising campaign has
increased sales, and we believe this before collecting any data. As this is a test for a
single proportion and n is large, we compute the test statistic value to be:
p−π 100/300 − 0.25
z=p =p = 3.33.
π(1 − π)/n 0.25 × 0.75/300

302
G.1. Worked examples

At the 1% significance level, we compare this to the critical value of z0.01 = 2.326
(from the bottom row of Table 10) and we see that it is significant at the 1%
significance level since 3.33 > 2.326. The test is highly significant and there is
strong evidence that more than 25% of households use ‘Snolite’. It appears that the
advertising campaign has indeed been successful.

4. Consumer Reports evaluated and rated 46 brands of toothpaste. One attribute


examined in the study was whether or not a toothpaste brand carries an American
Dental Association (ADA) seal verifying effective decay prevention. The data for
the 46 brands (coded as 1 = ADA seal, 0 = no ADA seal) are reproduced below.

0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1.

(a) State the null and alternative hypotheses testing whether the true proportion
of toothpaste brands with the ADA seal verifying effective decay prevention is
less than 0.50.
(b) Assuming that the sample was random, from a much larger population,
conduct a test between the two hypotheses.
Solution:
(a) Appropriate hypotheses are H0 : π = 0.50 vs. H1 : π < 0.50.
(b) The data show that there are r = 21 ADA seals among n = 46 items in the
G
sample. Let π be the true proportion. If H0 is true then the sample proportion
is:
   
π(1 − π) 0.50 × 0.50
P ∼ N π, = N 0.50, = N (0.50, 0.005435)
n 46

approximately, since 46 is ‘large’ and hence we can use the normal


approximation. The test statistic value for this test would then be:

21/46 − 0.50
z= √ = −0.59.
0.005435
This is clearly not going to be significant at any plausible significance level
given that the p-value is:

P (Z < −0.59) = 0.2776.

Therefore, we cannot reject the null hypothesis. The test is not statistically
significant, and there is insufficient evidence that less than half the brands
have the ADA seal.

5. A government agency carries out a large-scale random survey of public attitudes


towards job security. 78 of the 300 workers surveyed indicated they were worried
about losing their job. A newspaper report claims 30% of workers fear losing their
job. Is such a high percentage claim justified? State and carry out an appropriate
hypothesis test at two appropriate significance levels and explain your results.

303
G. Hypothesis testing of means and proportions

Solution:
We suspect the newspaper is being ‘sensationalist’ such that we seek evidence that
the true percentage of workers who fear losing their job is actually less than 30%.
The null and alternative hypotheses are:

H0 : π = 0.30 vs. H1 : π < 0.30.

The test statistic under H0 is:

P − π0 P − 0.30
Z=p =p ∼ N (0, 1).
π0 (1 − π0 )/n 0.30 × (1 − 0.70)/n

The sample proportion is p = 78/300 = 0.26 and the test statistic value is −1.51.
For α = 0.05, the lower-tailed test critical value is z0.95 = −1.645. Since
−1.51 > −1.645, we do not reject H0 at the 5% significance level. Moving to
α = 0.10, the critical value is now z0.90 = −1.282. Since −1.51 < −1.282, we reject
H0 at the 10% significance level. The test is weakly significant and we conclude that
there is weak evidence that fewer than 30% of workers were worried about losing
their jobs, suggesting there is only a marginal reason to doubt the newspaper
report’s claim.

6. If you live in California, the decision to purchase earthquake insurance is a critical


G one. An article in the Annals of the Association of American Geographers
investigated many factors which California residents consider when purchasing
earthquake insurance. The survey revealed that only 133 of 337 randomly selected
households in Los Angeles County were protected by earthquake insurance.
(a) What are the appropriate null and alternative hypotheses to test the research
hypothesis that fewer than 40% of households in Los Angeles County were
protected by earthquake insurance?
(b) Do the data provide sufficient evidence to support the research hypothesis?
(Use α = 0.10.)
(c) Calculate and interpret the p-value for the test.

Solution:

(a) Appropriate hypotheses are H0 : π = 0.40 vs. H1 : π < 0.40.


(b) The estimate of the proportion covered by insurance is 133/337 = 0.395. As
this is a test of a single proportion and n is large, the test statistic value is:

p−π 0.395 − 0.40


z=p = √ = −0.20.
π(1 − π)/n 0.02669

Compare this to the critical value of z0.90 = −1.282 (from the bottom row of
Table 10) and you will see that it is not significant at even the 10%
significance level since −0.20 > −1.282. Hence the test is not statistically
significant, i.e. there is insufficient evidence that fewer than 40% of households
in Los Angeles County were protected by earthquake insurance.

304
G.1. Worked examples

(c) To compute the p-value for this lower-tailed test, we determine:

P (Z < −0.20) = 0.4207

using Table 4. Hence we would not reject H0 for any α < 0.4207, remembering
that the p-value is the smallest significance level such that we reject the null
hypothesis.

7. A museum conducts a survey of its visitors in order to assess the popularity of a


device which is used to provide information on the museum exhibits. The device
will be withdrawn if fewer than 30% of all of the museum’s visitors make use of it.
Of a random sample of 80 visitors, 20 chose to use the device.
(a) Carry out a test at the 5% significance level to see if the device should be
withdrawn and state your conclusions.
(b) Determine the p-value of the test.
Solution:
(a) Let π be the population proportion of visitors who would use the device. We
test:
H0 : π = 0.30 vs. H1 : π < 0.30.
The sample proportion
p is p = 20/80 = 0.25. The standard error of the sample
proportion is 0.30 × 0.70/80 = 0.0512. The test statistic value is:
0.25 − 0.30
G
z= = −0.976.
0.0512
For a lower-tailed test at the 5% significance level, the critical value is
z0.95 = −1.645, so the test is not significant, as −0.976 > −1.645. Moving to
the 10% significance level, the critical value is z0.90 = −1.282, so again we
cannot reject H0 . Hence the test is not statistically significant, i.e. there is
insufficient evidence to justify a withdrawal of the device.
(b) The p-value of the test is the probability of the test statistic value or a more
extreme value conditional on H0 being true. Hence the p-value is:

P (Z ≤ −0.976) = 0.1645.

So, for any α < 0.1645 we would fail to reject H0 .

8. A market research company has conducted a survey of adults in two large towns,
either side of an international border, in order to judge attitudes towards a
controversial internationally-broadcast celebrity television programme.
The following table shows some of the information obtained by the survey:

Town A Town Z
Sample size 80 80
Sample number approving of the programme 44 41
(a) Conduct a formal hypothesis test, at two appropriate significance levels, of the
claim that the population proportions approving of the programme in the two
towns are equal.

305
G. Hypothesis testing of means and proportions

(b) Would your conclusion be the same if, in both towns, the sample sizes had
been 1,500 (with the same sample proportions of approvals and the same
pooled sample proportion)?
Solution:
(a) We test:
H0 : πA = πZ vs. H1 : πA 6= πZ .
Note there is no a priori reason to suppose which town would have a greater
proportion approving of the programme in the event that the proportions are
not equal, hence we conduct a two-tailed test.
The sample proportions are pA = 44/80 = 0.55 and pZ = 41/80 = 0.5125, and
the estimate of the common proportion under H0 is:
44 + 41
p= = 0.53125.
80 + 80
We ues the test statistic:
PA − PZ
Z=p ∼ N (0, 1)
P (1 − P )(1/nA + 1/nZ )
approximately under H0 . The test statistic value is:
0.55 − 0.5125
G p = 0.48.
0.53125 × (1 − 0.53125) × (1/80 + 1/80)

At the 5% significance level, the critical values are ±1.96. Since 0.48 < 1.96 we
do not reject H0 . Following the significance level decision tree, we now test at
the 10% significance level, for which the critical values are ±1.645. Since
0.48 < 1.645, again we do not reject H0 . Hence the test is not statistically
significant, i.e. there is insufficient evidence of a difference in the population
proportions approving of the programme.
For reference, the p-value for this two-tailed test is:

2 × P (Z ≥ 0.48) = 2 × 0.3156 = 0.6312

which is greater than any reasonable significance level, α.


(b) If nA = nZ = 1,500, with the same values for pA , pZ and p, the test statistic
value becomes:
0.55 − 0.5125
p = 2.06.
0.53125 × (1 − 0.53125) × (1/1,500 + 1/1,500)

Since 1.96 < 2.06, we reject H0 at the 5% significance level. Now testing at the
1% significance level we have critical values of ±2.576. Since 2.06 < 2.576, we
do not reject H0 and conclude that the test is moderately significant indicating
moderate evidence of a difference in the population proportions approving of
the programme.
The larger sample sizes have increased the power of the test, i.e. they have
increased our ability to reject H0 whereby the difference in the sample

306
G.1. Worked examples

proportions is attributed to chance (that is, sampling error) when


nA = nZ = 80, while the difference is statistically significant when
nA = nZ = 1,500.
For reference, the p-value for this two-tailed test is:

2 × P (Z ≥ 2.06) = 2 × 0.0197 = 0.0394.

Since 0.01 < 0.0394 < 0.05, the test is moderately significant, as before.

9. A random sample of 250 households in a particular community was taken and in 50


of these households the lead level in the water supply was found to be above an
acceptable level. A sample was also taken from a second community, which adds
anti-corrosives to its water supply and, of these, only 16 out of 320 households were
found to have unacceptably high lead levels. Is this conclusive evidence that the
addition of anti-corrosives reduces lead levels?
Solution:
Here we are testing the difference between two proportions with a one-sided test.
Also, n1 and n2 are both large. Let:
• π1 = the population proportion of households with unacceptable levels of lead
in the community without anti-corrosives
• π2 = the population proportion of households with unacceptable levels of lead
in the community with anti-corrosives. G
We wish to test:
H0 : π1 = π2 vs. H1 : π1 > π2 .
The test statistic value is:
p1 − p2 50/250 − 16/320
z=p =p = 5.55
p(1 − p)(1/n1 + 1/n2 ) 0.116 × (1 − 0.116) × (1/250 + 1/320)

where p is the pooled sample proportion, i.e. p = (50 + 16)/(250 + 320) = 0.116.
This test statistic value is, at 5.55, obviously very extreme and hence is highly
significant (since z0.01 = 2.326 < 5.55), so there is strong evidence that
anticorrosives reduce lead levels.

10. Two companies supplying a television repair service are compared by their repair
times (in days). Random samples of recent repair times for these companies gave
the following statistics:

Sample size Sample mean Sample variance


Company A 44 11.9 7.3
Company B 52 10.8 6.2
(a) Is there evidence that the companies differ in their true mean repair times?
Give an appropriate significance test to support your conclusions.
(b) Determine the p-value of the test.
(c) What difference would it have made if the sample sizes had each been smaller
by 35 (i.e. sample sizes of 9 and 17, respectively)?

307
G. Hypothesis testing of means and proportions

Solution:
(a) H0 : µA = µB vs. H1 : µA 6= µB , where we use a two-tailed test since there is no
indication of which company would have a faster mean repair time. The test
statistic value is:
11.9 − 10.8
z=p = 2.06.
7.3/44 + 6.3/52
For a two-tailed test, this is significant at the 5% significance level (since
z0.025 = 1.96 < 2.06), but it is not at the 1% significance level (since
2.06 < 2.326 = z0.01 ). We reject H0 and conclude that the test is moderately
significant with moderate evidence that the companies differ in terms of their
mean repair times.
(b) The p-value for this two-tailed test is 2 × P (Z > 2.06) = 0.0394.
(c) For small samples, we should use a pooled estimate of the population standard
deviation: s
(9 − 1) × 7.3 + (17 − 1) × 6.2
s= = 2.5626
(9 − 1) + (17 − 1)
on 24 degrees of freedom. Hence the test statistic value in this case is:
11.9 − 10.8
t= p = 1.04.
2.5626 × 1/9 + 1/17

G This should be compared with critical values from the t24 distribution (5%:
1.711 and 10%: 1.318) and is clearly not significant, even at the 10%
significance level. With the smaller samples we fail to detect the difference.
Comparing the two test statistic calculations shows that the different results
flow from differences in the estimated standard errors, hence ultimately (and
unsurprisingly) from the differences in the sample sizes used in the two
situations.

11. A study was conducted to determine the amount of hours spent on Facebook by
university and high school students. For this reason, a questionnaire was
administered to a random sample of 14 university and 11 high school students and
the hours per day spent on Facebook were recorded.

Sample size Sample mean Sample variance


University students 14 3.3 0.9
High school students 11 2.9 1.3
(a) Use an appropriate hypothesis test to determine whether the mean hours per
day spent on Facebook is different between university and high school
students. Test at two appropriate significance levels, stating clearly the
hypotheses, the test statistic and its distribution under the null hypothesis.
Comment on your findings.
(b) State clearly any assumptions you made in (a).
(c) Adjust the procedure above to determine whether the mean hours spent per
day on Facebook for university students is higher than that of high school
students.

308
G.2. Practice problems

Solution:
(a) Let µ1 denote the mean hours per day spent on Facebook for university
students and let µ2 denote the mean hours per day spent on Facebook for high
school students. We test:
H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2 .
Since the sample sizes are small, we assume σ12 = σ22 and use the test statistic:
X̄1 − X̄2
q ∼ tn1 +n2 −2
2
Sp (1/n1 + 1/n2 )

under H0 . The estimate of the common variance is:


(n1 − 1) × s21 + (n2 − 1) × s22 13 × 0.9 + 10 × 1.3
= = 1.0739.
n1 + n2 − 2 14 + 11 − 2
The test statistic value is:
3.3 − 2.9
p = 0.9580.
1.0739 × (1/14 + 1/11)
For α = 0.05, the critical values are ±t0.025, 23 = ±2.069 (the t23 distribution is
used). Since 0.9580 < 2.069, we do not reject H0 . For α = 0.10, the critical
values are ±t0.05, 23 = ±1.714. Since 0.9580 < 1.714, again we do not reject H0 .
We conclude that the test is not statistically significant, i.e. there is G
insufficient evidence of a difference in the mean hours spent on Facebook
between university and high school students.
(b) It assumed that the random samples are independent and that σ12 = σ22 (due to
the small sample sizes).
(c) Now we test:
H0 : µ1 = µ2 vs. H1 : µ1 > µ2 .
All that changes are the critical values. For α = 0.05, we have a critical value
of t0.05, 23 = 1.714. Since 0.9580 < 1.714, we do not reject H0 . For α = 0.10, we
have a critical value of t0.10, 23 = 1.319. Since 0.9580 < 1.319, we again do not
reject H0 . The conclusion is the same as in part (a).

G.2 Practice problems


1. A random sample of 18 observations from the population N (µ, σ 2 ) yields the
sample mean x̄ = 8.83 and the sample variance s2 = 1.59. At the 1% significance
level, test the following hypotheses by obtaining critical values:
(a) H0 : µ = 8 vs. H1 : µ > 8.
(b) H0 : µ = 8 vs. H1 : µ < 8.
(c) H0 : µ = 8 vs. H1 : µ 6= 8.
Repeat the above exercise with the additional assumption that σ 2 = 1.59. Compare
the results with those derived without this assumption and comment.

309
G. Hypothesis testing of means and proportions

2. The manufacturer of a patient medicine claimed that it was 90% effective in


relieving an allergy for a period of 8 hours. In a random sample of 200 people
suffering from the allergy, the medicine provided relief for 160 people.
Determine whether the manufacturer’s claim is legitimate. (Be careful, your
parameter here will be π.) Is your test one-tailed or two-tailed?

3. You have been asked to compare the percentages of people in two groups with
n1 = 16 and n2 = 24 who are in favour of a new plan. You decide to make a pooled
estimate of the proportion and make a test. Which test would you use?

4. Two different methods of determination of the percentage fat content in meat are
available. Both methods are used on portions of the same meat sample. Is there
any evidence to suggest that one method gives a higher reading than the other?

Meat Sample Method Meat Sample Method


I II I II
1 23.1 22.7 9 38.4 38.1
2 23.2 23.6 10 23.5 23.8
3 26.5 27.1 11 22.2 22.5
4 26.6 27.4 12 24.7 24.4
5 27.1 27.4 13 45.1 43.5
6 48.3 46.8 14 27.6 27.0
G 7 40.5 40.4 15 25.0 24.9
8 25.0 24.9 16 36.7 35.2

G.3 Solutions to practice problems



1. When σ 2 is unknown, we use the test statistic T = n(X̄ − 8)/S. Under H0 ,
T ∼ t17 . With α = 0.01, we reject H0 if:
(a) t > t0.01, 17 = 2.567, against H1 : µ > 8.
(b) t < −t0.01, 17 = −2.567, against H1 : µ < 8.
(c) |t| > t0.005, 17 = 2.898, against H1 : µ 6= 8.
For the given sample, t = 2.79. Hence we reject H0 against the alternative
H1 : µ > 8, but we will not reject H0 against the two other alternative hypotheses.

When σ 2 is known, we use the test statistic Z = n(X̄ − 8)/σ. Under H0 ,
Z ∼ N (0, 1). With α = 0.01, we reject H0 if:
(a) z > z0.01 = 2.326, against H1 : µ > 8.
(b) z < −z0.01 = −2.326, against H1 : µ < 8.
(c) |z| > z0.005 = 2.576, against H1 : µ 6= 8.
For the given sample, z = 2.79. Hence we reject H0 against the alternative
H1 : µ > 8 and H1 : µ 6= 8, but we will not reject H0 against H1 : µ < 8.
With σ 2 known, we should be able to perform inference better simply because we
have more information about the population. More precisely, for the given
significance level, we require less extreme values to reject H0 . Put another way, the

310
G.3. Solutions to practice problems

p-value of the test is reduced when σ 2 is given. Therefore, the risk of rejecting H0 is
also reduced.

2. Here the manufacturer is claiming that 90% of the population will be relieved over
8 hours, i.e. π = 0.90. There is a random sample of n = 200 and 160 gained relief,
giving a sample proportion of p = 160/200 = 0.80.
Given the manufacturer’s claim, we would be concerned if we found significant
evidence that fewer than 90% were relieved (more than 90% would not be a
problem!). So we have a one-tailed test with:
H0 : π = 0.90 vs. H1 : π < 0.90.
We use the population value π to work out the standard error, and so the test
statistic value is:
0.80 − 0.90 0.10
z=p = = −4.717.
(0.90 × 0.10)/200 0.0212
This goes beyond Table 4 of the New Cambridge Statistical Tables, so this result is
highly significant, since the p-value is nearly zero. Alternatively, you could look at
the bottom row of Table 10 which provides the 5% significance level critical value
of −1.645 for a lower-tailed test and the 1% significance level critical value of
−2.326. This confirms that the result is highly significant. So we reject H0 and the
manufacturer’s claim is not legitimate. Based on the observed sample, we have
(very) strong evidence that the proportion of the population given relief is G
significantly less than the 90% claimed.

3. We would test:
H0 : π1 = π2 vs. H1 : π1 6= π2 .
Under H0 , the test statistic would be:
P1 − P2
p ∼ N (0, 1)
P (1 − P )(1/16 + 1/24)
where P is the pooled proportion estimator:
n 1 P1 + n 2 P 2 16P1 + 24P2
P = = .
n1 + n2 40
Since the test statistic follows (approximately) the standard normal distribution,
there are no degrees of freedom to consider. For a two-tailed test, the 5% critical
values would be ±1.96.

4. Since the same meat is being put through two different tests, we are clearly dealing
with paired samples. We wish to test:
H 0 : µ1 = µ2 vs. H1 : µ1 6= µ2
but for paired samples we do our calculations using the differenced data, so we
might reformulate this presentation of the hypotheses in the form:
H0 : µ1 − µ2 = 0 vs. H1 : µ1 − µ2 6= 0.
If we list the differences from the original data table then we get:

311
G. Hypothesis testing of means and proportions

0.4 −0.4 −0.6 −0.8 −0.3 1.5 0.1 0.1


0.3 −0.3 −0.3 0.3 1.6 0.6 0.1 1.5

The main statistics are:



n x̄d sd s√d / n
Differences 16 0.238 0.745 0.745/ 16 = 0.186

Hence the test statistic value is:


0.238 − 0
t= = 1.28.
0.186
Looking at Table 10 for the t15 distribution, we find that the critical values for a
two-tailed test at the 5% significance level are ±t0.025, 15 = ±2.131, so this result is
not significant at the 5% significance level since 1.28 < 2.131. Also, the critical
values for a test at the 10% significance level are ±t0.05, = ±1.753, so the result
would still not be significant. We conclude that the test is not statistically
significant, i.e. there is insufficient evidence of a difference in the means.

312
Appendix H
Contingency tables and the
chi-squared test

H.1 Worked examples


1. Give an example of a 2 × 2 contingency table in which there is:
(a) no association
(b) strong association.
Be sure to show why each contingency table exhibits the property.
Solution:
(a) Any 2 × 2 contingency table with equal observed frequencies would suffice. For
example (where k is some constant):
Level 1 Level 2
Level 1 k k
Level 2 k k
Clearly, Oij = Eij = k for all cells, leading to a test statistic value of zero for
the chi-squared test. H
(b) A possible example of a 2 × 2 contingency table showing strong association is:
Level 1 Level 2
Level 1 100 200
Level 2 200 100
All expected frequencies are 300 × 300/600 = 150, and (Oij − Eij )2 /Eij =
16.67 for all cells. Therefore, the test statistic value for the chi-squared test is
4 × 16.67 = 66.67 which is highly significant given the critical value for the 1%
significance level is 6.635 (using Table 8 of the New Cambridge Statistical
Tables).

2. An analyst of the retail trade uses as analytical tools the concepts of ‘Footfall’ (the
daily number of customers per unit sales area of a shop) and ‘Ticket price’ (the
average sale price of an item in the shop’s offer). Shops are classified as offering low,
medium or high price items and, during any sales period, as having low, medium or
high footfall. During the January sales, the analyst studies a sample of shops and
obtains the following frequency data for the nine possible combined classifications:
Low price Medium price High price
Low footfall 22 43 16
Medium footfall 37 126 25
High footfall 45 75 23

313
H. Contingency tables and the chi-squared test

Conduct a suitable test for association between Ticket classification and Footfall
level, and report on your findings.
Solution:
We test:
H0 : No association between ticket price and footfall level
vs.
H1 : Association between ticket price and footfall level.
We next compute the expected values for each cell using:
row i total × column j total
total number of observations
which gives:

Ticket price
Footfall Low Medium High Total
O1· 22 43 16 81
Low E1· 20.45 47.97 12.58 46
(O1· − E1· )2 /E1· 0.12 0.52 0.93
O2· 37 126 25 188
Medium E2· 47.46 111.34 29.20 188
(O2· − E2· )2 /E2· 2.30 1.93 0.61
O3· 45 75 23 143
High E3· 36.10 84.69 22.21 143
(O3· − E3· )2 /E3· 2.20 1.11 0.03
H Total 104 244 64 412

The test statistic value is:


3 X 3
X (Oij − Eij )2
= 2.20 + 1.11 + · · · + 0.93 = 9.73.
i=1 j=1
E ij

Since r = c = 3, we have (r − 1)(c − 1) = (3 − 1)(3 − 1) = 4 degrees of freedom.


For α = 0.05, using Table 8 of the New Cambridge Statistical Tables, we obtain an
upper-tail critical value of 9.488. Hence we reject H0 since 9.73 > 9.488. Moving to
the 1% significance level, the critical value is now 13.277 so we do not reject H0
since 9.73 < 13.277. Therefore, the test is moderately significant and we conclude
that there is moderate evidence of an association between ticket price and footfall.

3. The table below shows the relationship between gender and party identification in
a US state.

Party identification
Gender Democrat Independent Republican Total
Male 108 72 198 378
Female 216 59 142 417
Total 324 131 340 795

314
H.1. Worked examples

Test for an association between gender and party identification at two appropriate
significance levels and comment on your results.
Solution:
We test:
H0 : There is no association between gender and party identification
vs.
H1 : There is an association between gender and party identification.

We construct the following contingency table:


Party identification
Democrat Independent Republican Total
O1j 108 72 198 378
Male E1j 154.05 62.29 161.66 378
(O1j − E1j )2 /E1j 13.77 1.51 8.17
O2j 216 59 142 417
Female E2j 169.95 68.71 178.34 417
(O2j − E2j )2 /E2j 12.48 1.37 7.40
Total 324 131 340 795

The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 44.71. The number of degrees
P

of freedom is (2 − 1)(3 − 1) = 2, so we compare the test statistic with the χ22


distribution in Table 8. The test is significant at the 5% significance level (critical
value is 5.991, and 44.71 > 5.991). It is also significant at the 1% significance level
(critical value is 9.210, and 44.71 > 9.210) so we the test is highly significant and
we have found (very!) strong evidence of an association between gender and party H
identification.
Comparing the observed and expected frequencies, we see that males are more
likely to identify as Republican supporters, while females are more likely to identify
as Democrat supporters.

4. The following table shows the numbers of car accidents in an urban area over a
period of time. These are classified by severity and by type of vehicle. Carry out a
test for association on these data and draw conclusions.
Severity of accident
Minor Medium Major
Saloon 29 39 16
Van 15 24 12
Sports car 7 20 12

Solution:
We test:
H0 : There is no association between type of vehicle and severity of accident
vs.
H1 : There is an association between type of vehicle and severity of accident.

315
H. Contingency tables and the chi-squared test

We construct the following contingency table:


Severity of accident
Minor Medium Major Total
O1j 29 39 16 84
Saloon E1j 24.6 40.1 19.3 84
(O1j − E1j )2 /E1j 0.78 0.03 0.57
O2j 15 24 12 51
Van E2j 15.0 24.3 11.7 51
(O2j − E2j )2 /E2j 0.00 0.00 0.00
O3j 7 20 12 39
Sports E3j 11.4 18.6 9.0 39
(O3j − E3j )2 /E3j 1.72 0.10 1.03
Total 51 83 40 174
The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 4.24. The number of degrees of
P

freedom is (3 − 1)(3 − 1) = 4, so we compare the test statistic with the χ24


distribution in Table 8.
The test is not significant at the 5% significance level (critical value is 9.488, and
4.24 < 9.488). It is also not significant at the 10% significance level (critical value is
7.779, and 4.24 < 7.779) so we conclude that the test is not statistically significant,
and there is insufficient evidence of an association since the test is insignificant.

5. In a random sample of 200 children in a school it was found that 171 had been
inoculated against the common cold before the winter. The table below shows the
numbers observed to have suffered from colds over the winter season.

H Inoculated Not inoculated


No cold 40 5
Cold 131 24

Test for evidence of an association between colds and inoculation, and draw
conclusions.
Solution:
We test:
H0 : There is no association between being inoculated and cold prevention.
vs.
H1 : There is an association between being inoculated and cold prevention.
We construct the following contingency table:
Inoculated Not inoculated Total
O1j 40 5 45
No cold E1j 38.475 6.525 45
(O1j − E1j )2 /E1j 0.060 0.356
O2j 131 24 155
Cold E2j 132.525 22.475 155
(O2j − E2j )2 /E2j 0.018 0.103
Total 171 29 200

316
H.1. Worked examples

The number of degrees of freedom is (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1.


The χ2 test statistic value is 0.060 + 0.356 + 0.018 + 0.103 = 0.537.
For comparison, using Table 8 the χ21 critical value for α = 0.30 (i.e. a 30%
significance level!) is 1.074. Since 0.537 < 1.074, the test statistic value is clearly
not significant. There is no evidence that inoculation prevents colds.

6. The classification of examination candidates for three examining bodies is shown


below. Is there evidence that the examining bodies are classifying differently? If so,
explain what the differences are.

Examining bodies
A B C D
Pass 233 737 358 176
Refer 16 68 29 20
Fail 73 167 136 64

Solution:
We test:

H0 : There is no association between examining bodies and grade classifications

vs.

H1 : There is an association between examining bodies and grade classifications.

We construct the following contingency table: H


Examining bodies
A B C D Total
O1j 233 737 358 176 1,504
Pass E1j 233.2 703.8 378.7 188.3 1,504
(O1j − E1j )2 /E1j 0 1.56 1.13 0.80
O2j 16 68 29 20 133
Refer E2j 20.6 62.2 33.5 16.6 133
(O2j − E2j )2 /E2j 1.04 0.53 0.60 0.53
O3j 73 167 136 64 440
Fail E3j 68.2 205.9 110.8 55.1 440
(O3j − E3j )2 /E3j 0.34 7.35 5.73 1.47
Total 322 972 523 260 2,077

The χ2 test statistic value is i,j (Oij − Eij )2 /Eij = 21.21. The number of degrees
P

of freedom is (4 − 1)(3 − 1) = 6, so we compare the test statistic with the χ26


distribution in Table 8.
The test is significant at the 5% significance level (the critical value is 12.59, and
21.21 > 12.59), indeed it is also significant at the 1% significance level (the critical
value is 16.81, and 21.21 > 16.81). So we can conclude that the test is highly

317
H. Contingency tables and the chi-squared test

significant and hence there is strong evidence of an association between examining


bodies and grade classifications.
Looking at the table for large individual contributors to the test statistic value and
comparing observed and expected frequencies, it appears that Examination body B
fails fewer people than expected under independence while Examination body C
fails more. Note that none of the other individual values of the χ2 test statistic is a
significant fraction of the 21.21 total.

7. Many people believe that when a horse races, it has a better chance of winning if
its starting line-up position is closer to the rail on the inside of the track. The
starting position of 1 is closest to the inside rail, followed by position 2, and so on.
The table below lists the numbers of wins of horses in the different starting
positions. Do the data support the claim that the probabilities of winning in the
different starting positions are not all the same?

Starting position 1 2 3 4 5 6 7 8
Number of wins 29 19 18 25 17 10 15 11

Solution:
We test whether the data follow a discrete uniform distribution of 8 categories. Let
pi = P (X = i), for i = 1, 2, . . . , 8.
We test the null hypothesis H0 : pi = 1/8, for i = 1, 2, . . . , 8.
Note n = 29 + 19 + 18 + 25 + 17 + 10 + 15 + 11 = 144. The expected frequencies are
Ei = 144/8 = 18, for all i = 1, 2, . . . , 8.
H
Starting position 1 2 3 4 5 6 7 8 Total
Oi 29 19 18 25 17 10 15 11 144
Ei 18 18 18 18 18 18 18 18 144
Oi − Ei 11 1 0 7 −1 −8 −3 −7 0
(Oi − Ei )2 /Ei 6.72 0.06 0 2.72 0.06 3.56 0.50 2.72 16.34

Under H0 , i (Oi − Ei )2 /Ei ∼ χ27 . At the 5% significance level, the critical value is
P
14.067. Since 16.34 > 14.067, we reject the null hypothesis. Turning to the 1%
significance level, the critical value is 18.475. Since 16.34 < 18.475, we cannot reject
the null hypothesis, hence we conclude that the test is moderately significant, with
moderate evidence to support the claim that the chances of winning in the different
starting positions are not all the same.

H.2 Practice problems


1. A survey has been made of levels of satisfaction with housing by people living in
different types of accommodation. Levels of satisfaction are categorised as being
high, medium, low and very dissatisfied. Levels of housing type are public housing
apartment, public housing house, private apartment, private detached house,
private semi-detached house, miscellaneous (includes boat, caravan etc.). State:

318
H.2. Practice problems

(a) the null and alternative hypotheses


(b) the degrees of freedom
(c) the 5% and 1% critical values for a χ2 test.

2. You have carried out a χ2 test on a 3 × 4 contingency table to study whether there
is evidence of an association between advertising type and level of sales of a
product. You have four types of advertising (A, B, C and D) and three levels of
sales (low, medium and high).
Your calculated χ2 value is 13.50. Giving degrees of freedom and an appropriate
significance level, set out your hypotheses. What would you say about the result?

3. In a survey conducted to decide where to locate a factory, samples from five towns
were examined to see the numbers of skilled and unskilled workers. The data were
as follows.

Number of Number of
Area skilled workers unskilled workers
A 80 184
B 58 147
C 114 276
D 55 196
E 83 229

(a) Does the population proportion of skilled workers vary with the area?
(b) Test:
H0 : πD = πothers vs. H1 : πD < πothers . H
Think about your results and what you would explain to your management
team who had seen the chi-squared results and want, for other reasons, to site
their factory at area D.

4. The table below shows a contingency table for a sample of 1,104 randomly-selected
adults from three types of environment (City, Town and Rural) who have been
classified into two groups by level of exercise (high and low). Test the hypothesis
that there is no association between level of exercise and type of environment and
draw conclusions.

Level of exercise
Environment High Low
City 221 256
Town 230 118
Rural 159 120

5. You have been given the number of births in a country for each of the four seasons
of the year and are asked whether births vary over the year. What would you need
to know in order to carry out a chi-squared test of the hypothesis that births are
spread uniformly between the four seasons? Outline the steps in your work.

319
H. Contingency tables and the chi-squared test

H.3 Solutions to Practice problems


1. (a) The null and alternative hypotheses are:
H0 : There is no association between the kind of accommodation people
have and their level of satisfaction with it.
H1 : There is an association between the kind of accommodation people
have and their level of satisfaction with it.
(b) The degrees of freedom are (number or rows − 1) × (number of columns − 1).
Let us take the levels of satisfaction to be rows and the housing types to be
columns. Your table would look like this:
Housing type
Public Public Private Private Private semi- Misc.
apartment house apartment detached detached
Level of house house
satisfaction
High

Medium

Low

Very
dissatisfied

There are 4 rows and 6 columns and therefore (4 − 1) × (6 − 1) = 3 × 5 = 15


degrees of freedom.
H
(c) Now look up the 5% and 1% critical values for a chi-squared distribution with
15 degrees of freedom, using Table 8 of the New Cambridge Statistical Tables.
The 5% critical value is 25.00 and the 1% critical value is 30.58.

2. We test:
H0 : No association between advertising type and level of sales
H1 : There is an association between advertising type and level of sales.
The degrees of freedom are (r − 1)(c − 1) = (3 − 1)(4 − 1) = 6. Using a 5%
significance level, the critical value is 12.59. Since 12.59 < 13.50, we reject the null
hypothesis. Moving to the 1% significance level, the critical value is 16.81, hence we
do not reject the null hypothesis and conclude that the test is moderately significant
with moderate evidence of an association between advertising type and level of sales

3. (a) We test:
H0 : There is no association between proportion of skilled workers and
area.
H1 : There is an association between proportion of skilled workers and
area.
We get a contingency table which should look like:

320
H.3. Solutions to Practice problems

Number of Number of Total


skilled workers unskilled workers
O1· 80 184 264
A E1· 72.41 191.59
2
(O1· − E1· ) /E1· 0.80 0.30
O2· 58 147 205
B E2· 56.22 148.78
2
(O2· − E2· ) /E2· 0.06 0.02
O3· 114 276 390
C E3· 106.96 283.04
(O3· − E3· )2 /E3· 0.46 0.18
O4· 55 196 251
D E4· 68.84 181.43
(O4· − E4· )2 /E4· 2.78 1.17
O5· 83 229 312
E E5· 85.57 226.43
(O5· − E5· )2 /E5· 0.08 0.03
Total 390 1,032 1,422

The χ2 test statistic value is i i (Oij − Eij )2 /Eij = 5.88. The number of
P P
degrees of freedom is (5 − 1)(2 − 1) = 4, so we compare the test statistic value
with the χ24 distribuion, using Table 8.
The test is not significant at the 5% significance level (the critical value is
9.488). It is also not significant at the 10% significance level (the critical value
is 7.779) so the test is not statistically significant and we have found
insufficient evidence of an association. That is, the proportion of skilled
workers does not appear to vary with area. H
(b) We have an upper-tailed z test:

H0 : πD = πothers vs. H1 : πD < πothers .

We have pD = 55/251 = 0.2191, pothers = 335/1,171 = 0.2861, and a pooled


sample proportion of (55 + 335)/(251 + 1,171) = 0.2742.
So the test statistic value is:
0.2861 − 0.2191 0.2861 − 0.2191
z=p = = 2.1613.
0.2742 × (1 − 0.2742) × (1/251 + 1/1,171) 0.0310

The 5% critical value is 1.645 and the 1% critical value is 2.326. So our
calculated test statistic value of 2.1613 lies between the two, meaning we reject
the null hypothesis at the 5% significance level only. Hence the test is
moderately significant.
We can tell the management team that there is moderate evidence of a lower
proportion of skilled workers in area D compared with the others. However, it
is not significant at the 1% significance level and so it would be sensible to
check whether there are other factors of interest to them before deciding
against the area.

321
H. Contingency tables and the chi-squared test

4. We test:
H0 : No association between level of exercise and type of environment
H1 : There is an association between level of exercise and type of environment.
Determining row and column totals, we obtain the expected frequencies as shown
in the following table, along with the calculated χ2 values for each cell.
Level of exercise
Environment High Low Total
Observed 221 256 477
City Expected 263.56 213.44
(O1· − E1· )2 /E1· 6.87 8.49
Observed 230 118 348
Town Expected 192.28 155.72
(O2· − E2· )2 /E2· 7.40 9.14
Observed 159 120 279
Rural Expected 154.16 124.84
(O3· − E3· )2 /E3· 0.15 0.19
Total 610 494 1,104
The test statistic value is:
6.87 + 8.49 + 7.40 + 9.14 + 0.15 + 0.19 = 32.24.
There are (r − 1)(c − 1) = (3 − 1)(2 − 1) = 2 degrees of freedom, and the 5%
critical value is 5.991 using Table 8. Since 5.991 < 32.24, we reject H0 . Testing at
the 1% significance level, the critical value is 9.210. Again, we reject H0 and
conclude that the test is highly significant. There is strong evidence of an
H association between level of exercise and type of environment. Looking at the large
contributors to the test statistic value, we see that those living in cities tend not to
exercise much, while those in towns tend to exercise more.

5. We are told that we are supplied with the observed frequencies, i.e. the number of
births in a country for each of the four seasons. To test whether births are evenly
distributed between the four seasons, we would need to calculate the expected
number of births for each season. We would need to know the exact number of days
in each season to be able to calculate the expected frequencies, and hence perform
a goodness-of-test of an even distribution. Remember the seasons are of slightly
different lengths reflecting the variable number of days in each month. We would
then test:
H0 : There is a uniform distribution of live births throughout the year.
H1 : There is a non-uniform distribution of live births throughout the year.
Denoting the observed and expected frequencies for season i as Oi and Ei ,
respectively, the test statistic is:
4
X (Oi − Ei )2
∼ χ23 .
i=1
Ei
At a 5% significance level, the critical value would be 7.815 and so we would reject
the null hypothesis if the test statistic value exceeded 7.815. We would then test at
a suitable second significance level to determine the strength of the test result.

322
Appendix I
Sampling and experimental design

I.1 Worked examples


1. What is the difference between judgemental and convenience sampling? Give
examples of where these techniques may be applied successfully.
Solution:
The main difference is that judgemental sampling by design selects sampling units
based on the judgement of the researcher as opposed to the haphazard method of
convenience sampling.
Judgemental sampling could be used, for example, to try out a new product on
consumers who in the judgement of the researcher are sophisticated, highly
discerning and extremely fussy – the researcher may contend that satisfying such a
group may mean high levels of satisfaction in less discerning groups. The success of
a judgemental sample lies in how knowledgeable the researcher is of the target
population and how well they justify the selection of specific elements. Given these
requirements, judgemental sampling is usually limited to relatively small target
populations and qualitative studies.
Convenience sampling could be used in the piloting of a questionnaire. For
example, imagine a postal questionnaire which is to be administered to retired
women. A group of retired women who may be easy to contact and willing to
discuss problems with the questionnaire and are elements of the target population
would make a good convenience sample.
I
2. What are the main potential disadvantages of quota sampling relative to random
sampling?
Solution:
Quota sampling may yield biased results. The size of the bias may be hard to
assess because the sampling is not probability based. Empirically, it has been
shown to give inferior results.

3. The simplest random sampling technique is simple random sampling. Give two
reasons why it may be desirable to use a sampling design which is more
sophisticated than simple random sampling.
Solution:
Possible reasons include the following.
• We can estimate parameters with greater precision (stratified sampling).
• It may be more cost effective (cluster sampling).

323
I. Sampling and experimental design

4. A corporation wants to estimate the total number of worker-hours lost for a given
month because of accidents among its employees. Each employee is classified into
one of three categories – labourer, technician and administrator. Which sampling
method do you think would be preferable here? Justify your choice.

Solution:
We can expect the three groups to be homogeneous with respect to accident rates –
for example, labourers probably have the highest rates. Therefore, stratified
sampling is preferable.

5. Describe how stratified sampling is performed and explain how it differs from quota
sampling.

Solution:
In stratified sampling, the population is divided into strata, natural groupings
within the population, and a simple random sample is taken from each stratum.
Stratified sampling differs from quota sampling in the following ways.
• Stratified sampling is random sampling, whereas quota sampling is
non-random sampling.
• In stratified sampling a sampling frame is required, whereas in quota sampling
pre-chosen frequencies in each category are sought.

6. Provide a similarity and a difference between one-stage cluster sampling and


two-stage cluster sampling.

Solution:
Similarity: In both one-stage and two-stage cluster sampling, only some clusters are
sampled (at random).
I Difference: In one-stage every unit within each selected cluster is sampled, while in
two-stage a random sample is taken from each selected cluster.

7. In the context of sampling, explain the difference between item non-response and
unit non-response.

Solution:
Item non-response occurs when a sampled member fails to respond to a specific
question in a survey. Unit non-response occurs when no information is collected
from a sample member at all.

8. Non-response in surveys is considered to be problematic.


(a) Give two possible reasons why non-response may occur.
(b) Why is non-response problematic for the person or organisation conducting
the research?
(c) How can non-response be reduced in telephone surveys and mail surveys,
respectively?

324
I.1. Worked examples

Solution:
(a) There may be people whom the data collection procedure does not reach, who
refuse to respond, or who are unable to respond. As an example, in telephone
surveys, people may not be at home or may not have time to do the survey. In
a mail survey, the survey form may not reach the addressee.
(b) The non-respondents may be different from respondents, leading to biased
inferences. Additionally, the sample size is effectively reduced, leading to a loss
of precision in estimation.
(c) Numerous calls, concentrating on evenings and weekends. Interviewers with
flexible schedules who can make appointments with interviewees at a time
convenient for them. Cooperation can be enlisted by sending an advance letter,
ensure respondents feel their participation is important and having
well-trained interviewers. Clear layout of survey form, easy response tasks,
send reminders and call respondents by telephone.

9. A research group has designed a survey and finds the costs are greater than the
available budget. Two possible methods of saving money are a sample size
reduction or spending less on interviewers (for example, by providing less
interviewer training or taking on less-experienced interviewers). Discuss the
advantages and disadvantages of these two approaches.
Solution:
A smaller sample size leads to less precision in terms of estimation. Reducing
interviewer training leads to response bias. We can statistically assess the reduction
in precision, but the response bias is much harder to quantify, making it hard to
draw conclusions from the survey. Therefore, the group should probably go for a
smaller sample size, unless this will give them too imprecise results for their
purposes.

10. Readers of the magazine Popular Science were asked to phone in (on a premium I
rate number) their responses to the following question: ‘Should the United States
build more fossil fuel-generating plants or the new so-called safe nuclear generators
to meet the energy crisis?’. Of the total call-ins, 86% chose the nuclear option.
Discuss the way the poll was conducted, the question wording, and whether or not
you think the results are a good estimate of the prevailing mood in the country.
Solution:
Both selection bias and response bias may have occurred. Due to the phone-in
method, respondents may well be very different from the general population
(selection bias). Response bias is possible due to the question wording – for
example ‘so-called safe nuclear generators’ calls safety into question. It is probably
hard to say much about the prevailing mood in the country based on the survey.

11. A company producing handheld electronic devices (tablets, mobile phones etc.)
wants to understand how people of different ages rate its products. For this reason,
the company’s management has decided to use a survey of its customers and has
asked you to devise an appropriate random sampling scheme. Outline the key
components of your sampling scheme.

325
I. Sampling and experimental design

Solution:
Note that there is no single ‘right’ answer. A possible set of ‘ingredients’ of a good
answer is given below.
• Propose stratified sampling since customers of all ages are to be surveyed.
• Sampling frame could be the company’s customer database.
• Take a simple random sample from each stratum.
• An obvious stratification factor is age group, another could be gender.
• Contact method: mail, telephone or email (likely to have all details on the
database).
• Minimise non-response through a suitable incentive, such as a discount off the
next purchase.

12. Explain the difference between an experimental design and an observational study.
Solution:
An experimental design involves manipulating an independent variable to observe
its effects on a dependent variable, establishing cause-and-effect relationships. In
contrast, an observational study observes and analyses existing conditions,
variables, or behaviours without intervention. While experimental designs allow for
greater control, observational studies are valuable for exploring real-world
phenomena, but causation is not easily inferred.

13. What is randomisation in the context of experimental design?


Solution:
Randomisation in experimental design involves assigning participants to different
experimental conditions randomly. This process ensures that each participant has
an equal chance of being in any condition, minimising bias and controlling for
I potential confounding variables. Randomisation strengthens the internal validity of
experiments by creating comparable groups, enhancing the reliability of causal
inferences.

I.2 Practice problems


1. Think of at least three lists you could use in your country as a basis for sampling.
Remember, each list must be generally available, be up-to-date, and provide a
reasonable target group for the people you might wish to sample.

2. Discuss the statistical problems you might expect to have in each of the following
situations.
(a) Conducting a census of the population.
(b) Setting up a study of single-parent families.
(c) Establishing future demand for post-compulsory education.

326
I.3. Solutions to Practice problems

3. Think of three quota controls you might use to make a quota sample of shoppers in
order to ask them about the average amount of money they spend on shopping per
week. Two controls should be easy for your interviewer to identify. How can you
help them with the third control?

4. You have been asked to design a random sample in order to study the way school
children learn in your country. Explain the clusters you might choose, and why.

5. Which method of contact might you use for your questionnaire in each of the
following circumstances?
(a) A random sample of school children about their favourite lessons.
(b) A random sample of households about their expenditure on non-essential
items.
(c) A quota sample of shoppers about shopping expenditure.
(d) A random sample of bank employees about how good their computing facilities
are.
(e) A random sample of the general population about whether they liked
yesterday’s television programmes.

6. You are carrying out a random sample survey of leisure patterns for a holiday
company, and have to decide whether to use interviews at people’s homes and
workplaces, postal (mail) questionnaires, or a telephone survey. Explain which
method of contact you would use, and why.

7. Your government is assessing whether it should change the speed limit on its
motorways or main roads. Several countries in your immediate area have lowered
their limit recently by 10 miles per hour.
What control factors might you use in order to examine the likely effect on road
accidents of a change in your country?
I
8. List some advantages and disadvantages of a longitudinal survey. Describe how you
would design such a survey if you were aiming to study the changes in people’s use
of health services over a 20-year period. State the target group, survey design, and
frequency of contact. You should give examples of a few of the questions you might
ask.

I.3 Solutions to Practice problems


1. Generally available lists include electoral registers, postal addresses used by the
postal services, lists of schools and companies. Whether they are up-to-date or not
depends on the purpose for which the list was intended. For example, looking at
the kinds of lists mentioned above you would expect there to be some government
regulation about registering electors – every year, or perhaps before each national
election.
For the electoral register, the post office will register new buildings by postcode on
a continuous basis. For postal addresses you need to check who would be

327
I. Sampling and experimental design

responsible for the lists of schools or companies and why and when they would be
updated. Is there a subscription to pay annually, for example, in order to be a
member of a school or company association, or is there a government regulation
which means that schools or companies must be registered?
The question of target group also affects the usefulness of your sampling frame. If
you want to know about foreign nationals resident in your country, then the
electoral register is hardly appropriate, nor will it help you if you are interested
primarily in contacting people under voting age.
The list of new buildings will help you get in touch with people living in particular
areas so you may find you stratify by the socio-economic characteristics of the place
rather than the people.
Schools and companies are fine as a contact for people for many purposes, but the
fact that they have been approached through a particular organisation may affect
responses to particular questions. Would you tell interviewers who contacted you in
school time and had got hold of you through your school that you hate school, or
even about what you do after school? Would you reply to a postal questionnaire
sent via your company if you were asked how often you took unnecessary sick leave?
The more you think about all this, the more difficult it can seem! So think of three
lists and then list the ways you could find them useful as a basis for sampling and
the problems you might have in particular circumstances.

2. (a) If the population is very large, then this increases the chances of making
non-sampling errors which reduces the quality of the collected data. For
example, large amounts of data collected may result in more data-entry errors
leading to inaccurate data and hence inaccurate conclusions drawn from any
statistical analysis of the data.
(b) An adequate sampling frame of single-parent families is unlikely to be available
I for use. This would make random sampling either impossible, or very
time-consuming if a sampling frame had to be created.
(c) Those who would potentially undertake post-compulsory education in the
future will be ‘young’ today. Data protection laws may prevent us from
accessing lists of school-age children. In addition, young children may not
realise or appreciate the merits of post-compulsory education so their current
opinions on the subject may differ from their future decisions about whether to
continue into further education and higher education.

3. The two quota controls which are relatively easy for your interviewer to identify are
gender and age group. They are also useful controls to use as they help you gain a
representative picture of shopping patterns. We would expect people to buy
different things according to whether they are female (particular perfumes, for
example) or male (special car accessories) or by age group (out-of-season holidays
for older people, or pop concert tickets for younger people) to give trivial examples.
If we are wrong, and older people are into pop concerts of lots of females want to
make their cars sportier, then we would find this out if we have made sure we have
sufficient numbers in each of the controls.

328
I.3. Solutions to Practice problems

The question of a third control is more tricky. We would like to know about
people’s preferences if they spend more or less money when they shop, or according
to their income group, or depending on how many people they are shopping for
(just themselves, or their family, or perhaps their elderly neighbours). Of course,
the interviewer could ask people how much they spent on shopping last week, or
what their family income is, or how many people they are shopping for. People
might reply to the last of these questions, but may well be unhappy about the
other two – so you get a lot of refusals, and lose your sample!
Even if they do reply, the interviewer will then have to discard some of the
interviews they have started. If everyone interviewed so far is shopping for
themselves, for example, and the quota has been filled, the interviewer will have to
ignore the person they have stopped and go to look for someone else to ask whether
they have shopped for others!
If the aim is to interview people who have bought a lot of things at that time, then
the interviewer could wait for people to leave the store concerned, and stop people
who have bought a lot that day, a medium amount, or a small amount, judging by
how full their shopping trolley is! Or do the same if people are accompanied by
children or not on their shopping expedition. An alternative is to interview at more
than one shopping area and deliberately go to some shops in high-income areas and
some in low-income areas, knowing that most of the people you interview will be in
an income category which matches their surroundings.

4. Here an obvious cluster will be the school. This is because a main influence on how
children learn will be the school they are at – you could give yourself an enormous
amount of unnecessary extra work and expense if you choose a sample of 100
children, say, from 100 different schools – you would in fact have to find out about
each of their schools individually. So it makes sense to sample clusters of schools
themselves. For a similar reason, it would be a good idea to cluster children
according to the study group (class) to which they belong. The only problem with
this is that if you have too few clusters, it may be difficult to separate the effect of I
outside influences (the effect of a particular teacher, for example) as opposed to the
actual teaching methods used.

5. There are no single ‘right’ answers to these – you may disagree with the
suggestions, but if you do make sure you can explain why you think what you do! In
addition, the explanation must use the kinds of arguments given in this chapter.
(a) This slightly depends on the age of the children concerned. The younger the
child, the more difficult it will be to elicit a clear written reply. Telephone is
probably not an option (annoying to parents). On the whole, interviews at
school make most sense – though we hope the children’s replies would not be
too influenced by their environment.
(b) The items are non-essential – we need clear records – probably a diary kept at
home (if people will agree to do this) would be best. Random digit dialling
might work, but you would need to catch people while they still remembered
their shopping. Interviewing outside the store is not an option – we have been
told this is a random survey so we will need a list of addresses rather than a
quota approach to people as they shop.

329
I. Sampling and experimental design

(c) A quota sample is specified, so a face-to-face interview contact method would


be appropriate. Suitable quota controls need to be provided to the interviewer
to carry out the exercise in, say, a shopping centre.
(d) Email might be a good idea – quick and easy – but, of course, if their
computer facilities are really poor, we will only get replies from employees with
state-of-the-art facilities! If the company is supporting this, you could use any
method – an interview might obtain detailed and useful information. If we
only want to know the proportions of employees who have experienced
particular problems with their computers, a postal survey would work – and
the response rate should be as good as they are doing it for their employer!
(e) Here time is important – mail will be too slow. Even a face-to-face survey
might take too long – you need to catch the interviewee at home. So the choice
is the following.
• If random, use random digital dialling and telephone people at home the
day following the TV programmes – everyone will remember them and
you can make your questions short and to the point!
• Alternatively, you may prefer to use a quota survey with the possibility of
showing pictures to illustrate your questions and perhaps ask more
detailed questions. You can also do this straight after the TV programmes
are broadcast.

6. Arguments could be made for and against any of the suggested contact methods.
However, on balance, a postal (mail) questionnaire might be most preferable.
Respondents could then return the survey form at their own convenience and
would not be rushed to respond as would be the case with an interview or
telephone survey. Also, this would be the least expensive and least time-consuming
method to employ. An obvious disadvantage is the likely high non-response rate. A
suitable incentive could be used to help reduce non-response.
I
7. Possible control factors might be the:
• number of vehicles per capita of population (because the effect of speed limits
will be different the more cars there are being driven)
• number of vehicles per mile (or kilometre) of made-up road (for the same
reason)
• age distribution of the driving population (younger drivers are more likely to
be involved in high-speed accidents; the very old have a high incidence of
low-speed, low-fatality accidents)
• ‘mix’ of road types (on the whole, motorways tend to have fewer (but more
serious) accidents compared with minor roads).
There are a few controls you might like to use. You would not use them all at once
– the data would become too detailed for sensible statistical analysis and some of
the figures you want might not be available. You may think of more – check if the
figures exist for your country!

330
I.3. Solutions to Practice problems

8. Longitudinal surveys have the following advantages and disadvantages:


• Advantages: Can detect change over time, and are likely to get more accurate
results than with a single cross-sectional survey.
• Disadvantages: Due to attrition there may be a lack of representativeness over
time, and we may experience response bias.
To study people’s use of health services over a 20-year period, we would begin in
year 1 by selecting the study participants. These should be representative of the
population in terms of several characteristics such as gender, age group, general
well-being, diet types, different fitness levels etc. To allow for likely attrition over
20 years, a large number of participants would be required. Appropriate incentives
should be provided to induce people to remain in the study over time.
The participants would be interviewed face-to-face on an annual basis or perhaps
every four or five years, depending on the budget resources available. In order to
detect change, data on the same variables would be needed at each interview stage.
Examples of questions to include might be the following.
• ‘How many times have you visited a doctor in the past 12 months?’
• ‘How long did you have to wait from making an appointment to actually
seeing a doctor?’
• ‘Overall, how satisfied are you with the level of health service provision in your
area?’

331
I. Sampling and experimental design

332
Appendix J
Correlation and linear regression

J.1 Worked examples


1. The following table shows the number of salespeople employed by a company and
the corresponding values of sales (in £000s).

Salespeople (x) 210 209 219 225 232 221 220 233 200 215 205 227
Sales (y) 206 200 204 215 222 216 210 218 201 212 204 212

Here are some useful summary statistics:


12
X 12
X 12
X 12
X 12
X
xi = 2,616, yi = 2,520, xi yi = 550,069, x2i = 571,500 and yi2 = 529,746.
i=1 i=1 i=1 i=1 i=1

Compute the sample correlation coefficient for these data.


Solution:
We have:
2,616 2,520
x̄ = = 218 and ȳ = = 210.
12 12
Therefore:
P
xi yi − nx̄ȳ
r=pP 2
( xi − nx̄2 )( yi2 − nȳ 2 )
P

550,069 − 12 × 218 × 210


=p
(571,500 − 12 × (218)2 )(529,746 − 12 × (210)2 )
J
= 0.8716.

This indicates a strong, positive linear relationship between salespeople and sales.
The more salespeople employed, the higher the sales.

2. State whether the following statements are true or false, explaining your answers.
(a) The sample correlation coefficient between x and y is the same as the sample
correlation coefficient between y and x.
(b) If the slope is negative in a regression equation, then the sample correlation
coefficient between x and y would be negative too.
(c) If two variables have a sample correlation coefficient of −1 they are not related.
(d) A large sample correlation coefficient means the regression line will have a
steep slope.

333
J. Correlation and linear regression

Solution:
(a) True. The sample correlation coefficient is:

Sxy
r=p .
Sxx Syy

Switching x and y does not change r because Sxy = Syx .


(b) True. The sign of the slope and the sign of the correlation coefficient between
x and y are the same. A negative slope indicates a negative correlation
between the variables. Using sample data, the sign is determined by the term:
n
X
Sxy = xi yi − nx̄ȳ
i=1

which is the numerator of βb1 and r.


(c) False. A sample correlation coefficient of −1 represents a perfect negative
linear relationship between two measurable variables. A sample correlation
coefficient of zero would indicate that the two variables are not linearly related
(they may be independent or possibly non-linearly related).
(d) False. A large sample correlation coefficient means a strong linear relationship
between two measurable variables. The magnitude of the correlation coefficient
says nothing about how steep the regression line is.

3. Write down the simple linear regression model, explaining each term in the model.
Solution:
The simple linear regression model is:

y = β0 + β1 x + ε

where:
• y is the dependent (or response) variable
J • x is the independent (or explanatory) variable
• β0 is the y-intercept
• β1 is the slope of the line
• ε is a random error term.

4. List the assumptions which we make when using the simple linear regression model
to explain changes in y by changes in x (i.e. the ‘regression of y on x’).
Solution:
We have the following four assumptions.
• A linear relationship between the variables of the form y = β0 + β1 x + ε.
• The existence of three model parameters: the linear equation parameters β0
and β1 , and the error term variance, σ 2 .

334
J.1. Worked examples

• Var(εi ) = σ 2 for all i = 1, 2, . . . , n, i.e. the error term variance is constant and
does not depend on the independent variable.
• The εi s are independent and N (0, σ 2 ) distributed for all i = 1, 2, . . . , n.

5. The table below shows the cost of fire damage for ten fires together with the
corresponding distances of the fires to the nearest fire station.

Distance in miles (x) 4.9 4.5 6.3 3.2 5.0 5.7 4.0 4.3 2.5 5.2
Cost in £000s (y) 31.1 31.1 43.1 22.1 36.2 35.8 25.9 28.0 22.9 33.5

Here are some useful summary statistics:


10
X 10
X 10
X 10
X 10
X
xi = 45.6, yi = 309.7, xi yi = 1,475.1, x2i = 219.46 and yi2 = 9,973.99.
i=1 i=1 i=1 i=1 i=1

Fit a straight line to these data.


Solution:
We have:
45.6 309.7
x̄ = = 4.56 and ȳ = = 30.97.
10 10
The slope estimate is:
P
xi yi − nx̄ȳ 1,475.1 − 10 × 4.56 × 30.97
β1 = P 2
b
2
= = 5.46.
xi − nx̄ 219.46 − 10 × (4.56)2
The intercept estimate is:

βb0 = ȳ − βb1 x̄ = 30.97 − 5.46 × 4.56 = 6.07.

Hence the estimated regression line is yb = 6.07 + 5.46x.

6. Consider the following dataset (which should be treated as sample data):

Observation 1 2 3 4 5 6 7 8 9 10
xi 1.1 3.9 2.8 3.2 2.9 4.4 3.4 4.9 2.3 3.8
J
yi 6.4 17.0 12.8 14.4 13.1 18.7 15.1 20.6 11.0 16.6

Summary statistics for the dataset are:


10
X 10
X 10
X 10
X 10
X
xi = 32.7, yi = 145.7, xi yi = 516.19 x2i = 117.57 and yi2 = 2,271.39.
i=1 i=1 i=1 i=1 i=1

Calculate the estimates of β0 and β1 for the regression of y on x based on the above
sample data.
Solution:
We have:
32.7 145.7
x̄ = = 3.27 and ȳ = = 14.57.
10 10
335
J. Correlation and linear regression

The slope estimate is:


P
xi yi − nx̄ȳ 516.19 − 10 × 3.27 × 14.57
βb1 = P 2 2
= = 3.7356.
xi − nx̄ 117.57 − 10 × (3.27)2

The intercept estimate is:

βb0 = ȳ − βb1 x̄ = 14.57 − 3.7356 × 3.27 = 2.3546.

Hence the estimated regression line is yb = 2.3546 + 3.7356x.

7. In a study of household expenditure a population was divided into five income


groups with the mean income, x, and the mean expenditure, y, on essential items
recorded (in euros per month). The results are in the following table.

x y
1,000 871
2,000 1,300
3,000 1,760
4,000 2,326
5,000 2,950

Here are some useful summary statistics:


5
X 5
X 5
X
xi = 15,000, yi = 9,207, xi yi = 32,805,000,
i=1 i=1 i=1

5
X 5
X
x2i = 55,000,000 and yi2 = 19,659,017.
i=1 i=1

(a) Fit a straight line to the data.


(b) How would you use the fit to predict the percentage of income which
households spend on essential items? Comment on your answer.
J Solution:
(a) We have:
15,000 9,207
x̄ = = 3,000 and ȳ = = 1,841.4.
5 5
The slope estimate is:
P
xi yi − nx̄ȳ 32,805,000 − 5 × 3,000 × 1,841.4
β1 = P 2
b
2
= = 0.5184.
xi − nx̄ 55,000,000 − 5 × (3,000)2

The intercept estimate is:

βb0 = ȳ − βb1 x̄ = 1,841.4 − 0.5184 × 3,000 = 286.2.

Hence the estimated regression line is yb = 286.2 + 0.5184x.

336
J.1. Worked examples

(b) The percentage of income which households spend on essential items is:
y
× 100%
x
which is approximated by:

βb0 + βb1 x
× 100%.
x
This percentage is decreasing with increasing x so that poorer households
spend a larger proportion of their income on essentials.

8. The following table shows the number of computers (in 000s), x, produced by a
company each month and the corresponding monthly costs (in £000s), y, for
running its computer maintenance department.

Number of computers Maintenance costs Number of computers Maintenance costs


(in 000s), x (in £000s), y (in 000s), x (in £000s), y
7.2 100 6.8 103
8.1 116 7.3 106
6.4 98 7.8 107
7.7 112 7.9 112
8.2 115 8.1 111

The following statistics can be calculated from the data.


10
X 10
X 10
X 10
X 10
X
xi = 75.5, yi = 1,080, xi yi = 8,184.9, x2i = 573.33 and yi2 = 116,988.
i=1 i=1 i=1 i=1 i=1

(a) Calculate the sample correlation coefficient for computers and maintenance
costs.
(b) Find the best-fitting straight line relating y and x.
(c) Plot the points on a scatter diagram and draw the line of best fit.
(d) Comment on your results. How would you check on the strength of the
relationship you have found?
J
Solution:
(a) We have:
75.5 1,080
x̄ == 7.55 and ȳ = = 108.
10 10
The sample correlation coefficient is:
P
xi yi − nx̄ȳ
r=pP 2
( xi − nx̄2 )( yi2 − nȳ 2 )
P

8,184.9 − 10 × 7.55 × 108


=p
(573.33 − 10 × (7.55)2 )(116,988 − 10 × (108)2 )
= 0.9111.

337
J. Correlation and linear regression

(b) The slope estimate is:


P
xi yi − nx̄ȳ 8,184.9 − 10 × 7.55 × 108
β1 = P 2
b
2
= = 9.349.
xi − nx̄ 573.33 − 10 × (7.55)2
The intercept estimate is:
βb0 = ȳ − βb1 x̄ = 108 − 9.349 × 7.55 = 37.415.
Hence the estimated regression line is yb = 37.415 + 9.349x.
(c) We have:
Scatter diagram of maintenance costs vs. computers

x
115

x
Maintenance costs (in £000s)

x x
x
110

x
x
105

x
100

6.5 7.0 7.5 8.0

Number of computers (in 000s)

Note to plot the line you need to compute any two points on the line. For
example, for x = 7 and x = 8 the y-coordinates are determined, respectively,
as:
37.415 + 9.349 × 7 ≈ 103 and 37.415 + 9.349 × 8 ≈ 112
giving points (7, 103) and (8, 112). The line of best fit should be drawn passing
through these two points.
J (d) The sample correlation coefficient is close to 1, hence there is a strong, positive
linear relationship between computers and maintenance costs. More computers
means higher monthly maintenance costs. For each additional 1,000
computers, maintenance costs increase by £9,349.

9. The examination marks of thirteen students in two statistics papers (a foundation


paper, Paper I, and a more advanced paper, Paper II) were as follows:
Paper I (x) 65 73 42 52 84 60 70 79 60 83 57 77 54
Paper II (y) 78 88 60 73 92 77 84 89 70 99 73 88 70
Useful summary statistics for these data are:
13
X 13
X 13
X 13
X 13
X
xi = 856, x2i = 58,402, yi = 1,041, yi2 = 84,801 and xi yi = 70,203.
i=1 i=1 i=1 i=1 i=1

338
J.1. Worked examples

(a) Calculate the sample correlation coefficient and comment on its value.
(b) Determine the line of best fit of y on x.
(c) Plot the points on a scatter diagram and draw the line of best fit.
(d) If you were told a student achieved a mark of 68 in Paper I, what would you
predict for the student’s mark in Paper II? Would you trust this prediction?
Solution:
(a) We have:
856 1,041
x̄ = = 65.846 and ȳ = = 80.077.
13 13
Therefore:
P
xi yi − nx̄ȳ
r=pP 2
( xi − nx̄2 )( yi2 − nȳ 2 )
P

70,203 − 13 × 65.846 × 80.077


=p
(58,402 − 13 × (65.846)2 )(84,801 − 13 × (80.077)2 )
= 0.9671.
This indicates a very strong, positive linear relationship between Paper I and
Paper II examination marks. Assuming these papers are sat in order, the
higher the examination mark in Paper I, the higher the examination mark in
Paper II.
(b) The slope estimate is:
P
xi yi − nx̄ȳ 70,203 − 13 × 65.846 × 80.077
β1 = P 2
b
2
= = 0.8132.
xi − nx̄ 58,402 − 13 × (65.846)2
The intercept estimate is:
βb0 = ȳ − βb1 x̄ = 80.077 − 0.8132 × 65.846 = 26.5310.
Hence the estimated regression line is yb = 26.5310 + 0.8132x.
(c) We have:
Scatter diagram of examination marks
100

x
J
x
90
Paper II examination mark

x
x x

x
80

x
x

x x
70

x x
60

50 60 70 80

Paper I examination mark

339
J. Correlation and linear regression

Note to plot the line you need to compute any two points on the line. For
example, for x = 50 and x = 70 the y-coordinates are determined, respectively,
as:
26.5310 + 0.8132 × 50 ≈ 67 and 26.5310 + 0.8132 × 70 ≈ 83
giving points (50, 67) and (70, 83). The line of best fit should be drawn passing
through these two points.
(d) We have:
yb = 26.5310 + 0.8132 × 68 = 81.8286.
Since examination marks are integers, we round this to 82.
The Paper I mark of 68 is within the range of our x data, hence this is
interpolation so we trust the accuracy of the prediction.

10. The following table, published in USA Today, lists divorce rates and mobility rates
for different regions of the USA. The divorce rate is measured as the annual
number of divorces per 1,000 population, and the mobility rate is the percentage of
people living in a different household from five years before.
Region Mobility rate (x variable) Divorce rate (y variable)
New England 41 4.0
Middle Atlantic 37 3.4
East North Central 44 5.1
West North Central 46 4.6
South Atlantic 47 5.6
East South Central 44 6.0
West South Central 50 6.5
Mountain 57 7.6
Pacific 56 5.9
Summary statistics for the dataset are:
9
X 9
X 9
X 9
X 9
X
xi = 422, x2i = 20,132, yi = 48.7, yi2 = 276.91 and xi yi = 2,341.6.
i=1 i=1 i=1 i=1 i=1

(a) Calculate the sample correlation coefficient and comment on its value.
J (b) Calculate the regression equation.
(c) Plot the points on a scatter diagram and draw the line of best fit.
(d) Compute the expected divorce rate if the mobility rate is 40.
(e) Why might it be reasonable to use the divorce rate as the y variable?
Solution:
(a) The sample correlation coefficient is:
P
xi yi − nx̄ȳ
r=pP 2
( xi − nx̄2 )( yi2 − nȳ 2 )
P

2,341.6 − 9 × 46.89 × 5.41


=p
(20,132 − 9 × (46.89)2 )(276.91 − 9 × (5.41)2 )
= 0.8552

340
J.1. Worked examples

which suggests a strong, positive linear relationship betweeen divorce rate and
mobility rate (as seen from the scatter diagram in part (c)).

(b) The slope estimate is:


P
xy − nx̄ȳ
β1 = P 2
b = 0.1685.
x − nx̄2

The intercept estimate is:

βb0 = ȳ − βb1 x̄ = −2.4893.

Hence the estimated regression line is yb = −2.4893 + 0.1685x.

(c) We have:
Scatter diagram of divorce rate vs. mobility rate

x
Divorce rate (per 1,000 population)

x
6

x
x

x
5

x
4

40 45 50 55

Mobility rate (per 1,000 population)

J
Note to plot the line you need to compute any two points on the line. For
example, for x = 45 and x = 50 the y-coordinates are determined, respectively,
as:
−2.4893 + 0.1685 × 45 ≈ 5 and − 2.4893 + 0.1685 × 50 ≈ 6

giving points (45, 5) and (50, 6). The line of best fit should be drawn passing
through these two points.

(d) For x = 40, the expected divorce rate is −2.4893 + 0.1685 × 40 = 4.25 per
1,000 population.

(e) The use of the divorce rate as the dependent variable is reasonable due to the
likely disruptive effect moving home may have on relationships (any reasonable
argument would be acceptable here).

341
J. Correlation and linear regression

J.2 Practice problems


1. Sketch a scatter diagram for each of the following situations:
(a) r = 1
(b) r is strong and positive
(c) r is moderate and negative
(d) r = 0 with no relationship between the variables
(e) r is strong and positive, but with a possible curvature between the variables.
(f) r = 0 with a clear non-linear relationship between the variables.

2. Think of an example where you feel the correlation is clearly spurious (that is,
there is a correlation, but no causal connection) and explain how it might arise.
Also, think of a ‘clear’ correlation and the circumstances in which you might accept
causality.

3. Work out βb0 and βb1 in Example 10.7 using advertising costs as the dependent
variable, and sales as the independent variable. Now predict advertising costs when
sales are £460,000.
Make sure you understand how and why your results are different from Example
10.7!

4. Try to think of a likely linear relationship between x and y which would probably
work over some of the data, but then break down like that in the anthropologist
case in Example 10.8. This should make sure you understand the difference
between interpolation and extrapolation.

5. The following data were recorded during an investigation into the effect of fertiliser
in g/m2 , x, on crop yields in kg/ha, y.

Crop yields (kg/ha) 160 168 176 179 183 186 189 186 184
Fertiliser (g/m2 ) 0 1 2 3 4 5 6 7 8
J
Here are some useful summary statistics:
9
X 9
X 9
X 9
X 9
X
xi = 36, yi = 1,611, xi yi = 6,627, x2i = 204 and yi2 = 289,099.
i=1 i=1 i=1 i=1 i=1

(a) Plot the data and comment on the appropriateness of using the simple linear
regression model.
(b) Calculate a least squares regression line for the data.
(c) Predict the crop yield for 3.5 g/m2 of fertiliser.
(d) Would you feel confident predicting a crop yield for 10 g/m2 of fertiliser?
Briefly justify your answer.

342
J.3. Solutions to Practice problems

J.3 Solutions to Practice problems


1. The following scatter diagrams satisfy each respective requirement.

2. Here you should think about two things which have risen or fallen over time
together, but have no obvious connection. Examples might be the number of
examination successes and the number of films shown per year in a town.
A ‘clear’ correlation might be recovery rates from a particular illness in relation to
the amount of medicine given. You might accept this correlation as perhaps being
causal ‘other things equal’ if everyone who got the disease were given medicine as
soon as it was diagnosed and also if recovery began only after the medicine was
administered.

3. We now have:
12 × 191,325 − 410 × 5,445
βb1 = = 0.1251
12 × 2,512,925 − (5,445)2
and:
410 − 0.1251 × 5,445
J
βb0 = = −22.5975.
12
So the regression equation, if we decide that advertising costs depend on sales, is:

yb = −22.5975 + 0.1251x.

We are assuming that as sales rise, the company concerned spends more on
advertising. When sales are £460,000 we get predicted advertising costs of:

yb = −22.5975 + 0.1251 × 460 = 34.9485

i.e. £34,948.50.
Note that the xs and the ys were given in thousands, so be careful over the units of
measurement!

343
J. Correlation and linear regression

4. An example might be an equation linking national income as an independent


variable to the percentage of employed persons as the dependent variable. Clearly
it is impossible to employ more than 100% of the population!

5. (a) We have:
Scatter diagram of crop yields vs. fertiliser

190
x

x x

185
x
180 x
Crop yields (kg/ha)

x
175
170

x
165
160

0 2 4 6 8

Fertiliser (g/m^2)

Fitting a linear model seems reasonable, although there is a strong hint of


non-linearity.
(b) We have:
36 1,611
x̄ = = 4 and ȳ = = 179.
9 9
The slope estimate is:
P
xi yi − nx̄ȳ 6,627 − 9 × 4 × 179
βb1 = P 2 = = 3.05.
xi − nx̄2 204 − 9 × 42

J The intercept estimate is:

βb0 = ȳ − βb1 x̄ = 179 − 3.05 × 4 = 166.8.

Hence the estimated regression line is yb = 166.8 + 3.05x.


(c) When x = 3.5, the predicted crop yield is:

yb = 166.8 + 3.05 × 3.5 = 177.475 kg/ha.

(d) No, we would not feel confident since this would be clear extrapolation. The x
data values do not exceed 8, and the scatter diagram suggests that fertiliser
values above 7 g/m2 may actually have a negative effect on crop yield.

344
Appendix K
Examination formula sheet

Expected value of a discrete random Standard deviation of a discrete random


variable: variable:
N
v
u N
X √
µ = E(X) = p i xi
uX
σ = σ2 = t pi (xi − µ)2
i=1
i=1

The transformation formula for Finding Z for the sampling distribution of


standardisation: the sample mean:
X −µ X̄ − µ
Z= Z= √
σ σ/ n

Finding Z for the sampling distribution of Confidence interval endpoints for a


the sample proportion: single mean (σ known):
P −π σ
Z=p x̄ ± zα/2 × √
π(1 − π)/n n

Confidence interval endpoints for a Confidence interval endpoints for a


single mean (σ unknown): single proportion:
s r
x̄ ± tα/2, n−1 × √ p(1 − p)
n p ± zα/2 ×
n

Sample size determination for estimating Sample size determination for estimating
a population mean: a population proportion:

n≥
(zα/2 )2 σ 2
n≥
(zα/2 )2 p(1 − p) K
e2 e2

z test of hypothesis for a single mean (σ t test of hypothesis for a single mean (σ
known): unknown):
X̄ − µ0 X̄ − µ0
Z= √ T = √
σ/ n S/ n

345
K. Examination formula sheet

z test of hypothesis for a single z test for the difference between two means
proportion: (variances known):
P − π0 X̄1 − X̄2 − (µ1 − µ2 )
Z∼
=p Z=
π0 (1 − π0 )/n
p
σ12 /n1 + σ22 /n2

t test for the difference between two means Confidence interval endpoints for the
(variances unknown): difference between two means:
s 
X̄1 − X̄2 − (µ1 − µ2 )

1 1
T = q x̄1 − x̄2 ± tα/2, n1 +n2 −2 × s2p +
Sp2 (1/n1 + 1/n2 ) n1 n2

Pooled variance estimator when assuming t test for the difference in means in
equal variances: paired samples:

(n1 − 1)S12 + (n2 − 1)S22 X̄d − µd


Sp2 = T = √
n1 + n2 − 2 Sd / n

Confidence interval endpoints for the z test for the difference between two
difference in means in paired samples: proportions:
sd P1 − P2 − (π1 − π2 )
x̄d ± tα/2, n−1 × √ Z=p
n P (1 − P )(1/n1 + 1/n2 )

Pooled proportion estimator when Confidence interval endpoints for the


assuming equal proportions: difference between two proportions:
R1 + R2
s
P = p1 (1 − p1 ) p2 (1 − p2 )
n1 + n2 p1 − p2 ± zα/2 × +
n1 n2

Chi-squared test statistic for tests Sample correlation coefficient:


of association: n
P
r X
c 2 xi yi − nx̄ȳ
X (Oij − Eij ) i=1
r = s
Eij n
 n

K i=1 j=1 P
x2i − nx̄2
P
yi2 − nȳ 2
i=1 i=1

Spearman rank correlation: Simple linear regression line estimates:


n n
d2i
P P
6 xi yi − nx̄ȳ
i=1 i=1
rs = 1 − β0 = ȳ − β1 x̄ and β1 = P
b b b
n
n(n2 − 1)
x2i − nx̄2
i=1

346
Appendix L
Sample examination paper

Candidates should answer THREE questions: all parts of Section A (50 marks in total)
and TWO questions from Section B (25 marks each).

Section A
Answer all parts of question 1 (50 marks in total).

1. (a) Suppose that x1 = −45, x2 = 25, x3 = 3, and y1 = −5, y2 = 4, y3 = 3.


Calculate the following quantities:
2 3 3
X 1 X √ X
i. ii. | xi yi | iii. − x1 − x2i yi2 .
y
i=1 i i=1 i=2

(6 marks)
(b) Classify each one of the following variables as either measurable (continuous)
or categorical. If a variable is categorical, further classify it as either nominal
or ordinal. Justify your answer. (No marks will be awarded without a
justification.)
i. Types of musical instrument.
ii. Interest rates set by a central bank.
iii. Finishing position of athletes in a sprint race.
(6 marks)
(c) State whether the following are true or false and give a brief explanation. (No
marks will be awarded for a simple true/false answer.)
i. If A and B are mutually exclusive events with P (A) > 0 and P (B) > 0,
then P (A | B) = P (A).
ii. The probability that a normal random variable is less than one standard
deviation from its mean is 95%.
iii. If a 90% confidence interval for π is (0.412, 0.428), then this means that
there is a 90% probability that 0.412 < π < 0.428.
L
iv. A hypothesis test which is not significant at the 10% significance level can
be significant at the 1% significance level.
v. If the value of βb1 in a simple linear regression is −0.1, then the variables x
and y must have a weak, negative sample correlation coefficient.
(10 marks)

347
L. Sample examination paper

(d) In your own words, define the term non-response bias and provide a real-world
example.
(4 marks)
(e) A machine fills bottles with water. The amount that the machine delivers is
normally distributed, with a mean of 1,000 cm3 and a variance of σ 2 . 10% of
filled bottles are known to have less than 995 cm3 of water.
i. Calculate σ to three decimal places.
(3 marks)

ii. A bottle overflows if it is filled with more than 1,010 cm3 of water.
Calculate the probability that a bottle overflows, given that the machine
fills it with at least 1,005 cm3 of water. Express your answer in terms of Φ,
the cumulative distribution function of the standard normal distribution.
(4 marks)

iii. A random sample of 20 bottles is measured, and the mean of this sample
is required. Calculate the probability that this sample mean is less than
1,001 cm3 to four decimal places.
(4 marks)
(f) The probability distribution of a random variable X is given below.

X=x −5 −1 1 5
P (X = x) 3c 2c 2c 3c

i. Explain why c = 0.10.


(1 mark)
ii. Calculate the expected value of X and also determine the median of X,
briefly explaining your reasoning.
(3 marks)
 
iii. Calculate P |X| < 4 X ≥ −1 to four decimal places.

(3 marks)
(g) You wish to estimate a population mean, µ, and know to use the following
formula to determine the sample size:

(zα/2 )2 σ 2
n≥ .
L e2

Briefly explain how you would decide numerical values for:

i. zα/2 ii. σ 2 iii. e.

(6 marks)

348
Section B
Answer two out of the three questions from this section (25 marks each).

2. (a) A survey was conducted in order to examine potential differences of opinion


regarding a new policy on national insurance contributions in 3 major cities of
the United Kingdom (London, Glasgow, Liverpool). The responses were
measured on a binary scale (in favour, against) and are summarised in the
table below
London Glasgow Liverpool
In favour 56 34 28
Against 44 46 42

i. Based on the data in the table, and without conducting any significance
test, would you say there is an association between public opinion on the
new policy and the city of residence? Provide a brief justification for your
answer.
ii. Calculate the chi-squared statistic for the hypothesis of independence
between public opinion on the new policy and the city of residence, and
test that hypothesis at two appropriate levels. What do you conclude?
(13 marks)
(b) You work for a market research company and your boss has asked you to carry
out a random sample survey for a mobile phone company to identify whether a
recently launched mobile phone is attractive to younger people. Limited time
and money resources are available at your disposal. You are being asked to
prepare a brief summary containing the items below in no more than three
sentences for each of them.
i. Choose an appropriate probability sampling scheme. Provide a brief
justification for your answer.
ii. Describe the sampling frame and the method of contact you will use.
Briefly explain the reasons for your choices.
iii. Provide an example in which response bias may occur. State an action
that you would take to address this issue.
iv. State the main research question of the survey. Identify the variables
associated with this question.
(12 marks)

349
L. Sample examination paper

3. (a) A study was conducted to determine whether smoking is associated with


alcohol consumption. The data in the table below provide the number of
cigarettes smoked per day (y) and the number of alcohol units consumed per
week (x) for 9 randomly selected people.
Participant A B C D E F G H I
x 5 7.5 5 7 8 3 2 8 11
y 10 20 15 17 25 5 2 13 30
The summary statistics for these data are:
Sum of x data: 56.5 Sum of the squares of x data: 417.25
Sum of y data: 137 Sum of the squares of y data: 2,737
Sum of the products of x and y data: 1,047

i. Draw a scatter diagram of these data. Carefully label the diagram.


ii. Calculate the sample correlation coefficient. Interpret its value.
iii. Calculate and report the least squares line of y on x. Draw the line on the
scatter diagram.
iv. Based on the regression model above, how many cigarettes per day would
you expect for someone consuming on average 6 alcohol units per week?
Would you trust this value? Justify your answer.
(13 marks)
(b) A survey was conducted in order to compare the number of hours in the office
per day between male and female employees in a big company. A random
sample was drawn consisting of various employees in the company and the
average number of hours in the office per day was recorded. The data are
summarised in the following table:

Sample size Average hours Sample standard


in the office deviation
Males 41 9.0 1.9
Females 29 7.5 1.1

i. Use an appropriate hypothesis test to determine whether there is a


difference in the mean hours in the office between men and women. State
clearly the hypotheses, the test statistic and its distribution under the null
hypothesis, and carry out the test at two appropriate significance levels.
Comment on your findings.
ii. State clearly any assumptions you made in (b) part i.

L iii. Give a 98% confidence interval for the mean hours in the office of women.
(12 marks)

350
4. (a) The data below are the exam marks of 30 students at a particular course.

42 44 45 45 47
47 48 52 53 54
55 55 56 56 57
58 59 60 62 63
63 64 64 65 66
66 67 73 95 98

i. Find the mean, the median and the interquartile range of the data above.
It is given that their sum of the data is 1,779.
ii. Carefully construct, draw and label a boxplot of these data.
iii. Comment on the data, given the shape of the boxplot and the measures
which you have calculated.
(12 marks)
(b) i. A doctor is conducting an experiment to test whether a new treatment for
a disease is effective. In this context, a treatment is considered effective if
it is successful with a probability of more than 0.50. The treatment was
applied to 40 randomly sampled patients and it was successful for 26 of
them. You are asked to use an appropriate hypothesis test to determine
whether the treatment is effective in general. State the test hypotheses,
and specify your test statistic and its distribution under the null
hypothesis. Comment on your findings.
ii. A second experiment followed where a placebo pill was given to another
group of 30 randomly sampled patients. A placebo pill contains no
medication and is prescribed so that the patient will expect to get well. In
some situations, this expectation is enough for the patient to recover. This
effect, also known as the placebo effect, occurred in the second experiment
where 16 patients recovered. You are asked to consider an appropriate
hypothesis test to incorporate this new evidence with the previous data
and reassess the effectiveness of the new treatment.
(13 marks)

[END OF PAPER]

351
L. Sample examination paper

352
Appendix M
Sample examination paper –
Solutions

1. (a) i. We have:
2
X 1 1 1 1
=− + = = 0.05.
i=1
x i 5 4 20

ii. We have:
3
X √ √ √ √
| xi yi | = | −45 × −5| + | 25 × 4| + | 3 × 3| = 15 + 10 + 3 = 28.
i=1

iii. We have:

X3
−x1 − x2i yi2 = −(−45)−((25)2 ×42 )−(32 ×32 ) = 45−10,000−81 = −10,036.
i=2

(b) i. Categorical, nominal. No sense of ordering of musical instruments.

ii. Measurable. Interest rates can be measured in percentage points (or basis
points) to several decimal places.

iii. Categorical, ordinal. Finishing position is in rank order, i.e. 1st, 2nd, 3rd
etc.

(c) i. False. Since A and B are mutually exclusive, then P (A ∩ B) = 0. Since


P (A | B) = P (A ∩ B)/P (B), then P (A | B) = 0, but we are told P (A) > 0.

ii. False. The probability that a normal random variable is less than one
standard deviation from its mean is ≈ 68%.

iii. False. 90% of the time the computed confidence interval covers π. As π is
unknown, it either does or does not fall in the interval (0.412, 0.428).

iv. False. If a hypothesis test is not significant at the 10% significance level,
then it cannot be significant at any lower percentage significance level.
Alternatively, this could be illustrated with a suitable diagram or with
p-values.
M
v. False. While βb1 = −0.1 < 0, the correlation would be negative, but we
cannot infer the strength of the correlation.

353
M. Sample examination paper – Solutions

(d) Non-response bias occurs in survey research when individuals chosen to


participate in a study do not respond, and their non-participation introduces a
systematic error into the results. This bias can affect the generalisability of
findings to the larger population because those who choose not to respond may
differ in important ways from those who do respond.
Imagine a researcher conducts a survey on public opinion about a new
government policy by randomly selecting households to participate. However,
only 60% of the selected households respond to the survey. If the households
that did not respond have different opinions on the policy compared to those
who did respond, there is a non-response bias.

(e) i. Let X ∼ N (1,000, σ 2 ). We have that:


 
995 − 1,000
P (X < 995) = P Z < = 0.10.
σ

Hence:
5
− = −1.282 ⇒ σ = 3.900 cm3 .
σ
In the above, it is important to at least get the correct probability
expression, then identify the correct z-value and finally solve for σ.

ii. We require:

P ({X > 1,010} ∩ {X > 1,005})


P (X > 1,010 | X > 1,005) =
P (X > 1,005)
P (X > 1,010)
=
P (X > 1,005)
1 − P (X ≤ 1,010)
=
1 − P (X ≤ 1,005)
1 − Φ((1,010 − 1,000)/3.900)
=
1 − Φ((1,005 − 1,000)/3.900)
1 − Φ(2.564)
= .
1 − Φ(1.282)

In the above it is essential to correctly use the conditional probability


formula. The remaining steps involve simplifying the numerator, using
standardisation and, finally, answering in terms of Φ as requested in the
question.

iii. We have X̄ ∼ N (1,000, (3.900)2 /20). Hence:


 
1,001 − 1,000
P (X̄ < 1,001) = P Z < √ = P (Z < 1.15) = 0.8749.
3.900/ 20
M
It is essentially to make correct use of the sampling distribution in the
above.

354
(f) i. We require:
X
p(x) = 3c + 2c + 2c + 3c = 1 ⇒ c = 0.10.
x

ii. We have:
X
E(X) = xp(x) = (−5 × 0.30) + (−1 × 0.20) + (1 × 0.20) + (5 × 0.30) = 0
x

and since the distribution of X is symmetric, the median is also 0.


iii. We have:
  P ({−4 < X < 4} ∩ {X ≥ −1}) P (−1 ≤ X < 4)
P |X| < 4 X ≥ −1 = =
P (X ≥ −1) P (X ≥ −1)
0.20 + 0.20
=
0.70
= 0.5714.

(g) i. zα/2 would depend on the level of confidence we required. A default choice
would be 95%, i.e. zα/2 = 1.96, but a case could be made for other
confidence levels.
ii. σ 2 could be an assumed value of the population variance. Alternatively,
use a pilot study and use the sample variance, s2 , to estimate σ 2 .
iii. e would be our desired tolerance on estimation error, whose value would
depend on our requirements. Any sensible value would be acceptable.

2. (a) i. An example of a ‘good’ answer is given below:


There are some differences in public opinion on the new policy and the
city of residence. More specifically, 56% of those who are in London are in
favour, in contrast with only 40% of those in Liverpool. Hence there seems
to be an association although this needs to be investigated further.
ii. H0 : No association between public opinion on the new policy and the city
of residence. vs. H1 : Association between public opinion on the new
policy and the city of residence.
It is essential to calculate the expected values, which are shown below:
London Glasgow Liverpool
In favour 47.2 37.8 33.0
Against 52.8 42.2 37.0
The test statistic formula is:
X (Oi,j − Ei,j )2

i,j
Ei,j
M
which gives a value of 5.273. This is a 2 × 3 contingency table so the
degrees of freedom are (2 − 1)(3 − 1) = 2.

355
M. Sample examination paper – Solutions

For α = 0.05, the critical value is 5.991, hence we do not reject H0 . For
α = 0.10, the critical value is 4.605, hence we reject H0 . There is weak
evidence of an association between public opinion on the new policy and
the city of residence.
(b) General instructions: we are asked for accuracy and random (probability)
sampling, so this implies using some kind of a list.
i. Stratified random sampling is appropriate here due to the accuracy
requirement.
ii. An example answer is given below:
• Use list provided by the mobile phone company to identify those who
bought the recently launched mobile phone.
• List could be postal address, phone or email.
• Stratify by age group or by gender of buyer.
• Explanation as to which you would prefer. For example, email is fast
if all have it but there may be no response.
iii. If anonymity was not guaranteed, then response bias may occur. Ensuring
anonymity is conveyed to respondents should mitigate this.
iv. Main research question: How does the new mobile phone compare with
previous models?
Associated variables: mobile phone model, and a measure of consumer
preference.
3. (a) i. A reasonable scatter diagram is show below:

356
ii. The summary statistics can be substituted into the formula for the sample
correlation coefficient to obtain the value 0.9260. The higher the alcohol
consumption, the higher the cigarette consumption. The fact that the
value is close to 1, suggests that this is a strong, positive linear
relationship.
iii. The regression line can be written by the equation:

yb = βb0 + βb1 x or y = βb0 + βbx + ε.

The formula for βb1 is: P


xi yi − nx̄ȳ
β1 = P 2
b
xi − nx̄2
and by substituting the summary statistics we get βb1 = 2.988. The
formula for βb0 is:
βb0 = ȳ − βb1 x̄
so we get βb0 = −3.539.
Hence the regression line can be written as:

yb = −3.539 + 2.9885x or y = −3.539 + 2.9885x + ε.

It should also be plotted on the scatter diagram as shown in part (a).


iv. The expected amount of sales is y = −3.539 + 2.988 × 6 ≈ 14.39 cigarettes
per day. This prediction is reasonable, since it is in the range of the x data
hence this is interpolation.
(b) i. The working of the exercise is shown below, with µ1 denoting the mean
hours in the office for males and µ2 the mean hours in the office for
females.
• H0 : µA = µB vs. H1 : µA 6= µB .
• Test statistic value: 3.8180 (if equal variances assumed). (If equal
variances are not assumed the test statistic value is 4.1639.)
For reference the test statistic formula is (corresponding assumption
to be provided in part ii.):
x̄1 − x̄2 x̄1 − x̄2
q or p .
s2p (1/n1 + 1/n2 ) s21 /n1 + s22 /n2

• For α = 0.05, critical values are ±1.96 (±2.00 if the t60 distribution is
used).
• Decision: reject H0 .
• Choose smaller α, say α = 0.01, hence critical values are ±2.576 (or
±2.660), hence reject H0 .
• The test is highly significant, with strong evidence of a difference in
M
the mean hours in the office between males and females.

357
M. Sample examination paper – Solutions

ii. The assumptions for i. were:


• about whether variances are equal
• about whether n1 + n2 is ‘large’ so that the normality assumption is
satisfied
• about independent samples.
iii. The working for the 98% confidence interval is given below:

• Knowledge of method, i.e. using 7.5 ± 2.467 × 1.1/ 29.
• Important to identify the correct t-value: 2.467, with n − 1 = 28
degrees of freedom.
• Correct endpoints or else (6.996, 8.004).

4. (a) i. • Mean mark: 59.3.


• Median mark: 57.5.
• Q1 : 52.25 (the 7th smallest value or an interpolation between the 7th
and the 8th smallest value).
• Q3 : 64 (the 22nd smallest value or an interpolation between the 22nd
and the 23rd smallest value).
• IQR: 64 − 52.25 = 11.75.
ii. An acceptable boxplot is displayed below:

358
iii. The two main things to note here, are positive/right skewness and the fact
that mean > median.
(b) This a standard exercise for a one-sided hypothesis test for a single proportion
(part i.) and differences between two proportions (part ii.). The working of the
exercise is given below.
i. Let πT denote the true probability for the new treatment to work. We can
use the following test:
• H0 : πT = 0.50 vs. H1 : πT > 0.50.
p
• Standard error: 0.50 × (1 − 0.50)/40 = 0.0791.
• Test statistic value: 1.897.
• For α = 0.05, the critical value is 1.645.
• Decision: reject H0 .
• The test is moderately significant, with moderate evidence that this
treatment is better than doing nothing.
ii. Let πP denote the true probability for the patient to recover with the
placebo.
• H0 : πT = πP vs. H1 : πT > πP . For reference the statistic is:
pT − pP
∼ N (0, 1).
s.e.(pT − pP )

• Calculation of standard error:


s  
42 28 1 1
s.e.(pT − pP ) = × × + = 0.1183
70 70 40 30

• Test statistic value = (26/40 − 16/30)/0.1183 = 0.986.


• For α = 0.05, the critical value is 1.645.
• Do not reject H0 at the 5% significance level.
• The test is not statistically significant, with insufficient evidence of
higher effectiveness than the placebo effect.
• There is insufficient evidence to recommend the treatment for use.

359
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
M
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy