StudyGuide001 2015 4 B STA1502
StudyGuide001 2015 4 B STA1502
Department of Statistics
STA1502
Statistical inference I
CONTENTS
ORIENTATION iii
STUDY UNIT 1
1.1 Introduction 1
1.2 Inference about the Difference Between Two Population Means: 1
Independent Samples
1.3 Observational and Experimental Data 9
1.4 Inference about the Difference Between Two Population Means: 9
Matched Pairs Experiment
1.5 Inference about the Ratio of Two Variances 19
1.6 Self-correcting Exercises for Unit 1 22
1.7 Solutions to Self-correcting Exercises for Unit 1 23
1.8 Learning Outcomes 27
STUDY UNIT 2
2.1 Introduction 28
2.2 Inference about the Difference Between Two Population Proportions 28
2.3 One-Way Analysis of Variance 34
2.4 Multiple Comparisons 43
2.5 Analysis of Variance experimental designs (read only) 47
2.6 Randomized Block(two-way) Analysis of Variance 47
2.7 Self-correcting Exercises for Unit 2 51
2.8 Solutions to Self-correcting Exercises for Unit 2 52
2.9 Learning Outcomes 55
STUDY UNIT 3
3.1 Chi–square test 57
3.2 Chi-squared goodness-of-fit test 58
3.3 Chi-squared test of a Contingency Table 62
3.4 Summary of test on nominal data 64
STUDY UNIT 4
4.1 Simple linear regression and correlation 70
4.2 Estimating the coefficients 70
4.3 Error variable: required conditions 75
4.4 Assessing the model 76
4.5 Using the regression equation 77
4.6 Regression diagnostics 77
ii
STUDY UNIT 5
5.1 Non parametric statistics 82
5.2 Wilcoxon Rank Sum Test 82
5.3 Sign test and Wilcoxon signed rank sum test 86
STUDY UNIT 6
6.1 Time series analysis and time series forecasting 96
6.2 Components of time series and smoothing possibilities 96
6.3 Smoothing techniques 97
6.4 Trend and seasonal effects 100
6.5 Introduction to forecasting 102
6.6 Forcasting models 102
iii STA1502/1
ORIENTATION
Welcome
Welcome to STA1502. This module is the second one of the first-year statistics courses. STA1501
and STA1502 form the first year Statistics course for students from the College of Economic and
Management Sciences. If you are a BSc student in the College of Science, Engineering and
Technology, the three modules STA1501 and STA1502 and STA1503 form the first year in Statistics.
In the preceding module STA1501, we treated probability and probability distributions, and unless
one has a proper understanding of the laws of probability, the mechanisms underlying statistical data
analysis will not be understood properly. Probability theory is the tool that makes statistical inference
possible. In STA1502, we consider to the applications of the probability distributions. You have
learned in STA1501 that the shape of the normal distribution is determined by the value of the mean
µ and the variance σ 2 , whilst the shape of the binomial distribution is determined by the sample size
n and the probability of a success p. These critical values are called parameters. We most often
don’t know what the values of the parameters are and thus we cannot "utilise" these distributions (i.e.
use the mathematical formula to draw a probability density graph or compute specific probabilities)
unless we somehow estimate these unknown parameters. It makes perfect logical sense that to
estimate the value of an unknown population parameter, we compute a corresponding or comparable
characteristic of the sample.
The objective of this module is to focus on the issues related to prediction and inference in statistics
and therefore it is called Statistical Inference and the "I" in the title indicates that it is a module at
the first level. We draw inference about a population (a complete set of data) based on the limited
information contained in a sample. In dictionary terms, inference is the act or process of inferring;
to infer means to conclude or judge from premises or evidence; meaning to derive by reasoning.
In general, the term implies a conclusion based on experience or knowledge. More specifically in
statistics, we have as evidence the limited information contained in the outcome of a sample and
we want to conclude something about the unknown population from which the sample was drawn.
The set of principles, procedures and methods that we use to study populations by making use of
information obtained from samples is called statistical inference.
Learning outcomes
There are very specific outcomes for this module, listed below. Throughout your study of this module
you must come back to this page, sit back and reflect upon them, think them through, digest them
into your system and feel confident in the end that you have mastered the following outcomes:
iv
For this module you have to study certain sections from six chapters of the prescribed textbook:
you so prefer, you are welcome to write and reference your solutions in your own book or file, if the
space we supply is insufficient or not to your liking.
We realise that you might feel overwhelmed by the volumes and volumes of printed matter that
you have to absorb as a student! How do you eat an elephant? Bite by bite! We have divided
the 6 chapters of the textbook into 5 study units or "sessions". Make very sure about the sections
indicated in each study unit since some sections of the textbook are excluded and we do not want
you frustrated by working through unnecessary work. Regular contact with statistics will ensure that
your study becomes personally rewarding.
Doing exercises on your own will not only enhance your understanding of the work, but it will give you
confidence as well. Feedback is given immediately after the activity to help you check whether you
understand the specific concept. The activities are designed (i.e. specific exercises are selected) so
that you can reflect on a concept discussed in the textbook. You can only obtain maximum benefit
from this activity-feedback process if you discipline yourself not to peep at the solution before you
have attempted it on your own!
We know that many of you have some "math anxiety" to deal with, but we will do our best to make
your statistics understandable and not too theoretic. Studying statistics is sometimes not "exciting"
or "fun" but keep in mind that the considerable effort to master the content of this module can be very
rewarding. We claim that knowledge of statistics will enable you to make effective decisions in your
business and to conduct quantitative research into the many larger and detailed data sources that
are available. Statistical literacy will enable you to understand statistical reports you might encounter
as a manager in your business.
We are there to assist you in a process where you shift yourself from a supported school learner to
an independent learner. Studying through distance education is neither easy nor quick. There will
be times when you feel frustrated and discouraged and then only your attitude will pull you through!
You are the master of your own destiny.
In a paper by Sue Gordon1 (1995) from the University of Sydney, the following metaphor is given:
"The learning of statistics is like building a road. It’s a wonderful road, it will take you to places you
did not think you could reach. But when you have constructed one bit of road you cannot sit back and
think ‘Oh, that’s a great piece of road!’ and stop at that. Each bit leads you on, shows the direction
to go, opens the opportunity for more road to be built. And furthermore, the part of the road that
1
Gordon, Sue (1995) A theoretical Approach to Understanding Learners of Statistics. Journal of Statistics Education
v. 3, n.3 University of Sydney.
vi
you built a few weeks ago, that you thought you were finished with, is going to develop pot holes
the instant you turn your back on it. This is not to be construed as failure on your part, this is not
inadequacy. This is just part of road building. This is what learning statistics is about: go back and
repair, go on and build, go back and repair."
(You can skip the following section if you have read through it when you did STA1501.)
We realise that in the South African schooling system commas are used to indicate the decimal digit
values. You have been penalised at school for using a point. Now we sit between two fires: the
school system and common practice in calculators and computers! Most computer packages use
decimal points (ignoring the option to change it) and Keller (the author) also uses the decimal point
in our textbook (Statistics for Management and Economics). Therefore we use the decimal point in
our study guide, assignments and examination.
vii STA1502/1
The emphasis in the textbook is well beyond the arithmetic of calculating statistics and the focus is
on the identification of the correct technique, interpretation and decision making. This is achieved
with a flexible design giving both manual calculations and computer steps.
It is a good idea that you initially go through the laborious manual computations to enhance your
understanding of the principles and mathematics but we strongly urge you to manage the Excel
computations because using computers reflects the real world outside. The additional advantage of
using a computer is that you can do calculations for larger and more realistic data sets. Whether
you use a computer program or a statistical calculator as tool for your calculations is irrelevant to us.
However, the emphasis in this module will always be on the interpretation and how to articulate the
results in report writing.
CD Appendixes and A Study Guide are provided on the CD-ROM (included in the textbook) in pdf
format . The slide shot below is just to give you an idea of some of the topics covered. Although it will
not be to your disadvantage if you do not use the CD, we encourage you to try your best to have at
least a few sessions on a computer. Statistical Software makes Statistics exciting - so, play around
on the computer should you have access!
viii
STUDY UNIT 1
1.1 Introduction
You should not attempt to do this module without knowledge of the contents of STA1501 as it is a
continuation in the same textbook of the follow-up chapters. Chapters 2 and chapters 4 - 12 were
covered in STA1501 and we now continue with Chapter 13. In chapter 12 you learnt about statistical
inference for a single population and derived hypothesis tests and confidence intervals from the
information contained in a single sample. You did this for
In this study unit we will focus on statistical inference for two populations and derive hypothesis
tests and confidence intervals from the information contained in two separate samples. Recall how
a confidence interval is derived for (µ1 − µ2 ) using the sampling distribution of (X 1 − X 2 ). Similar
to the practical problems with inference for a single population mean, µ, you will understand that we
again work with a t-distribution because of the more realistic set-up where we assume that both the
population variances are unknown and we have to estimate them.
STUDY
Keller Chapter 13 Inference about comparing two populations
13.1 Inference about the difference between two means: independent samples
Make sure that you understand figure 13.1 of Keller: Note that we need subscripts to distinguish
between the parameters of two different variables!
We are now sampling from two independent populations where the means of the populations are
our focus.
2
1. We have two independent populations from which we draw small random samples.
2. Both populations have normal distributions.
3. Both populations have the same variance, i.e. σ 21 = σ 22 = σ 2 .
[ :-) I like to add the subscript "pooled" to remind me that it is a combined/composed variance
and not the subscript consisting only of "p" as Keller does!]
(x1 − x2 ) − (µ1 − µ2 )
The test statistic is t(x1 −x2 ) = t u which has a t-distribution with υ = (n1 +n2 −2)
2 1 1
spooled +
n1 n2
degrees of freedom.
[ :-) I like to add the subscript " (x1 − x2 )" to t to remind me that it is a different t-statistic from
what we used in chapter 12 of Keller ]
The test statistic can be used directly to perform a hypothesis test or be manipulated to create a
lower and an upper bound for the confidence interval.
The null hypothesis H0 : (µ1 − µ2 ) = D0 may be tested at the α% level of significance against one
of the following alternatives:
(i) H1 : (µ1 − µ2 ) = D0 or
(ii) H1 : (µ1 − µ2 ) < D0 or
(iii) H1 : (µ1 − µ2 ) > D0
The symbol D0 implies a known, specified difference under H0 and is usually (mostly!)
the value 0, indicating that we are testing H0 : µ1 = µ2 .
3 STA1502/1
To obtain a (1 − α)100% confidence interval estimate for the difference between the two
populations means, (µ1 − µ2 ), we compute
u
1 1
(x1 − x2 ) ± t α2 ;(n1 +n2 −2) s2pooled ( + )
n1 n2
tn +n -2
1 2
-t"/2;n +n -2
1 2
0 t"/2;n +n -2
1 2
After you have studied section 13.1 of chapter 13 of the textbook you should try and work through
activities 1.1 and 1.2 to enhance your understanding of a hypothesis test for the difference between
two population means.
Activity 1.1
Say whether the following statements are correct or incorrect and try to rectify the incorrect
statements to make them true.
.............................................................................. ..............................................................................
.............................................................................................................................................................
v
s21 s2
(b) If we derive a confidence interval for (µ1 − µ2 ) we use SE = + 2
n1 n2
u
1 1
but if we test H0 : µ1 = µ2 we use SE = s2pooled ( + ) .
n1 n2
.............................................................................................................................................................
.............................................................................................................................................................
4
(c) In a one-tailed test for the difference between two population means, (µ1 − µ2 ), if the null
hypothesis is rejected when the alternative hypothesis, H1 : µ1 < µ2 is false, a Type I error
is committed.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Feedback Feedback
(a) Correct. With a little algebraic manipulation it follows from the definitions of
Σ(x1i − x1 )2 Σ(x2i − x2 )2
s21 = and s22 = that (n1 − 1)s21 = Σ(x1i − x1 )2
n1 − 1 n2 − 1
and that (n2 − 1)s22 = Σ(x2i − x2 )2 .
u
1 1
(b) Incorrect. We use SE = s2pooled ( + ) for both the hypothesis test and the confidence
n1 n2
interval!
(c) Correct.
You will find that in most of the exercises on this section, whether they are for an assignment, the
examination or exercises in Keller, the information you have to work with will either be
Population
1 2
Sample size n1 n2
Sample mean x1 x2
Sample variance s21 s22
There could be "variations" on the theme of summarised data where computed sums are given
instead of sample statistics, e.g. Σx1i instead of x1 or Σx21i and Σx1i instead of s21 .
In the case of raw data, you must try to have at least a Scientific Pocket Calculator with Statistical
Functions that will enable you to compute the sample statistics:
5 STA1502/1
Activity 1.2
Psychologists have claimed that the scores on a tolerance measurement scale have a normal
distribution. Suppose that this scale is administered to two independent random samples of males
and females and their tolerance towards other road users is measured. (The higher the score, the
more tolerant you are.) The following scores were obtained:
Males: 12 8 11 14 10
Females: 15 12 14 11 13 14 12
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
6
(b) Compute a 99% confidence interval for the difference (µmales − µf emales ). How do you interpret
this interval?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
(c) What can you conclude from questions (a) and (b)?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
7 STA1502/1
Feedback Feedback
(a) Step 1:
We have to test H0 : µmales = µf emales =⇒ H0 : (µ1 − µ2 ) = 0
against H1 : µmales = µf emales =⇒ H1 : (µ1 − µ2 ) = 0
Step 2:
(x1 − x2 ) − (µ1 − µ2 )
We use the test statistic t(x1 −x2 ) = u ∼ t(n1 +n2 −2) .
2 1 1
spooled ( + )
n1 n2
Σx1i
x1 = = 55
5 = 11 ;
n1
Σx2i
x2 = = 91
7 = 13 ;
n2
Step 3:
Find the critical values.
From Table 4 (see Appendix B, Keller) we find t α2 ; (n1 +n2 −2) = t 0.01 ; (5+7−2) = t0.005; 10 = 3.169 which
2
= −2 ± 3.3194
= (−5.3194 ; 1.3194).
8
We are 99% confident that the unknown difference (µmales − µf emales ) will be between −5.3194
and 1.3194. We see that (−5, 3194; 1, 3194) includes the null value, which implies that we are 99%
confident that the mean for the males is the same as the mean for the females.
[Extra explanation: We translate the phrase "the mean for the males is the same as the mean for
the females" as µmales = µf emales which is in general µ1 = µ2 . But, if µmales = µf emales it implies
that (µ1 − µ2 ) = 0.
So, to conclude that µmales = µf emales we have to check whether zero is included in the confidence
interval. ]
(c) We conclude from questions (a) and (b) that using a two-sided confidence interval and performing
a two-sided hypothesis test must always lead to the same conclusion because it is a different
"juggle" of the same information! This is indeed the case with this exercise!
You will find that in most of the exercises on this section, whether they are for an assignment, the
examination or exercises in Keller, we will simply state: " Assume that.....blah-blah-blah" and then we
conveniently take care of the assumptions of normality and equal variances! But, strictly speaking,
we should have first checked whether these conditions are met before we proceed with the test.
There exist additional preliminary tests where we can formally test for normality and for the equality
of variances. The tests for normality are covered in detail in your second-year statistics syllabus.
Most statistical packages will provide you with a statistical test to formally test H0 : σ 21 = σ 22 . In
the module STA2601 you will be formally introduced to the statistical package JMP. In case you do
not continue with statistics but anyhow apply your first-year knowledge using a statistical package of
your own choice, be aware that most statistical software packages will automatically include a test
for the equality of variances when you request to do a test for means! (This also happens when you
request to do an ANOVA test for means – a procedure you will learn about in the following study unit.)
The output for the test for the equality of variances will be a so-called F -test. An F-test, in general, is
basically the ratio of two quantities – in this application two variances. The p-value associated with
the F -test could be interpreted exactly like you have learned to do for any other test. If it is significant
(i.e. p-value < α) you will reject H0 : σ 21 = σ 22 .
9 STA1502/1
STUDY
Keller Chapter 13 Inference about comparing two populations
13.2 Observational and experimental data
Although this is a section of less than two pages, it is vitally important to grasp what Keller wants to
convey and to always keep this in mind whenever you interpret results.
STUDY
Keller Chapter 13 Inference about comparing two populations
13.3 Inference about the Difference Between Two Population Means:
Matched Pairs Experiment
Have you noticed when we derived the sampling distribution of (x1 − x2 ), we used the fact that
s2 s2
E(x1 − x2 ) = (µ1 − µ2 ) ...........(the minus sign stays), but that var(x1 − x2 ) = ( 1 + 2 ) ...............(the
n1 n2
minus sign disappears)?
(Yes, there is a plus sign even though you might expect a minus sign!) In other words, if we create
a new variable by subtracting two variables, the variance of this new variable will – provided they
are independently distributed – be the sum of the variances of the two original variables.
Strictly speaking there is (in general) a third term that takes care of the dependency between the two
variables. We did not even bother to mention it in section 1.1 because this dependency term falls
away if we assume that X and Y are independent.
However, if we cannot assume that we have two samples from two independent populations, we
have a problem with var(x1 − x2 ).
Σ(x1i − x1 )2 + Σ(x2i − x2 )2
e2 =
Using σ = s2pooled is not valid anymore!
n1 + n2 − 2
So, whenever there is a "connectedness" between one set of values (sample 1) and the second
set of values (sample 2), we could take care of the dependency by treating the data as matched
pairs. We remove the dependency by reducing the two samples to one set of scores. This would
immediately imply that n1 = n2 .
10
Thus, we create a single random sample by taking the paired differences di = x1i − x2i . With a little
adaptation (and imagination) we are now back to the set-up discussed in STA1501 (depending on
whether we consider the sample as having a known or unknown population variance!) i.e. go back
to Keller regarding the topics:
Comparing the means of two dependent data sets is always a separate choice (or sub-menu
in computer jargon) of the test procedures available for testing means (main-menu in computer
jargon) in any statistical software package. It is generally known as a “paired samples t-test” and
observations of a single sample, obtained by first taking the differences, are used.
For dependent observations, the hypothesis test for the difference between the two means therefore
boils down to the hypothesis test for a single sample.
H0 : µX = µY is the same as H0 : µD = 0 .
It is interesting to note that in the paired observations test, the degrees of freedom are half of what
they are if the samples are not paired. (When the samples are not paired two kinds of variation are
present: differences among the groups and differences among the subjects.)
11 STA1502/1
Activity 1.3
Say whether the following statements are correct or incorrect and try to rectify the incorrect
statements to make them true.
(a) Repeated measurements from the same individuals constitute an example of data collected from
matched pairs experiment.
.............................................................................. ..............................................................................
.............................................................................................................................................................
(b) The number of degrees of freedom associated with the t-test, when the data are gathered from a
matched pairs experiment with 8 pairs, is 7.
.............................................................................................................................................................
.............................................................................................................................................................
(c) The matched pairs experiment always produce a larger test statistic than the independent samples
experiment.
.............................................................................................................................................................
.............................................................................................................................................................
(d) In comparing two population means of interval data, we must decide whether the samples are
independent (in which case the parameter of interest is µ1 − µ2 ) or matched pairs (in which case
the parameter is µD ) in order to select the correct test statistic.
.............................................................................................................................................................
.............................................................................................................................................................
(e) When comparing two population means using data that are gathered from a matched pairs
experiment, the test statistic for µD has a Student t-distribution with ν = nD − 1 degrees of
freedom, provided that the differences are normally distributed.
.............................................................................................................................................................
.............................................................................................................................................................
12
Feedback Feedback
(a) Correct.
(b) Correct.
(c) Incorrect. We may say that the matched pairs produce a smaller estimated SE because we
eliminate the often considerable variability due to individual variation in the seperate samples.
(d) Correct.
(e) Correct.
13 STA1502/1
Activity 1.4
Suppose that person A believes that sons, upon maturity, are in general taller than their fathers.
Person B, on the other hand, argues that the opposite is true. In order to investigate this issue, we
measure the heights of a random sample of nine father-son pairs. The following are the results (in
cm):
Pair 1 2 3 4 5 6 7 8 9
Son 185 173 168 178 188 173 165 183 175
Father 180 175 160 178 183 175 160 173 178
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
(b) Find a 95% confidence interval estimate for (µ1 − µ2 ), the mean difference in heights of fathers
and sons.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
14
Feedback Feedback
We have dependent (paired) observations and we need to work with the differences of the pairs,
H0 : µD = 0 vs H1 : µD = 0
where
di : 5 −2 8 0 5 −2 5 10 −3
xD − 0
t= √
sD / n
where
S
di 26
xD = = = 2.889 ;
9 9
S 2
di − nx2D 256 − 9(2.889)2
s2D = = = 22.611 ;
(n − 1) 8
√
sD = 22.611 = 4.755 .
Therefore,
2.889 − 0
t = √
4.755/ 9
= 1.8227 .
Decision rule
Since t ∼ tn−1 we will reject H0 if t ≤ −t0.025; 8 or if t ≥ t0.025; 8 .
From Table 4 (see Appendix B, Keller) t0.025; 8 = 2.306.
Since 1.8227 < 2.306 we cannot reject H0 . The height of sons and fathers do not differ
significantly at the 5% level of significance.
15 STA1502/1
(b) For a 95% confidence interval we need t α2 ;(n−1) = t 0.05 ; (9−1) = t0.025; 8 = 2.306.
2
sD 4.755
xD ± (t α2 ;(n−1) ) √ = 2.889 ± (2.306) √
n 9
= 2.889 ± 3.655
= (−0.766; 6.544).
We are 95% confident that the mean difference in heights of fathers and sons is between −0.766
and 6.544. (Sons seem to be taller than their fathers but not significantly.)
Activity 1.5
Question 1
In testing the hypothesis H0 : µD = 5 vs. H1 : µD > 5, two random samples from two
dependent normal populations produced the following statistics: xD = 9, nD = 20, and sD = 7.5.
What conclusion can we draw at the 1% significance level?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Question 2
Promotional Campaigns
The general manager of a chain of fast food chicken restaurants wants to determine how effective
their promotional campaigns are. In these campaigns “20% off” coupons are widely distributed.
These coupons are only valid for one week. To examine their effectiveness, the executive records
the daily gross sales (in R1000’s) in one restaurant during the campaign and during the week after
the campaign ends. The data is shown below.
16
(a) Can they infer at the 5% significance level that sales increase during the campaign?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
(b) Find the 95% confidence interval for the difference in sales during the week.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
(c) What can you conclude from the answers in (a) and (b)?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
17 STA1502/1
Feedback Feedback
Question 1
xD − µD
t = √
sD / nD
9−5
= √
7.5/ 20
= 2.385
Decision rule
Since t ∼ tnD −1 we will reject H0 if t ≥ t0.01; 20−1 = 2.539 (from Table 4, Appendix B, Keller).
Since 2.385 < 2.539 we cannot reject H0 at the 1% level of significance.
Question 2
We have dependent (paired) observations and we need to work with the differences of the pairs.
H0 : µD = 0 vs H1 : µD > 0.
xD − 0
t= √
sD / n
where
S
di 6.7
xD = = = 0.957 14 ;
7 7
S 2
di − nx2D 8.69 − 7(0.957 14)2
s2D = = = 0.37953 ;
(n − 1) 6
√
sD = 0.37953 = 0.616 06.
18
Therefore,
0.957 14 − 0
t = √
0.61606/ 7
= 4.111 .
Decision rule
Since t ∼ tn−1 we will reject H0 if t ≥ t0.05; 6 .
From Table 4 (see Appendix B, Keller) t0.05; 6 = 1.943.
Since 4.111 > 1.943 we reject H0 . Yes, they may infer at the 5% significance level that sales
increase during the campaign.
(b) For a 95% confidence interval we need t α2 ;(n−1) = t 0.05 ; (7−1) = t0.025; 6 = 2.447.
2
sD 0.61606
xD ± (t α2 ;(n−1) ) √ = 0.957 14 ± (2.447) √
n 7
= 0.957 ± 0.57
= (0.387; 1.527).
We are 95% confident that the mean difference in sales is between 0.387 and 1.527 thousand
rand.
(c) We estimate that the daily sales during the campaign increase on average between 0.387 and
1.527 thousand rand.
19 STA1502/1
STUDY
Keller Chapter 13 Inference about comparing two populations
13.4 Inference about the Ratio of two variances
The interest in this section is on variablilty in two populations, using the F -tables. It is a small but
significant section in Keller.
S12
· is F distributed.
S22
The hypothesis testing follows the same pattern as you have had in previous sections, namely
- define the null and alternative hypotheses according to the information given in the question (they
σ2
have to involve the parameter 12 )
σ2
S12
- know that F = is the test statistics with ν 1 = n1 − 1 and ν 2 = n2 − 1 degrees of freedom
S22
S12 1
- the LCL is with ν 1 = n1 − 1 and ν 2 = n2 − 1
S22 F α ,ν 1 ,ν 2
2
S12
- theUCL is F α ,ν 2 ,ν 1 with ν 1 = n1 − 1 and ν 2 = n2 − 1
S22 2
- Find the cutt-off value for the rejection region from the F -table.
If you are not sure about finding these table-values, page back to Chapter 8, 8.4 Other Continuous
Distributions in Keller, where the F-distribution is explained in detail. This section formed part of the
STA1501 (STS1113) syllabus. The advantage of using the same textbook for the modules STA1501
and STA1502 is that you can go back to previous knowledge whenever needed.
Study the example Testing the quality of two-bottle filling in detail so you can understand the
procedure of this ratio of variances test.
Activity 1.6
Question 1
σ 21
In constructing a 90% interval estimate for the ratio of two population variances, , two independent
σ 22
samples of sizes 40 and 60 are drawn from the populations. If the sample variances are 515 and 920,
then the lower confidence limit is:
1. 0.244
2. 0.352
3. 0.341
4. 0.890
5. 0.918
Question 2
a) Do the sample variances provide enough evidence at the 10% significance level to infer that the
two population variances differ?
b) Estimate with 90% confidence the ratio of the two population variances.
c) Describe what the interval estimate tells you and briefly explain how to use the interval estimate
to test the hypotheses.
21 STA1502/1
Feedback Feedback
Question 1
S12 1
The formula for the LCL is and you have to substitute the correct values into this
S22 F α ,ν 1 ,ν 2
2
formula.
S12 515
=
S22 920
= 0.55978..
α
Go to the F -table with heading 0.05 (because α = 0.1 and you need ) and where the values for 40
2
and 60 meet, you will read off the value 1.59.
S12 1 1
· = 0.55978 ·
S22 F α ,ν 1 ,ν 2 1.59
2
= 0.352, which is option 2
Question 2
σ 21 σ 21
a) H0 : = 1 versus H0 : =1
σ 22 σ 22
1 1
Rejection region:F > F0.05,15,13 = 2.53 or F < F0.95,13,15 = = ≈ 0.408
F0.05,13,15 2.45
55
Test statistics: F = = 0.466
118
Conclusion: Don’t reject the null hypothesis. No, the sample variances don’t provide enough
evidence at the 10% significance level to infer that the two population variances differ
22
b) The 90% confidence interval for the ratio of the two population variances:
2
S1 1
LCL = 2 ·
S2 F α
,ν 1 ,ν 2
2
55 1
= ·
118 F0.05,15,13
1
= 0.466 ·
2.53
= 0.1842
2
S1
U CL = · F α ,ν 2 ,ν 1
S22 2
55
= · F0.05,13,15
118
= 0.466 · 2.45
= 1.1417
σ 21
c) We estimate that the ratio lies between 0.1842 and 1.1417. Since the hypothesized value 1 is
σ 22
included in the 90% interval estimate, we fail to reject the null hypothesis at α = 0.10.
Question 2
Do EXERCISE 13.7 of chapter 13 Keller.
Question 3
Do EXERCISE 13.41 of chapter 13 Keller.
Question 4
Do EXERCISE 13.43 of chapter 13 Keller.
23 STA1502/1
Question 2
Solution to 13.7
Step 1:
We have to test H0 : (µ1 − µ2 ) = 0
against H1 : (µ1 − µ2 ) < 0.
Step 2:
We use the test statistic
Step 3:
Find the critical values.
From Table 4 (see Appendix B, Keller) we find tα; (n1 +n2 −2) = t0.10; (6+6−2) = t0.10; 10 = 1.372 which
means we will reject H0 if t < −1.372.
Since −0.43 < −1.372 we reject the null hypothesis, and conclude that the manager should
choose to use guards.
(Please note: Using statistical software you will find the p-value = 0.0795. Since p < α = 0.10 we
reject H0 .)
(Also note: Using statistical software you will find the Two-tail F-test: F = 1.24, p-value = 0.8194; =⇒
cannot reject H0 : σ 21 = σ 22 =⇒ it is valid to use the equal-variances test statistic.)
Question 3
Solution to 13.41
We have dependent (paired) observations and we need to work with the differences of the pairs.
(Note that this depends on how you defined the difference: If ABS brakes are more effective
(implying less seconds!) than non-ABS brakes, it implies that (ABS − non ABS) would be a negative
value under the alternative hypothesis.)
25 STA1502/1
Therefore,
−0.175 − 0
t = √
0.225/ 8
= −2.199 .
Decision rule
Rejection region: We will reject H0 if t < − tα; n−1 .
From Table 4 (see Appendix B, Keller) t0.05; 7 = 1.895.
Since −2.199 < −1.895 we reject H0 . There is enough evidence at the 5% significance level that
ABS brakes are more effective (implying less seconds!) than non-ABS brakes.
Question 4
Solution to 13.43
We have dependent (paired) observations and we need to work with the differences of the pairs.
H0 : µD = 0 vs H1 : µD < 0.
(Note that this depends on how you defined the difference: If the new fertilizer is more effective
than the current fertilizer, it implies that the difference in crop yields of (current fertilizer − new
fertilizer) would be negative under the alternative hypothesis.)
26
xD − 0
t= √
sD / n
where
S
di −12
xD = = = −1.00 ;
12 12
S 2
di − nx2D 112 − 12(−1.00)2
s2D = = = 9.0909 ;
(n − 1) 11
√
sD = 9.0909 = 3.0151.
Therefore,
−1.00 − 0
t = √
3.02/ 12
= −1.15 .
Decision rule
Rejection region: We will reject H0 if t < − tα; n−1 since t ∼ tn−1 .
From Table 4 (see Appendix B, Keller) t0.05; 11 = 1.796.
Since −1.15 > −1.796 we cannot reject H0 . They may not infer at the 5% significance level that
the new fertilizer is more effective than the current fertilizer.
(b) For a 95% confidence interval we need t α2 ;(n−1) = t 0.05 ; (12−1) = t0.025; 11 = 2.201.
2
sD 3.02
xD ± (t α2 ;(n−1) ) √ = −1.00 ± (2.201) √ 12
n
= −1.00 ± 1.92
= (−2.92; 0.92).
We are 95% confident that the mean difference in crop yields is between −2.92 and 0.92.
(c) The differences are required to be normally distributed.
(d) No, the histrogram of the differences is bimodal.
(e) The data are experimental.
(f) The experimental design should be independent samples.
27 STA1502/1
Can you
· perform a small-sample statistical test for the difference between two population means in the
case of independent random samples?
· derive a small-sample confidence interval for the difference between two population means
(µ1 − µ2 ) in the case of independent random samples?
· perform a small-sample statistical test for the difference between two population means in the
case of dependent random samples?
· derive a small-sample confidence interval for the difference between two population means
(µ1 − µ2 ) in the case of dependent random samples?
· use a confidence interval estimator to test hypotheses for the ration of two variances when two
independent samples are drawn from normal populations.
Key Terms/Symbols
t-distribution
F-distribution
degrees of freedom
dependent and independent random samples
paired difference test
28
STUDY UNIT 2
2.1 Introduction
In this study unit we tie some loose ends. We continue our inference about comparing two
populations, but we shift from means and comparing two variances to proportions. In the last section
we move back to means but extend it to more than two populations.
STUDY
Keller Chapter 13 Inference about comparing two populations
13.5 Inference about the Difference Between Two Population Proportions
We are now sampling from two independent populations where the proportions of the populations
have a certain attribute.
x1
If pe1 = is the proportion in a random sample of size n1 from a population with parameter p1 and
n1
x2
pe2 = is the proportion in a random sample of size n2 from a second independent population with
n2
x1 x2
( − ) − (p1 − p2 )
n n2
parameter p2 , we use the test statistic Z = u1 which has an approximate n(0; 1)
1 1
p(1 − p)( + )
n1 n2
Please note that similar to the argument for the one-sample case, which we treated in STA1501
(STS1113) and Keller, chapter 12, the expression for the hypothesis test is not the same as the
SE expression which we will use when we derive a confidence interval for p1 − p2 . Computing a
pooled estimate makes sense only under the assumption that p1 = p2 (in other words "case 1" or
the hypothesis H0 : p1 − p2 = 0) which is absent when we construct a confidence interval.
29 STA1502/1
You must not be confused by the very "rare case" or "case 2" which Keller talks about. For this case
2 scenario the null hypothesis is H0 : p1 − p2 = D and the SE expression for the hypothesis test is
exactly the same as the SE expression which we will use when we derive a confidence interval for
p1 − p2 .
Please also note that in the one-sample case in STA1501 (STS1113) our rule of thumb was that np
and n(1 − p) must be greater than 5 for the inference to be valid. We extend these conditions to two
samples meaning that n1 p1 ; n1 (1 − p1 ) ; n2 p2 and n2 (1 − p2 ) must all be greater than 5 for the
inference to be "good".
After you have studied section 13.5 of chapter 13 of the textbook you should try and work through
activities 2.1 and 2.2 to enhance your understanding of a large sample test of hypotheses for the
difference between two binomial proportions.
Activity 2.1
Say whether the following statements are correct or incorrect and try to rectify the incorrect
statements to make them true.
u
pe1 (1 − pe1 ) pe2 (1 − pe2 )
(a) If we derive a confidence interval for (p1 − p2 ) we use SE = +
n1 n2
u
1 1 X1 + X2
but if we test H0 : p1 = p2 we use SE = p(1 − p)( + ) with p = .
n1 n2 n1 + n2
.............................................................................. ..............................................................................
.............................................................................................................................................................
(b) In testing a hypothesis about the difference between two population proportions (p1 − p2 ) , the z
test statistic measures how close the computed sample difference between two proportions has
come to the hypothesized value of zero.
.............................................................................................................................................................
.............................................................................................................................................................
30
(c) In a one-tailed test for the difference between two population proportions (p1 − p2 ), if the null
hypothesis is rejected when the alternative hypothesis, H1 : p1 > p2 , is false, a Type I error is
committed.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
u
pe1 (1 − pe1 ) pe2 (1 − pe2 )
(d) If we derive a confidence interval for (p1 − p2 ), we use SE = +
n1 n2
u
pe1 (1 − pe1 ) pe2 (1 − pe2 )
and if we test H0 : p1 − p2 = 0.15, we will also use SE = + for the z test
n1 n2
statistic.
.............................................................................. ..............................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Feedback Feedback
(a) Correct.
(b) Correct.
(c) Correct.
(d) Correct.
31 STA1502/1
Activity 2.2
A seed distributer, called Easy Grow Seeds, claims that 75% of a specific variety of maize, called
Golden Glow, will germinate. A random sample of n1 = 300 seeds was selected from this batch
and 207 germinated. Denote the population proportion of seeds that germinate as p1 . Suppose that
a second, independent seed distributer, called Seeds of All Kinds claims that 80% of their stock of
the same variety of maize, called Golden Glow, will germinate. (Denote this population proportion of
seeds that germinate as p2 .) From this population we draw a random sample of size n2 = 200 and
the number seeds that germinate in this sample is 153.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
32
Feedback Feedback
= 0.72
u
1 1
SEpooled (p1 − p2 ) = p(1 − p)( + )
n1 n2
t
1 1
= 0.72(1 − 0.72)( 300 + 200 )
= 0.040988
x1 x2
( − ) − (p1 − p2 )
n n2
Z = u1
1 1
p(1 − p)( + )
n1 n2
( 207 153
300 − 200 ) − 0
=
0.040988
−0.075
=
0.040988
= −1.8298
Since |Z| = |−1.8298| = 1.8298 > 1.645 =⇒ we reject H0 . It seems likely that the two populations
do not have the same proportions.
Extra explanation:
With a confidence interval our focus is on the inside of the probability statement and with a
hypothesis test our focus is on the outside of the probability statement. For example, for a 90%
confidence interval
0.05 0.05
-1.645 0 1.645
Rejection region Rejection region
Two-sided hypothesis test using α = 0.10
STUDY
Keller Chapter 14&& Analysis of Variance (not 14.5 and 14.6)
14.1 One-Way Analysis of Variance
Suppose we mix the data observations of the k different groups in one big box and disregard for a
moment which score belongs to which group. Even as one big sample, the scores are not the same!
They vary from a smallest score to a largest score – hence we have variability.
Consider (n1 + n2 + n3 + ...nk ) = n as one big sample of n values. Now, suppose that H0 : µ1 = µ2 =
µ3 = .... = µk is true! Then the k "mixed-together" groups can be considered as one big random
sample from the same population! This can of course only be true under the original assumption
of equal variances, i.e. that σ 21 = σ 22 = .... = σ 2k for the k populations. We denote this common
population variance by σ 2 (say).
From estimation theory we learn that the most efficient, unbiased point estimator of a population
variance, σ 2 in general, is given by the sample variance.
Σ(xi − x)2
An expansion of the notation to elegantly accommodate different xi -values from different
(n − 1)
samples would be to use an x with a double subscript ij instead of just i as well as double
summation instead of single summation. You might wonder why on earth we would like to do this,
but the beauty of this notation is that it allows us to keep track of every single observation from every
possible sample!
So, for a k sample data set we write xij where j = 1 or 2 or 3 or ...k (the number of samples) and
i = 1; 2; 3...nj (the size of the individual sample).
[:-) This means if j = 1 we list the first group as x11 ; x21 ; x31 ; ...up to xn1 1 (for our first data set) and
if j = 2 we list the second group as x12 ; x22 ; x32 ; ...up to xn2 2 (for our second data set) etc. up to
x1k ; x2k ; x3k ; ...up to xnk k (for our k -th data set).]
Even though we momentarily consider the data as one sample, we can still calculate (k + 1) possible
different means. There is of course the overall mean of all the observations, indicated as x, and then
there are also x1 and x2 etc. up to xk for the k respective group means.
n
Σkj=1 Σi=1
j
xij
x=
n
Σni=1
1
xi1
x1 =
n1
Σni=1
2
xi2
x2 =
n2
↓
..
.
Σni=1
k
xik
xk =
nk
[1] The total sum of squares of deviations from the overall mean is given as SST otal.
n
Let SST otal = Σkj=1 Σi=1
j
(xij − x)2 then
SST otal
E = σ2 .
n−1
n
[2] Let SSW ithin = Σkj=1 Σi=1
j
(xij − xj )2 then
SSW ithin
E = σ2 .
n−k
36
SSW ithin
provides an accurate estimate of σ 2 , whether or not the sample means are
n−k
equal.
SSW ithin
But if n1 + n2 = n then Spooled
2 = .]
n−2
In many applications σ 2 is considered as a measure of "error" hence SSW ithin = SSError and
SSW ithin SSError
if we divide by the degrees of freedom we call = the Mean Square Error.
n−k n−k
SSBetween
E = σ2 .
k−1
Why the third expression is a possible estimate of σ 2 is more tricky to explain and it makes intuitive
sense (and it simplifies matters) if the sample sizes are equal (i.e. the same). Assume that
n1 = n2 = ...nk = (say)nj . Under the assumption of the null hypothesis, H0 : µ1 = µ2 = ...µk ,
the means of the different groups are actually k estimates of the overall population mean µ but the
means (when considered as variables) have a smaller variance than individual observations when
we compute their deviations from the overall mean. [ :-) Think back of what you learned about the
σ2
sampling distribution of a sample mean: It has variance .] In other words, when we compute a
n
sample variance for the k observed means (which are now considered as a sample of size k), this
σ2
sample variance is an estimate of the value .
n
In the true jargon of experimental design, the different groups/samples are considered to be different
levels of a treatment, hence SSBetween = SST reatment which measures the variation between
samples. If we divide by the degrees of freedom we call SST reatment
(k−1) the Mean Square Treatment.
This estimate only provides an accurate estimate of σ 2 if the sample means are equal.
37 STA1502/1
Where does the F -distribution get into the picture? If there is no difference between the means we
would expect the ratio estimate 3
estimate 2
to be equal to one. According to statitical distribution theory the
estimate 3
ratio estimate 2 has a so-called F -distribution. A-ha, and here we have the makings of a hypothesis
test! We can compare the computed F -value with a critical value obtained from a Critical Values of
F-table. (See Keller, Table 6 Appendix B.) If the computed value of the test statistic deviates "too
much" from 1 we will become suspicious of H0 .
estimated population variance based on
the variation among the sample means M ST reatment
F = = ∼ Fυ1 ; υ2 .
estimated population variance based on M SError
the variation within each of the samples
υ1 = k − 1
υ2 = n − k
The F-distribution
The F-distribution has two parameters, also called degrees of freedom. In any F-table with critical
values you will need to know these two values, often indicated as υ1 and υ2 . For very small values
of υ 1 and υ 2 the density function does not look like the typical "skewed-to-the-right-normal" density
function.
0.8
F4; 55 Distribution
0.6
0.4
0.2
0
0 1 2 3 4 5
For an ANOVA test your critical region will always look like a right-sided test even though it
is a two-sided test! This means you use "all of α” on the right side.
This principle, where the focus is on variances but the test statistic is actually sensitive for
differences between means, applies even to two groups. It is important to note that the ANOVA
test for the case where k = 2, i.e. when we test H0 : µ1 = µ2 , is only valid for a two-sided
alternative, i.e. H1 : µ1 = µ2 .(For a specific application "k" will be replaced with "2" or will be
replaced with "3" or whatever, where "k" =number of samples.) ]
If we reject the null hypothesis, we conclude that at least two means differ. The "extension of
ANOVA" to be able to conclude which means are responsible for the differences, is called multiple
comparisons. This is treated in section 14.2 of Keller. We will discuss this soon.
Activity 2.3
The marketing manager of a pizza chain is in the process of examining some of the demographic
characteristics of her customers. In particular, she would like to investigate the belief that the ages
of the customers of pizza parlors, hamburger huts, and fast-food chicken restaurants are different.
As an experiment, the ages of eight customers randomly selected of each of the restaurants are
recorded and listed below. Assume that we know from previous analyses that the ages are normally
distributed with the same variances.
Customers’ Ages
Pizza Hamburger Chicken
23 26 25
19 20 28
25 18 36
17 35 23
36 33 39
25 25 27
28 19 38
31 17 31
39 STA1502/1
[:-) Always keep in mind that small differences could be due to rounding errors!]
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
40
(c) Do these data provide enough evidence at the 5% significance level to infer that there are
differences in ages among the customers of the three restaurants?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Feedback Feedback
(iv) Correct. SSBetween = SST reatment = Σkj=1 nj (xj − x)2 = 8(25.448) = 203.584
(b)
Source of Variation SS df MS F F0.05;2;21
Treatments 203.584 2 101.792 2.475 3.47
Error 863.760 21 41.131
Total 1067.344 23
M STr 101.792
(c) F = = = 2.475 < F0.05;2;21 = 3.47 =⇒ we cannot reject H0 : µ1 = µ2 = µ3 .
M SE 41.131
The data do not provide enough evidence at the 5% significance level to infer that there are
differences in ages among the customers of the three restaurants.
:-) Do you agree that doing an ANOVA manually is usually arduous work?
To appreciate the assistance of a computer even more, and to understand the workings of ANOVA,
you can try to do the next activity.
You will notice that this activity challenges you to manipulate your computational formulae implying
that you understand what you do!
41 STA1502/1
Activity 2.4
Do Keller exercise 14.1.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
42
Feedback Feedback
ST reatment = Σkj=1 nj (xj − x)2 = 5(10 − 15)2 + 5(15 − 15)2 + 5(20 − 15)2 = 250
ANOVA Table
Source of
Sum of squares df Mean Squares F
Variation
SSTr 250 M STr 125
Treatments 250 k−1=2 = = 125 = = 2.50
k−1 2 M SE 50
SSE 600
Error 600 n − k = 12 = = 50
n−k 12
Total 850 n − 1 = 14
SST reatment = Σkj=1 nj (xj − x)2 = 10(10 − 15)2 + 10(15 − 15)2 + 10(20 − 15)2 = 500 (this value
increased).
ANOVA Table
Source of
Sum of squares df Mean Squares F
Variation
SSTr 500 M STr 250
Treatments 500 k−1=2 = = 250 = = 5.00
k−1 2 M SE 50
SSE 1350
Error 1350 n − k = 27 = = 50
n−k 27
Total 1850 n − 1 = 29
STUDY
Keller Chapter 14 Analysis of variance
14.2 Multiple comparisons
Performing an anaylsis of variance test to detemine whether differences exist between two or more
population means is a good start, but not nearly enough for a practical application where it is
necessarty to identify which treatment means are responsible for the differences. The statistical
method used to determine this is called multiple comparisons. We will consider three methods for
this purpose, namely
· Fisher’s least significant diference method (LSD) which is used of you want find areas for further
investigation.
· The Bonferroni method which is used of you want to identify two or three pairwise comparisons.
· Tukey’s method is used when you want to consider all possible population-combinations.
These three methods are discussed in Keller. Make sure that you understand them and can apply
the knowledge. The formulas for the three methods are different, but you need not remember them.
In fact, rather go through activity 2.5 and its solution to see how the three methods are applied.
As your knowledge of statistics expands, lengthy calculations will interest you less and less, seeing
that your interest should move to the actual statistical analysis. There is a very delicate balance
between the importance of the calculation and the statistical analysis: if the calculation is incorrect,
the analysis has no meaning. Still,you are being trained to make a meaningful and correct analysis.
Once you understand the method applied in the calculation, that part can be taken over by statistical
software. This is why most statisticians start to use statistical software for their calculations at an
early stage. We are introducing students at second level in STA2601 to the software package JMP. It
is therefore advisable for you to take note of the given Excel and Minitab printouts in Keller. Try to do
them yourself if you have access to Excel or Minitab and if you do not have access, study them and
note what information they supply and how to interpret it. No professional statistician can function
properly without knowledge of and using statistical software.
44
Activity 2.5
Question 1
An investor studied the percentage rates of return of three different types of mutual funds. Random
samples of percentage rates of return for four periods were taken from each fund. The results appear
in the table below:
Mutual Funds Percentage Rates
Fund 1 Fund 2 Fund 3
12 4 9
15 8 3
13 6 5
14 5 7
17 4 4
Use Tukey’s method with α = .05 to determine which population means differ.
............................... ............................... ............................... ............................... ...............................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Question 2
Do Keller exercise 14.21.
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
45 STA1502/1
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
Feedback Feedback
Question 1
ω = 2.684 x̄1 = 14.2 x̄2 = 5.4 x̄3 = 5.6
Question 2
a) α = .05 tα/2,n−k = t.025,27 = 2.052
v
1 1
LSD = t α M SE +
ni nj
2 v
1 1
= 2.052 700 +
10 10
= 24.28
46
Conclusion: µ2 differs from µ1 and µ3 because| 27.3 |> 24.28 and | −32.3 |> 24.28
3(2) αE
b) C = = 3, αE = .05, α = = 0.0167 tα/2,n−k = t.0083,27 = 2.552 (from Excel)
2 C
v
1 1
LSD = t α M SE +
ni nj
2 v
1 1
= 2.552 700 +
10 10
= 30.20
READ
Keller Chapter 14 Analysis of variance
14.3 Analysis of variance experimental designs
In this section an overview is given of two experimental designs and different concepts are described.
Read through the three paragraphs - most probably a few times to get a proper overview of single and
multifactor designs; independent samples; randomized block designs; repeated measures; two-way
analysis of variance for fixed and random effects.
STUDY
Keller Chapter 14 Analysis of variance
14.4 Randomized block(two-way) analysis of variance
The calculations for this type of analysis are time consuming that Keller gives only computer printouts
in the explanations. This way you can learn about the method and its application. In the examination
will not be testing your calculation skills, but your knowlegde about the process and the analysis
itself.
When moving from considering within treatments variation to looking at the treatment means and the
differences between them we are designing a randomized block experiment. Total variation is then
partitioned into three different sources, namely
With this design, testing if the treatment means differ can also be used to test if there are differences
in the block means. Of course, if the block means do not differ, it implies that specific analysis was
not the correct one!
48
M ST
Compare the two test statisics F = with ν 1 = k − 1; ν 2 = n − k − b + 1 degrees of freedom
M SE
and
M SB
F = with ν 1 = b − 1; ν 2 = n − k − b + 1 degrees of freedom.
M SE
Study the example in Keller and give special attention to the interpretation of the results.
Activity 2.6
Question 1
Do question 14.31 in Keller
............................... ............................... ............................... ............................... ...............................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
49 STA1502/1
Question 2
A partial ANOVA table in a randomized block design is shown below, where the treatments refer to
different high blood pressure drugs, and the blocks refer to different groups of men with high blood
pressure. Use the given ANOVA table to answer the questions:
Source of Variation SS df MS F
__________________________________________________________
Treatments 6,720 4 1,680 14.6087
Blocks 3,120 6 520 4.5217
Error 2,760 24 115
__________________________________________________________
Total 12,600 34
a) Can we infer at the 5% significance level that the treatment means differ?
b) Can we infer at the 5% significance level that the block means differ?
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
.............................................................................................................................................................
50
Feedback Feedback
Questions 1
ANOVA Table
Source Degrees of Freedom Sum of Squares Mean Squares F .
Treatments 2 100 50.00 24.04
Blocks 6 50 8.33 4.00
Error 12 25 2.08
Total 20 175
Questions 2
a) H0 : µ1 = µ2 = µ3 = µ4 = µ5 versus:
Ha : At least two means differ
Rejection region: F > Fα,ν 1 ,ν 2 = F0.05,4,24 = 2.78
Test statistics: F = 14.6087
Conclusion: Reject the null hypothesis. Yes, at least two of the treatment means differ.
b) H0 : µ1 = µ2 = µ3 = µ4 = µ5 = µ6 = µ7 versus:
Ha : At least two block means differ.
Rejection region: F > Fα,ν 1 ,ν 2 = F0.5,6,24 = 2.51
Test statistics: F = 4.5217
Conclusion: Reject the null hypothesis. Yes, at least two of the block means differ.
51 STA1502/1
Question 1
Do Keller: Exercise 13.68.
Question 2
Do Keller: Exercise 13.73.
Question 3
Consider the following ANOVA table:
Source of
Sum of squares df Mean Squares F
Variation
Treatments 128 4 32 2.963
Error 270 25 10.8
Total 398 29
(e) Assume that the above ANOVA is applied to independent samples taken from normally distributed
populations with equal variances. If the null hypothesis is rejected, then we can infer that at least
two population means differ.
Question 4
Do Keller: Exercise 14.5
52
Question 1
(Solution to Keller: Exercise 13.68)
x1 + x2 n1 (e
p1 ) + n2 (e
p2 ) 225(0.60) + 225(0.55)
ppooled = = = = 0.575
n1 + n2 n1 + n2 225 + 225
[:-) This was tricky and mean and something you simply had to figure out on your own!]
p1 − pe2 ) − (p1 − p2 )
(e (0.60 − 0.55) − 0
Z=u =u = 1.0728
1 1 1 1
p(1 − p)( + ) 0.575(1 − 0.575)( + )
n1 n2 225 225
x1 + x2 n1 (e
p1 ) + n2 (e
p2 ) 225(0.95) + 225(0.90)
(b) Now ppooled = = = = 0.925
n1 + n2 n1 + n2 225 + 225
p1 − pe2 ) − (p1 − p2 )
(e (0.95 − 0.90) − 0
Z=u =u = 2.0135
1 1 1 1
p(1 − p)( + ) 0.925(1 − 0.925)( + )
n1 n2 225 225
x1 + x2 n1 (e
p1 ) + n2 (e
p2 ) 225(0.10) + 225(0.05)
(d) ppooled = = = = 0.075
n1 + n2 n1 + n2 225 + 225
Note that p(1 − p) is the same value as in (b) =⇒ z is the same for both expressions =⇒ the
p-value will be exactly the same as in question (b).
Question 2
(Solution to Keller: Exercise 13.73)
(a) Test the null hypothesis H0 : (p1 − p2 ) = 0 vs H1 : (p1 − p2 ) > 0. (If popularity decreases
=⇒ p1 > p2 .)
x1 + x2 n1 (e
p1 ) + n2 (e
p2 ) 1100(0.56) + 800(0.46)
ppooled = = = = 0.517 89
n1 + n2 n1 + n2 1100 + 800
p1 − pe2 ) − (p1 − p2 )
(e (0.56 − 0.46) − 0
Z=u =u = 4.3070
1 1 1 1
p(1 − p)( + ) 0.517 89(1 − 0.517 89)( + )
n1 n2 1100 800
Reject H0 if z > z0.05 = 1.645. Since 4.3070 > 1.645 =⇒ we reject the null hypothesis and
conclude his popularity decreased.
(b) For this question we have to test the null hypothesis H0 : (p1 −p2 ) = 0.05 vs H1 : (p1 −p2 ) > 0.05.
Now ppooled does not exist and SE must be computed differently because under H0 p1 = p2 :
u u
pe1 (1 − pe1 ) pe2 (1 − pe2 ) 0.56(1 − 0.56) 0.46(1 − 0.46)
SE = + = + = 0.02311 9
n1 n2 1100 800
Reject H0 if z > z0.05 = 1.645. Since 2.1627 > 1.645 =⇒ we reject the null hypothesis and
conclude his popularity decreased by more than 5%.
u
pe1 (1 − pe1 ) pe2 (1 − pe2 )
(c) If we derive a confidence interval for (p1 − p2 ) we use SE = + =
n1 n2
0.02311 9.
Question 3
The statements are all correct.
Question 4
(Solution to Keller 14.5)
H0 : µ1 = µ2 = µ3
SSTr = Σkj=1 nj (xj − x)2 = 6(1.333 − 2.167)2 + 6(2.5 − 2.167)2 + 6(2.667 − 2.167)2 = 6.339
ANOVA Table
Source of Sum of
df Mean Squares F
Variation squares
SSTr 6.339 M STr 3.1695
Treatments 6.339 k−1=2 = = 3.1695 = = 1.686
k−1 2 M SE 1.88
SSE 28.200
Error 28.200 n − k = 15 = = 1.8 8
n−k 15
Total 34.539 n − 1 = 17
Cannot reject H0 : µ1 = µ2 = µ3 . There is not enough evidence to conclude that differences exist
between the three brands.
55 STA1502/1
Can you
· define SE for (e
p1 − pe2 ) under the assumption that p1 = p2 ?
· demonstrate an understanding of the connections between the concepts significance level and
p-value?
· interpret computer output regarding inferences about an F-test for two population variances
· differentiate between one- and two-way analysis of variance experimental designs as well as
randomized block designs?
Key Terms/Symbols
degrees of freedom
F-test for two population variances
ANOVA-test
within-treatments variation
sum of squares for error
between-treatments variation
SS Within
SS Between
SS Blocks &&&
SS Error
SS Treatment
overall mean
57 STA1502/1
STUDY UNIT 3
3.1 Chi–square test
It is just as important to consider the sampled population as it is to know the data type of your
sample. What do you want to know about a specific population or populations? In the earlier study
units we were always interested in the parameters of the population, which implied that we had some
information about the population (e.g. we knew that it was normally, or approximately so, distributed).
What we have discussed so far implied so-called parametric techniques, where we considered the
statistics of a sample to predict the parameters of the distribution describing the population. In the first
part of this study unit we consider other very important parametric techniques, namely chi-squared
tests. In the second part of this unit we then venture into something new, addressing the dilemma
when one cannot make assumptions about the shape of the sampled population. As statisticians
we are often faced with this reality. Do you think that it is still possible to use a random sample
drawn from such a population and make a sensible analysis and even predictions about that sampled
population? Yes! You are going to see that there are also nonparametric techniques that you can use
if you do not know about the distribution of the sampled population. As usual, apart from explaining
the methods, the necessary conditions under which these alternatives apply, will also be described,
Of course, the correct technique for the particular data type stays important.
The first part of this study guide covers two applications of the continuous chi-squared distribution,
which is the technique applicable if the data is nominal. In STA1501 you heard about this distribution
and here hypothesis tests will be discussed and the conditions for their application. Only the chi-
squared goodness-of-fit test and the chi-squared test of a contingency table form part of the contents
of this module (the test for normality is therefore not included). In the second part of this study unit
you will be introduced to three nonparametric techniques. You will see that the sampled populations
are nonnormal and that dependence and independence of the samples play an important role. The
techniques you have to know for this module are the Wilcoxon rank sum test for ordinal or interval
data from two independent samples, the sign test for ordinal data in the form of matched pairs and
lastly the Wilcoxon signed rank test for interval data, also in the form of matched pairs. There are
other nonparametric tests in the prescribed book, but they are not included in the contents of this
module. Remember about them because you never know if you may need to use one of them in
future. Then you simply take Keller and read up about them!
As you study these different tests, please do not be discouraged by all the different definitions that
are given and are used in the manual examples. Remember that we are statisticians and we do not
want to test your memory, but your knowledge of the different procedures and their conditions. In the
examination you will be given a list of formulas from which you can select the one you need (should
we ask a question in an examination paper where you need a formula).
58
STUDY
Keller Chapter 15 Chi-squared tests
15.1 Chi-Squared Goodness-of-Fit Test
◦ Test statistic
◦ Required condition
In distance learning the pronunciation of words or symbols is often a problem. If you wonder about
the word "chi" or its symbol χ, think of the words "pie" or "sky" in English, because "chi" rhymes with
it. The ch is pronounces as a k, which means that you actually say "kai".
For the symbol χ2 you say "kai-square".
Recall the knowledge given to you in STA1501 about a binomial experiment and the binomial
distribution. Just a reminder - the prefix bi- refers to two, while the prefix multi- refers to many.
Chi-square is a family of distributions commonly used for significance testing. A chi-square test
(also chi-squared or χ2 test) is any statistical hypothesis test in which the sampling distribution of
the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this
is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be
made to approximate a chi-square distribution as closely as desired by making the sample size large
enough. A number of tests exist, but you are required to focus only on this one.
Below is a table illustrating the similarities and differences between a binomial and a multinomial
experiment.
Binomial experiment consists of Multinomial experiment consists of
a fixed number n of trials a fixed number n of trials
two possible outcomes per trial k categories (cells) of outcomes per trial
constant probability outcomes p and 1 − p constant probabilities pi for each cell i
two probabilities p (success) and 1 − p (failure) k probabilities pi and p1 + p2 + ...pk = 1
different independent trials different independent trials
x successes in n trials observed frequencies fi of outcomes in cell i
expected value µ = np expected frequencies ei = npi
The discussion in STA1501 on the chi-squared distribution was very brief. In this section you are
going to learn more about different tests where the test statistic has a chi-squared distribution.
59 STA1502/1
There are many interesting and practical applications of the chi-squared distribution. Researchers
are also very keen to use a chi-squared test and we hope that you will now study research results
and see if the conditions for application of this distribution are satisfied. The purpose of an analysis
can be to determine if the sample is from a specified population or the interest can be to determine
if there is a relationship between two populations, e.g. between predicted values and actual values.
An example of the latter: suppose a telecommunications company, interested in customer care, is
uncertain about the continuation or not of a specific product. They decide to ask customers if they
would like the service to continue for the next year or not (this would be categorical or nominal data).
The recorded data (two categories of ’yes’ and ’no’) can be saved and the product continued for a
year. Then data (’yes’ or ’no’) can again be collected and a chi-square analysis can be made to see if
there is a relationship between what the people said and what they actually did. If the null hypothesis
is rejected, it indicates that there is a relationship between the two populations. In this scenario the
managers can then decide to use data where customers say what they are going to do, the data are
reliable enough for their planning.
If you study the examples in the book and in the activities, see if you understand the following
comment: Samples should not be too large for applications of the chi-squared test, and in practice,
analysts carefully study the distribution of the items in the chi-square table and do not only rely on
the numerical value of the test.
Goodness-of-fit test
Make sure that you understand the hypothesis testing procedure and the sampling distribution of the
test statistic for the goodness-of-fit test.
Test statistic
How would you express the formula for the test statistic of the goodness-of-fit test in your own words?
k
[ (fi − ei )2
χ2 =
ei
i=1
60
Is that not easier to remember than the formula itself? It tells you exactly what to do. Can you explain
it to someone else?
If you are still not so sure, we illustrate with the words:
Square (...)2 the difference (. − .)2 between the observed fi and expected frequency ei
and divide it ·· by the expected frequency ei for each cell.
k
[
Add all these answers and it gives you the formula for the test statistic of the
i=1
chi-squared goodness-of-fit test.
Activity 3.1
Question 1
Employee absenteeism has become a serious problem which cannot be ignored. The personnel
department at a university decided to record the weekdays during which lecturers in the Faculty of
Humanities in a sample of 300 called in sick over the past several months. Determine if the given
data suggests that absenteeism is higher on some days of the week than on others.
From existing medical evidence the following information is specified in the null hypothesis for the
consecutive days of the week:
Monday p1 = 0.3, Tuesday p2 = 0.1, Wednesday p3 = 0.2, Thursday ṗ4 = 0.2, Friday p5 = 0.2
Day of
Monday Tuesday Wednesday Thursday Friday
the week
Number
84 24 56 64 72
absent
Question 2
In a goodness-of-fit test, suppose that a sample showed that the observed frequency fi and expected
frequency ei were equal for each cell i. Then, the null hypothesis is
Question 3
The critical value in a goodness-of-fit test with 6 degrees of freedom, considered at the 5%
significance level, is
1. equal to 18.5476
2. equal to 12.6
3. equal to 0.872085
4. always greater than the test statistic
5. always less than the test statistic
Question 4
A chi-squared goodness-of-fit test is always conducted as
1. a lower-tail test
2. an upper-tail test
3. a two-tailed test
4. a measure of the size of the cells
5. any of the above
Question 5
Five statements are given below. Only one of them is a true statement. Which option is true?
1. For a chi-squared distributed random variable with 10 degrees of freedom and a level of
significance of 0.025, the chi-squared table value is 20.4831. The computed value of the test
statistic is 16.857. This will lead us to reject the null hypothesis.
2. Whenever the expected frequency of a cell is less than 5, one remedy for this condition is to
decrease the size of the sample.
3. For a chi-squared distributed random variable with 12 degrees of freedom and a level of
significance of 0.05, the chi-squared value from the table is 21.0. The computed value of the
test statistics is 25.1687. This will lead us to reject the null hypothesis.
4. The chi-squared goodness-of-fit test can be used for any type of data.
5. In a multinomial experiment the probability pi that the outcome will fall into cell i can change from
one trial to the next.
62
STUDY
Keller Chapter 15 Chi-squared tests
15.2 Chi-Squared Test of a Contingency Table
◦ Test statistic
◦ Rejection region and p-value
◦ Rule of five
You need to realize that there are many similarities between the two χ2 -tests in this chapter, and that
there are also definite differences.
In statistics, contingency tables are used to record and analyse the relationship between two or
more variables, most usually categorical variables. Suppose that we have two variables, sex (male
or female) and handedness (right- or left-handed). We observe the values of both variables in a
random sample of 100 people. Then a contingency table can be used to express the relationship
between these two variables, as follows:
Right-handed Left-handed TOTAL
Male 43 9 52
Female 44 4 48
TOTAL 87 13 100
The figures in the right-hand column and the bottom row are called marginal totals and the figure
in the bottom right-hand corner is the grand total. The table allows us to deduce at a glance that
the proportion of men who are right-handed is about the same as the proportion of women who are
right-handed. However the two proportions are not identical and the statistical significance of the
difference between them can be tested statistically using one of a number of available methods. In
our case we will use a nonparametric method called a Pearson’s chi-square test. In this case the
entries provided in the table must represent a random sample from the population contemplated in
the null hypothesis. If the proportions of individuals in the different columns vary between rows (and,
therefore, vice versa) we say that the table shows contingency between the two variables. If there is
no contingency, we say that the two variables are independent.
If we make a table of comparisons it might help you to remember the different principles involved and
the calculation methods.
63 STA1502/1
Only applicable for nominal data produced Only applicable for nominal data
by a multinomial experiment. arranged in a contingency table.
k
[ k
[
(fi − ei )2 (fi − ei )2
Test statistic: χ2 = Test statistic: χ2 =
ei ei
i=1 i=1
Ho lists values for the probabilities pi . Ho states the two variables are independent.
The manual calculation of the χ2 -values for the contingency table is rather cumbersome, but not that
complex!
Make sure that you understand the process of
· calculating the expected frequencies for each cell - multiply total of row and total of column and
divide by the grand total
· writing the given (observed) frequencies and calculated (expected) frequencies next to each other
for each cell in a new contingency table
· calculation of the test statistic, which involves only this last contingency table for each cell: subtract
the two frequencies, square the answer, then divide by the calculated (expected) frequency
If you calculate these values with Excel or Minitab it is of course not so complex, but remember that,
at this first-year level, you have to know the "how" of the process itself and not only the interpretation
of the χ2 and p−values.
64
STUDY
Keller Chapter 15 Chi-squared tests
15.3 Summary of tests on nominal data
This section emphasises the contexts in which the various chi–square tests apply. Study the entire
section in the prescribed book, and especially understand Table 15.1 of the prescribed book.
Activity 3.2
Question 1
Do question 15.22 in Keller.
Question 2
The number of degrees of freedom for a contingency table with 5 rows and 7 columns is
1. 35
2. 12
3. 10
4. 24
5. 30
Question 3
In a chi-squared test of a contingency table, the test statistic value was χ2 = 12.678, and the critical
value at α = 0.025 was 14.4494. Thus,
Question 4
Which of the following statements is/are false?
1. A chi-squared test for independence is applied to a contingency table with 3 rows and 4 columns
for two qualitative variables. The degrees of freedom for this test must be 12.
2. A chi-squared test for independence with 10 degrees of freedom results in a test statistic of 17.894.
Using the chi-squared table, the most accurate statement that can be made about the p-value for
this test is that 0.05 < p-value< 0.10.
3. In a chi-squared test of independence, the value of the test statistic was 15.652, and the critical
value at α = 0.025 was 11.1433. Thus, we must reject the null hypothesis at α = 0.025.
4. A chi-squared test for independence with 6 degrees of freedom results in a test statistic of 13.25.
Using the chi-squared table, the most accurate statement that can be made about the p-value for
this test is that p-value is greater than 0.025 but smaller than 0.05.
5. The chi-squared test of a contingency table is used to determine if there is enough evidence to
infer that two nominal variables are related, and to infer that differences exist among two or more
populations of nominal variables.
Activity 3.3
Question 1
A statistics professor posted the following grade distribution guidelines for his elementary statistics
class:
8% A, 35% B, 40% C, 12% D, and 5% F.
A sample of 100 elementary statistics grades at the end of last semester showed
12 A’s, 30 B’s, 35 C’s, 15 D’s, and 8 F’s.
Suppose that you test at the 5% significance level to determine whether the actual grades deviate
significantly from the posted grade distribution guidelines. Compare your calculations with the step
by step calculations given below. Indicate in which step the first error was made.
Question 2
Which of the following tests is appropriate for nominal data if the problem objective is to compare two
or more populations and the number of categories is at least 2?
Feedback Feedback
Activity 3.1
Question 1
H0 : p1 = 0.3, p2 = 0.1, p3 = 0.2, ṗ4 = 0.2, p5 = 0.2
H1 : At least one pi is not equal to its specified value.
(fi − ei )2
Cell i fi ei (fi − ei )
ei
1 84 300(.3) = 90 −6 0.40
2 24 300(.1) = 30 −6 1.20
3 56 300(.2) = 60 −4 0.27
4 64 300(.2) = 60 4 0.27
5 72 300(.2) = 60 12 2.40
Total 300 300 χ2 = 4.54
Question 2
Answer: 4
The chi-squared goodness-of-fit test involves the difference between the expected and observed
frequencies. In this question there is never a difference between the two, with the result that the null
hypothesis will never be rejected.
Question 3
Answer: 2
From the χ2 − table in Keller, find the cell where the column under χ2.050 in the first row meets the row
with 6 in the first column. The value written there is 12.6.
67 STA1502/1
Question 4
Answer: 2
If you are not sure, look at the little picture at the top of the page listing the χ2 − table (Keller) and
you will see that the shaded area lies on the right-hand side.
Question 5
Answer: 3
1. False, because the table is correct, but the value 16.857 does not fall in the critical region and
therefor the null hipothesis will not be rejected.
2. False. The remedy is to combine cells should any expected value in a cell be less than 5.
3. True. 25.1687 is greater than the test statistic and the null hypothesis would be rejected.
4. False. Only nominal data may be used in applications of the test.
5. False. These probabilities have to remain constant for each trial of a multinomial experiment.
Activity 3.2
Question 1
It is sometimes convenient to distinguish between employees doing more physical work ("blue collar"
workers) and those who are doing desk work ("white collar" workers). In this problem they wanted to
find out if the job description of an employee has an influence on their choice of opinion.
130 50 20 200
= 2.2658
Degrees of freedom: (3 − 1) (2 − 1) = 2
From the χ2 −table, for 2 degrees of freedom and significance level α = 0.050 the χ2 −value is 5.99.
This is more than the value of the test statistic and therefore the null hypothesis cannot be rejected.
There is not enough evidence that the response to the proposed revision plan depends on the group
(according to job description in the company) of the employee.
Question 2
Answer: 4
(5 − 1)(7 − 1) = 4 × 6 = 24
Question 3
Answer: 2
The number of degrees of freedom was 6, as can be seen from the χ2 table if you find the cell under
χ2.025 with the value 14.4 written in it. Furthermore, because 14.4 is larger than the calculated 12.678,
the null hypothesis cannot be rejected. For option 5, if you look at the table and you decrease the
significance level to χ2.010 the critical value is 16.8 and the null hypothesis would still not be rejected
because 12.678 < 16.8.
Question 4
Answer: 1
Option 1 is false in the number of degrees of freedom. It is not 3 × 4 but 2 × 3 = 6.
Option 2 is true because the p-values can only be determined accurately with computer software.
However, we can have some indication from the χ2 −table. 17.894 lies between the table values 16.0
and 18.3, which correspond respectively with significance levels of 0.100 and 0.050. Therefore the
comment about the range of the p-value is true.
Option 3 is true because the test statistic’s value 15.652 is more than the table value 11.1433, which
places it in the rejection region at level α = 0.025.
Option 4 is true for the same reasons as option 2 is true.
Option 5 is true.
69 STA1502/1
Activity 3.3
Question 1
Answer: 4
H0 : p1 = 0.08, p2 = 0.35, p3 = 0.40, p4 = 0.12, p5 = 0.05.
H1 : At least one pi is not equal to its specified value.
(fi − ei )2
Cell i fi ei (fi − ei )
ei
1 12 100(0.08) = 8 4 2.0
2 30 100(0.35) = 35 −5 0.714
3 35 100(0.40) = 40 5 0.625
4 15 100(0.12) = 12 3 0.75
5 8 100(0.05) = 5 3 1.80
Total 300 300 χ2 = 5.889
The test statistic does not fall in the rejection region, therefore the null hypothesis cannot be rejected
The error lies in the interpretation of the calculated value. The last comment is correct as the null
hypothes is not rejected (as should have been the case).
Question 2
Answer: 3
70
STUDY UNIT 4
4.1 Simple linear regression and correlation
In this study unit the discussion is about the relationship between interval variables. In regression
analysis involving two variables, one of the variables is used to make predictions about the other
variable. Recall that interval data are real numbers, such as heights, weights, incomes and distance,
as was said in chapter 2 (or STA1501), where you were told that interval data can also be referred to
as quantitative or numerical data. In this unit the so-called probabilistic model for regression analysis
is described, with initial interest in the first-order linear model (also called the simple linear regression
model). In this model an error variable is introduced. Finding the equation of the regression line is
the first step, but this has to be followed by an assessment of the fit of the line to the data as well as
looking into the relationship between the dependent and independent variables. The importance of
the error variable and the conditions that apply to it, forms the basis of many of the discussions that
follow.
Please read through the discussion on section 16.4 to get the feeling of what regression analysis
entails. You will not be examined on all the sections of chapter 16. The topics covered in these
sections are very important and should you continue with statistics, you will surely learn about them
in a second-level module.
STUDY
Keller Chapter 16 Simple linear regression and correlation
16.1 Model
16.2 Estimating the coefficients
◦ Interesting facts about the coefficients b0 and b1
The model
The graph of a straight line and its equation is introduced in school mathematics in grade 8/9. This
means that, even if you did not choose mathematics as a grade 12 subject, the equation of a standard
form of a straight line should not be new to you!!! The notation in school and that given in Chapter
16 may be different, but the meaning of the variables and the constants is the same, The only new
concept in Keller’s line equation is that extra term epsilon (ε) , but that we will explain to you shortly.
71 STA1502/1
School: y = m·x + c
Keller: y = β1 · x + β0 + ε
Slope times inde- constants
Explanation: dependent variable
pendent variable (numbers)
Do not allow the abstract form of these equations to mislead you. If we give the general form of the
equation as y = β 0 + β 1 x + ε, it is in symbolic terms. For the equation of each particular straight line
the β 0 and the β 1 will not be there, but there will be numbers in their places, e.g. y = 2 + 3x + ε. The
x and y will, however, always be there. They are the variables and in particular x is the independent
and y the dependent variable. The two go together as a pair (x, y).
· The number with the x (in the two equations above it is m and β 1 ) indicates the slope of the line.
· The number without an x (but not the ε ) is often referred to as a constant and it indicates the
value on the vertical axis (or y-axis) where the line passes through it.
· The new symbol in Keller’s line equation is the epsilon ε, written there to accommodate the
possible error in the model, making it a probabilistic model instead of a deterministic model. An
easy way to remember that ε is the symbol for error is to think in terms of the first letter of the
word "e"rror.
· It is customary in regression to write the terms in the particular order of β 0 + β 1 x, which is the
other way round as the school form of mx + c, but you will get used to that as well.
· For every xi -value in the data set, there was a linked yi -value in the pair (xi , yi ). The line that must
pass through these data points does not go through all of the (xi , yi ) points. It may even be that
the ’best’ line does not pass through any of these observed points! How do we find this ’best’ line
and its equation?
72
· The name of this ’best’ line is the least squares line for a specific reason. Think abstract and
imagine that you have (in some way) determined the equation of a line passing through the sample
data. Take each (xi , yi ) pair and substitute each xi -value into the least squares line equation and
find a calculated yi -value for it. To distinguish between the observed yi and the calculated (or
estimated) yi , this last one is given a hat and it becomes ŷi . You have then, for each xi -value two
y -values: the one is the observed yi and the other is the the estimated ŷi .
· The correct least squares line must be determined such that for each observed pair (xi , yi ) and
its calculated pair (xi , ŷi ) the differences between yi and ŷi , namely (yi − ŷi ) must be squared
[n
2
(yi − ŷi ) and the sum of all these squared differences (yi − ŷi )2 must be as small as possible!
i+1
Do you think that is an easy task? We do not! Mathematics has to be used to calculate the
equation of this least squares line.
· Many statisticians talk about ŷ = b0 + b1 x as the least squares regression line. You see, the least
squares criterion is applied in the calculation of what we call the regression line. Keller uses least
squares line or regression line. From now on we will call it the regression line.
· Once the equation of the regression line ŷ = b0 + b1 x is known, the slope b1 and y -intercept b0 are
used to predict the values of the population parameters β 0 and β 1 in the first-order linear model
equation y = β 0 + β 1 x + ε.
Keller’s example 16.1 states the aim, then illustrates how a data set consisting of pairs of interval
variables is used to find the equation of the regression line. Figures 16.1 and 16.2 show the data
points, the calculated regression line passing through them and then the little verticle lines, called
the residuals. Make sure that you understand that the equation ŷ = 0.934 + 2.114x was calculated
using the data set.
As an example, look at the residual y4 − ŷ4 . The value of y is 5 (from the data set the y with
x = 4). The value of ŷ we have to calculate from y4 − ŷ4 = 0.934 + 2.114(4) = 9.39. Therefore
y4 − ŷ4 = 5 − 9.39 = −4.39. Although the particular value of 9.39 was not indicated in Figure 16.2, it is
there on the line and possible to calculate. The reason why the residuals are squared (removing the
possible negativity of a residual) is because our interest lies in the distance between the data point
and the calculated y -value and not whether it lies above or below the line.
The slope b1
This number with the x indicates the slope of the line. Remember the characteristic that it occurs with
the x and is independent of the position where it is written in the equation: ŷ = 0.934 + 2.114x and
ŷ = 2.114x + 0.934 is the same line and for both the slope is 2.114.
73 STA1502/1
· The value of b1 can be either positive or negative – nothing wrong with that!
· If b1 is positive, the value of the two variables both increase and the direction of the line is . If
b1 = 2.114, it implies that for each year of increase in service, the annual bonus will increase with
2.114 of the previous bonus. Some books say there is a direct relationship between the variables
if b1 is positive.
· If the value of b1 is negative, the one variable increases when the other decreases and the direction
of the line is .
The y-intercept b0
The number b0 indicates where the line passes through the y-axis, which is the value of y when x = 0.
In our example it should therefore indicate the amount of the bonus when a person starts working.
Does that make sense? Not really, because it is a ’service bonus’, which implies that it is only paid
out after a term of service! Maybe it would have been less misleading if the author did not draw the
intercept of the line on the y-axis, but ’started’ to draw the line from above the value of x = 1! You
must be careful in the interpretation of the y -intercept – it depends on the nature of the variables.
Keller also comments on this topic with reference to the example about the relationship between the
odometer reading and the selling price of a vehicle. (Have you noticed the error on p 625, where the
sentence reads "The slope coefficient b0 is −0.0669, .."? The slope coefficient is b1 .)
We hope you note that calculating these coefficients in the regression line involves a large amount
of arithmetic. Remember about the shortcut formulae and do not hesitate to use your scientific
calculator – that is if you do not have a computer handy! Remember once again about us testing
insight rather than your calculation skills in the examination.
Activity 4.1
Question 1
The regression line ŷ = 3 + 2x has been fitted to the data points (4, 8), (2, 5), and (1, 2). The sum of
the squared residuals will be
1. 7
2. 15
3. 8
4. 22
5. 7.5
74
Question 2
If an estimated regression line has a y -intercept of 10 and a slope of 4, then when x = 2 the actual
value of y is
1. 15
2. 24
3. 18
4. 14
5. unknown
Question 3
Given the least squares regression line ŷ = 5 − 2x, choose the correct statement:
Question 4
A regression analysis between weight y (in kilogram) and height x (in centimetre) resulted in the
following least squares line: ŷ = 70 + 2x. This implies that if the height is increased by 1 centimetre,
the weight, on average, is expected to
1. increase by 1 kilogram
2. decrease by 2 kilogram
3. increase by 2 kilogram
4. decrease by an unknown amount
5. increase with an unknown amount.
_________________________________________________________________
75 STA1502/1
STUDY
Keller Chapter 16 Simple linear regression and correlation
16.3 Error variable: required conditions
The residuals are considered as observations of the error variable. There are special requirements
for this error variable in order that the regression equation may be used for estimation or predictions.
These are explicitly given in Keller, but in short they stipulate that the error variable must be normally
distributed, with mean zero, constant variance and independence of all errors. The paragraph where
observational and experimental data are compared, you need only read.
Activity 4.2
Question 1
In regression analysis, the residuals represent the
Question 2
In a simple linear regression problem, the following statistics are calculated from a sample of 10
S S S
observations: (x − x) (y − y) = 2250, sx = 10, x = 50, y = 75. The least squares estimates
__________________________________________________________________________
76
STUDY
Keller Chapter 16 Simple linear regression and correlation
16.4 Assessing the model
Regression analysis looks at the relationship between two variables; usually to determine how the
independent variable relates to the dependent variable. It can also be applied simply to determine
whether two variables are related. An inferential method is used to go beyond the presentation of a
linear regression equation (based on sample data) to the estimation of the coefficients of the linear
regression model that fits the population.
It is logical that a relationship between two variables need not be linear. What about a quadratic
relationship? Then the graph representing the relationship is a parabola and not a straight line. The
statistician should determine the strength of the linear relationship before accepting it as correct.
This implies that the sum of the squares for error must be determined and used to determine the
standard error of estimate, the t-test of the slope and the coefficient of determination.
We would really like you to read the paragraphs "Developing and Understanding of Statistical
Concepts" and "Cause-and-Effect Relationship". The discussion is very informal and something
to note for future reference.
In STA1501 you learnt about the correlation coefficient for a sample or a population, which is a
numerical description (a value between −1 and +1) of the strength of the relationship between two
variables. Now a description is given of how it can also be used to test for a relationship between two
variables, as described in a short paragraph about the difference between the t-test of the population
correlation coefficient ρ and the t-test of the population slope β 1 .
The reason why you should read through all these sections is for future use. You might be confronted
with choices like these in your job situation and you will be surprised about the human brain and its
memory potential. Consider to keep your Keller prescribed book as it can be a very helpful reference
for basic practical stastistics!
77 STA1502/1
STUDY
Keller Chapter 16 Simple linear regression and correlation
16.5 Using the Regression Equation
Activity 4.3
A random sample of 11 statistics students produced the following data where x is the third test score,
out of 100, and y is the final exam score, out of 300. Can you predict the final exam score of a
random student if you know the third test score?
You can easily show by estimating the slope and gradient that the best fit line for the third exam/final
exam example has the equation: ŷ = −173.51 + 4.83x.
What would be the expected final scores for students who obtained third exam scores of (i) 68, (ii)
78 and (iii) 94?
Study
Keller Chapter 16 Simple linear regression and correlation
16.6 Regression diagnostics – I
In a diagnostic analysis the requirements for the error variable and the influence of very large or
small observations must be investigated. In 16.6 you need not be able to apply the different tests,
but you have to know about them and what they mean.
78
Please read the procedure of regression diagnostics. You must understand the consecutive steps,
but need not memorize them!
Activity 4.4
Question 1
Do question 16.1 in Keller.
Question 2
Which value of the coefficient of correlation r indicates a stronger correlation than 0.65?
1. 0.55
2. −0.75
3. 0.60
4. 0.05
5. −0.65
Question 3
In a regression problem the following pairs of (x, y) are given:(3, 1), (3, −1), (3, 0), (3, −2) and (3, 2).
That indicates that the
Feedback Feedback
Activity 4.1
Question 1
Answer: 4
Substitute the values x = 4, 2 and 1 into the equation and determine the corresponding values of .
Then determine the difference between these calculated values and the given y -values of 8, 5, and 2
(these are the residuals). Finally square these answers and add them:
ŷ = 3 + 2x ŷ = 3 + 2x ŷ = 3 + 2x
= 3 + 2(4) = 3 + 2(2) = 3 + 2(1)
= 11 =7 =5
Question 2
Answer: 5
We can say nothing about the actual value of y , because the interpretation of the calculated values
only refer to the sample.
Question 3
Answer: 2
In the least squares regression line ŷ = 5 − 2x the value of the slope is −2, which is negative;
therefore the relationship is negative (if the one increases, the other will decrease).
Question 4
Answer: 3
The relationship can be expressed based on the slope. From the equation ŷ = 70 + 2x we know the
slope of the line is 2, which implies that ratio rise/run is 2/1.For each move forward (x− height) the
movement up (y− weight) will be double of that.
80
Activity 4.2
Question 1
Answer: 1
Question 2
Answer: 4
S
(x − x) (y − y)
sxy = s2x = 102
n−1
2250
= = 100
10 − 1
= 250
sxy
b1 = b0 = y − b1 x
s2x
250 75 50
= = − 2.5 ·
100 10 10
= −5
Actvity 4.3
We are give the equation: for this estimation. Thus, for those who obtained third exam scores of (i)
68, (ii) 78 and (iii) 94 we would expect the final exam scores of:
Activity 4.4
Question 1
We advice you to read that paragraph of statistical history!
Compare the given equation of the regression line with the standard form of the regression line:
81 STA1502/1
ŷ = b0 + b1 x
Son’s height = 33.73 + 0.516· Father’s height
This implies that the dependent variable y represents the son’s height and the independent variable
x represents the number of inches that the father is taller or less than 33.73.inches We assume that
both father and son are measured when they are fully grown.
Does anything in this equation bother you? You should be worried about these very TALL people!
Can they all be 33.73metre plus about half of the father’s height? Of course not! The prescribed book
as well as the scenario described in this question is from America and the Americans still measure
height in the imperial system of inches, feet or yards. The older people in South Africa know these
non-metric measures and will be able to tell you that an inch is little more than 2 cm, a feet little more
than 25 cm and a yard little less than a metre (ask a granddad or grandmother if they know about
these measurements).How many metres will 33.73 inches be?
(a) The intercept b0 = 33.73 is where the regression line and the y-axis intersect and at that point
x = 0. As argued earlier, be careful in the interpretation of this. It does not mean that when the
father’s height is 0 (not born yet ??) the son’s height is 33.73 inches. You can see that makes no
sense - it is meaningless!
The slope coefficient b1 = 0.516 implies that for each additional inch of the father’s height the
son’s height increases on average by 0.516 inches.
(b) 33.73 inches is taken as the cut-off value: ’tall’ fathers are supposedly taller than 33.73 and ’short’
fathers are shorter than 33.73. Therefore, if the father is tall, the son would on average be shorter
than his father.
(c) If the father is short, then on average the son will be taller than his father.
Question 2
Answer: 2
Remember that we said that the closer the value of r is to either +1 or −1, the stronger the
relationship between the variables. The fact that we compare positive and negative values is
irrelevant if the only issue is the strength of the relationship. A value of r close to zero indicates
a very weak relationship. (See 4.4 in Keller "Measures of linear relationship" under the heading
Coefficient of Correlation.)
Question 3
Answer: 3
82
STUDY UNIT 5
5.1 Non parametric statistics
Non-parametric (or distribution-free) inferential statistical methods are mathematical procedures for
statistical hypothesis testing which, unlike parametric statistics, make no assumptions about the
probability distributions of the variables being assessed.
STUDY
Keller Chapter 19 Nonparametric statistics
19.1 Wilcoxon rank sum test
◦ procedure and test statistic
◦ understanding the required conditions
There are different nonparametric methods that can be used, but not at random. For each test there
are specified conditions about the nature of the data that must be satisfied. You are given a summary
of the different tests, their conditions and their parametric counterparts at the end of this study unit.
We do not expect you at first-year level to know all these tests, so we only discuss three of them: the
rank sum test, the sign test and the signed rank sum test.
The rank sum test for two independent samples of either ordinal or interval data is the nonparametric
counterpart of the two-sample pooled t-test. If there is doubt about the interval scale of data, the
normality of the sampled populations or equality of the variances, this rank sum test should be used.
The sizes of the two samples can be small and need not necessarily be equal. Furthermore, with
both sample sizes ≥ 10, there is a normal approximation of the Wilcoxon rank sum test which can be
used.
This test determines the differences between the placement (location) of two independent
populations, using the median as measure of location, and therefore it is preceded by a ranking
process for the data. The name of the test is leading, don’t you think so? You rank and then you sum
the data! Once that is done, you have also calculated the test statistic - as easy as that!
Given numbers 8 5 0 2 5 0 4 5
Ranked numbers 0 0 2 4 5 5 5 8
Rank allocations 1.5 1.5 3 4 6 6 6 8
Instead of allocating rank 1 to the smallest value (0), and rank 2 to the other smallest value (0), both
are given the rank 1.5. Two identical values cannot have different ranks. The average rank of 1.5 is
halfway between rank 1 and rank 2. With similar reasoning the three 5’s must have the same rank.
Instead of placing one 5 in rank 5, another 5 in rank 6 and the third 5 in rank 7, they are all given the
average rank of 6. Note that you have to "skip" rank 5 and rank 7 because they have already been
"used".
· Re-group the data and their ranks into the original samples and sum the ranks for the data in each
sample.
· The sample with the smallest total is then named "sample 1". Further calculations and
interpretations are based on "total sample 1", which is the observed value of the test statistic.
· Make sure that you can formulate the hypotheses and use Table 9 containing the critical values
for this rank sum test according to the formulation of the alternative hypothesis. Specification can
simply be that the locations are different which implies a two-tailed test, while specification of the
relative position of the two populations implies a one-tailed test.
The formula given to use for sample sizes larger than 10 is a normal approximation and is calculated
without the tables (because they do not list values larger than 10!!) and only use the sizes of the two
independent samples and the test statistics.
Activity 5.1
Question 1
Consider the following data set: 14, 14, 15, 16, 18, 19, 19, 20, 21, 22, 23, 25, 25, 25, 25,and 28.
The rank assigned to the four observations of value 25 is
1. 12
2. 12.5
3. 13
4. 13.5
5. 14
Question 2
The Wilcoxon rank sum test statistic T is approximately normally distributed whenever the sample
sizes are
1. larger than 10
2. smaller than 10
3. between 5 and 15
4. larger than 20 but smaller than 30
5. smaller than 20
Question 3
A Wilcoxon rank sum test for comparing two populations involves two independent samples of sizes
5 and 7. The alternative hypothesis is stated as: The location of population 1 is different from the
location of population 2. The appropriate critical values at the 5% significance level are
1. 20 and 45
2. 22 and 43
3. 33 and 58
4. 35 and 56
5. 12 and 32
85 STA1502/1
Question 4
Consider the following two independent samples:
Sample A: 16 17 19 22 47
Sample B: 27 31 34 37 40
The value of the test statistic for a left-tail Wilcoxon rank sum test is
1. 6
2. 20
3. 35
4. 55
5. 121
Question 5
Two observers are placed on two different observation points (randomly chosen) for a specified
period of time. They have to observe the drivers of the cars passing by and count the number of
them driving by while talking on a cell phone. Data given below was recorded at Point A for 6 days
and at Point B for 7 days. At the 0.10 level, can we conclude that the number of drivers talking on cell
phones at the two locations have the same median occurrence?
Point A 74 61 73 67 80 89
Point B 90 73 97 81 77 61 79
Question 6
A Wilcoxon rank sum test for comparing two populations involves two independent samples of sizes
15 and 20. The unstandardized test statistic (that is the rank sum) is T = 210. The value of the
standardized test statistic z is
1. 14.0
2. 10.5
3. 6.0
4. 0.7
5. −2.0
86
STUDY
Keller Chapter 19 Nonparametric statistics
19.2 Sign test and Wilcoxon signed rank sum test
◦ sign test
◦ Wilcoxon signed rank sum test
· Alternative hypothesis: the population locations are different (can be one-or two-sided).
Activity 5.2
Question 1
It is important to sponsors of television shows that viewers remember as much as possible about
the commercials. The advertising executive of a large company is trying to decide which of two
commercials to use on a weekly half-hour sit-com. To help make a decision she decides to have 12
individuals watch both commercials. After each viewing, each respondent is given a quiz consisting
of 10 questions. The number of favourable responses is recorded and listed below. Assume that the
quiz results are not normally distributed.
Quiz Scores
Respondent Commercial 1 Commercial 2
1 7 9
2 8 9
3 6 6
4 10 10
5 5 4
6 7 9
7 5 7
8 4 5
9 6 8
10 7 9
11 5 6
12 8 10
(a) Which test is appropriate for this situation?
(b) Do these data provide enough evidence at the 5% significance level to conclude that the two
commercials differ?
Question 2
In a normal approximation to the sign test, the standardized test statistic is calculated as z = -1.58.
To test the alternative hypothesis that the location of population 1 is to the left of the location of
population 2, the p-value of the test is
1. 0.1142
2. 0.2215
3. 0.0571
4. 0.2284
5. 0.4429
88
For the
- sign test the data is ordinal
- signed rank sum test the data is interval
For the signed rank sum test the rules are as follows:
Study the manual computations in example 19.4 and you will find these ’rules of the game’ given
above easy to remember.
89 STA1502/1
Activity 5.3
Question 1
Do question 19.22 in Keller.
Question 2
Do question 19.23 in Keller.
Question 3
In a Wilcoxon signed rank sum test, the test statistic is calculated as T = 91. There are 18 observation
pairs of which 3 have zero differences and a two-tail test is performed at the 5% significance level.
Choose the correct option below:
Question 4
In a Wilcoxon signed rank sum test with n = 30, the rank sums of the positive and negative
differences are 198 and 165, respectively. The value of the standardized test statistic z is
1. 232.50
2. -0.7096
3. -2.8125
4. 48.6107
5. 0.6425
90
Feedback Feedback
Activity 5.1
Question 1
Answer: 4
The data set is already ranked (we wanted to test something else than ranking)
14, 14, 15, 16, 18, 19, 19, 20, 21, 22, 23, 25, 25, 25, 25, and 28.
Data 14 14 15 16 18 19 19 20 21 22 23 25 25 25 25 28
Ranks 1.5 1.5 3 4 5 6.5 6.5 8 9 10 11 13.5 13.5 13.5 13.5 16
Question 2
Answer: 1
In the discussion about the sampling distribution of the Wilcoxon rank sum test statistic it is stated
that T is approximately normally distributed whenever the sample sizes are larger than 10.
Question 3
Answer: 1
n1 = 5 and n2 = 7. The values for n1 are listed in the first row and those for n2 in the first column.
The statement in the alternative hypothesis about the location of the populations being different does
not imply that the location of population 1 lies to the left or the right of population 2. It is a two-tailed
statement. The appropriate critical values at the 5% (two-tailed) significance level are 20 and 45.
Question 4
Answer: 2
Ranked data 16 17 19 22 27 31 34 37 40 47
Ranks 1 2 3 4 5 6 7 8 9 10
Question 5
(This is not a multiple choice question.)
Point A 74 61 73 67 80 89
Point B 90 73 97 81 77 61 79
Ranked data 61 61 67 73 73 74 77 79 80 81 89 90 97
Ranks 1.5 1.5 3 4.5 4.5 6 7 8 9 10 11 12 13
Sample A has the smallest total, so the test statistic is equal to 35.
If we are only testing for a "difference" in the data from the two points, it is a two-sided test. From
Table 9(b) the limits for n1 = 6 and n2 = 7 are 30 and 54. The test statistic of 35 falls between
these limits, so the null hypothesis cannot be rejected at the 10% level. We conclude that the median
number of persons talking on their cell phones while driving could be the same at points A and B.
Question 6
Answer: 5
This answer is simply substitution into formulae.
u
n1 (n1 + n2 + 1) n1 n2 (n1 + n2 + 1)
E(T ) = σT =
2 12
u
15(15 + 20 + 1) 15 · 20(15 + 20 + 1)
= =
2 12
= 270 = 30
T − E(T )
z =
σT
= −2.0
Activity 5.2
Question 1
Quiz Scores
Respondent Commercial 1 Commercial 2 Difference
1 7 9 -2
2 8 9 -1
3 6 6 0
4 10 10 0
5 5 4 1
6 7 9 -2
7 5 7 -2
8 4 5 -1
9 6 8 -2
10 7 9 -2
11 5 6 -1
12 8 10 -2
(a) The appropriate test for this situation is the sign test.
(b) Do these data provide enough evidence at the 5% significance level to conclude that the two
commercials differ?
92
ANSWER:
Two cells have zeros and are not counted for the sample size. Therefore n = 10 and x = 1 (only one
plus).
Conclusion: Reject the null hypothesis. Yes, these data provide enough evidence at the 5%
significance level to conclude that the two commercials differ.
Question 2
The standardized test statistic is calculated as z = -1.58. The p-value should then be such that
p-value: P (z < −1.58) = P (z > 1.58) = 1 − 0.9429 = 0.0571
Answer: 3
Activity 5.3
Question 1
H0 : The two population locations are the same.
H1 : The location of population 1 is to the right of the location of population 2.
T = T + = 3457
u
n(n + 1) n(n + 1)(2n + 1)
E(T ) = σT =
4 24
108 · 109 √
= = 106438.5
4
= 2943 = 326.25
Rejection region:
T − E(T ) 3457 − 2943
z= = = 1.5754
σT 326.25
There is not enough evidence to conclude that population 1 is located to the right of the location of
population 2.
93 STA1502/1
Question 2
H0 : The two population locations are the same.
H1 : The location of population 1 is different from the location of population 2.
Rejection region: T ≥ TU = 19 or T ≤ TL = 2.
Pair Sample 1 Sample 2 Difference |Difference| Ranks
1 9 5 4 4 5.5
2 12 10 2 2 3.5
3 13 11 2 2 3.5
4 8 9 -1 1 1.5
5 7 3 4 4 5.5
6 10 9 1 1 1.5
Totals T + = 19.5 T − = 1.5
T = 19.5. There is enough evidence to infer that the population locations differ.
Question 3
Answer: 4
The value of the test statistic is calculated as T = 91, therefore the test statistic lies inside the ’safe’
region of [25, 95]. for a two-tailed test at the 5% significance level. The null hypothesis is therefore
not rejected.
Question 4
Answer: 2
T = T + = 198
u
n(n + 1) n(n + 1)(2n + 1)
E(T ) = σT =
4 24
30 · 31 √
= = 2363.75
4
= 232.5 = 48.6184
Rejection region:
T − E(T ) 198 − 232.5
z= = = −0.7096
σT 48.6184
94
· flow chart of different techniques applied in inference as set out in Figure A13.1
· summary of the different statistical techniques for nominal data given in Table 15.1
Deciding which test to use is the task of the statistician in practice and it is our aim to supply you
with the tools to make such a decision. This is not always so straightforward as it may seem and
that is why Keller gave us the flow chart in Figure A13.1. The significance of data type is obvious,
but note how the study objective gives direction. Even now, while you are still studying, make a point
of looking at published statistical information and determine if it involves "lying with statistics" or not.
Two tables made from the information in the above-mentioned summaries are given below. Look at
them, but try to make your own. Making such a summary is a very valuable method of studying.
Basic principles of hypothesis testing apply Basic principles of hypothesis testing apply
STUDY UNIT 6
6.1 Time series analysis and forecasting
In this study unit we will, as Keller explains it in clear language "only scratch the surface of this topic".
Time series and forecasts based on time series are very relevant and significant in modern times. At
first year level, fortunately these concepts are simple and easy to explain. If you sit and think about
it, you can make a long list of events that you can observe at regular time intervals. If you drive to
work in a car or taxi or train, you can record the traffic every first day of the month, or every Friday,
or every day of the week, or...; if you have a favourite take-away food store you can record the length
of the queue at regular time intervals; an obvious example is to record the monthly rainfall at your
home. The list never ends, as government bodies, researchers, economists, etc. all record different
phenomena over short and long periods of time. These scores, collected at regular time intervals
are known as time series.
The question is – what do we do with the time series? Do you record the data simply to look at it, is
it just for the sake of fun, or what? As statisticians we are going to teach you how to look at, interpret
and even ’smooth’ the time series data, but is that the end of the process? That would have been a
sad day if everything stopped just there! The point is that what we observe as a pattern in the past
could well be repeated in the future and therefore a technique has been developed where the data
of a time series is used and the characteristics of that particular phenomenon is used to predict what
can be expected in the future. Of course, statisticians are always very careful not to say that anything
is certain (think in terms of hypothesis testing!), so they use models in their predictions. We will only
look at three elementary models, but there are many other models, some of which are much more
complex.
STUDY
Keller Chapter 20 Time series analysis and forecasting
20.1 Time series components
Keller clearly explains the characteristics of long term trend, cyclic, seasonal and random variation as
well as graphs to illustrate the first three components. Random variation (sometimes called ’noise’)
can camouflage the effects of other components in a time series to a great extent and it is important
97 STA1502/1
to minimize their effect. Why? Even if they can significantly influence the time series data (think of
war or a hurricane, or..) they are irregular happenings and their influence should be temporary. It
is therefore necessary that short term fluctuation be ’removed’ from the data using a technique of
smoothing.
Of course, one must make sure that it is really a random happening and we hope that analysts do
think! In a war-torn country or a region known for hurricane occurrences, such events cannot be
considered as irregular. They then form an inherent part of the time series ’pattern’.
STUDY
Keller Chapter 20 Time series analysis and forecasting
20.2 Smoothing techniques
◦ Moving avarages
◦ Centred moving averages
◦ Exponential smoothing
The first technique of smoothing is to determine moving avarages. Remember that the data
points in a time series are consecutive values, i.e. they are ordered. The idea of an average
is nothing new and in this case you substitute the actual observations of a time series with a
list of averages. You can compute a three-period moving average, which is the average of three
consecutive observations or you can compute a four-period moving average, which is the average
of four consecutive observations, etc. Make sure that you understand how these three, or four, or
... moving averages are calculated. In a three-period moving average each observation (except the
first and last values) are part of three averages.
Suppose we have real observations indicated as A, B, C, D E, F and G, then the three, four and
five-period moving averages would be as follows:
98
Exponential smoothing
This method is mathematically more complex, but still a ’relatively crude method’ to remove random
variation. However, it removes two of the concerns mentioned above when the method of moving
averages is used for smoothing out random variation. These are the following:
· With every calculation all the observations up to that particular observation form part of the
calculation, in other words give weight to the answer.
· The smoothing process starts from the very first observation and continues up to the very last
observation.
99 STA1502/1
The formula given may look a little complex, but with constant use it is manageable. Application of
the formula smooths values by calculating a weighted average of each observation in the series and
the previously already smoothed observation. The smoothing constatnt w is a number between 0 and
1 and seeing that w is multiplied by the actual observation yt (at time t), you should understand that
the closer w is to 1 the more influence the actual observation y will have. That is the sort of decision
the statistician has to make. Choosing the value of w will therefore depend on the importance of the
actual observations.
Keep in mind that you will receive a list of formulas in the examination. You simply have to recognize
which formula to use where and to know the meaning of the different symbols.
Activity 6.1
Question 1
Test your knowledge.
Link each of the descriptions below to one of the four time series components (long-term trend,
cyclic, seasonal or random variation):
1. The time series component that reflects a long-term, relatively smooth pattern or direction
exhibited by a time series over a long time period (more than one year)
2. The time series component that reflects variability over short repetitive time periods and has
duration of less than one year
3. The time series component that reflects the irregular changes in a time series that are not
caused by any other component, and tends to hide the existence of the other more predictable
components
4. The time series component that reflects a wave-like pattern describing a long-term trend that is
generally apparent over a number of years
Question 2
In exponentially smoothed time series, the smoothing constant w is chosen on the basis of how much
smoothing is required. In general, which of the following statements is true?
1. A small value of w such as w = 0.1 results in very little smoothing, while a large value such as
w = 0.8 results in too much smoothing.
2. A small value of w such as w = 0.1 results in too much smoothing, which a large value such as
w = 0.8 results in very little smoothing.
3. A small value of w such as w = 0.1 and a large value such as w = 0.8 may both result in very little
smoothing.
100
4. A small value of w such as w = 0.1 and a large value such as w = 0.8 may both result in too much
smoothing.
5. It is impossible to have too much or too little smoothing, regardless of the value of w.
Question 3
Monthly sales (in R11,000) of a computer store are shown below.
STUDY
Keller Chapter 20 Time series analysis and forecasting
20.3 Trend and seasonal effects
◦ Trend analysis
◦ Seasonal analysis
◦ Deseasonalizing a time series
Once you can see that there is a trend in a time series, you have to determine what the ’nature’ of the
trend is. This we do using mathematics. Do you remember the following from school mathematics?
At this stage you should know enough about the possibility to fit a regression line through given data
and also about the principles involved in such a method. Now, in time series analysis to determine if
there is a trend in the data, such a fitted line can assist you in seeing if there is a trend in the data.
The ŷ then becomes the trend line estimate of the y of the regression model y = β 0 + β 1 t + ε. The
slope of the line indicates the trend. If the slope is positive, you know the trend is positive and the
larger the numerical value of the slope the larger the positive trend.
101 STA1502/1
These arguments about a graph assisting us to find trend in a time series apply if the relationship is
nonlinear. Should a quadratic model be needed to fit the time series, the trend equation relies on the
multiple regression technique (not included in this module).
To detect seasonality in a time series, several ’seasons’ must be observed. Seasonal index can be
calculated and used to either inflate or deflate the trend in the series. Depending on the choice, it will
either express the degree to which the seasons differ from one another or it can be used to remove
the seasonal variation. The purpose of removing the seasonality is that other changes in the series
can then be detected. This has many benefits, especially in forecasting.
Activity 6.2
Question 1
Do Question 20.24 in Keller.
Question 2
The Pyramid of Giza is one of the most visited monuments in Egypt. The number of visitors per
quarter has been recorded (in thousands) as shown in the accompanying table:
Year
Quater 2000 2001 2002 2003
Winter 210 215 218 220
Spring 260 275 282 290
Summer 480 490 505 525
Autumn 250 255 265 270
(a) Plot the time series.
(b) Discuss your observations. Would exponential smoothing be recommended for this data?
102
STUDY
Keller Chapter 20 Time series analysis and forecasting
20.4 Introduction to forecasting
How accurate is my forecast? This is a question the statistician has to ask him/herself, as there is
a variety of forecasting models available. What can we do to evaluate the accuracy of a forecasting
procedure? We are going to consider the following two measures of accuracy
· Mean Absolute Deviation (MAD). This is a measure of the consistency of moderately accurate
forecasts. The interest is in the size of the error, not the direction, and one chooses the model
with the lowest mean value for the error as the best-fit model.
· Sum of Squares for Forecast Error (SSE). This measure shows how close the forecasts are to the
actual values. This criterion chooses the model with the lowest mean value for the squared errors
(compare this to the least-squares criterion when you determine the regression equation).
Formulas for both MAD and SSE are given in Keller. There is also a worked out example where three
forecasting models are subjected to these measures. These criteria are very useful if you fit more
than one model to the same time series
STUDY
Keller Chapter 20 Time series analysis and forecasting
20.5 Forecasting models
◦ Forecasting with exponential smoothing
◦ Forecasting with seasonal indexes
The selected model for forecasting a time series is determined by the components present in the
recorded time series. The choice of model is therefore based on measures of accuracy and precision.
In general, the method used in the particular smoothing method can give you an indication of the type
of forecast. If you think about the method applied in exponential smoothing, you can imagine that for
a time series with a small positive trend, the forecast will be too low and if there is a small negative
trend, the forecast will tend to be too high.
103 STA1502/1
A proper analysis of the given data must underlie the choice and you have to realize that one should
not try to forecast too far in the future as the accuracy decreases with each additional time frame
added.
At first-year level we only introduce you to forecasting and expect you to understand three relatively
elementary forecasing models: Exponential and seasonal models will be easy for you to understand.
Should you feel uncertain about the autocorrelation model, it may be necessary for you to read the
section on Nonindependence of the Error Variable in Chapter 16 again.
Forecasting
Conditions Forecasting Action
model
Smoothing
Preferably used constant
No trend
Exponential for one time Assume initial
No exponential smoothing
smoothing period forecast forecast
No seasonal variation
but can be more Substitute St
with Ft+1
Regression
equation
Preferably one
Seasonal Long-term trend is used as
season but
indexes Seasonal variation well as
can be more
seasonal index
for period t
Based on
Can be complex
correlation of
Autocorrelation if the time
Autoregressive consecutive
No trend series values are
model terms (first
No seasonality themselves
order
correlated
autocorrelation)
104
Activity 6.3
Question 1
Do question 20.32 in the textbook.
Question 2
The following is the list of mean absolute deviation (MAD) statistics for each of the models you have
estimated from time-series data:
Model MAD
Linear trend 1.38
Quadratic trend 1.22
Exponential trend 1.39
Autoregressive 0.71
1. linear trend
2. quadratic trend
3. exponential trend
4. autoregressive
5. not possible to answer
Feedback Feedback
Activity 6.1
Question 1
1. long-term trend
2. seasonal variation
3. random variation
4. cyclical variation
105 STA1502/1
Question 2
Answer: 2
Question 3
Month Sales Moving averages
Three-month Five-month
Jan 73
Feb 65 70
March 72 73 75.6
April 82 80 79.0
May 86 86
June 90
Activity 6.2
Question 1
y
Year Quater Period t y ŷ ŷ
2001 1 1 52 62.9 0.827
2 2 67 64.1 1.046
3 3 85 65.2 1.303
4 4 54 66.4 0.813
2002 1 5 57 67.6 0.843
2 6 75 68.8 1.090
3 7 90 70.0 1.286
4 8 61 71.1 0.857
2003 1 9 60 72.3 0.830
2 10 77 73.5 1.048
3 11 94 74.7 1.259
4 12 63 75.9 0.830
2004 1 13 66 77.0 0.857
2 14 82 78.2 1.048
3 15 98 79.4 1.234
4 16 67 80.6 0.831
Quarter
1 2 3 4 Total
2001 0.827 1.046 1.303 0.813
2002 0.843 1.090 1.286 0.857
2003 0.830 1.048 1.259 0.830
2004 0.857 1.048 1.234 0.831
Average 0.839 1.058 1.271 0.833 4.001
Seasonal index 0.839 1.058 1.270 0.833 4.000
106
Question 2
(a)
600
500
Number of Visitors
400
300
200
100
0
2000 2001 2002 2003
Year
We note a distinct pattern of seasonal variation in the series. This could have been detected in
the data, but in the graph one can see it without even thinking!
(b) Exponential smoothing is a method to remove the random variation in a time series and makes
it easier to detect the trend. In the further discussions you will see that exponential smoothing is
not an accurate forecasting method if the time series has clear seasonal effects.
Activity 6.3
Question 1
|57 − 63| + |60 − 72| + |70 − 86| + |75 − 71| + |70 − 60|
M AD =
5
6 + 12 + 16 + 4 + 10
=
5
48
=
5
= 9.6.
SSE = (57 − 63)2 + (60 − 72)2 + (70 − 86)2 + (75 − 71)2 + (70 − 60)2
= 552.
Question 2
Answer: 4
107 STA1502/1
Learning Outcomes
Use the chapter summary as a checklist to see if you have mastered the knowledge in this chapter
after you have completed this study unit to evaluate if you have really acquired a good understanding
of the work covered.
Can you
• list and understand principles involved in the general procedures when applying chi-squared
testing?
• apply your knowledge of the chi-square test, for nominal scale variables, to describe a single
population and/or to determine the relationship between two populations?
• apply non-parametric statistical tests?
• employ the Wilcoxon rank sum test, the sign test and the Wilcoxon signed rank sum test to
compare two populations of ordinal data?
• analyse the relationship between two interval variables using simple linear regression?
• explain and decompose the components of a time series?
• explain how trend and seasonal variation are measured?
• describe exponential smoothing, seasonal indexes and the autoregressive model for forecasting
in time series?
References
Keller, Gerald et al. (2005) Instructor’s Suite CD for the Student Edition of Statistics for Management
and Economics, Belmont, CA USA: Duxbury, Thomson.
Weiers, Ronald M. (2005) Introduction to Business Statistics, Brooks/Cole, Duxbury, Thomson.