Sta301 Lec45
Sta301 Lec45
Lecture No. 45
of the course on
Statistics and Probability
by
Miss Saleha Naghmi Habibullah
IN THE LAST LECTURE,
YOU LEARNT
where n = f.x
fx
n
Thus we have the following
calculations:
Number of Frequency
Arrivals
x f fx
0 84 0
1 114 114
2 70 140
3 60 180
4 32 128
5 16 80
6 15 90
7 4 28
8 5 40
400 800
Hence:
Mean X
fX
800
2
n 400
Replacing by x , the formula for the
Poisson probabilities is
X
x
2
X 2 x
e e
f ( x)
x! x!
Hence, we obtain:
Number
Poisson Expected
of Observed
Probabilities Frequencies
Customer Frequencies
f(x) 400 f(x)
Arrivals
0 84 0.1353 54.12
1 114 0.2707 108.28
2 70 0.2707 108.28
3 60 0.1804 72.16
4 32 0.0902 36.08
5 16 0.0361 14.44
6 15 0.0120 4.80
7 4 0.0034 1.36
8 5 0.0009 0.36
9 or more 0 0.0002 0.08
400 1 400
Next, we apply the chi-square test of
goodness of fit according to the following
procedure:
Hypothesis-Testing Procedure:
Step-1:
Null and Alternative Hypotheses:
i ei
which, if H0 is true, follows the chi-square
distribution having
k - 1 - r degrees of freedom
(where k = No. of categories after having
carried out the necessary mergers, and r =
number of parameters that we estimate from
the sample data).
Step-4:
Computations:
The necessary calculations are shown
in the following table:
Number
Observed Expected
of 2 2
Frequency Frequency (0 – e) (0-e) (0-e) /e
Customer
oi ei
Arrivals
0 84 54.12 29.88 892.81 16.50
1 114 108.28 5.72 32.72 0.30
2 70 108.28 -38.28 1465.36 13.53
3 60 72.16 -12.16 147.87 2.05
4 32 36.08 -4.08 16.65 0.46
5 16 14.44 1.56 2.43 0.17
6 15 4.80
7 4 24
1.36 6.60 17.40 302.76 45.87
8 5 0.36
9 or more 0 0.08
400 400 2=78.88
With reference to the above, it should
be noted that, since some of the expected
frequencies are less than the required
minimum of 5, it became necessary to
combine some of those classes.
Combination is best accomplished
working from the bottom up.
In order that we obtain a number greater
than 5, the last four expected frequencies had
to be combined.
Hence, the effective number of
categories becomes 7.
Step-5:
Determination of the
Critical Region:
Since the effective number of
categories becomes 7
therefore k = 7.
Also, since the one lone parameter of
the Poisson distribution has been
estimated from the sample data, hence r =
1.
Hence:
Our statistic follows the chi-square
distribution having
k - 1 - r = 7 - 1 - 1 = 5
degrees of freedom.
The critical region is given by
2 20.05 (5) = 11.07
CRITICAL REGION:
0.05
0 11.07 78.88
Step-6:
Conclusion:
Since the computed value of our test
statistic i.e. 78.88 is much larger than the
critical value 11.07, therefore, we reject H0
and conclude that the distribution is probably
not a Poisson distribution with parameter 2.
(With only 5% risk of committing Type-1
error, we conclude that the fit is not good.)
In fact, the computed value of our test
statistic i.e. 78.88 is so large that it is possible
that if we had set the level of significance at
1%, even then it would have exceeded the
critical value.
The students are encouraged to check
this up themselves.
If the computed value does fall in the
critical region corresponding to 1% level of
significance, then our result is highly
significant.
RATIONALE OF THE CHI-SQUARE
TEST
OF GOODNESS OF FIT:
2 oi ei 2
that
It is clear
i e i
will be a
small quantity when all the oi’s are close to
the corresponding ei’s.
(In fact, if the observed frequencies are
exactly equal to the expected ones, then 2
will be exactly equal to zero.)
The 2 - statistic will become larger
when the differences between the oi’s and ei’s
become larger.
Thus, 2 measures the amount of
deviation (or discrepancy) between the
observed and the expected results.
ASSUMPTIONS OF THE
CHI-SQUARE TEST OF GOODNESS OF
FIT:
300 250
e22 150 .
500
Hence, we have:
Expected Frequencies:
/
oij – eij, (oij - eij)2 and (oij - eij)2 eij , as shown
below:
o e o e o e e
Observed Expected 2 2
Frequency Frequency
oij eij ij ij ij ij ij ij ij
2
13.33
Hence, the computed value of our test-
statistic comes out to be
2
13.33
v) Critical Region:
2 20.05(1) = 3.84
vi) Conclusion:
Since 13.33 is bigger than 3.84, we
reject H0 and conclude that desire to own
a personal computer set and sex are
associated.
Now that we have concluded that
gender and desire for PC are associated, the
natural question is, “Which gender is it
where the proportion of persons wanting a
PC is higher?”
We have:
Men Women Total
Want PC 120 80 200
Don’t Want
130 170 300
PC
Total 250 250 500
A close look at the given data indicates
clearly that the proportion of persons who
are desirous of owning a personal computer
is higher among men than among women.
And, (since our test statistic has come
out to be significant), therefore we can say
that the proportion of men wanting a PC is
significantly higher than the proportion of
women wanting to own a PC.
Let us consider another example:
EXAMPLE
A national survey was conducted in a
country to obtain information regarding the
smoking patterns of the adults males by
marital status.
A random sample of 1772 citizens, 18
years old and over, yielded the following
data :
SMOKING PATTERN
Marital
Total Only Regular Total
Status
Abstinence at times Smoker
Single 67 213 74 354
Married 411 63 129 1173
Widowed 85 51 7 143
Divorced 27 60 15 102
Total 590 957 225 1772
Use this data to decide whether there is
an association between marital status and
smoking patterns.
The students are encouraged to work
on this problem on their own, and to decide
for themselves whether to accept or reject
the null hypothesis.
(In this problem, the null and the
alternative hypotheses will be:
H0 : Marital status and smoking patterns
are statistically independent.
HA : Marital status and smoking patterns
are not statistically independent.)
This brings us to the end of the series
of topics that were to be discussed in some
detail for this course on Statistics and
Probability.
For the remaining part of today’s
lecture, we will be discussing some
interesting and important concepts.
First and foremost, let us consider the
concept of Degrees of Freedom:
As you will recall, when discussing the
t-distribution, the chi-square distribution, and
the F-distribution, it was conveyed to you
that the parameters that exists in the
equations of those distributions are known as
degrees of freedom.
But the question is, ‘Why are these
parameters called degrees of freedom?’
Let us try to obtain an answer to this
question by considering the following :
Consider the two-dimensional plane, and
consider a straight line segment in the plane.
If one edge of the line segment is fixed at
some point (x0, y0), the line segment can be
rotated in the plane such that the fixed edge
stays in its place. In other words, we can say
that the line segment is free to move in the
plane with one restriction. Hence, if we fix
one end-point of the line segment, then we are
left with one degree of freedom for its
movement.
Next, consider the case when we fix both
end-points of the line segment in the plane. In
this case, both degrees of freedom are lost, and
therefore the line can no longer move in the
plane.
But, if we view the above situation with
reference to the three-dimensional space --- the
one that we live in --- we note that the whole
plane (containing the fixed line segment) can
move in three dimensions, and hence, we have
one degree of freedom for its movement.
Let us try to understand this concept in
another way:
Suppose we have a sample of size n =
6, and suppose that the sum of the sample
values is 20. That is, we have the following
situation:
Our Sample :
H 0 : 1 – 2 = 0
H A : 1 – 2 0
(Two-tailed test)
The computed value of our test statistic
came out to be 3.43, whereas, at the 5%
level of significance, the critical value was
2.33, hence, we rejected H0.
Z.01 = -2.33 Z=0 Z.01 = +2.33
Z
Calculated Z = 3.43
1 2 0 X1 X 2 1.15
X1 X 2
Hence, we concluded that there was a
significant difference between the average
hourly wage of a temporary computer
analyst and the average hourly wage of a
temporary registered nurse.
This conclusion could also have been
reached by using the
p-value method:
I. Looking up the probability of Z > 3.43 in
the area table of the standard normal
distribution yields an area of .5000 – .4996
= .0004.
II. To compute the p-value, we need to be
concerned with the region less than –3.43 as
well as the region greater than 3.43 (because
the rejection region is in both tails).
p-value = 0.0004+0.0004 = 0.0008
0.0004 0.0004
.05 .05
.025 .025
2 2 2 2
-3.43
-1.96 -1.96 0 1.96 3.43 Scale of z
Acknowledgement:
Munir Ahmad
(President, Islamic Society of Statistical Sciences)
Also, at this point, the students are
reminded of what was conveyed to them in
the very first lecture of this course:
Statistical thinking will one day be as
necessary for efficient citizenship as the
ability to read and write.
If this course has been able to inculcate
in the students the fundamentals of
statistical and probabilistic thinking, then
this course has served its purpose.
Education is a wholesome process.
The essence of effective teaching is
effective and inspiring communication on
the part of the teacher, and concentrated
attention and effort on the part of the
student.
May Allah be with you in your pursuit
of knowledge.
May He bless you with the wealth of
knowledge, and the capability to utilize this
knowledge for the betterment of humanity.