0% found this document useful (0 votes)
26 views22 pages

Lecture 13-14-15 Chi - Square Test

1. The document discusses analysis of categorical data through goodness-of-fit tests and contingency tables. 2. Goodness-of-fit tests determine if observed data fits a hypothesized categorical distribution using a chi-square statistic. 3. Contingency tables examine the relationship between two categorical variables using independence tests that compare observed and expected counts in a cross-tabulation.

Uploaded by

Shahadat Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views22 pages

Lecture 13-14-15 Chi - Square Test

1. The document discusses analysis of categorical data through goodness-of-fit tests and contingency tables. 2. Goodness-of-fit tests determine if observed data fits a hypothesized categorical distribution using a chi-square statistic. 3. Contingency tables examine the relationship between two categorical variables using independence tests that compare observed and expected counts in a cross-tabulation.

Uploaded by

Shahadat Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Analysis of Categorical

Data
1. Goodness-of-Fit Test
Many experiments result in measurements that are
qualitative or categorical rather than quantitative.
◦ People classified by ethnic origin
◦ Cars classified by color
These data sets have the characteristics of a
multinomial experiment.
m m m
The Multinomial m
m m
Experiment
1. The experiment consists of n identical trials.
2. Each trial results in one of k categories.
3. The probability that the outcome falls into a
particular category i on a single trial is pi and
remains constant from trial to trial. The sum of
all k probabilities, p1+p2 +…+ pk = 1.
4. The trials are independent.
5. We are interested in the number of outcomes in
each category, O1, O2 ,… Ok with O1 + O2 +… +
Ok = n.
The Binomial Experiment
A special case of the multinomial experiment with k = 2.
Categories 1 and 2 : success and failure
p1 and p2 : p and q
O1 and O2: : x and n-x
We made inferences about p (and q = 1 - p)

In the multinomial experiment, we make inferences about


all the probabilities, p1, p2, p3 …pk.
m m m
Pearson’s Chi-Square m
m m
Statistic
We have some preconceived idea about the values of
the pi and want to use sample information to see if
we are correct.
The expected number of times that outcome i will
occur is Ei = npi.
If the observed cell counts, Oi, are too far from
what we hypothesize under H0, the more likely it is
that H0 should be rejected.
m m m
Pearson’s Chi-Square m
m m
Statistic
We use the Pearson chi-square statistic:

• When H0 is true, the differences O-E will be small,


but large when H0 is false.
• Look for large values of χ2 based on the chi-square
distribution with a particular number of degrees of
freedom.
Degrees of Freedom

These will be different depending on the


application.
1. Start with the number of categories or cells in the
experiment.
2. Subtract 1df for each linear restriction on the cell
probabilities. (You always lose 1 df since p1+p2
+…+ pk = 1.)
3. Subtract 1 df for every population parameter you
have to estimate to calculate or estimate Ei.
Assumptions
Assumptions for Pearson’s Chi-Square:
1. The cell counts O1, O2, …,Ok must satisfy the conditions of
a multinomial experiment, or a set of multinomial
experiments created by fixing either the row or the column
totals.
2. The expected cell counts E1, E2, …, Ek ≥ 5.

If not (one or more is < 5)


1. Choose a larger sample size n.
The larger the sample size, the closer the chi-square
distribution will approximate the distribution of your test
statistic χ2.
2. It may be possible to combine one or more of the cells with
small expected cell counts, thereby satisfying the
assumption.
The Goodness of Fit Test

• The simplest of the applications.


• A single categorical variable is measured, and exact
numerical values are specified for each of the pi.
• Expected cell counts are Ei = npi
• Degrees of freedom: df = k-1
Example 1
• Toss a die 300 times with the following results. Is the die
fair or biased?

Upper Face 1 2 3 4 5 6
Number of times 50 39 45 62 61 43

A multinomial experiment with k = 6 and O1 to O6 given


in the table.
We test:
H0: p1= 1/6; p2 = 1/6;…p6 = 1/6 (die is fair)
H1: at least one pi is different from 1/6 (die is biased)
Example 1 - Solution
•Calculate the expected cell counts:
Ei = npi = 300(1/6) = 50
Upper Face 1 2 3 4 5 6
Oi 50 39 45 62 61 43

Ei 50 50 50 50 50 50

Test statistic and rejection region:

Do not reject H0. There is insufficient evidence to indicate that the


die is biased.
Some Notes
The test statistic, χ2 has only an approximate
chi-square distribution.
For the approximation to be accurate, statisticians
recommend Ei ≥ 5 for all cells.
Goodness of fit tests are different from previous
tests since the experimenter uses H0 for the model he
thinks is true.
H0: model is correct (as specified)
H1: model is not correct
2. CONTINGENCY TABLES:
A TWO-WAY CLASSIFICATION

The test of independence of variables is used to


determine whether two variables are independent
when a single sample is selected.

The experimenter measures two qualitative variables


to generate bivariate data.
⚫ Gender and colorblindness
⚫ Age and opinion
⚫ Professorial rank and type of university
Summarize the data by counting the observed number
of outcomes in each of the intersections of category levels
in a contingency table.
RX C CONTINGENCY
TABLE

The contingency table has r rows and c columns = rc


total cells.
1 2 … c
1 O11 O12 … O1c
2 O21 O22 … O2c
… … … … ….
r Or1 Or2 … Orc

• We study the relationship between the two variables. Is one


method of classification contingent or dependent on the
other?
Does the distribution of measurements in the various categories for
variable 1 depend on which category of variable 2 is being observed?
If not, the variables are independent.
CHI-SQUARE TEST OF INDEPENDENCE

H0: classifications are independent


H1 : classifications are dependent

• Observed cell counts are Oij for row i and column j.


• Expected cell counts are Eij = npij
✔ If H0 is true and the classifications are independent,
✔ pij = pipj = P(falling in row i)P(falling in row j)
CHI-SQUARE TEST OF INDEPENDENCE

The test statistic has an approximate chi-square


distribution with df = (r-1)(c-1).
EXAMPLE
Furniture defects are classified according to type of
defect and shift on which it was made.
Shift
Type 1 2 3 Total
A 15 26 33 74
B 21 31 17 69
C 45 34 49 128
D 13 5 20 38
Total 94 96 119 309

Do the data present sufficient evidence to indicate that the type


of furniture defect varies with the shift during which the piece of
furniture is produced? Test at the 1% level of significance.

H0: type of defect is independent of shift


H1: type of defect depends on the shift
EXAMPLE
• Calculate the expected cell counts. For example:
You don’t
need to divide
the α by 2

Reject H0. There is sufficient evidence to indicate that


the proportion of defect types vary from shift to shift.
EXAMPLE
• Calculate the expected cell counts. For example:
Chi-Square Test: 1, 2, 3
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected
counts
1 2 3 Total
1 15 26 33 74
22.51 22.99 28.50
2.506 0.394 0.711
2 21 31 17 69
20.99 21.44 26.57
0.000 4.266 3.449 Reject H0. There is sufficient
3 45 34 49 128 evidence to indicate that the
38.94 39.77 49.29 proportion of defect types vary
0.944 0.836 0.002
from shift to shift.
4 13 5 20 38
11.56 11.81 14.63
0.179 3.923 1.967
Total 94 96 119 309

Chi-Sq = 19.178, DF = 6, P-Value = 0.004


Example
Random samples of 200 voters in each of four wards were surveyed
and asked if they favor candidate A in a local election.
Ward
1 2 3 4 Total
Favor A 76 53 59 48 236
Do not favor A 124 147 141 152 564
Total 200 200 200 200 800

Do the data present sufficient evidence to indicate that the fraction of voters
favoring candidate A differs in the four wards?

H0: fraction favoring A is independent of ward


H1: fraction favoring A depends on the ward
H0 : p 1 = p 2 = p 3 = p 4
where pi = fraction favoring A in each of the four wards
Example - Solution

•Calculate the expected cell counts. For example:

Reject H0. There is sufficient evidence to indicate that the


fraction of voters favoring A varies from ward to ward.
Example - Solution
Since we know that there are differences among the four wards,
what are the nature of the differences?
Look at the proportions in favor of candidate A in the four
wards.

Ward 1 2 3 4
Favor A 76/200=.38 53/200 = .27 59/200 = .30 48/200 = .24

Candidate A is doing best in the first ward, and worst in the


fourth ward. More importantly, he does not have a majority of the
vote in any of the wards!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy