0% found this document useful (0 votes)
111 views146 pages

Chapter 5-6 Estimation Hypothesis

1) Point estimate: x̄ = 1.52 hrs Known population variance: σ2 = 2.25 hrs2 Sample size: n = 20 Confidence level: 95% Critical value (Z0.025): 1.96 Margin of error: e = Z0.025 * σ/√n = 1.96 * √(2.25/20) = 0.44 95% CI: (1.52 - 0.44, 1.52 + 0.44) = (1.08, 1.96) 2) The CI would be narrower since the margin of error decreases as sample size increases.

Uploaded by

Lielina Endris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views146 pages

Chapter 5-6 Estimation Hypothesis

1) Point estimate: x̄ = 1.52 hrs Known population variance: σ2 = 2.25 hrs2 Sample size: n = 20 Confidence level: 95% Critical value (Z0.025): 1.96 Margin of error: e = Z0.025 * σ/√n = 1.96 * √(2.25/20) = 0.44 95% CI: (1.52 - 0.44, 1.52 + 0.44) = (1.08, 1.96) 2) The CI would be narrower since the margin of error decreases as sample size increases.

Uploaded by

Lielina Endris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

Biostatistics & Epidemiolog

Berhanu Teshome(MSc)
Assistant Professor of Biostatistics
berteshome19@gmail.com

Berhanu Teshome, SPHMMC 11/4/2023


5. Elementary statistical estimation theory

• Point and interval estimation on


5.Elementary means and proportions
statistical
• Sample size estimation for cross
estimation
theory sectional studies: mean and
proportion estimation
11/4/2023 10
Statistical Inference

Based on how to use collected


data statistics may be classified
as
Inferential Descriptive
Statistics Statistics

Hypothesis is concerned with summary


Estimation calculations, graphs, charts and
testing tables. 11/4/2023
Statistical Inference

i v e
i p t
Sampling s c r ics
De atist
St

11/4/2023
Statistical Inference

q Inferential Statistics
Ø Estimation :
ü Point Estimation
ü Interval Estimation
Ø Hypothesis testing:
ü One sample t-test
ü Two sample t-test
ü AANOVA
ü Paired T-test
ü Chi-Square test of independency etc.
11/4/2023
Statistical estimation
 The process of drawing conclusions about an entire population based on
the data in a sample is known as statistical inference.
 Methods of inference usually fall into one of two broad categories:

estimation or hypothesis testing.


Estimation
 Is concerned with estimating the values of specific population
parameters based on sample statistics.
 is about using information in a sample to make estimates of the

characteristics (parameters) of the source population.


11/4/2023
Estimation, Estimator & Estimate
♣ Estimation is the computation of a statistic from sample data, often yielding a value
that is an approximation (guess) of its target, an unknown true population
parameter value.
♣ The statistic itself is called an estimator and can be of two types - point or interval.
♣ The value or values that the estimator assumes are called estimate.
 Two methods of estimation are commonly used: point estimation and interval
estimation
 Point estimation involves the calculation of a single number to estimate the population
parameter
 Interval estimation specifies a range of reasonable values for the parameter
11/4/2023
Point versus Interval Estimators
♣ An estimator that represents a "single best guess" is called a point
estimator.
♣ When the estimate is of the form of a "range of plausible values", it
is called an interval estimator.
 Thus,
 A point estimate is of the form: [ Value ],
 Whereas, an interval estimate is of the form: [ lower limit, upper
limit ]

11/4/2023
Properties of good estimators
9

A. Unbiased
♣ A statistic is said to be an unbiased estimator if its expected value is
equal to the estimated parameter.
 SamSample mean ( ) is an unbiased estimator of population mean.
E ( )=µ
B. Consistent
♣ A statistic is said to be a consistent estimator if its value close to the
parameter as the sample size grows larger.
C. Relatively Efficient
♣ A statistic is said to be an efficient estimator if its11/4/2023
variance is smaller.
Estimating the Sampling Error
 Any estimates derived from samples are subject to the sampling error.
 This comes from the fact that only a part of the population was observed,
instead of the whole.
 A different samples could have come up with different results. The
amount of variation that exists among the estimates from the different
possible samples is the sampling error.
 The set of sample means in repeated random samples of size n from a
given population has variance .
 The standard deviation of this set of sample means is and is
referred to as the standard error of the mean (sem) or the standard error.
 The sem is estimated by if  is unknown.11/4/2023
Estimating the Sampling Error
 The sampling error is dependent on on sample size (n), the variability of
individual sample points (), sampling and estimation methods.
 As n increases, the sample mean ( ) and the sample variance s2
approach the values of the true population parameters, µ and 2,
respectively.

11/4/2023
Confidence Intervals
 Give a plausible range of values of the estimate likely to include the “true”
(population) value with a given confidence level.
 An interval estimate provides more information about a population characteristic
than does a point estimate
 Such interval estimates are called confidence intervals.
 CIs also give information about the precision of an estimate.
 How much uncertainty is associated with a point estimate of a population
parameter?
 When sampling variability is high, the CI will be wide to reflect the uncertainty of
the observation.
11/4/2023
 Wider CIs indicate less certainty.
Confidence Intervals
 CIs can also answer the question of whether or not an association exists or a
treatment is beneficial or harmful. (analogous to p-values…)
e.g., if the CI of an odds ratio includes the value 1.0 we cannot be
confident that exposure is associated with disease.
 A CI in general:
 Takes into consideration variation in sample statistics from sample to
sample
 Based on observation from 1 sample

 Gives information about closeness to unknown population parameters

 Stated in terms of level of confidence


11/4/2023
 Never 100% sure
Confidence Intervals
General Formula:
14
The general formula for all CIs is:
The value of the statistic in my sample (eg., mean, odds ratio, etc.)

point estimate  (measure of how confident we want to be)  (standard error)

From a Z table or a t table, depending


on the sampling distribution of the
statistic (Critical value).
Standard error of the statistic.
Confidence Intervals

Lower limit = Point Estimate - (Critical Value) x (Standard Error)


Upper limit = Point Estimate + (Critical Value) x (Standard Error)
A wide interval suggests imprecision of estimation.
 Narrow CI widths reflects large sample size or low variability or both.

 Note: Measure of how confident we want to be = critical value = confidence coefficient

 Confidence Level

 Confidence in which the interval will contain the unknown population parameter

 A percentage (less than 100%)

 Example: 95%

 Also written (1 - α) = .95


11/4/2023
5.1. Estimation for Single Population

11/4/2023
1. CI for a Single Population Mean
A. Known variance or large sample size
There are 3 elements to a CI:
1. Point estimate
2. SE of the point estimate
3. Confidence coefficient
 Consider the task of computing a CI estimate of μ for a population distribution
that is normal with σ known.
 Available are data from a random sample of size = n.
Assumptions
§ Population standard deviation () is known
§ Sample size is large (n ≥ 30)
11/4/2023
§ Population normally distributed
1. CI for a Single Population Mean
 Use Z- distribution
A 100(1-)% C.I. for  is:

 is to be chosen by the researcher, most common values of  are 0.05,


0.01 and 0.1.

3. Commonly used CLs are 90%, 95%, and 99%


11/4/2023
1. CI for a Single Population Mean

Confidence Level Confidence Coefficient


((1-α) 100%) (Zα/2)
99% 2.576
95% 1.960
90% 1.645
80% 1.282

11/4/2023
1. CI for a Single Population Mean

Finding the Critical Value


20
1. CI for a Single Population Mean
21

Margin of Error (Precision of the estimate)


Margin of Error (e): the amount added and subtracted to the point estimate
to form the confidence interval

11/4/2023
1. CI for a Single Population Mean
22
Factors Affecting Margin of Error

qThe CI for mean and margin of error is determined by n,


σ or s, and α.
– As n increases, the size of the CI decreases.
– As σ or s increases, the size of the CI increases.
– As the confidence level increases (α decreases), the size of the CI increases.
1. CI for a Single Population Mean
Example
1. Waiting times (in hours) at a particular hospital are believed to be
approximately normally distributed with a variance of 2.25 hr.
a. A sample of 20 outpatients revealed a mean waiting time of 1.52
hours. Construct the 95% CI for the estimate of the population mean.
b. Suppose that the mean of 1.52 hours had resulted from a sample of
32 patients. Find the 95% CI.
c. What effect does larger sample size have on the CI?
Solution 2 .25
1 .52  1 .96  1 .52  1 .96 (. 33 )
20
 1 .52  .65  (. 87 , 2 .17 ) 11/4/2023
1. CI for a Single Population Mean
 We are 95% confident that the true mean waiting time is between 0.87 and 2.17
hrs.
Although the true mean may or may not be in this interval, 95% of the intervals
formed in this manner will contain the true mean.
An incorrect interpretation is that there is 95% probability that this
interval contains the true population mean.

2 . 25
1 . 52  1 . 96  1 . 52  1 . 96 (. 27 )
32
 1 . 52  . 53  (. 99 , 2 .05 )
c. The larger the sample size makes the CI narrower (more precision).
11/4/2023
1. CI for a Single Population Mean
 When constructing CIs, it has been assumed that the standard deviation of the underlying
population,  , is known
 What if  is not known?

 In practice, if the population mean μ is unknown, then the standard deviation,, is


probably unknown as well.
 In this case, the SE of the population can be replaced by the SE of the sample if the sample

size is large enough (n≥30). With large sample size, we assume a normal distribution
Example:
 It was found that a sample of 35 patients were 17.2 minutes late for appointments, on the
average, with SD of 8 minutes. What is the 90% CI for µ? Ans: (15.0, 19.4).
 Since the sample size is fairly large (≥30) and the population SD is unknown, we assume the
distribution of sample mean to be normally distributed based on the CLT and the sample SD
11/4/2023
to replace population .
1. CI for a Single Population Mean
B. Unknown variance (and small sample size, n < 30)
What if the  for the underlying population is unknown and the sample size
is small?
As an alternative we use Student’s t distribution.

Assumptions
§ Population standard deviation () is unknown
§ Sample size is small (n < 30)
§Population normally distributed 11/4/2023
§ If population is not normal, use CLT
1. CI for a Single Population Mean
27

What happens to CI as sample gets larger?


 s  For large samples:
x  Z  
 n 
Z and t values become almost identical,
 s  so CIs are almost identical.
x  t 
 n  11/4/2023
CI for a Single Population Mean

28 Degrees of Freedom (df)


df = Number of observations that are free to vary after sample mean has
been calculated df = n-1
Example: The mean of 3 numbers is 8.0
CI for a Single Population Mean
29

Student’s t Table

11/4/2023
CI for a Single Population Mean
30

t distribution values
 With comparison to the Z value

11/4/2023
CI for a Single Population Mean
Example:

Standard error =

t-value at 90% CL at 19 df =1.729

11/4/2023
CI for a Single Population Mean

Exercise:
Compute a 95% CI for the mean birth weight based on n = 10, sample mean = 116.9
and s =21.70.
From the t Table, t9, 0.975 = 2.262
Ans: (101.4, 132.4) 11/4/2023
2. CIs for single population proportion, p

An interval estimate for the population proportion (P) can be


calculated by adding an allowance for ubcertainty to the sample
proportion
Is based on three elements of CI.
q Point estimate
q SE of point estimate
q Confidence coefficient

11/4/2023
2. CIs for single population proportion, p

Recall that the distribution of the sample proportion is approximately normal


if the sample size is large, with standard deviation

We will estimate this with sample data:

11/4/2023
2. CIs for single population proportion, p
Upper and lower confidnce limits for the population proportion are calculated
with the formula:

11/4/2023
2. CIs for single population proportion, p

Lower limit = Point Estimate - (Critical Value) x (Standard Error of Estimate)


Upper limit = Point Estimate + (Critical Value) x (Standard Error of Estimate)
36
Hence,

is an approximate 95% CI for the true proportion p.


2. CIs for single population proportion, p
Example: A random sample of 100 people shows that 25 are left-handed. Form
a 95% CI for the true proportion of left-handers.

Interpretation:
q We are 95% confident that the true percentage of left-handlers inthe populationis
betwen 16.51% and 33.49%.
q Although this range may or may not contain the true population, 95% of the
intervals formed from samples of size 100 in this manner11/4/2023
will contain the true
proportion.
2. CIs for single population proportion, p
Changing the sample size
Increase in the sample size reduce the width of the confidence interval.
Example: If the sample size in the above example is doubled to 20, and if 50 are
left-handed inthe sample, then the interval is still centered at 0.25, but the width
shrinks to 0.19, 0.31.
Example: It was found that 28.1% of 153 cervical-cancer cases had never had a Pap
smear prior to the time of case’s diagnosis. Calculate a 95% CI for the percentage of
cervical-cancer cases who never had a Pap test.
A 95% CI is given by

11/4/2023
2. CIs for single population proportion, p
Example:
Suppose that among 10,000 female operating-room nurses, 60 women have developed
breast cancer over five years. Find the 95% for p based on point estimate.
 Point estimate = 60/10,000 = 0.006

 The 95% CI for p is given by the interval:

The 95% CI for p is:

11/4/2023
Estimation for Two Populations

11/4/2023
3. CI for the difference between population means

A. Known variances (2 independent samples)


When 1 and 2 are known and both populations are normal or both sample sizes are
at least 30, the test statistic is a z-value and the standard error for the mean
difference is

The point estimate for the difference is

The confidence interval for is:

11/4/2023
3. CI for the difference between population means

Assumptions
 Samples are randomly and independently drawn

 Population distributions are normal or both sample sizes are ≥30

 Population standard deviations are known

Illustration
 A researcher performs a drug trial involving two independent groups.

 A control group is treated with a placebo while, separately;


 The intervention group is treated with an active agent.
 Interest is in a comparison of the mean control response with the mean
intervention response under the assumption that the responses are
independent. 11/4/2023
3. CI for the difference between population means

Example
 We are interested in the similarity of the two groups.

1) Is mean blood pressure the same for males and females?


2) Is body mass index (BMI) similar for breast cancer cases versus non-
cancer patients?
3) Is length of stay (LOS) for patients in hospital “A” the same as that for
similar patients in hospital “B”?
Thus, evidence of similarity of the two groups is reflected in a difference
between means that is “near” zero.
11/4/2023
3. CI for the difference between population means

Example:
 Researchers are interested in the difference between serum uric acid levels in
patients with and without Down’s syndrome.
 Patients without Down’s syndrome

 n=12, sample mean=4.5 mg/100ml, 2=1.0


 Patients with Down’s syndrome

 n=15, sample mean=3.4 mg/100ml, 2=1.5


 Calculate the 95% CI.

 SE = 0.43, 95% CI = 1.1 ± 1.96 (0.43) = (0.26, 1.94)

 WE are 95% confident that the true difference between the two
population means is between 0.26 and 1.94. 11/4/2023
3. CI for the difference between population means

B. Unknown variances (Independent samples)


I. Population variances equal (large sample)
Assumptions:
 Samples are randomly and independently drawn

 Both of the sample sizes are ≥30

 Population standard deviations are unknown

Forming confidence estimates:


 Use sample standard deviation s to estimate , and the test statistic is a z-value
 The confidence interval for is
11/4/2023
3. CI for the difference between population means

Example:
 The mean CD4 + cells for 112 men with HIV infection was 401.8 with a
SD of 226.4. For 75 men without HIV, the mean and SD were 828.2 and
274.9, respectively. Calculate a 99% CI for the difference between
population means.
 SE of the difference b/n two means = 38.28

 99% CI = 426.4 ± 2.58 (38.28)

= (327.6, 525.2)

11/4/2023
3. CI for the difference between population means

II. Population variances equal (small sample)


Assumptions:
 Populations are normally distributed
 The populations have equal variances
 Samples are independent
 One or both sample sizes are <30
 Population standard deviations are unknown
v If 0.5  s12/s22  2 then we assume that the population variances are equal.
Forming confidence estimates:
 The population variances are assumed equal, so use the two sample standard deviations and pool them to
estimate 
 The test statistic is a t value with (n1 + n2 – 2) degrees of freedom
11/4/2023
 The pooled estimate (s2p) is the weighted average of the two sample variances.
3. CI for the difference between population means

 The pooled standard deviation is :

 The standard error of the estimate is given by:

 The confidence interval for µ1-µ2 is

11/4/2023
3. CI for the difference between population means

Example:
A study was conducted to compare the serum iron levels of children with
cystic fibrosis to those of healthy children. Serum iron levels were measured
for random samples of n1 = 9 healthy children and n2 = 13 children with
cystic fibrosis.

The two underlying populations of serum iron levels are independent and
normally distributed.
11/4/2023
3. CI for the difference between population means

 The difference in sample means can be used as a point


estimate for the true difference in ppulation means µ1-µ2
 We could also construct a confidence interval for µ1-µ2
 A t-value at 95% CL with 20 df is 2.086
11/4/2023
3. CI for the difference between population means

=
= (1.4, 12.6)
Or (1.4, 12.6) is a 95% confidence interval for µ1-µ2
Example:
Birth weights of children born to 14 heavy smokers (group 1) and to 15 non-
smokers (group 2) were sampled from live births at a large teaching hospital.
For the heavy smokers, sample mean = 3.17 kg, SD = 0.46 and for non-
smokers, sample mean = 3.63 kg and SD = 0.36.
Sp = 0.4121, SE = 0.1531, t-value at 27 df = 2.05
11/4/2023
95% CI = (0.14, 0.77)
3. CI for the difference between population means

III. Population variances unequal (small sample)


 The confidence interval for µ1-µ2 is:

 Where the degree of freedom (d’) is given by Welch -Satterthwaite


approximation:

 Round d’ down to the nearest integer 11/4/2023


3. CI for the difference between population means

Example:
 For the tuberculosis meningitis example, a random sample of n1 =37 HIV
infected patients has mean age at diagnosis years and
standard deviation S1 = 5.6
 A sample of n2 = 19 uninfected patients has mean age at diagnosis

years and standard deviation deviation S2 = 21.7 years

= 19.24 ≈ 19 11/4/2023
3. CI for the difference between population means

 For a t distribution with 19 df at 95% CL t-value is 2.093.


 Therefore, a 95% confidence interval would take the form

 Using the data from two samples of patients with tuberculosis


meningitis, the 95% CI for μ1 − μ2 is

=(-21.5, -0.3)
11/4/2023
3. CI for the difference between population means

C. Paired Samples
§ Tests Means of 2 Related Populations

∆ Paired or matched samples


∆ Repeated measures (before/after)
∆ Use difference between paired values:
d = x1-x2
§ Eliminates variation among subjects

Assumptions:
§ Both populations are normally distributed,
§ Or, if not normal, use large samples.

 Paired data arises when each individual in a sample is measured twice.

 Measurement might be "pre/post”, "before/after", “right/left, “parent/child”, etc.


3. CI for the difference between population means

Examples of paired data


1. Blood pressure prior to and following treatment,
2. Number of cigarettes smoked per week measured prior to and following
participation in a smoking cessation program,
3. Number of sex partners in the month prior to and in the month following
an HIV education campaign.
 Notice in each of these examples that the two occasions of measurement
are linked by virtue of the two measurements being made on the same
individual.
 Longitudinal or follow-up study

11/4/2023
3. CI for the difference between population means

Paired differences
 If two measurements of the same phenomenon (eg. blood pressure, #

cigarettes/week, etc) X and Y are measured on an individual and if each is


normally distributed, then their difference is also distributed normal.
 The interest in the difference between two measurements

 the ith paired difference is di, where

 the point estimate for the population mean paired difference is

11/4/2023
3. CI for the difference between population means

 the sample standard dseviation is

 n is the number of pairs in the paired sample


 the confidence interval fo paired sample mean diference is

 Where tα/2 is with n-1 df.

11/4/2023
3. CI for the difference between population means

Example:
 Ten hypertensive patients are screened at a neighborhood health clinic and are given
methyl dopa, a strong antihypertensive medication for their condition. They are asked
to come back 1 week later and have their blood pressures measured again. Suppose
the initial and follow-up SBPs (mm Hg) of the patients are given below.
1. What is the mean and Sd of the difference?
2. What is the standard error of the mean?
3. Assume that the difference is nor mally
distributed, construct a 95% CI for µ.

11/4/2023
3. CI for the difference between population means

Solution
We have the following data and summary statistics
4. Two Population Proportions
 We are often interested in comparing proportions from 2 populations:
• Is the incidence of disease A the same in two populations?
• Patients are treated with either drug D, or with placebo. Is the proportion
“improved” the same in both groups?
Goal: Form a confidence interval for or test a hypothesis about the difference
between two population proportion,
Assumptions:

 The point estimate for the difference is


11/4/2023
4. Two Population Proportions
 Confidence Interval for Two Population Proportions
 SE of the difference =

 The confidence interval for p1 – p2 is:

 The following formula is also equally used


 An approximate 95% confidence interval takes the form

11/4/2023
4. Two Population Proportions
Example: In a clinical trial for a new drug to treat hypertension, n1 = 50 patients were randomly
assigned to receive the new drug, and n2 = 50 patients to receive a placebo. 34 of the patients
receiving the drug showed improvement, while 15 of those receiving placebo showed
improvement.
 Compute a 95% CI estimate for the difference between proportions improved.

 p1 = 34/50 = 0.68, p2 = 15/50 = 0.30

 The point estimate for the difference is: = [0.68−0.30]=0.38

 SE of the difference =

 95% CI
 Lower = ( point estimate ) - (Zα/2) (SE) = 0.38 – (1.96)(0.0925) = 0.20
 Upper = ( point estimate ) + (Zα/2) (SE) = 0.38 + (1.96)(0.0925) = 0.56
 95% CI = (0.20, 0.56) 11/4/2023
5. Sample size estimation for cross sectional studies: mean
and proportion estimation
 In planning any investigation we must decide how many people need to be
studied in order to answer the study objectives.
 If the study is too small we may fail to detect important effects, or may
estimate effects too imprecisely.
 If the study is too large then we will waste resources.

 The eventual sample size is usually a compromise between what is


desirable and what is feasible.
 The feasible sample size is determined by the availability of resources

11/4/2023
Sample Size

o If sample (“n”) is
§ Large
§Increase accuracy
§ Costy / complex
Take Optim
§ Small
um
oDecrease
sample
accuracy
o Less costy
How ?
Factors to determine sample size
 Size of population
 Resources – subjects, financial, manpower
 Method of Sampling- random, stratified
 Degree of difference to be detected
 Variability (S.D.) – pilot study, historical
 Degree of Accuracy (or errors)
- Type I error (alpha) p<0.05
- Type II error (beta) less than 0.2 (20%)
- Power of the test : more than 0.8 (80%)
11/4/2023
 Dropout rate, non-compliance
Sample size for Single population
To estimate sample size for single survey using simple or systematic random sampling, need to
know:
oEstimate of the prevalence of the outcome
o Precision desired
o Design effect
o Size of total population
oLevel of confidence (always use 95%)
This is the situation in which the variable of interest is categorical.
The possible source of this proportion are:
ü from the results of a previous study,
ü item from a pilot study,
ü item judgment of the researcher.
11/4/2023
ü item Simply taking 50%
Sample size for Single population
Then the formula for the sample size of single population proportion is defined as:
z 2 2 * p (1  p )
n
w2
Where α = the level of significance which can be obtained as 1- confidence level.
P = best estimate of population proportions
W = maximum acceptable difference
z the value under standard normal table for the given value of confidence level
2

§Consider the total size of the population (N): if N <10000 then we need correction the
formula which is defined by n  no
f
no
1 
N
§Where nf = final sample size, no = sample size from the above formula and N total
population. Take the design effect in to account if needed
Sample size for Single population
 Example:
One of MPH student want to conduct a research on the prevalence of ANC utilization
of mothers in Sululta town. Given that the prevalence from the previous study found to
be 45.7% , what will be the sample size he should take to address his objective?
Solution:
Ø Margin of error d= 5%

Ø A confidence level of 95% will give the value of as Zα/2=1.96.


Ø Then using the formula
2 2
 Z  P (1  P )  Z  0 . 457 (1  0 . 457 )
 0 . 05
n   2 
  2 

W 2 0 . 05 2


 1 . 96  0 . 457 ( 0 . 543 )
2

0 . 05 2 11/4/2023
 382
Sample size for single population mean
This is the condition in which the research question is about mean.
Standard deviation () of the population:
It is rare that a researcher knows the exact standard deviation of the
population.
Typically, the standard deviation of the population is estimated:
Ø from the results of a previous survey,
Ø from a pilot study,
Ø from secondary data,
Ø from judgment of the researcher.
11/4/2023
Sample size for single population mean
Maximum acceptable difference (w): This is the maximum amount of error that you are
willing to accept.
Desired confidence level (Z/2 ) : is your level of certainty that the sample mean does
not differ from the true population mean by more than the maximum acceptable
difference. Commonly we use a 95% confidence level.
Then the sample size determination formula for single population mean is defined by:
z 2 2   2

Where n 
w2
 α= The level of significance which can be obtain as 1-confidence level.
 σ=Standard deviation of the population
 w= Maximum acceptable difference
 z α/2 = The value under standard normal table for the given11/4/2023
value of confidence level
Comparison of two proportions
72

(z 2  Z )2 ( p1 (1 p1 )  p2 (1 p2 ))


n
( p1  p2 )2
For a specified pair of values p1 and p2, we can find the sample sizes n1 = n2 = n required to give the test
of size α, that has specified type II error β.
Example: d = P1 - P2 = 0.7 - 0.5 = 0.2
When  = 30 mg/dl, β = 0.10,  = 0.05; z1-/2 = 1.96
Power = 1- β ; z1- β = 1.282 , d = 20mg/dl
(P1+P2)/2 = (0.7+0.5)/2 = 0.6

(z 2  Z )2 ( p1 (1  p1 )  p2 (1  p2 ))
n
( p1  p2 )2

(1.96 1.282) 2 (0.7(0.3)  0.5(0.5)) 10.09


n   252.25  253
(0.7  0.5) 2
0.04 11/4/2023
Comparison of two means (sample size in each group)
73

•For the two-sided alternative hypothesis with significance level α, the sample size
n1 = n2 = n required to detect a true difference in means of µ1 - µ2 with power at
least 1 – β is:
ZβZα/2σ
2

n  2* 
 μ1μ2 
, then total sample size is equal to
ZβZα/2σ
2

n  4* 
 μ1μ2 
•For a one-sided alternative hypothesis with significance
level α, this sample size is given by:
ZβZασ
2

n  2*  11/4/2023
 μ1μ2 
Comparison of two means (sample size in each group)

Example: We are interested in the size for a sample from a population of blood cholesterol levels. We
74 know that typically σ is about 30 mg/dl for these populations. How large a sample would be needed

for comparing two approaches to cholesterol lowering using α = 0.05, to detect a difference of d = 20
mg/dl or more with Power = 1-  = 0.90?
Solution:
When  = 30 mg/dl, β = 0.10,  = 0.05; z1-/2 = 1.96
Power = 1- β ; z 1- β = 1.282 ,

μ
1 μ2 = 20mg/dl.

The required sample size is:

Zβ  Zα/2 
2

n  4 * 2 *  
 μ1  μ2 
4(30) 2 (1.96  1.282) 2 4x90(3.242) 2 37838.03
    94.6  95
(20) 2
400 400 11/4/2023
Exercise
75

 An investigator is planning a clinical trial to evaluate the efficacy of a new drug designed to
reduce systolic blood pressure. The plan is to enroll participants and to randomly assign them to
receive either the new drug or a placebo. Systolic blood pressures will be measured in each
participant after 12 weeks on the assigned treatment. Based on prior experience with similar
trials, the investigator expects that 10% of all participants will be lost to follow up or will drop
out of the study. If the new drug shows a 5 unit reduction in mean systolic blood pressure, this
would represent a clinically meaningful reduction. How many patients should be enrolled in the
trial to ensure that the power of the test is 80% to detect this difference? A two sided test will
be used with a 5% level of significance. Assume that the standard deviation of systolic blood
pressure was 19.0.
z1-/2 = 1.96
z 1- β = 0.84,

Determine the minimum sample size?

11/4/2023
6. Hypothesis testing
11

• Definition of technical terms


• Type I and type II errors
• Test of significance
6.Hypothesis • Test on mean
• Test on proportion
testing
• Common test of significance
• Chi-square test
• T-test
• Z-test
11/4/2023
6. Hypothesis testing
The purpose of Hypothesis Testimg is to aid the clinician, researcher or
administrator in reaching a decision (conclusion) concerning a population by
examining a sample from that population.
Hypothesis
 Is a statement about one or more populations

 Is a claim (assumption) about a population parameter

 Is frequently concerned with the parameters of the population about which

the statement is made.


 Is a formal scientific process that accounts for statistical uncertainty

11/4/2023
6. Hypothesis testing
 Two types of hypothesis:
v Research hypothesis:
 Is the supposition or conjecture that motivates the research. It may be
proposed after numerous repeated observation
 Researc h hypothesis is generated about unknown population
parameter
 It leads directly to statistical hypotheses.
v Statistical hypothesis:
 Stated in such a way that they can be evaluated by using appropriate
statistical technique. 11/4/2023
6. Hypothesis testing
Examples of Research Hypotheses
Population Mean
 The average length of stay of patients admitted to the hospital is five
days
 The mean birthweight of babies delivered by mothers with low SES is
lower than those from higher SES. Etc
Population Proportion
 The proportion of adult smokers in Addis Ababa is believed to be p = 0.40

 The prevalence of HIV among non-married adults is higher than that in

married adults, Etc 11/4/2023


6. Hypothesis testing
Types of Hypothesis (statistical)
The Null Hypothesis, H0
· Is a statement claiming that there is no difference between the
hypothesized value and the population value.
· (The effect of interest is zero = no difference)
· No difference, no change
· States the assumption (hypothesis) to be tested

11/4/2023
6. Hypothesis testing
81 · H0 is a statement of agreement (or no difference)
· H0 is always about a population parameter, not about a sample
statistic

Begin with the assumption that the H0 is true


Similar to the notion of innocent until proven guilty
Always contains “=” sign
May or may not be rejected
6. Hypothesis testing
2. Research Hypothesis (The Alternative Hypothesis), HA
 Is a statement of what we will believe is true if our sample data
causes us to reject H0.
 Is generally the hypothesis that is believed (or needs to be supported)

by the researcher.
 What investigator believes to be true
 Is a statement that disagrees (opposes) with H0
 (The effect of interest is not zero= difference)
· Never contains “=” sign
 May or may not be accepted 11/4/2023
Steps in Hypothesis Testing
1. Formulate the appropriate statistical hypotheses clearly
• Specify H0 and HA
H0:  = 0 H0:  = 0 H0:  = 0
HA:   0 HA:  > 0 HA:  < 0
two-tailed one-tailed one-tailed
2. Set up a suitable significance level:
• The level of significance is the probability of rejecting the true null
hypothesis.
• It indicates the level of significance that signifies the probability of
computing type I error. 11/4/2023
Steps in Hypothesis Testing
 It is usually denoted by α and should be specified before any
samples are drawn.
 The level of significance is arbitrarily chosen small numbers usually
0.05, 0.01…

11/4/2023
Errors in Hypothesis Tests

inference
In fact
reject H0 Do not reject H0
H0 is true Type I error ()
H0 is false Power (1-) Type II error ()
11/4/2023
Type I error & type II error

1-


0 1
Critical value
Do not reject H0 Reject H0
11/4/2023
Steps in Hypothesis Testing
3. Decide on the appropriate test statistic for the hypothesis. E.g., One
population

87

OR
4. Determine the critical region: it is the area that indicates the rejection
region of the hypothesis.

Acceptance region of Ho
Steps in Hypothesis Testing
5. Doing computation: it is the right way of computing the test statistic
and other results from the sample. Then we need to see whether sample
result falls in the rejection region or in acceptance regions.
6. Making decision: finally we draw statistical conclusions. A statistical
decision comprises either accepting the null hypothesis or rejecting it.
Remark:
1. from z or t table, if the calculated value is greater than tabulated
value, the null hypothesis is rejected, i.e. the statistical results are
significant.
11/4/2023
Steps in Hypothesis Testing
2. The p-value
 p-Value is the probability of obtaining values of a test statistic as
extreme as that observed if the null hypothesis is true.
 The p-value for a test of a hypothesis is the smallest value of α for
which the null hypothesis is accepted or rejected.
 When p-value is below the cut off level (α), say 0.05, the result is
called statistically significant; when above 0.05 it is called not
significant.
 Reject H0 if P-value < α or Accept H0 if P-value > α
11/4/2023
Steps in Hypothesis Testing
 In a one tail test, the rejection region is at one end of the distribution or the other.
 Decision rules Zcal > Ztab or Zcal < - Ztab reject H0
 In a two tail test, the rejection region is split between the two tails.
 Decision rules |Zcal |> Ztab reject H0
 Which one is used depends on the way the alternative hypothesis is written.
 The same is true for t-test also

11/4/2023
Rules for Stating Statistical Hypotheses
1. One population
 Indication of equality (either =, ≤ or ≥) must appear in H0.

H 0: μ = μ 0, H A : μ ≠ μ 0
H 0: P = P 0, H A : P ≠ P 0
 Can we conclude that a certain population mean is

 not 30?
H0: μ = 30 and HA: μ ≠ 30
 greater than 50?
H0: μ = 50 HA: μ > 50
 Can we conclude that the proportion of patients with leukemia who survive more
than six years is not 60%?
11/4/2023
H0: P = 0.6 HA: P ≠ 0.6
Rules for Stating Statistical Hypotheses
In summary,
1. What you hope to conclude should be placed in the HA.
2. The H0 should have a statement of equality, =.
3. The H0 is the hypothesis that is tested
4. The H0 and HA are complementary.

11/4/2023
1. Hypothesis Testing of a Single Mean

11/4/2023
1. Hypothesis Testing of a Single Mean

1.1 Known Variance

Example: Two-Tailed Test


1. A simple random sample of 10 people from a certain population has a mean age
of 27. Can we conclude that the mean age of the population is not 30? The
population variance is 20. Let α=.05.
A. Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
B. Assumptions
Simple random sample 11/4/2023
Normally distributed population
1. Hypothesis Testing of a Single Mean

C. Hypotheses
H0: µ = 30
HA: µ ≠ 30
D. Test statistic
As the population variance is known, we use Z as the test statistic.

11/4/2023
1. Hypothesis Testing of a Single Mean
96

E. Decision Rule
 Reject H0 if the Zcal value falls in the rejection region.
 Don’t reject Ho if the Zcal value falls in the non-rejection region.
 Because of the structure of H0 it is a two tail test. Therefore, reject H0
if Zcal < -1.96 or Zcal > 1.96 or |Zcal| > 1.96

11/4/2023
1. Hypothesis Testing of a Single Mean
97
F. Calculation of test statistic

G. Statistical decision
We reject the H0 because Z = -2.12 is in the rejection region (-2.12 < 1.96). The
value is significant at 5%.
H. Conclusion
We conclude that µ is not 30. P-value = 0.0340 <0.05
A Z value of -2.12 corresponds to an area of 0.0170. Since there are two parts to
the rejection region in a two tail test, the P-value is twice this which is .0340.
11/4/2023
1. Hypothesis Testing of a Single Mean
98

Hypothesis test using confidence interval


 A problem like the above example can also be solved using a
confidence interval.
 A confidence interval will show that the calculated value of Z does
not fall within the boundaries of the interval. However, it will not
give a probability.
 Confidence interval

11/4/2023
1. Hypothesis Testing of a Single Mean

Example: One -Tailed Test


A simple random sample of 10 people from a certain population has a
mean age of 27. Can we conclude that the mean age of the population is
less than 30? The population variance is known to be 20. Let α = 0.05.
 Data

n = 10, sample mean = 27, 2 = 20, α = 0.05


 Hypotheses

H0: µ = 30, HA: µ < 30

11/4/2023
1. Hypothesis Testing of a Single Mean
100

Test statistic
=

 Rejection Region
Lower tail test

 With α = 0.05 and the inequality, we have the entire rejection region at the
left. The critical value will be Ztab = -1.645. Reject Ho 11/4/2023
if Zcal < -1.645.
1. Hypothesis Testing of a Single Mean

 Statistical decision
 We reject the H0 because -2.12 < -1.645.

 Conclusion
 We conclude that µ < 30.
 p = .0170 this time because it is only a one tail test and not a
two tail test.

11/4/2023
1. Hypothesis Testing of a Single Mean

1.2 Unknown Variance


In most practical applications the standard deviation of the underlying
population is not known
 In this case,  can be estimated by the sample standard deviation s.
 If the underlying population is normally distributed and n < 30, then
the test statistic is:

11/4/2023
1. Hypothesis Testing of a Single Mean
103

Example: Two-Tailed Test


 A simple random sample of 14 people from a certain population gives a
sample mean body mass index (BMI) of 30.5 and sd of 10.64. Can we
conclude that the BMI is not 35 at α 5%?
 H0: µ = 35, HA: µ ≠35
 Test statistic

 If the assumptions are correct and Ho is true, the test statistic follows
Student's t distribution with 13 degrees of freedom.
11/4/2023
1. Hypothesis Testing of a Single Mean
104

 Decision rule
 We have a two tailed test. With α = 0.05 it means that each tail is 0.025. The
critical ttab values with 13 df are -2.1604 and 2.1604.
 We reject H0 if the tcal < -2.1604 or tcal > 2.1604.

 Do not reject Ho because -1.58 is not in the rejection region. Based on the
data of the sample, it is possible that µ = 35. P-value11/4/2023
= 0.1375
1. Hypothesis Testing of a Single Mean

1.3. Sampling from a population that is not normally distributed


 Here, we do not know if the population displays a normal distribution.

 However, with a large sample size, we know from the Central Limit

Theorem that the sampling distribution of the population is distributed


normally.
 With a large sample(n ≥ 30), we can use Z as the test statistic
calculated using the sample sd.

11/4/2023
Example: The National Center for Health Statistics (NCHS) reports the mean
total cholesterol for adults is 203. Is the mean total cholesterol in
Framingham Heart Study participants significantly different?
In 3310 participants the mean is 200.3 with a standard deviation of 36.8.
H0: 203
H1: ≠203 0.05
Test statistic: X - μ0
Z
s/ n
Decision rule: Reject H0 if z > 1.96 or if z < -1.96
11/4/2023
Compute test statistic
X - μ0 200.3  203
Z   4.22
s/ n 36.8 / 3310
 Significance of the findings. Z = -4.22.
 Conclusion. Reject H0 because -4.22 <-1.96. We have statistically
significant evidence at =0.05 to show that the mean total cholesterol
is different in the Framingham Heart Study participants.

11/4/2023
2. Hypothesis Tests for Proportions
 Involves categorical values
 Two possible outcomes
 “Success” (possesses a certain characteristic)
 “Failure” (does not possesses that characteristic)

 Fraction or proportion of population in the “success” category is denoted by p

11/4/2023
2. Hypothesis Tests for Proportions

1.4. Hypothesis Testing about a Single Population Proportion (Normal


Approximation to Binomial Distribution)

11/4/2023
2. Hypothesis Tests for Proportions
110

 Example: We are interested in the probability of developing asthma over a given


one-year period for children 0 to 4 years of age whose mothers smoke in the home.
In the general population of 0 to 4-year-olds, the annual incidence of asthma is 1.4%.
If 10 cases of asthma are observed over a single year in a sample of 500 children
whose mothers smoke, can we conclude that this is different from the underlying
probability of p0 = 0.014? α =5%
H0 : p = 0.014
Ha: p ≠ 0.014
• The test statistic is given by:
2. Hypothesis Tests for Proportions
 The critical value of Zα/2 at α=5% is ±1.96.
 Don’t reject H0 since Z (=1.14) in the non-rejection region between ±1.96.

P-value = 0.2542
 We do not have sufficient evidence to conclude that the probability of

developing asthma for children whose mothers smoke in the home is


different from the probability in the general population

11/4/2023
2. Hypothesis Tests for Proportions
Example:
The NCHS reports that the prevalence of cigarette smoking among adults
in 2002 is 21.1%. Is the prevalence of smoking lower among
participants in the Framingham Heart Study? In 3536 participants, 482
reported smoking.
H0: p=0.211
H1: p<0.211 0.05
Test statistic p̂ - p 0
Z
p 0 (1 - p 0 )
n
Decision rule
11/4/2023
Reject H0 if z < -1.645
2. Hypothesis Tests for Proportions
Compute test statistic
p̂ - p 0 0.136  0.211
Z   10.93
p 0 (1 - p 0 ) 0.211(1  0.211)
n 3536

Conclusion. Reject H0 because -10.93 < -1.645. We have statistically


significant evidence at =0.05 to show that the prevalence of smoking is
lower among the Framingham Heart Study participants. (p<0.0001)

11/4/2023
Summary
114

Summary

11/4/2023
Summary

11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
When studying one-sample tests for a continuous random variable, the unknown mean μ of
a single population was compared to some known value μ0.
116
 We are usually interested in comparing the means of two different populations when the
values of both means are unknown
Independent Samples
Two Sample Means,

11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
3.1 Known Variances (Independent Samples)
 When two independent samples are drawn from a normally distributed
population with known variance, the test statistic for testing the Ho of equal
population means is:

Example:
Researchers wish to know a difference in mean serum uric acid (SUA) levels between
normal individuals and individuals with Down’s syndrome. The means SUA levels on 12
individuals with Down’s syndrome and 15 normal individuals are 4.5 and 3.4 mg/100
ml, respectively. with variances. (2=1, 2=1.5, respectively). Is there a difference
between the means of both groups at α 5%? 11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
Hypotheses:

H0: µ1- µ2 = 0 or H0: µ1 = µ2


HA: µ1 - µ2 ≠ 0 or HA: µ1 ≠ µ2
• With α = 0.05, the critical values of Ztab are -1.96 and +1.96. We reject H0 if Zcal < -
1.96 or Zcal > +1.96.

118

• Reject H0 because 2.57 > 1.96.


• From these data, it can be concluded that the population means are not equal. A 95%
11/4/2023
CI would give the same conclusion. P-value = 0.01.
3. Hypothesis Testing about the Difference Between Two
Population Means
3.2 Unknown Variances
I. Equal variances (Independent samples)
 With equal population variances, we can obtain a pooled value from

the sample variances.


 The test statistic for µ1 - µ2 is:

Where t has (n1 + n2 – 2) df., and

11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
 Example:
 We wish to know if we may conclude, at the 95% confidence level, that
smokers, in general, have greater lung damage than do non-smokers.

Calculation of Pooled Variance

11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
 Hypotheses:
H0: µ1-µ2 = 0, HA: µ1 > µ2
 With α = 0.05 and df = 23, the critical value of ttab is 1.7139. We reject H0
if tcal > 1.7139.
 Test statistic

 Reject H0 because 2.6563 > 1.7139. On the basis of the data, we conclude
that µ1 > µ2.
11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
II. Unequal variances (Independent samples)
 We are still interested in testing

H0 : μ1 = μ2 vs HA: μ1 ≠ μ2
 The test statistic used is:

 To compute a test statistic, we simply substitute s12 for 12 and s22 for 22.
 If tcal > td’’,α/2 or tcal < -td’’,α/2 then reject H0.

11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
Example:
 Suppose we want to compare the characteristics of tuberculosis meningitis for
patients infected with HIV and those not infected with HIV. In particular, we are
interested in comparing age at diagnosis. A random sample of n1 = 37 HIV
infected patients has mean age at diagnosis x1 = 27.9 years and s1 = 5.6
years. A sample of n2 = 19 uninfected patients has mean age at diagnosis x2 =
38.8 years and s2 = 21.7 years
 The test statistic is:

11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
 For a t distribution with 19 df, the area to the left of −2.15(tcal )is
between 0.01 and 0.025
 Therefore, 0.02 < p < 0.05 (tcal(-2.15) < -t19,0.025 (-2.093))
 For a test conducted at α= 0.05, H0 is rejected
 We conclude that among patients diagnosed with tuberculosis meningitis,
those who are infected with HIV tend to be younger than those who are
not

11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
3.3. Sampling from populations that are not normally distributed
 In this situation, the results of the CLT may be employed if sample sizes
are large (≥30).
 If the population variances are known, they are used; but if unknown, the

sample variances based on large sample sizes are used as estimates.


 The test statistics for µ1-µ2 is

11/4/2023
4. Hypothesis Testing for Paired Samples
 Two samples are paired when each data point of the first sample is matched and is
related to a unique data point of the second sample.
 Tests means of 2 related populations The Paired t Test
 Paired or matched samples The test statistic for differenc is
 Repeated measures (before/after)
 Longitudinal or follow-up study
Assumptions:
where tα/2 has n-1 df and Sd is:
 Both populations are normally distributed

 Or, if not normal, use large samples

n is the number of pairs in the paired sample


Sd = Sample standard deviation
4. Hypothesis Testing for Paired Samples
Example:
 The following data show the SBP levels (mm Hg) in 10 women while not using
(baseline) and while using (follow-up) oral contraceptives. Can we conclude that
there is a difference between mean baseline and follow-up SBP at α 5%? di =
baseline – follow-up
i SBP (baseline) SBP (follow-up) di
1 115 128 13
2 112 115 3
3 107 106 -1
4 119 128 9
5 115 122 7
6 138 145 7
7 126 132 6
8 105 109 4
9 104 102 -2 11/4/2023
10 115 117 2
4. Hypothesis Testing for Paired Samples
= (13 + 3 + …. + 2)/10 = 4.80
S2d = [(13-4.8)2 + … + (2-4.8)2]/9 = 20.844
Sd = √20.844 = 4.566
tcal= 4.80/(4.566/√10) = 4.80/1.44 = 3.32
 From the Table, t9, 0.025= 2.262
 Since tcal (= 3.32) > t9,α/2 (2.262) H0 is rejected
 P-value is between 0.001 and 0.01
 Since 3.32 falls in the rejection region, there is a significance difference
between the population means SBP while not using and using OC use.
11/4/2023
5. Hypothesis Tests about the Difference Between Two
Population Proportions

11/4/2023
5. Hypothesis Tests about the Difference Between Two
Population Proportions
Since we begin by assumming the null hypothesis is true, we assume p1 = p2 and
pool the two estimates
The pooled estimate for the overall proportion is:

Where X1 = the observed number of events in the first sample


and X2 = the observed number of events in the second sample
The test statistic for p1-p2 is:

11/4/2023
5. Hypothesis Tests about the Difference Between Two
Population Proportions
Example:
 A study was conducted to investigate the possible cause of gastroenteritis outbreak
following a lunch served in a high school cafeteria. Among the 225 students who ate
the sandwiches, 109 became ill. While, among the 38 students who did not eat the
sandwiches, 4 became ill. Is there a significant difference between the two groups at
α =5%.
 We wish to test

H0: p1 = p2 against the alternative


H A: p 1 ≠ p 2
 Assume that the sample sizes are large enough, and the normal approximation to the
binomial distribution is valid.
11/4/2023
 If the Ho is true, then p1 = p2 = p
5. Hypothesis Tests about the Difference Between Two
Population Proportions

The area under the standard normal curve to the


right of 4.36 is less than 0.0001. Therefore, p < 0.0002. We reject H0
at the 0.05 level. (4.36 >1.96)
The proportion of students who became ill differs in the two groups;
those who ate the prepared sandwiches were more likely to develop
gastroenteritis.

11/4/2023
Chi-square test of independence /association 2
 The chi squared test for independence tests whether two categorical variables are
independent of one another.
 A non-parametric test that is used to measure the association between two categorical
variables.
 The data is often summarized in a contingency table.
 Example: Suppose a new postoperative procedure is administered to a group of patients
at a particular hospital.

 Chi Square Test for Independence tests the null hypothesis which states the variables in the
rows and columns are independent of one another. 11/4/2023
Chi-square test of independence /association 2
 A chi square (X2) statistic is used to investigate whether distributions of categorical
(i.e. nominal/ordinal) variables differ from one another.
 Assumptions for Test of Goodness of Fit

1. Identify the variable and the level of measurement.


2. The data are obtained from a random sample.
3. The expected frequency for each category must be 5 or more.
The hypotheses are:
 H0: The two variables are independent (no relationship).

 H1: The two variables are dependent (there is a relationship the two variables).

 If the null hypothesis is rejected, there is some relationship between the variables.

11/4/2023
Chi-square test of independence /association 2
 2 tests are based on the agreement between expected (under H0) and
observed (sample) frequencies.
 Degrees of freedom are calculated as: (#rows – 1) x (# columns – 1)
 If H0 is true 2 will be close to 0, if H0 is false, 2 will be large

11/4/2023
 Reject H0 if 2 > Critical Value from 2 Table
General Notation for a chi square 2x2 Contingency Table

Variable
Variable 2 1Data Type 1 Data Type 2 Totals
Category 1 a b a+b
Category 2 c d c+d
Total a+c b+d a+b+c+d
�� − �� 2 �+�+�+�
�2 =
�+� �+� �+� �+�

11/4/2023
Example for a chi square 2x2 Contingency Table

Example:
A sample of 200 college students participated in a study designed to evaluate the
level of college students’ knowledge of a certain group of common diseases. The
following table shows the students classified by major field of study and level of
knowledge of the group of diseases: Do these data suggest that there is a
relationship between knowledge of the group
of diseases and major field of study of the
college students from which the present sample
was drawn? Let α=0.05.

11/4/2023
Example for a chi square 2x2 Contingency Table

Observed cells
16 24
Four cells  four-fold table
20 140

16 24 7.2 32.8
20 140 28.8 131.2

40  36 40  164 Expected cells


E11  ; E12 
200 200 7.2 32.8
160  36 160  164 28.8 131.2
E21  ; E22 
200 200 11/4/2023
Example for a chi square 2x2 Contingency Table

(ad  bc)2 n
  2

(a  b)(c  d )(a  c)(b  d )


2
(16 140  24  20)  200 /(40 160 36 164)
 16.396
df=(R-1)(C-1)=1

 2
0.05,1  3.84
Reject H0 at a=0 .05
There is relationship between knowledge of the group of diseases and major field of
study of the college students.
11/4/2023
The students major in premedical has higher knowledge rates of diseases.
The 2 statistic
140

The Pearson’s chi-square tests use the folowing formula to calculate the chi-square (X2)
statistic:

Where:
• X2 is the chi-square statistic
• Σ is the summation operator (i.e., it “takes the sum of”)
• O is the observed frequency
• E is the expected frequency
The larger the difference between the observations and expectations (O – E in the
equation), the bigger the chi-square statistic will be.
The chi-square statistic gets compared with a critical value using a chi-square critical
value table or statistical software. 11/4/2023
Example: Hospitals and Infections
 A researcher wishes to see if there is a relationship between the
hospital and the type of patient infections. A sample of 3 hospitals
was selected, and the number of infections for a specific year has
been reported. The data are shown next.

11/4/2023
Example: Hospitals and Infections
Step 1: State the hypotheses and identify the claim.
 H0: The type of infection is independent of the hospital.

 H1: The type of infection is dependent on the hospital (claim).

 Step 2: Find the critical value.


 The critical value at α = 0.05

 d.f. = (# rows – 1) x (# columns – 1)

 = (3 – 1)(3 – 1) =(2)(2) = 4
 we can use GeoGebra to find the CV (see next slide)

 In order to test the null hypothesis, one must compute the expected
11/4/2023
frequencies, assuming the null hypothesis is true.
Example: Hospitals and Infections

First compute the expected values:


E
 row sum   column sum 
grand total

Let’s see where these calculated values end up on the next11/4/2023


slide.
Example: Hospitals and Infections
Observed vs Expected

11/4/2023
Example: Hospitals and Infections
Step 3: Compute the test value.

11/4/2023
Example: Hospitals and Infections
Step 4: Make the decision.
The decision is to reject the null hypothesis since 30.698 is in the critical
region.

Step 5: Conclusion
There is enough evidence to support the claim that the type of infection is related
to the hospital where they occurred.
11/4/2023

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy