Chapter 5-6 Estimation Hypothesis
Chapter 5-6 Estimation Hypothesis
Berhanu Teshome(MSc)
Assistant Professor of Biostatistics
berteshome19@gmail.com
i v e
i p t
Sampling s c r ics
De atist
St
11/4/2023
Statistical Inference
q Inferential Statistics
Ø Estimation :
ü Point Estimation
ü Interval Estimation
Ø Hypothesis testing:
ü One sample t-test
ü Two sample t-test
ü AANOVA
ü Paired T-test
ü Chi-Square test of independency etc.
11/4/2023
Statistical estimation
The process of drawing conclusions about an entire population based on
the data in a sample is known as statistical inference.
Methods of inference usually fall into one of two broad categories:
11/4/2023
Properties of good estimators
9
A. Unbiased
♣ A statistic is said to be an unbiased estimator if its expected value is
equal to the estimated parameter.
SamSample mean ( ) is an unbiased estimator of population mean.
E ( )=µ
B. Consistent
♣ A statistic is said to be a consistent estimator if its value close to the
parameter as the sample size grows larger.
C. Relatively Efficient
♣ A statistic is said to be an efficient estimator if its11/4/2023
variance is smaller.
Estimating the Sampling Error
Any estimates derived from samples are subject to the sampling error.
This comes from the fact that only a part of the population was observed,
instead of the whole.
A different samples could have come up with different results. The
amount of variation that exists among the estimates from the different
possible samples is the sampling error.
The set of sample means in repeated random samples of size n from a
given population has variance .
The standard deviation of this set of sample means is and is
referred to as the standard error of the mean (sem) or the standard error.
The sem is estimated by if is unknown.11/4/2023
Estimating the Sampling Error
The sampling error is dependent on on sample size (n), the variability of
individual sample points (), sampling and estimation methods.
As n increases, the sample mean ( ) and the sample variance s2
approach the values of the true population parameters, µ and 2,
respectively.
11/4/2023
Confidence Intervals
Give a plausible range of values of the estimate likely to include the “true”
(population) value with a given confidence level.
An interval estimate provides more information about a population characteristic
than does a point estimate
Such interval estimates are called confidence intervals.
CIs also give information about the precision of an estimate.
How much uncertainty is associated with a point estimate of a population
parameter?
When sampling variability is high, the CI will be wide to reflect the uncertainty of
the observation.
11/4/2023
Wider CIs indicate less certainty.
Confidence Intervals
CIs can also answer the question of whether or not an association exists or a
treatment is beneficial or harmful. (analogous to p-values…)
e.g., if the CI of an odds ratio includes the value 1.0 we cannot be
confident that exposure is associated with disease.
A CI in general:
Takes into consideration variation in sample statistics from sample to
sample
Based on observation from 1 sample
Confidence Level
Confidence in which the interval will contain the unknown population parameter
Example: 95%
11/4/2023
1. CI for a Single Population Mean
A. Known variance or large sample size
There are 3 elements to a CI:
1. Point estimate
2. SE of the point estimate
3. Confidence coefficient
Consider the task of computing a CI estimate of μ for a population distribution
that is normal with σ known.
Available are data from a random sample of size = n.
Assumptions
§ Population standard deviation () is known
§ Sample size is large (n ≥ 30)
11/4/2023
§ Population normally distributed
1. CI for a Single Population Mean
Use Z- distribution
A 100(1-)% C.I. for is:
11/4/2023
1. CI for a Single Population Mean
11/4/2023
1. CI for a Single Population Mean
22
Factors Affecting Margin of Error
2 . 25
1 . 52 1 . 96 1 . 52 1 . 96 (. 27 )
32
1 . 52 . 53 (. 99 , 2 .05 )
c. The larger the sample size makes the CI narrower (more precision).
11/4/2023
1. CI for a Single Population Mean
When constructing CIs, it has been assumed that the standard deviation of the underlying
population, , is known
What if is not known?
size is large enough (n≥30). With large sample size, we assume a normal distribution
Example:
It was found that a sample of 35 patients were 17.2 minutes late for appointments, on the
average, with SD of 8 minutes. What is the 90% CI for µ? Ans: (15.0, 19.4).
Since the sample size is fairly large (≥30) and the population SD is unknown, we assume the
distribution of sample mean to be normally distributed based on the CLT and the sample SD
11/4/2023
to replace population .
1. CI for a Single Population Mean
B. Unknown variance (and small sample size, n < 30)
What if the for the underlying population is unknown and the sample size
is small?
As an alternative we use Student’s t distribution.
Assumptions
§ Population standard deviation () is unknown
§ Sample size is small (n < 30)
§Population normally distributed 11/4/2023
§ If population is not normal, use CLT
1. CI for a Single Population Mean
27
Student’s t Table
11/4/2023
CI for a Single Population Mean
30
t distribution values
With comparison to the Z value
11/4/2023
CI for a Single Population Mean
Example:
Standard error =
11/4/2023
CI for a Single Population Mean
Exercise:
Compute a 95% CI for the mean birth weight based on n = 10, sample mean = 116.9
and s =21.70.
From the t Table, t9, 0.975 = 2.262
Ans: (101.4, 132.4) 11/4/2023
2. CIs for single population proportion, p
11/4/2023
2. CIs for single population proportion, p
11/4/2023
2. CIs for single population proportion, p
Upper and lower confidnce limits for the population proportion are calculated
with the formula:
11/4/2023
2. CIs for single population proportion, p
Interpretation:
q We are 95% confident that the true percentage of left-handlers inthe populationis
betwen 16.51% and 33.49%.
q Although this range may or may not contain the true population, 95% of the
intervals formed from samples of size 100 in this manner11/4/2023
will contain the true
proportion.
2. CIs for single population proportion, p
Changing the sample size
Increase in the sample size reduce the width of the confidence interval.
Example: If the sample size in the above example is doubled to 20, and if 50 are
left-handed inthe sample, then the interval is still centered at 0.25, but the width
shrinks to 0.19, 0.31.
Example: It was found that 28.1% of 153 cervical-cancer cases had never had a Pap
smear prior to the time of case’s diagnosis. Calculate a 95% CI for the percentage of
cervical-cancer cases who never had a Pap test.
A 95% CI is given by
11/4/2023
2. CIs for single population proportion, p
Example:
Suppose that among 10,000 female operating-room nurses, 60 women have developed
breast cancer over five years. Find the 95% for p based on point estimate.
Point estimate = 60/10,000 = 0.006
11/4/2023
Estimation for Two Populations
11/4/2023
3. CI for the difference between population means
11/4/2023
3. CI for the difference between population means
Assumptions
Samples are randomly and independently drawn
Illustration
A researcher performs a drug trial involving two independent groups.
Example
We are interested in the similarity of the two groups.
Example:
Researchers are interested in the difference between serum uric acid levels in
patients with and without Down’s syndrome.
Patients without Down’s syndrome
WE are 95% confident that the true difference between the two
population means is between 0.26 and 1.94. 11/4/2023
3. CI for the difference between population means
Example:
The mean CD4 + cells for 112 men with HIV infection was 401.8 with a
SD of 226.4. For 75 men without HIV, the mean and SD were 828.2 and
274.9, respectively. Calculate a 99% CI for the difference between
population means.
SE of the difference b/n two means = 38.28
= (327.6, 525.2)
11/4/2023
3. CI for the difference between population means
11/4/2023
3. CI for the difference between population means
Example:
A study was conducted to compare the serum iron levels of children with
cystic fibrosis to those of healthy children. Serum iron levels were measured
for random samples of n1 = 9 healthy children and n2 = 13 children with
cystic fibrosis.
The two underlying populations of serum iron levels are independent and
normally distributed.
11/4/2023
3. CI for the difference between population means
=
= (1.4, 12.6)
Or (1.4, 12.6) is a 95% confidence interval for µ1-µ2
Example:
Birth weights of children born to 14 heavy smokers (group 1) and to 15 non-
smokers (group 2) were sampled from live births at a large teaching hospital.
For the heavy smokers, sample mean = 3.17 kg, SD = 0.46 and for non-
smokers, sample mean = 3.63 kg and SD = 0.36.
Sp = 0.4121, SE = 0.1531, t-value at 27 df = 2.05
11/4/2023
95% CI = (0.14, 0.77)
3. CI for the difference between population means
Example:
For the tuberculosis meningitis example, a random sample of n1 =37 HIV
infected patients has mean age at diagnosis years and
standard deviation S1 = 5.6
A sample of n2 = 19 uninfected patients has mean age at diagnosis
= 19.24 ≈ 19 11/4/2023
3. CI for the difference between population means
=(-21.5, -0.3)
11/4/2023
3. CI for the difference between population means
C. Paired Samples
§ Tests Means of 2 Related Populations
Assumptions:
§ Both populations are normally distributed,
§ Or, if not normal, use large samples.
11/4/2023
3. CI for the difference between population means
Paired differences
If two measurements of the same phenomenon (eg. blood pressure, #
11/4/2023
3. CI for the difference between population means
11/4/2023
3. CI for the difference between population means
Example:
Ten hypertensive patients are screened at a neighborhood health clinic and are given
methyl dopa, a strong antihypertensive medication for their condition. They are asked
to come back 1 week later and have their blood pressures measured again. Suppose
the initial and follow-up SBPs (mm Hg) of the patients are given below.
1. What is the mean and Sd of the difference?
2. What is the standard error of the mean?
3. Assume that the difference is nor mally
distributed, construct a 95% CI for µ.
11/4/2023
3. CI for the difference between population means
Solution
We have the following data and summary statistics
4. Two Population Proportions
We are often interested in comparing proportions from 2 populations:
• Is the incidence of disease A the same in two populations?
• Patients are treated with either drug D, or with placebo. Is the proportion
“improved” the same in both groups?
Goal: Form a confidence interval for or test a hypothesis about the difference
between two population proportion,
Assumptions:
11/4/2023
4. Two Population Proportions
Example: In a clinical trial for a new drug to treat hypertension, n1 = 50 patients were randomly
assigned to receive the new drug, and n2 = 50 patients to receive a placebo. 34 of the patients
receiving the drug showed improvement, while 15 of those receiving placebo showed
improvement.
Compute a 95% CI estimate for the difference between proportions improved.
SE of the difference =
95% CI
Lower = ( point estimate ) - (Zα/2) (SE) = 0.38 – (1.96)(0.0925) = 0.20
Upper = ( point estimate ) + (Zα/2) (SE) = 0.38 + (1.96)(0.0925) = 0.56
95% CI = (0.20, 0.56) 11/4/2023
5. Sample size estimation for cross sectional studies: mean
and proportion estimation
In planning any investigation we must decide how many people need to be
studied in order to answer the study objectives.
If the study is too small we may fail to detect important effects, or may
estimate effects too imprecisely.
If the study is too large then we will waste resources.
11/4/2023
Sample Size
o If sample (“n”) is
§ Large
§Increase accuracy
§ Costy / complex
Take Optim
§ Small
um
oDecrease
sample
accuracy
o Less costy
How ?
Factors to determine sample size
Size of population
Resources – subjects, financial, manpower
Method of Sampling- random, stratified
Degree of difference to be detected
Variability (S.D.) – pilot study, historical
Degree of Accuracy (or errors)
- Type I error (alpha) p<0.05
- Type II error (beta) less than 0.2 (20%)
- Power of the test : more than 0.8 (80%)
11/4/2023
Dropout rate, non-compliance
Sample size for Single population
To estimate sample size for single survey using simple or systematic random sampling, need to
know:
oEstimate of the prevalence of the outcome
o Precision desired
o Design effect
o Size of total population
oLevel of confidence (always use 95%)
This is the situation in which the variable of interest is categorical.
The possible source of this proportion are:
ü from the results of a previous study,
ü item from a pilot study,
ü item judgment of the researcher.
11/4/2023
ü item Simply taking 50%
Sample size for Single population
Then the formula for the sample size of single population proportion is defined as:
z 2 2 * p (1 p )
n
w2
Where α = the level of significance which can be obtained as 1- confidence level.
P = best estimate of population proportions
W = maximum acceptable difference
z the value under standard normal table for the given value of confidence level
2
§Consider the total size of the population (N): if N <10000 then we need correction the
formula which is defined by n no
f
no
1
N
§Where nf = final sample size, no = sample size from the above formula and N total
population. Take the design effect in to account if needed
Sample size for Single population
Example:
One of MPH student want to conduct a research on the prevalence of ANC utilization
of mothers in Sululta town. Given that the prevalence from the previous study found to
be 45.7% , what will be the sample size he should take to address his objective?
Solution:
Ø Margin of error d= 5%
W 2 0 . 05 2
1 . 96 0 . 457 ( 0 . 543 )
2
0 . 05 2 11/4/2023
382
Sample size for single population mean
This is the condition in which the research question is about mean.
Standard deviation () of the population:
It is rare that a researcher knows the exact standard deviation of the
population.
Typically, the standard deviation of the population is estimated:
Ø from the results of a previous survey,
Ø from a pilot study,
Ø from secondary data,
Ø from judgment of the researcher.
11/4/2023
Sample size for single population mean
Maximum acceptable difference (w): This is the maximum amount of error that you are
willing to accept.
Desired confidence level (Z/2 ) : is your level of certainty that the sample mean does
not differ from the true population mean by more than the maximum acceptable
difference. Commonly we use a 95% confidence level.
Then the sample size determination formula for single population mean is defined by:
z 2 2 2
Where n
w2
α= The level of significance which can be obtain as 1-confidence level.
σ=Standard deviation of the population
w= Maximum acceptable difference
z α/2 = The value under standard normal table for the given11/4/2023
value of confidence level
Comparison of two proportions
72
(z 2 Z )2 ( p1 (1 p1 ) p2 (1 p2 ))
n
( p1 p2 )2
•For the two-sided alternative hypothesis with significance level α, the sample size
n1 = n2 = n required to detect a true difference in means of µ1 - µ2 with power at
least 1 – β is:
ZβZα/2σ
2
n 2*
μ1μ2
, then total sample size is equal to
ZβZα/2σ
2
n 4*
μ1μ2
•For a one-sided alternative hypothesis with significance
level α, this sample size is given by:
ZβZασ
2
n 2* 11/4/2023
μ1μ2
Comparison of two means (sample size in each group)
Example: We are interested in the size for a sample from a population of blood cholesterol levels. We
74 know that typically σ is about 30 mg/dl for these populations. How large a sample would be needed
for comparing two approaches to cholesterol lowering using α = 0.05, to detect a difference of d = 20
mg/dl or more with Power = 1- = 0.90?
Solution:
When = 30 mg/dl, β = 0.10, = 0.05; z1-/2 = 1.96
Power = 1- β ; z 1- β = 1.282 ,
μ
1 μ2 = 20mg/dl.
Zβ Zα/2
2
n 4 * 2 *
μ1 μ2
4(30) 2 (1.96 1.282) 2 4x90(3.242) 2 37838.03
94.6 95
(20) 2
400 400 11/4/2023
Exercise
75
An investigator is planning a clinical trial to evaluate the efficacy of a new drug designed to
reduce systolic blood pressure. The plan is to enroll participants and to randomly assign them to
receive either the new drug or a placebo. Systolic blood pressures will be measured in each
participant after 12 weeks on the assigned treatment. Based on prior experience with similar
trials, the investigator expects that 10% of all participants will be lost to follow up or will drop
out of the study. If the new drug shows a 5 unit reduction in mean systolic blood pressure, this
would represent a clinically meaningful reduction. How many patients should be enrolled in the
trial to ensure that the power of the test is 80% to detect this difference? A two sided test will
be used with a 5% level of significance. Assume that the standard deviation of systolic blood
pressure was 19.0.
z1-/2 = 1.96
z 1- β = 0.84,
11/4/2023
6. Hypothesis testing
11
11/4/2023
6. Hypothesis testing
Two types of hypothesis:
v Research hypothesis:
Is the supposition or conjecture that motivates the research. It may be
proposed after numerous repeated observation
Researc h hypothesis is generated about unknown population
parameter
It leads directly to statistical hypotheses.
v Statistical hypothesis:
Stated in such a way that they can be evaluated by using appropriate
statistical technique. 11/4/2023
6. Hypothesis testing
Examples of Research Hypotheses
Population Mean
The average length of stay of patients admitted to the hospital is five
days
The mean birthweight of babies delivered by mothers with low SES is
lower than those from higher SES. Etc
Population Proportion
The proportion of adult smokers in Addis Ababa is believed to be p = 0.40
11/4/2023
6. Hypothesis testing
81 · H0 is a statement of agreement (or no difference)
· H0 is always about a population parameter, not about a sample
statistic
by the researcher.
What investigator believes to be true
Is a statement that disagrees (opposes) with H0
(The effect of interest is not zero= difference)
· Never contains “=” sign
May or may not be accepted 11/4/2023
Steps in Hypothesis Testing
1. Formulate the appropriate statistical hypotheses clearly
• Specify H0 and HA
H0: = 0 H0: = 0 H0: = 0
HA: 0 HA: > 0 HA: < 0
two-tailed one-tailed one-tailed
2. Set up a suitable significance level:
• The level of significance is the probability of rejecting the true null
hypothesis.
• It indicates the level of significance that signifies the probability of
computing type I error. 11/4/2023
Steps in Hypothesis Testing
It is usually denoted by α and should be specified before any
samples are drawn.
The level of significance is arbitrarily chosen small numbers usually
0.05, 0.01…
11/4/2023
Errors in Hypothesis Tests
inference
In fact
reject H0 Do not reject H0
H0 is true Type I error ()
H0 is false Power (1-) Type II error ()
11/4/2023
Type I error & type II error
1-
0 1
Critical value
Do not reject H0 Reject H0
11/4/2023
Steps in Hypothesis Testing
3. Decide on the appropriate test statistic for the hypothesis. E.g., One
population
87
OR
4. Determine the critical region: it is the area that indicates the rejection
region of the hypothesis.
Acceptance region of Ho
Steps in Hypothesis Testing
5. Doing computation: it is the right way of computing the test statistic
and other results from the sample. Then we need to see whether sample
result falls in the rejection region or in acceptance regions.
6. Making decision: finally we draw statistical conclusions. A statistical
decision comprises either accepting the null hypothesis or rejecting it.
Remark:
1. from z or t table, if the calculated value is greater than tabulated
value, the null hypothesis is rejected, i.e. the statistical results are
significant.
11/4/2023
Steps in Hypothesis Testing
2. The p-value
p-Value is the probability of obtaining values of a test statistic as
extreme as that observed if the null hypothesis is true.
The p-value for a test of a hypothesis is the smallest value of α for
which the null hypothesis is accepted or rejected.
When p-value is below the cut off level (α), say 0.05, the result is
called statistically significant; when above 0.05 it is called not
significant.
Reject H0 if P-value < α or Accept H0 if P-value > α
11/4/2023
Steps in Hypothesis Testing
In a one tail test, the rejection region is at one end of the distribution or the other.
Decision rules Zcal > Ztab or Zcal < - Ztab reject H0
In a two tail test, the rejection region is split between the two tails.
Decision rules |Zcal |> Ztab reject H0
Which one is used depends on the way the alternative hypothesis is written.
The same is true for t-test also
11/4/2023
Rules for Stating Statistical Hypotheses
1. One population
Indication of equality (either =, ≤ or ≥) must appear in H0.
H 0: μ = μ 0, H A : μ ≠ μ 0
H 0: P = P 0, H A : P ≠ P 0
Can we conclude that a certain population mean is
not 30?
H0: μ = 30 and HA: μ ≠ 30
greater than 50?
H0: μ = 50 HA: μ > 50
Can we conclude that the proportion of patients with leukemia who survive more
than six years is not 60%?
11/4/2023
H0: P = 0.6 HA: P ≠ 0.6
Rules for Stating Statistical Hypotheses
In summary,
1. What you hope to conclude should be placed in the HA.
2. The H0 should have a statement of equality, =.
3. The H0 is the hypothesis that is tested
4. The H0 and HA are complementary.
11/4/2023
1. Hypothesis Testing of a Single Mean
11/4/2023
1. Hypothesis Testing of a Single Mean
C. Hypotheses
H0: µ = 30
HA: µ ≠ 30
D. Test statistic
As the population variance is known, we use Z as the test statistic.
11/4/2023
1. Hypothesis Testing of a Single Mean
96
E. Decision Rule
Reject H0 if the Zcal value falls in the rejection region.
Don’t reject Ho if the Zcal value falls in the non-rejection region.
Because of the structure of H0 it is a two tail test. Therefore, reject H0
if Zcal < -1.96 or Zcal > 1.96 or |Zcal| > 1.96
11/4/2023
1. Hypothesis Testing of a Single Mean
97
F. Calculation of test statistic
G. Statistical decision
We reject the H0 because Z = -2.12 is in the rejection region (-2.12 < 1.96). The
value is significant at 5%.
H. Conclusion
We conclude that µ is not 30. P-value = 0.0340 <0.05
A Z value of -2.12 corresponds to an area of 0.0170. Since there are two parts to
the rejection region in a two tail test, the P-value is twice this which is .0340.
11/4/2023
1. Hypothesis Testing of a Single Mean
98
11/4/2023
1. Hypothesis Testing of a Single Mean
11/4/2023
1. Hypothesis Testing of a Single Mean
100
Test statistic
=
Rejection Region
Lower tail test
With α = 0.05 and the inequality, we have the entire rejection region at the
left. The critical value will be Ztab = -1.645. Reject Ho 11/4/2023
if Zcal < -1.645.
1. Hypothesis Testing of a Single Mean
Statistical decision
We reject the H0 because -2.12 < -1.645.
Conclusion
We conclude that µ < 30.
p = .0170 this time because it is only a one tail test and not a
two tail test.
11/4/2023
1. Hypothesis Testing of a Single Mean
11/4/2023
1. Hypothesis Testing of a Single Mean
103
If the assumptions are correct and Ho is true, the test statistic follows
Student's t distribution with 13 degrees of freedom.
11/4/2023
1. Hypothesis Testing of a Single Mean
104
Decision rule
We have a two tailed test. With α = 0.05 it means that each tail is 0.025. The
critical ttab values with 13 df are -2.1604 and 2.1604.
We reject H0 if the tcal < -2.1604 or tcal > 2.1604.
Do not reject Ho because -1.58 is not in the rejection region. Based on the
data of the sample, it is possible that µ = 35. P-value11/4/2023
= 0.1375
1. Hypothesis Testing of a Single Mean
However, with a large sample size, we know from the Central Limit
11/4/2023
Example: The National Center for Health Statistics (NCHS) reports the mean
total cholesterol for adults is 203. Is the mean total cholesterol in
Framingham Heart Study participants significantly different?
In 3310 participants the mean is 200.3 with a standard deviation of 36.8.
H0: 203
H1: ≠203 0.05
Test statistic: X - μ0
Z
s/ n
Decision rule: Reject H0 if z > 1.96 or if z < -1.96
11/4/2023
Compute test statistic
X - μ0 200.3 203
Z 4.22
s/ n 36.8 / 3310
Significance of the findings. Z = -4.22.
Conclusion. Reject H0 because -4.22 <-1.96. We have statistically
significant evidence at =0.05 to show that the mean total cholesterol
is different in the Framingham Heart Study participants.
11/4/2023
2. Hypothesis Tests for Proportions
Involves categorical values
Two possible outcomes
“Success” (possesses a certain characteristic)
“Failure” (does not possesses that characteristic)
11/4/2023
2. Hypothesis Tests for Proportions
11/4/2023
2. Hypothesis Tests for Proportions
110
P-value = 0.2542
We do not have sufficient evidence to conclude that the probability of
11/4/2023
2. Hypothesis Tests for Proportions
Example:
The NCHS reports that the prevalence of cigarette smoking among adults
in 2002 is 21.1%. Is the prevalence of smoking lower among
participants in the Framingham Heart Study? In 3536 participants, 482
reported smoking.
H0: p=0.211
H1: p<0.211 0.05
Test statistic p̂ - p 0
Z
p 0 (1 - p 0 )
n
Decision rule
11/4/2023
Reject H0 if z < -1.645
2. Hypothesis Tests for Proportions
Compute test statistic
p̂ - p 0 0.136 0.211
Z 10.93
p 0 (1 - p 0 ) 0.211(1 0.211)
n 3536
11/4/2023
Summary
114
Summary
11/4/2023
Summary
11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
When studying one-sample tests for a continuous random variable, the unknown mean μ of
a single population was compared to some known value μ0.
116
We are usually interested in comparing the means of two different populations when the
values of both means are unknown
Independent Samples
Two Sample Means,
11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
3.1 Known Variances (Independent Samples)
When two independent samples are drawn from a normally distributed
population with known variance, the test statistic for testing the Ho of equal
population means is:
Example:
Researchers wish to know a difference in mean serum uric acid (SUA) levels between
normal individuals and individuals with Down’s syndrome. The means SUA levels on 12
individuals with Down’s syndrome and 15 normal individuals are 4.5 and 3.4 mg/100
ml, respectively. with variances. (2=1, 2=1.5, respectively). Is there a difference
between the means of both groups at α 5%? 11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
Hypotheses:
118
11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
Example:
We wish to know if we may conclude, at the 95% confidence level, that
smokers, in general, have greater lung damage than do non-smokers.
11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
Hypotheses:
H0: µ1-µ2 = 0, HA: µ1 > µ2
With α = 0.05 and df = 23, the critical value of ttab is 1.7139. We reject H0
if tcal > 1.7139.
Test statistic
Reject H0 because 2.6563 > 1.7139. On the basis of the data, we conclude
that µ1 > µ2.
11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
II. Unequal variances (Independent samples)
We are still interested in testing
H0 : μ1 = μ2 vs HA: μ1 ≠ μ2
The test statistic used is:
To compute a test statistic, we simply substitute s12 for 12 and s22 for 22.
If tcal > td’’,α/2 or tcal < -td’’,α/2 then reject H0.
11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
Example:
Suppose we want to compare the characteristics of tuberculosis meningitis for
patients infected with HIV and those not infected with HIV. In particular, we are
interested in comparing age at diagnosis. A random sample of n1 = 37 HIV
infected patients has mean age at diagnosis x1 = 27.9 years and s1 = 5.6
years. A sample of n2 = 19 uninfected patients has mean age at diagnosis x2 =
38.8 years and s2 = 21.7 years
The test statistic is:
11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
For a t distribution with 19 df, the area to the left of −2.15(tcal )is
between 0.01 and 0.025
Therefore, 0.02 < p < 0.05 (tcal(-2.15) < -t19,0.025 (-2.093))
For a test conducted at α= 0.05, H0 is rejected
We conclude that among patients diagnosed with tuberculosis meningitis,
those who are infected with HIV tend to be younger than those who are
not
11/4/2023
3. Hypothesis Testing about the Difference Between Two
Population Means
3.3. Sampling from populations that are not normally distributed
In this situation, the results of the CLT may be employed if sample sizes
are large (≥30).
If the population variances are known, they are used; but if unknown, the
11/4/2023
4. Hypothesis Testing for Paired Samples
Two samples are paired when each data point of the first sample is matched and is
related to a unique data point of the second sample.
Tests means of 2 related populations The Paired t Test
Paired or matched samples The test statistic for differenc is
Repeated measures (before/after)
Longitudinal or follow-up study
Assumptions:
where tα/2 has n-1 df and Sd is:
Both populations are normally distributed
11/4/2023
5. Hypothesis Tests about the Difference Between Two
Population Proportions
Since we begin by assumming the null hypothesis is true, we assume p1 = p2 and
pool the two estimates
The pooled estimate for the overall proportion is:
11/4/2023
5. Hypothesis Tests about the Difference Between Two
Population Proportions
Example:
A study was conducted to investigate the possible cause of gastroenteritis outbreak
following a lunch served in a high school cafeteria. Among the 225 students who ate
the sandwiches, 109 became ill. While, among the 38 students who did not eat the
sandwiches, 4 became ill. Is there a significant difference between the two groups at
α =5%.
We wish to test
11/4/2023
Chi-square test of independence /association 2
The chi squared test for independence tests whether two categorical variables are
independent of one another.
A non-parametric test that is used to measure the association between two categorical
variables.
The data is often summarized in a contingency table.
Example: Suppose a new postoperative procedure is administered to a group of patients
at a particular hospital.
Chi Square Test for Independence tests the null hypothesis which states the variables in the
rows and columns are independent of one another. 11/4/2023
Chi-square test of independence /association 2
A chi square (X2) statistic is used to investigate whether distributions of categorical
(i.e. nominal/ordinal) variables differ from one another.
Assumptions for Test of Goodness of Fit
H1: The two variables are dependent (there is a relationship the two variables).
If the null hypothesis is rejected, there is some relationship between the variables.
11/4/2023
Chi-square test of independence /association 2
2 tests are based on the agreement between expected (under H0) and
observed (sample) frequencies.
Degrees of freedom are calculated as: (#rows – 1) x (# columns – 1)
If H0 is true 2 will be close to 0, if H0 is false, 2 will be large
11/4/2023
Reject H0 if 2 > Critical Value from 2 Table
General Notation for a chi square 2x2 Contingency Table
Variable
Variable 2 1Data Type 1 Data Type 2 Totals
Category 1 a b a+b
Category 2 c d c+d
Total a+c b+d a+b+c+d
�� − �� 2 �+�+�+�
�2 =
�+� �+� �+� �+�
11/4/2023
Example for a chi square 2x2 Contingency Table
Example:
A sample of 200 college students participated in a study designed to evaluate the
level of college students’ knowledge of a certain group of common diseases. The
following table shows the students classified by major field of study and level of
knowledge of the group of diseases: Do these data suggest that there is a
relationship between knowledge of the group
of diseases and major field of study of the
college students from which the present sample
was drawn? Let α=0.05.
11/4/2023
Example for a chi square 2x2 Contingency Table
Observed cells
16 24
Four cells four-fold table
20 140
16 24 7.2 32.8
20 140 28.8 131.2
(ad bc)2 n
2
2
0.05,1 3.84
Reject H0 at a=0 .05
There is relationship between knowledge of the group of diseases and major field of
study of the college students.
11/4/2023
The students major in premedical has higher knowledge rates of diseases.
The 2 statistic
140
The Pearson’s chi-square tests use the folowing formula to calculate the chi-square (X2)
statistic:
Where:
• X2 is the chi-square statistic
• Σ is the summation operator (i.e., it “takes the sum of”)
• O is the observed frequency
• E is the expected frequency
The larger the difference between the observations and expectations (O – E in the
equation), the bigger the chi-square statistic will be.
The chi-square statistic gets compared with a critical value using a chi-square critical
value table or statistical software. 11/4/2023
Example: Hospitals and Infections
A researcher wishes to see if there is a relationship between the
hospital and the type of patient infections. A sample of 3 hospitals
was selected, and the number of infections for a specific year has
been reported. The data are shown next.
11/4/2023
Example: Hospitals and Infections
Step 1: State the hypotheses and identify the claim.
H0: The type of infection is independent of the hospital.
= (3 – 1)(3 – 1) =(2)(2) = 4
we can use GeoGebra to find the CV (see next slide)
In order to test the null hypothesis, one must compute the expected
11/4/2023
frequencies, assuming the null hypothesis is true.
Example: Hospitals and Infections
11/4/2023
Example: Hospitals and Infections
Step 3: Compute the test value.
11/4/2023
Example: Hospitals and Infections
Step 4: Make the decision.
The decision is to reject the null hypothesis since 30.698 is in the critical
region.
Step 5: Conclusion
There is enough evidence to support the claim that the type of infection is related
to the hospital where they occurred.
11/4/2023