0% found this document useful (0 votes)
44 views122 pages

07 Analysis of Variance

This presentation discusses hypothesis testing with multiple populations, including paired t-tests, ANOVA, and MANOVA. It begins with an overview of hypothesis testing with two populations using paired t-tests and spooled t-tests. It then covers hypothesis testing with multiple populations using analysis of variance (ANOVA techniques like one-way ANOVA, two-way ANOVA, and MANOVA. Key concepts and assumptions for paired t-tests like normality of differences are also reviewed. Normality tests like Shapiro-Wilk, Kolmogorov-Smirnov, and Anderson-Darling are also briefly introduced.

Uploaded by

iartificial711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views122 pages

07 Analysis of Variance

This presentation discusses hypothesis testing with multiple populations, including paired t-tests, ANOVA, and MANOVA. It begins with an overview of hypothesis testing with two populations using paired t-tests and spooled t-tests. It then covers hypothesis testing with multiple populations using analysis of variance (ANOVA techniques like one-way ANOVA, two-way ANOVA, and MANOVA. Key concepts and assumptions for paired t-tests like normality of differences are also reviewed. Normality tests like Shapiro-Wilk, Kolmogorov-Smirnov, and Anderson-Darling are also briefly introduced.

Uploaded by

iartificial711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Lecture #07

Hypothesis Testing with Multiple Populations

Dr. Debasis Samanta


Professor
Department of Computer Science & Engineering
Quote of the day..

Try not to become a person of success, but


rather
Try not try to become
to become a person
a person of value.but
of success,
ALBERT
rather try EINSTEIN,
to become Theoretical
a person of value. physicist
ALBERT EINSTEIN, Theoretical Physicist

CS 61061: Data Analytics 2


This presentation includes…
 Hypothesis testing with two populations
 Paired t-test
 Spooled t-test

 Hypothesis testing with multiple populations


 Analysis of variance (ANOVA)
 Basic concepts and terminologies

 One − way ANOVA


 Two−way ANOVA
 MANOVA

CS 61061: Data Analytics 3


Single versus Multiple population

CS 61061: Data Analytics 4


Paired t-test

CS 61061: Data Analytics 5


The Paired t-test?
 Suppose you have been given two populations and you are to test if the mean of the two
populations the same or not.

 The Paired t-test is used to test whether the mean difference between pairs of measurements is
zero or not.

Example 7.1
A drug manufacturing company invented a drug to control the blood pressure of patients. A group of
patients are taken into an experiment and their blood pressures are noted before taking the drug. All
patients were then administered the invented drugs for a week and then their blood pressure were
measured again.

Here, we have pairs of measurements for each person, and we can find the differences. In this case, we are to
test if the mean difference is zero or not.

Patient ID BP (before drug) BP (after drug) Difference in BP


… … … …
… … … …

CS 61061: Data Analytics 6


The Paired t-test?
Assumption for Paired t-test
 Subjects must be independent i.e., measurements for one subject do not affect measurements for any
other subjects.
 Each of the paired measurements must be obtained from the same subject.
 The measured differences are normally distributed.

Example 7.2
Two different techniques are invented to type messages on a small hand-held device. Technique T1
uses a touch screen and technique T2 uses a keyboard. A group of users each possessing a specific
model of a hand-held device participated in an experiment, where each of them had to type a
particular text, and their text entry speed was recorded.

User ID Text entry time with T1 Text entry time with T2 Difference in times
… … … …
… … … …

Here, we are to test if both T1 and T2 are of equal efficiency or not.

CS 61061: Data Analytics 7


The Paired t-test Procedure
 Let n be the number of pairs are to be tested and the differences of pairs are {𝑑1, 𝑑2, … ,𝑑𝑛}
 Assume that the differences are normally distributed.

Steps
1. Calculate the mean of the differences
1
𝑋ത = ෍ 𝑑𝑖
𝑛
2. Calculate the variance of the differences
1
𝑆2 = 𝑑 − 𝑋ത 2
𝑛−1 𝑖

σ 𝑑𝑖 2 − (σ 𝑑𝑖 )2
=
𝑛−1
3. Calculate the test statistics
Average difference 𝑋ത σ 𝑑𝑖
𝑡= = =
Standard error 𝑆 𝑛(σ 𝑑𝑖 2 − σ 𝑑𝑖 2 )
𝑛 𝑛−1
This t-value should be tested with a level of significance to reject or accept the hypothesis that the mean
difference is zero.
CS 61061: Data Analytics 8
Example 7.3
An instructor wants to check if two exams are of equal difficulty level. For this, she
conducted two tests with a group of 15 students. The scores on the evaluation of two
tests are shown in Table.

Student ID Exam –I Exam-II Difference Student ID Exam –I Exam-II Difference


1 63 69 6 9 90 85 -5
2 65 65 0 10 84 92 8
3 56 62 6 11 68 69 1
4 100 91 -9 12 74 81 7
5 88 78 -10 13 87 84 -3
6 83 87 4 14 64 75 11
7 77 79 2 15 71 84 13
8 92 88 -4 16 88 82 -6

Check the validity of the Paired t-test:


1. Subjects are independent: Each student does their own work on the two exams (i.e. no copy from one to
another, etc.)
2. The same group of subjects: Each of the paired measurement is obtained from the same subject; each
student take both exams
3. Normal distribution of differences.
CS 61061: Data Analytics 9
Example 7.3

CS 61061: Data Analytics


1. Hypothesis
𝐻0 : µ𝑑 = 0 The population mean of the difference is zero
𝐻1 : µ𝑑 ≠ 0 The population mean of the difference is not zero
𝑆
2. Standard error 𝑆𝑋ത = // Tells the standard deviation of the samples
𝑛

𝑥=1.31
ҧ

𝑥ҧ 𝑥ҧ
3. 𝑡 = = 𝑆 // Average of the mean difference
𝑆𝑋

𝑛

1.31
= = 0.750
1.75

4. The critical t-value from the table with α=0.05 with df = n-1 = 16-1 = 15 is 2.131

5. Decision: The test statistics is lower than the critical t-value with α = 0.05. Thus we are failed to reject
the null hypothesis that the mean difference is zero. In other word, we can say that the two tests are equally
difficult.
ത 𝑡 ) = 0.4650
p-value estimation: 𝑝 = 𝑃(𝑋>
This means that the likelihood of seeing a sample of average difference is 1.31 or greater when the 10
underlying population mean difference is zero, about 47%.
Thus we are 53% confidence about our decision.
Validity of the Paired t-test

CS 61061: Data Analytics


 One important assumption that Paired t-Test is applicable only if
the sample drawn from a population, which are normally
distributed.
 In many situation (e.g. When the sample size is small or population is
uniformly distributed etc.) this assumption of normality may not be
satisfied.

 In this situation, we can use non-parametric approach to


statistical analysis.
 This non-parametric approach is suitable, when a large data is available
and / or data value are not necessarily normally distributed.

 For the Paired t-Test, a non-parametric statistical analysis is the


Wilcoxon Signed-Rank test. 11
Normality Testing
 Well known and popular tests for normal distribution of a population
data are

1. Shapiro-Wilk test

 applicable for 𝑛 < 50

2. Kolmogorov-Smirnov test

 is used for 𝑛 ≥ 50

3. Anderson-Darling test

CS 61061: Data Analytics 12


Shapiro-Wilk Test
 The Shapiro-Wilk test is more appropriate for small sample sizes (< 50 samples), although it can also
handle larger sample sizes.
 It is essentially a goodness-of-fit test and is used to determine if a random sample xi, i = 1, 2, …, n is drawn
from a normal distribution with true mean and variance 𝜇 and 𝜎 2 , respectively.
 In other words, it is a hypothesis testing where,
 𝐻0 : The random sample was drawn from a normal population
 𝐻1 : The random sample does not follow normal distribution

 To test the hypothesis, use the Shapiro-Wilk test statistics, which is given by
2
σ𝑛𝑖=1 𝑎𝑖 𝑥(𝑖)
𝑤= 𝑛 2
σ𝑖=1 𝑥𝑖 − 𝑥ҧ
 Here, 𝑥(𝑖) are the ordered sample values (the i-th sample when in order) and 𝑎𝑖 are constants
that are generated by the expression
𝑚𝑇 𝑉 −1
𝑎1 , 𝑎2 , … , 𝑎𝑛 = 1 ;
𝑚𝑇 𝑉 −1 𝑚 ൗ2
(This vector can be obtained from the Shapiro-Wilk table available in the book of Statistics.)

CS 61061: Data Analytics 13


Simplified Shapiro-Wilk Test

CS 61061: Data Analytics


1. 𝐻0 : Follow normal distribution
𝐻1 : Doesn't follow normal distribution

2. Rank the sample values in increasing order


𝑥(1) 𝑥(2) … … . . , 𝑥(𝑛)

3. Calculate
𝑏 = 𝑎1 (𝑥(𝑛) -𝑥(1) ) + 𝑎2 𝑥 𝑛−1 −𝑥 2 + ⋯ + 𝑎𝑚 𝑥 𝑛−𝑚+1 −𝑥 𝑚

4. Calculate test statistics

𝑏2
𝑤= S = standard deviation
(𝑛−1)𝑆 2

5. Compare the test statistics with a critical value (from Shapiro-Wilk table). Let this be w*

6. If 𝑤 < 𝑤 ∗ , then reject the null hypothesis.

14
Shapiro-Wilk Test

CS 61061: Data Analytics


Example 7.4
Given a ranked data: [20 20 21 26 43 43 54 54 55]

b = 0.5888 ✕(55−20) + 0.3244✕(54−20) +0.1976✕(54−21) + 0.0947✕(43−26) = 39.7683

Calculate the test statistics


𝑏2
𝑤=
(𝑛−1)𝑠 2

(39.7683)2
=
(9−1)(15.52)2

= 0.8203
The critical value for n = 9 and α =0.05
𝑤 ∗ = 0.8293
Since 𝑤 < 𝑤 ∗ , we reject the null hypothesis; that is, we conclude with 95% confidence
that the given data are not from a normal distribution.
15
Pooled Two-Sampled t-test

CS 61061: Data Analytics 16


Pooled Two-Sampled t-Test

CS 61061: Data Analytics


 There are many situations, when we are to take two
samples which are not necessarily of equal sizes; however
two samples are drawn from a given population (this
implies that population variance remains same).

 The test that assumes equal population variance is referred


to as the Pooled t-Test. Here, pooling refers to finding a
weighted average of the two independent sample variances.

 The pooled test statistics uses a weighted average of two sample


variances. 17
Pooled Two-Sampled t-Test
 Suppose two samples of sizes 𝑛1 and 𝑛2 with variance 𝑆12 and 𝑆22 ,
respectively. Then pooled sample variance can be calculated as

𝑛1 −1 𝑆12 + 𝑛2 −1 𝑆22
𝑆𝑝2 =
𝑛1 +𝑛2 −2

𝑛1 −1 𝑛2 −1
= 𝑆12 + 𝑆22
𝑛1 +𝑛2 −2 𝑛1 +𝑛2 −2
 Special case: If 𝑛1 = 𝑛2 = 𝑛, then

1
𝑆𝑝2 = (𝑆12 + 𝑆22 )
2
 Note
The larger sample size receives more weight.
The degree of freedom is 𝑛1 + 𝑛2 − 2 and if 𝑛1 = 𝑛2 = 𝑛, then 𝑑𝑓 = 2(𝑛 − 1)

CS 61061: Data Analytics 18


Pooled Two-Sampled t-Test
 This pooled test statistics follows the student’s t-distribution with 𝑛1 +
𝑛2 − 2 degrees of freedom.
 The t-statistics can be calculated as

𝑥1 − 𝑥2
𝑡=
1 1
𝑆𝑝2 +
𝑛1 𝑛2

𝑥1 −𝑥2
=
1 1
𝑆𝑃 +
𝑛1 𝑛2

 The hypothesis test procedure will follow the same steps as the standard
hypothesis testing.

CS 61061: Data Analytics 19


Pooled Two-Sampled t-Test
Example 7.5
 Suppose, two independent and random soil samples are collected from two
different agricultural areas to test if the two lands are equal fertilization
rate. The two samples with their fertilization rate (in cm/month) is
calculated in the table below:

Sample 1 Sample 2
3.2 4.5
4.5 6.2
3.8 5.8
4.0 6.0
3.7 7.1
3.2 6.8
4.1 7.2

 Test the hypothesis that two agricultural areas have the same fertilization
rate. Assume, the significance level α = 5%
CS 61061: Data Analytics 20
Pooled Two-Sampled t-Test

CS 61061: Data Analytics


1. The test hypothesis are
𝐻0 : µ1 =µ2
𝐻1 : µ1 <µ2

2. The pooled sample variance is


7−1 (0.474)2 + 7−1 (0.936)2
𝑆𝑝2 =
7+7−2

= 0.55

3. The t-statistics based on the sample is


3.79−6.23 −2.44
𝑡= = = -6.16
1 1 0.396
0.55 +
7 7

4. This is a one-sided test with 7+7-2=12 degrees of freedom. The critical test value with 12
degree of freedom and α = 0.05 is -1.782.
21
Pooled Two-Sampled t-Test

CS 61061: Data Analytics


5. Decision: The sample-based test statistics is less than the critical value, so we
reject the null hypothesis.

6. Conclusion is that two agricultural lands are not of equal fertility rate.

p-value estimation:
𝑝 = 𝑃(𝑋ഥ >-6.16) and we can check that 𝑝 ≤ 0.05.

Confidence interval estimation:


The CI-based quantitative estimation is also can be done with pooled two
sampled hypothesis testing.

22
Pooled Two-Sampled t-Test

CS 61061: Data Analytics


1 1
In this case, the standard error is 𝑆𝑃 +𝑛
𝑛1 2

1 1
Thus 𝐶𝐼 = (𝑥1 − 𝑥2 ) ±𝑡α/2 . 𝑆𝑃 +𝑛
𝑛1 2

1 1
𝐶𝐼 = (3.79 − 6.23) ±1.782 0.55 +
7 7

= −2.44 ±0.7064

= −3.146, −1.734 ≡ (3.146, 1.734)

Note: All negative values can be ignored.


The result implies that land with sample 2 is with higher rate than the land 1.
23
Analysis of Variance

CS 61061: Data Analytics 24


Example : Single vs. Multiple population

CS 61061: Data Analytics 25


What is Analysis of Variance

Single population

Multiple population

CS 61061: Data Analytics 26


Example 7.6
Suppose, students from different coaching centers take the same IIT-JEE
Examination in a year. We want to see if one center outperforms the others. The
test data set is given below with means 𝜇A , 𝜇B , 𝜇C and 𝜇D , respectively.

Center A Center B Center C Center D


… … … …
… … … …
𝜇A 𝜇B 𝜇C 𝜇D

ANOVA determines if there is any difference between the means of different groups. In
other words, it tests for differences among the population's mean by examining the
variation within each sample relative to the amount of variation between the samples.
Analyzing variance tests the hypothesis that the means of two or more populations are
equal.
Example 7.7
Let us consider another simple example to understand an application of
ANOVA test.

Suppose, there are three different drugs A, B, and C available from three
drug manufacturing company. We have to study the effectiveness of
drugs to cure a disease. The data that will be given to us three samples
and is as follows.

𝜇A 𝜇B 𝜇C
𝜎A2 𝜎B 2 𝜎C2
Example 7.8
A typical example of a sample data is shown below.

Drug A Drug B Drug C


100.07 90.54 108.00 Note
90.60 105.05 107.25 1. Here, the entries represent the measurement of
103.45 84.15 92.46 times to cure the disease after the
95.70 83.18 105.31 administration of drugs.
110.00 92.35 83.27 2. For each drug, a separate group of subjects
125.28 100.00 100.48 participated in the test.
121.32 88.45 80.24 3. Group size is not necessarily the same.
114.46 77.33 97.08

Given this data, the objective is to test at the 0.05 significance level
whether the meantime for three drugs to cure the disease are equal (it is 29
the null hypothesis H0).
The Issue in Statistical Testing
A recent study claims that using music in a class enhances the concentration
and consequently helps students absorb more information.

 What if it affected the results of the students in a negative way?


or
 What kind of music would be a good choice for this?

We should have some proof that it actually works or not.

CS 61061: Data Analytics 30


Example 7.9: Design of Experiment
• The teacher decided to implement it on a smaller group of randomly selected
students from three different classes.

Three different groups of ten randomly selected students from three different
classrooms were taken.

Each classroom was provided with three different environments for students to study.
 Classroom A had constant music being played in the background
 Classroom B had variable music being played in the background
 Classroom C was a regular class with no music playing

 A test was conducted after one month for all the three groups and their test scores were
collected.

CS 61061: Data Analytics 31


Test Result
Test scores of students (out of 10) Mean
Class A (constant
7 9 5 8 6 8 6 10 7 4 7
music)
Class B (variable
4 3 6 2 7 5 5 4 1 3 4
music)

Class C (no music) 6 1 3 5 3 4 6 5 7 3 4.3

Grand Mean -> 5.1

CS 61061: Data Analytics 32


Observations from the results
 It is noticed that the mean score of students from Group A is definitely greater
than the other two groups, so the treatment must be helpful.

 Maybe it’s true, but there is also a slight chance that we happened to select the
best students from class A, which resulted in better test scores (remember, the
selection was done at random).

 This leads to a few questions:

1. How do we decide that these three groups performed differently because of the
different situations and not merely by chance?

2. In a statistical sense, how different are these three samples from each other?

CS 61061: Data Analytics 33


Analysis of Variance (ANOVA)
Definition 7.1

• Analysis of Variance (ANOVA) is derived from a partitioning of total


variability into its component parts.
• ANOVA is a statistical technique that is used to check if the means of two
or more groups are significantly different from each other.

• ANOVA checks the impact of one or more factors by comparing the


means of different samples.

 This technique was invented by Sir Ronald Aylmer Fisher (1921), and is
often referred to as Fisher’s ANOVA.

CS 61061: Data Analytics 34


Why ANOVA?

CS 61061: Data Analytics 35


Statistical Inferences
• ANOVA is a statistical technique
• It is similar in application to techniques such as t-test, z-test and χ2-test in
that it is used to compare means and the relative variance between them.
 Why not use t-test, z-test and χ2-test?

 Why analysis of variance for comparing means?

CS 61061: Data Analytics 36


Using t-test
t-test is used to:

• To infer mean of a single population


• t-test can be used to compare two populations

However, t-test is not useful to compare mean of more than two populations

CS 61061: Data Analytics 37


Extending the two population procedure
• Construct pairwise comparison on all means.
• For 5 populations →10 possible pairs.
• Considering 𝛼 = 0.05, probability of correctly failing to reject the null
hypothesis for all 10 tests is (0.95)10 , assuming that the tests are
independent
• Thus the true value of α for this set of comparison is 0.4, instead of .05
• It inflates the Type 1 error.

CS 61061: Data Analytics 38


Extending the two population procedure
• Statistical Inference I
• A car magazine wishes to compare the average petrol consumption of
THREE models for car and has available SIX vehicles of each model.
Model 1 Model 2 Model 3

• There are THREE populations


• There are samples each of size six from each population

CS 61061: Data Analytics 39


Extending the two population procedure
• Statistical Inference II
• A teacher is interested in a comparison of the average percentage
marks obtained in the examinations of five different subjects and has
available the marks of eight students who all completed each
examination.
Subject 1 Subject 2 Subject 3 Subject 4 Subject 5

• What is the number of populations?


• How many samples? What are there sizes?? Are each samples
independent to each other?
CS 61061: Data Analytics 40
Example 7.10: Why ANOVA?
Consider the two sets of contrived data as shown below:
Set 1 Set 2
Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3

5.7 9.4 14.2 3.0 5.0 11.0


5.9 9.8 14.4 4.0 7.0 13.0
6.0 10.0 15.0 6.0 10.0 16.0
6.1 10.2 15.6 8.0 13.0 17.0
6.3 10.6 15.8 9.0 15.0 18.0
y¯ = 6.0 y¯ = 10.0 y¯ = 15.0 y¯ = 6.0 y¯ = 10.0 y¯ = 15.0

Observations:
 Looking only at the means, we can see that they are identical for the three populations in both
the sets.
 Using the means alone, we would state that there is no difference between the two sets.

CS 61061: Data Analytics 41


Box plots of the two experiments
Observation from Box plots
 It appears that there is stronger evidence of
differences among means in Set 1 than among means
in Set 2.

 The observations within the samples are more closely


bunched in Set 1 than they are in Set 2,

 We know that sample means from populations with


smaller variances will also be less variable.
(Central Limit Theorem)
 Thus, although the variances among the means for the two sets are identical, the variance among
the observations within the individual samples is smaller for Set 1 and is the reason for the
apparently stronger evidence of different means.

 This observation is the basis for using the analysis of variance for making inferences about
differences among means
 The analysis of variance is based on the comparison of the variance among the means of the
populations to the variance among sample observations within the individual populations.
CS 61061: Data Analytics 42
The problems to be analyzed
 Factor
 A characteristic under consideration, thought to
influence the measured observations

 Level (also called group)


 A value of the factor
Level Observations Total Mean
1 y11 y12 … y1n1
2 y21 y22 … y2n2
… … … … …
… … … … …
… … … … …
k yk1 yk2 … yknk

CS 61061: Data Analytics 43


Between Group Variability
Variance among the means of the populations
• Consider the distributions of the below two
samples.

• As these samples overlap, their individual means


won’t differ by a great margin.

• Hence, the difference between their individual


means and grand mean won’t be significant enough.

• Mean is a simple or arithmetic average of a range of


values. There are two kinds of means that we use in
ANOVA calculations, which are separate sample
means (𝜇1 and 𝜇2 ) and the grand mean 𝜇

• The grand mean is the mean of sample means or the


mean of all observations combined, irrespective of
the sample.
CS 61061: Data Analytics 44
Between Group Variability
Now consider these two sample distributions. As
the samples differ from each other by a big margin,
their individual means would also differ. The
difference between the individual means and grand
mean would therefore also be significant.

 Such variability between the distributions called Between-group variability or variance


among the means of the populations.

 Each sample is looked at and the difference between its mean and grand mean is
calculated to calculate the variability.

 If the distributions overlap or are close, the grand mean will be similar to the individual
means, whereas if the distributions are far apart, difference between means and grand
mean would be large.

CS 61061: Data Analytics 45


Within Group Variability
Variance among sample observations
Consider the given
distributions of three
samples. As the spread
(variability) of each sample
is increased, their
distributions overlap and
they become part of a big
population.
Now consider another
distribution of the same three
samples but with less
variability. Although the means
of samples are similar to the
samples in the above image,
they seem to belong to different
populations.

CS 61061: Data Analytics 46


How ANOVA?

CS 61061: Data Analytics 47


Some Terminologies

9/27/2023
 Dependent variable:
It is a measure of some measurable quantity.

DATA ANALYTICS
 Independent variable:
The things under experiment. It is also called a factor.

 Group:

Debasis Samanta
It denotes the subcategory of independent variable(s)

48
Example 7.11
A sample data is shown below.
Drug A Drug B Drug C
100.07 90.54 108.00
90.60 105.05 107.25
103.45 84.15 92.46
95.70 83.18 105.31
110.00 92.35 83.27
125.28 100.00 100.48
121.32 88.45 80.24
114.46 77.33 97.08

Dependent Variable Time (in hours) to cure the disease


Independent Variable (factor) Drug
Groups (levels) Drug A, Drug B, Drug C. 49
Categorization of ANOVA

9/27/2023
Depending on the number of dependent variables and independent
variables (i.e., factors), the ANOVA test can be classified as shown below.

DATA ANALYTICS
ANOVA

Factorial MANOVA
ANOVA

Debasis Samanta
One-way Two-way
ANOVA ANOVA

50
Factorial ANOVA

9/27/2023
 A Factorial ANOVA is an analysis of variance test with one or more
independent variable(s), that is, factor(s).

DATA ANALYTICS
 Thus, with One-way ANOVA, there is only one factor, whereas Two-
way and Three-way refer to the number of factors 2 and 3,
respectively.

 Four-way ANOVA and above are rarely used because the test results
are complex and difficult to interpret.

Debasis Samanta
51
Variants of ANOVA
Based on the number of Independent Variables and Dependent Variable𝑠
considered for the study, there are different variants of ANOVA

1. One-way ANOVA: Only one independent variable (factor) with greater than 2
levels.
2. Two-way ANOVA: Two independent variables (i.e., factors).
3. Three-way ANOVA: Three independent variables (i.e., factors).

4. Multivariate ANOVA: It is used to test the significance of the effect of more


independent variables.

CS 61061: Data Analytics 52


One-way ANOVA
The one-way ANOVA test measure that the analysis of variable has only
one independent variable. In other words, a One-way ANOVA is used to
see if there are any significant differences between the means of the
independent variables.

Example:
Effectiveness of three drugs manufactured by three drug manufacturing companies to
cure a disease.
Drug A Drug B Drug C

Dependent Variable: Time to cure disease 100.07 90.54 108.00


90.60 105.05 107.25
Independent variable( factor): Drug 103.45 84.15 92.46
95.70 83.18 105.31
Group (level): {Drug A, Drug B, Drug C }
110.00 92.35 83.27
125.28 100.00 100.48
121.32 88.45 80.24
53
114.46 77.33 97.08
Case Study 1
Clearly identify the dependent, independent variables and group (level)
in the following situations.

Case 1: We wanted to test the effectiveness of different types of teas (Black tea,
Green tea, No tea) on weight loss. For this purpose, we collected data with a
group of individuals randomly splitting into smaller groups and asked them to
drink a specific tea for a specific group with a certain period. The weight losses of
the individuals in each group are recorded.

Black Tea Green Tea No Tea


Dependent Variable: ?? …. … …
Independent variable (factor): ??

Group (level): ?? ….. …. ….

54

…. …. ….
Case Study 1
Clearly identify the dependent, independent variables and group (level)
in the following situations.

Case 1: We wanted to test the effectiveness of different types of teas (Black tea,
Green tea, No tea) on weight loss. For this purpose, we collected data with a
group of individuals randomly splitting into smaller groups and asked them to
drink a specific tea for a specific group with a certain period. The weight losses of
the individuals in each group are recorded.

Black Tea Green Tea No Tea


Dependent Variable: Weight loss …. … …
(Numerical)

Independent variable (factor): Weight


(Categorical) ….. …. ….
Group (level): 3 (Black Tea, Green Tea, No Tea) 55
(Categorical)
…. …. ….
Case Study 2

9/27/2023
From a population, we selected a number of individuals who are categorized into
the following four categories (called weight groups): Underweight, Overweight,
Obese, and Normal. For each category of individuals, their sprinting skill (i.e.,
running time in a 100m race) were recorded.

DATA ANALYTICS
In this case, identify the dependent variable, independent variable (factor) and
groups (level). Draw a table structure where data can be recorded for the Anova
test.

UW OW O N

Debasis Samanta
Dependent Variable: ?? …. …. … …

Independent variable(factor): ??
…. ….. ….
Group(level): ??
…..

56
… …. …. ….
Case Study 2

9/27/2023
From a population, we selected a number of individuals who are categorized into
the following four categories (called weight groups): Underweight, Overweight,
Obese, and Normal. For each category of individuals, their sprinting skill (i.e.,
running time in a 100m race) were recorded.

DATA ANALYTICS
In this case, identify the dependent variable, independent variable (factor) and
groups (level). Draw a table structure where data can be recorded for the Anova
test.

UW OW O N

Debasis Samanta
Dependent Variable: Speed (Numeric quantity) …. …. … …

Independent variable(factor): Weights


(Categorical attribute)
…. ….. ….
Group (level): 4 (UW, OW, O, N) …..
(Categorical attribute)
57
… …. …. ….
Two-way ANOVA
 The Two-way NOVA is an extension of the One-way ANOVA.
 With the One-way ANOVA, we have one independent variable (one
factor) affecting a dependent variable.
 With a Two-way ANOVA, there are two independent variables (two
factors).
 Hence, it is also alternatively called the two-factorial ANOVA test.

 We use the Two-way ANOVA when we have one measurement variable


and two nominal variables.

 In other words, if an experiment has a quantitative outcome and we


have two categorical explanatory variables, then the Two-way ANOVA
test is appropriate.
58
Case Study 3

9/27/2023
 We want to find out if there is an interaction between Income
(Low (L), Medium (M) and High (H) ) and Gender (Male (M),
Female (F) and Transgender (T)) for the performance score (e.g.,

DATA ANALYTICS
performance in competitive test)

1. How a table to record data would look?


2. Clearly identify the dependent variable factors and levels in this
case.

Debasis Samanta
59
Case Study 3

9/27/2023
Dependent Variable:
Score in the competitive examination
Gender Income Score

DATA ANALYTICS
M/F/T L/M/H … Independent variable(factor):
Gender, Income

…. ….. ….. Group (level): [ M, F, T ] ✕ [ L, M, H ] = 3✕3 = 9 levels

Note:
In this case, nine different means are to be analyzed
… …. ….

Debasis Samanta
{𝜇FL, 𝜇FM, 𝜇FH, 𝜇ML, 𝜇MM, 𝜇MH, 𝜇TL, 𝜇TM, 𝜇TH,}

That is, the variance across the means of nine different


groups is to be analyzed.

60
MANOVA

9/27/2023
MANOVA test is just an ANOVA test with several dependent
variables.

DATA ANALYTICS
Debasis Samanta
61
Case Study 4
Suppose, there are three video lectures of 3 different duration 30 minutes
(S), 60 minutes (M) and 90 minutes (L) have been prepared to teach the
Data Analytics course. A number of students randomly split into smaller
groups and experimented while they follow either 30 minutes, 60 minutes
or 90 minutes video lectures. To test their competence on the subject, two
measures, namely long-term recall (LTR) and short-term recall (STR) are
measured by means of conducting quiz and subjective test, respectively.
These performances of students are recorded in a table for the ANOVA
test.

Clearly identify the

a) Dependent variables

b) Independent variables
62

c) Groups in this analysis


Case Study 4

9/27/2023
Videos STR Score LTR Score
Dependent Variable:
S/M/L ….. …
1. Long-term-recall measurement

DATA ANALYTICS
2. short-term-recall measurement
…. ….. …..
Independent variable (factor):
Video Lectures
… …. ….
Group (level): 3 [ S, M, L]

Debasis Samanta
What are the different means in this case whose variance are to be
analyzed ?
63
Case Study 5

9/27/2023
A study was conducted to see the impact of socio-economic class (Rich (R),
Middle (M) and Poor (P)) and gender (Male (M), Female (F)) on TV-hours/day
and Study hours/day. A sample of 24 people collected as shown below.

DATA ANALYTICS
Gender S-E-C TV- Study-Hour Gender S-E-C TV- Study-
Hour Hour Hour

M R 5 3 F R 2 3
M R 4 6 F R 3 5
M R 3 4 F R 5 3
M R 2 4 F R 4 2

Debasis Samanta
M M 4 6 F M 9 8
M M 3 6 F M 6 5
M M 5 4 F M 7 6
M M 5 5 F M 8 9
M P 7 5 F P 8 9
M P 4 3 F P 9 8
F P 3 7 64
M P 3 1
M P 7 2 F P 5 7
Case Study 5

9/27/2023
A study was conducted to see the impact of socio-economic class (Rich (R),
Middle (M) and Poor (P)) and gender (Male (M), Female (F)) on TV-hours/day
and study hours/day. A sample of 24 people collected as shown below.

DATA ANALYTICS
Dependent Variable:
1. TV- Hours
2. Study- Hours

Independent variable (factor):


1. Gender

Debasis Samanta
2. Socio-Economic class

Group(level):
[M, F] x [R, M, P] = 2x3 = 6 levels

65
Statistical Analysis with ANOVA

9/27/2023
Like the previously learned statistical learning, with Anova test, we are
also to test hypothesis testing. Few cases are illustrated below.

DATA ANALYTICS
Case Study 1: One-way ANOVA

Drug A Drug B Drug C


100.07 90.54 108.00
Hypothesis testing
90.60 105.05 107.25
103.45 84.15 92.46 H0 : The mean time for the three drugs to
95.70 83.18 105.31 cure the disease are equal.

Debasis Samanta
110.00 92.35 83.27
125.28 100.00 100.48
H1: The mean time for the three drugs to
cure the disease are not equal.
121.32 88.45 80.24
114.46 77.33 97.08

66
Statistical Analysis with ANOVA
Case Study 4: Two-way ANOVA
1. The result from the two-way ANOVA calculates a main effect and an
interaction effect.

2. The main effect is similar to the One-way ANOVA: each factor’s effect
is considered separately. With the interaction effect, all factors are
considered at the same time.

The Hypotheses those can be tested are

H01 : All the income groups have equal mean performance score.

H02 : All the gender groups have equal mean performance score.

H03 : The factors are independent, that is, there is no effect of gender or income
group on performance score. 67
Statistical Analysis with ANOVA

9/27/2023
Case Study 5: MANOVA

With this analysis, we can answer to many research questions. With


reference to the case study in Example 4 (Gender and socio-economic

DATA ANALYTICS
class effect on TV-hours and Study hours study),

Some of the questions include


1. Do changes to the independent variable have significantly effects on
dependent variables?

Debasis Samanta
2. What are the interaction among dependent variables?
3. What are the interaction among independent variables?

68
Assumptions for ANOVA Tests

9/27/2023
1. The population must be close to a normal distribution.
2. Samples must be independent.
3. Population variance must be equal.

DATA ANALYTICS
4. Groups must have equal sample sizes.

Debasis Samanta
The F-test is used in ANOVA tests.

69
One-way ANOVA

CS 61061: Data Analytics 70


One-way ANOVA
 The purpose of the procedure is to compare sample means of 𝑘
populations.
 In general, One-way ANOVA technique can be used to study the effect of
𝑘 (> 2) levels of a single factor.

 To determine if different levels of the factor affect measured observations


differently, the following hypotheses are tested.

𝐻0 : 𝜇𝑖 = 𝜇 all 𝑖 = 1,2, … , 𝑘
𝐻1 : 𝜇𝑖 ≠ 𝜇 some 𝑖 = 1,2, … , 𝑘
That is, at least one equality is not satisfied

where 𝜇𝑖 is the population mean for a level 𝑖.

CS 61061: Data Analytics 71


Assumptions
 When applying one-way analysis of variance, there are three key
assumptions that should be satisfied as follows.

1. The observations are obtained independently and randomly from the


populations defined by the factor levels.
2. The population at each factor level is (approximately) normally
distributed.
3. These normal populations have a common variance, 𝜎 2 .

 Thus, for factor level 𝑖, the population is assumed to have a distribution


which is 𝑁(𝜇𝑖 , 𝜎 2 ).

CS 61061: Data Analytics 72


One-way ANOVA
Level Observations Total Average
1 𝒚𝟏𝟏 𝒚𝟏𝟐 ……… 𝒚𝟏𝒏 𝒚𝟏. 𝒚𝟏.

2 𝒚𝟐𝟏 𝒚𝟐𝟐 ……… 𝒚𝟐𝒏 𝒚𝟐. 𝒚𝟏.


. . . ……… . . ………
. . . ……… . . ………
. . . ……… . . ………
k 𝒚𝒌𝟏 𝒚𝒌𝟐 𝒚𝒌𝒏 𝒚𝒌. 𝒚𝒌.

𝑦.. 𝑦ഥ..

An entry in the table (e.g., 𝑦𝑖𝑗 ) represents the 𝑗𝑡ℎ observation taken under the factor at
level 𝑖.
 There will be, in general, 𝑛 observations under the 𝑖 𝑡ℎ level.
 𝑦𝑖. represents the total of the observations under the 𝑖 𝑡ℎ level.
 𝑦𝑖. represent the average of the observation under the 𝑖 𝑡ℎ level.
 𝑦.. represent the grand total of all the observation under the 𝑓𝑎𝑐𝑡𝑜𝑟.
 𝑦ഥ.. represent the average grand total of all the observation under the factor.

CS 61061: Data Analytics 73


One-way ANOVA
Expressed symbolically,
𝑛𝑖

𝑦𝑖. = ෍ 𝑦𝑖𝑗 𝑖 = 1,2, … … , 𝑘


𝑗=1

𝑦𝑖.
𝑦ത𝑖.. =
𝑛𝑖

𝑘 𝑛𝑖
𝑦..
𝑦.. = ෍ ෍ 𝑦𝑖𝑗 𝑦ഥ.. = ൗ𝑁
𝑖=1 𝑗=1

Here, N is the total observations, that is, N = ni + n2 + …+nk

CS 61061: Data Analytics 74


Overall Variability in Data

The correlated sum of squares for each factor level

𝑛 2
𝑖
𝑆𝑆𝑖 = σ𝑗=1 𝑦𝑖𝑗 − 𝑦ത𝑖. for i = 1, 2, …, k

CS 61061: Data Analytics 75


Overall Variability in Data

The corrected sum of squares for each factor level


𝑛𝑖

𝑆𝑆𝑖 = ෍(𝑦𝑖𝑗 − 𝑦. . )2
𝑖=1

Alternatively, it can be prove using the computational form that


𝑛𝑖
𝑦𝑖. 2
2
𝑆𝑆𝑖 = ෍ 𝑦𝑖𝑗 −
𝑛𝑖
𝑗=1

CS 61061: Data Analytics 76


Overall Variability in Data
We then calculate a pooled sum of squares
𝑘

𝑆𝑆𝑝 = ෍ 𝑆𝑆𝑖
𝑖=1

Finally, the pooled sample of variance is

𝑆𝑆𝑝 𝑆𝑆𝑝
𝑠𝑝 = =
𝑝𝑜𝑜𝑙𝑒𝑑 𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑓𝑟𝑒𝑑𝑜𝑚 σ 𝑛𝑖 − 𝑘

Note that if the individual variances are available, the same can be
computed as
σ𝑘 2
𝑖=1 𝑛𝑖 −1 𝑠𝑖
𝑠𝑝 = σ 𝑛𝑖 −𝑘

where 𝑠𝑖2 are the variances for each sample. This is also called variance
2
within samples and also popularly be denoted as 𝜎
ො𝑊
CS 61061: Data Analytics 77
Example 5: Variance within Samples
 The table below shows the lifetimes under controlled conditions, in hours in
excess of 1000 hours, of samples of 60𝑊 electric light bulbs of three
different brands.

Brand
1 2 3
16 18 26
15 22 31
13 20 24
21 16 30
15 24 24

CS 61061: Data Analytics 78


Solution : Variance within Samples
 Here, there is one factor (brand) at three levels (1, 2 and 3). Also the
sample sizes are all equal (to 5).

 The sample mean and variance (divisor (𝑛 − 1)) for each level are
as follows.

Brand

1 2 3
Sample Size 5 5 5
Sum 80 100 135
Sum of squares 1316 2040 3689
Mean 16 20 27
Variance 9 10 11

CS 61061: Data Analytics 79


Solution : Variance within Samples
 A pooled estimate of variance then can be calculated as follows.

2
5 − 1 × 9 + 5 − 1 × 10 + (5 − 1) × 11
𝜎ො𝑊 = = 10
5+5+5−3

 This quantity is called the variance within samples.

 It is an estimate of 𝜎 2 based on 𝑣 = 5 + 5 + 5 − 3 = 12 degrees of


freedom.

CS 61061: Data Analytics 80


Heuristic Justification of ANOVA
 From the sampling distribution of the mean, we know that a sample mean
computed from a random sample of size n from a population with mean µ
and variance 𝜎 2 is a random variable with mean µ and variance 𝜎 2 /n
[Central Limit Theorem].

 Let us see, what we can conclude in case of k (k > 1) populations, which


may have different µi but have the same variance 𝜎 2 .

CS 61061: Data Analytics 81


Heuristic Justification of ANOVA
 If the null hypothesis is true, that is, each of the µi has the same value,
say, µ, then the distribution of each of the k sample means, 𝑦𝑖. will have
mean µ and variance 𝜎 2 /n .

 It then follows that, if we calculate a variance using the sample means as


observations,
𝜎ො𝐵2 = σ 𝑦ത𝑖. − 𝑦ത.. 2/(𝑘 − 1)

 Then the quantity is an estimate of 𝜎 2 /n .

 Hence, n𝜎ො𝐵2 is an estimate of 𝜎 2 .


 This estimate has k-1 degree of freedom and is independent of the pooled
estimate of 𝜎 2 .

CS 61061: Data Analytics 82


Heuristic Justification of ANOVA
 Out of several sampling distributions, the F-distribution describes the
ratio of two independent estimates of a common variance.

 The parameters of the distribution are the degrees of freedom of the


numerator and denominator variances, respectively.

 If the null hypothesis of equal mean is true, then we can compute the two
estimates of 𝜎 2 namely
𝜎ො𝐵2 = σ 𝑦ത𝑖. − 𝑦ത.. 2/(𝑘 − 1) and 𝑠𝑝2 , the pooled variance
2
𝑛ෞ𝜎𝐵
 Therefore, the ratio has the F-distribution with degrees of freedom (k-
𝑠𝑝2
1) and 𝑛 − 𝑘

CS 61061: Data Analytics 83


Heuristic Justification of ANOVA
 Thus, the procedure for testing the hypothesis.

𝐻0 : 𝜇𝑖 = 𝜇 all 𝑖 = 1, 2, … , 𝑘
𝐻1 : at least one equality is not satisfied

2
𝑛ෞ𝜎𝐵
 We are to reject H0, if the calculated value of F = exceeds α
𝑠𝑝2
(confidence level) of the F-distributions with (k-1) and 𝑛 − 𝑘 degrees of
freedom.

CS 61061: Data Analytics 84


Example 6: F-Test
Set 1 Set 2
Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3

5.7 9.4 14.2 3.0 5.0 11.0


5.9 9.8 14.4 4.0 7.0 13.0
6.0 10.0 15.0 6.0 10.0 16.0
6.1 10.2 15.6 8.0 13.0 17.0
6.3 10.6 15.8 9.0 15.0 18.0
y¯ = 6.0 y¯ = 10.0 y¯ = 15.0 y¯ = 6.0 y¯ = 10.0 y¯ = 15.0

 For both sets, the value of n𝜎ො𝐵2 is 101.67. However, for Set 1, 𝑠𝑝2 = 0.250
while for Set 2, 𝑠𝑝2 = 10.67. Thus for Set 1, F = 406.67 and for Set 2, F =
9.53.

 This confirms that the relative magnitude of the two variances is the
important factor for detecting difference among means.

CS 61061: Data Analytics 85


Example 7: Variance between Samples
 The table below shows the lifetimes under controlled conditions, in hours in
excess of 1000 hours, of samples of 60𝑊 electric light bulbs of three
different brands.
Brand
1 2 3
16 18 26
15 22 31
13 20 24
21 16 30
15 24 24

 Assuming all lifetimes to be normally distributed with common variance,


test, at the 1% significance level, the hypothesis that there is no difference
between the three brands with respect to mean lifetime.

CS 61061: Data Analytics 86


Solution : Variance between Samples
 The variability between samples may be estimated from the three sample means as
follows.
Brand
1 2 3
Sample Mean 16 20 27
Sum 63
Sum of squares 1385
Mean 21
Variance 31

 This variance (divisor (𝑛 − 1)), denoted by 𝜎ො𝐵2ത is called the variance between
sample means. Since it calculated using sample means, it is an estimate of
𝜎2 𝜎2
(that is in general)
5 𝑛

based upon (3 − 1) = 2 degrees of freedom, but only if the null hypothesis is true.
If 𝐻0 is false, then the subsequent 'large' differences between the sample means will
result in 5𝜎ො𝐵2ത being an inflated estimate of 𝜎 2 .

CS 61061: Data Analytics 87


Solution : F-Test
2
ෞ 𝐵2 and 𝜎ො𝑊
 The two estimates of 𝜎 2 , 𝑛𝜎 ഥ , may be tested for equality using the F-
test with
2
5ෝ
𝜎𝐵ഥ
𝐹= ෝ𝑊2
𝜎
as lifetimes may be assumed to be normally distributed.

 Recall that the F-test requires the two variances to be independently distributed
(from independent samples). Although this is by no means obvious here (both
2
were calculated from the same data), 𝜎ො𝑊 and 𝜎ො𝐵2ത are in fact independently
distributed.
2
 The test is always one-sided, upper-tail, since if 𝐻0 is false, 𝜎ො𝑊
ഥ is inflated
whereas 5𝜎ො𝐵2 is unaffected.

 Thus in analysis of variance, the convention of placing the larger sample


variance in the numerator of the F-statistic is NOT applied.

CS 61061: Data Analytics 88


Solution
 The solution is thus summarized and completed as follows.

o 𝐻0 : 𝜇𝑖 = 𝜇 all 𝑖 = 1, 2, 3

o 𝐻1 : 𝜇𝑖 ≠ 𝜇 some 𝑖 = 1, 2, 3

o Significance level, 𝛼 = 0.01

o Degrees of freedom, 𝑣1 = 2, 𝑣2 = 12

o Critical region is 𝐹 > 6.927

2
5ෝ
𝜎𝐵ഥ 155
o Test statistic is 𝐹 = 2 = = 15.5
ෝ𝑊
𝜎 10

 This value does lie in the critical region. There is evidence, at the 1% significance
level, that the true mean lifetimes of the three brands of bulb do differ.

CS 61061: Data Analytics 89


Notation and computational formulae
 In essence, given a population a single factor of k levels, we have to calculate two
estimations for 𝜎 2 .

 Sampling variance between groups with (k-1) degree of freedom


n𝜎ො𝐵2 = 𝑛 σ 𝑦ത𝑖. − 𝑦ത.. 2/(𝑘 − 1).

 Sampling variance within groups with (n-k) degree of freedom


2 σ𝑘
𝑖=1 𝑆𝑆𝑖
𝜎ො𝑊 = σ 𝑛𝑖 −𝑘

CS 61061: Data Analytics 90


Notation and computational formulae
 The calculations undertaken in the previous example are somewhat cumbersome, and
are prone to inaccuracy with non-integer sample means. They also require
considerable changes when the sample sizes are unequal. Equivalent computational
formulae are available which cater for both equal and unequal sample sizes.

 First, some notation.

Number of samples (or levels) =𝑘


Number of observations in 𝑖th sample = 𝑛𝑖 , 𝑖 = 1, 2, … , 𝑘
Total number of observations = 𝑛 = ෍ 𝑛𝑖
𝑖

𝑗 − th observation in 𝑖-th sample = 𝑦𝑖𝑗 , 𝑗 = 1, 2, … , 𝑛𝑖


Sum of 𝑛𝑖 observations in 𝑖 −th sample = 𝑇𝑖 = ෍ 𝑦𝑖𝑗
𝑗

Sum of all 𝑛 observations = 𝑇 = ෍ 𝑇𝑖 = ෍ ෍ 𝑦𝑖𝑗


𝑖 𝑖 𝑗

CS 61061: Data Analytics 91


Notation and computational formulae
 The computational formulae now follow.
Total sum of squares, 2
𝑇2
𝑆𝑆𝑇 = ෍ ෍ 𝑥𝑖𝑗 −
𝑛
𝑖 𝑗

Between samples sum of squares, 𝑇𝑖 2 𝑇 2


𝑆𝑆𝐵 = ෍ −
𝑛𝑖 𝑛
𝑖

Within samples sum of squares, 𝑆𝑆𝑊 = 𝑆𝑆𝑇 − 𝑆𝑆𝐵

 A mean square (or unbiased variance estimate) is given by


(sum of squares) ÷ (degrees of freedom)
𝑥−𝑥ҧ 2
e.g. 𝜎ො 2 =
𝑛−1
Hence
Total mean square, 𝑆𝑆𝑇
𝑀𝑆𝑇 =
𝑛−1
Between samples mean square, 𝑆𝑆𝐵
𝑀𝑆𝐵 =
𝑘−1
Within samples mean square, 𝑆𝑆𝑊
𝑀𝑆𝑊 =
𝑛−𝑘

 Note that for the degrees of freedom: (𝑘 − 1) + (𝑛 − 𝑘 ) = (𝑛 − 1)

CS 61061: Data Analytics 92


Example 8: F-Test using Formula
 For the previous example on 60W electric light bulbs, use these
computational formulae to show the following.

(a) 𝑆𝑆𝑇 = 430 (b) 𝑆𝑆𝐵 = 310


(c) 𝑀𝑆𝐵 = 155 (5𝜎ො𝐵2ത ) 2
(d) 𝑀𝑆𝑊 = 10 (𝜎ො𝑊 )

𝑀𝑆𝐵 155
Note that 𝐹 = = = 15.5 as previously.
𝑀𝑆𝑊 10

CS 61061: Data Analytics 93


ANOVA Table
 It is convenient to summarize the results of an analysis of variance in a
table. For a one factor analysis this takes the following form.

Source of Sum of Degrees of Mean F ratio


variation squares freedom square

Between samples 𝑆𝑆𝐵 𝑘−1 𝑀𝑆𝐵 𝑀𝑆𝐵


𝑀𝑆𝑊
Within samples 𝑆𝑆𝑊 𝑛−𝑘 𝑀𝑆𝑊

Total 𝑆𝑆𝑇 𝑛−1

CS 61061: Data Analytics 94


Example 9: F-Test for unbalanced
 In a comparison of the cleaning action of four detergents, 20 pieces of white cloth were
first soiled with India ink. The cloths were then washed under controlled conditions
with 5 pieces washed by each of the detergents. Unfortunately three pieces of cloth were
'lost' in the course of the experiment. Whiteness readings, made on the 17 remaining
pieces of cloth, are shown below.

Detergent
A B C D
77 74 73 76
81 66 78 85
61 58 57 77
76 69 64
69 63

 Assuming all whiteness readings to be normally distributed with common variance, test
the hypothesis of no difference between the four brands as regards mean whiteness
readings after washing.

CS 61061: Data Analytics 95


Solution
o 𝐻0 : 𝜇𝑖 = 𝜇 all 𝑖 = 1, 2, 3

o 𝐻1 : 𝜇𝑖 ≠ 𝜇 some 𝑖 = 1, 2, 3

o Significance level, 𝛼 = 0.05 (say)

o Degrees of freedom, 𝑣1 = 𝑘 − 1 = 3,

and 𝑣2 = 𝑛 − 𝑘 = 17 − 4 = 13

o Critical region is 𝐹 > 3.411

CS 61061: Data Analytics 96


Solution
A B C D Total
𝑛𝑖 5 3 5 4 17 = 𝑛
𝑇𝑖 364 198 340 302 1204 = 𝑇

෍ ෍ 𝑦𝑖𝑗 2 = 86362
𝑖 𝑗

12042
𝑆𝑆𝑇 = 86362 − = 1090.47
17

3642 1982 3402 3022 12042


𝑆𝑆𝐵 = + + + − = 216.67
5 3 5 4 17

𝑆𝑆𝑊 = 1090.47 − 216.67 = 873.80

CS 61061: Data Analytics 97


Solution
 The ANOVA table is now as follows.

Source of Sum of Degrees of Mean F ratio


variation squares freedom square

Between detergents 216.67 3 72.22 1.07

Within detergents 873.80 13 67.22

Total 1090.47 16

 The F ratio of 1.07 does not lie in the critical region.


 Thus there is no evidence, at the 5% significance level, to suggest a difference
between the four brands as regards mean whiteness after washing.

CS 61061: Data Analytics 98


Two-way ANOVA

CS 61061: Data Analytics 99


Two way (factor) ANOVA
 This is an extension of the one factor situation to take account of a second
factor.

 The levels of this second factor are often determined by groupings of


subjects or units used in the investigation. As such it is often called a
blocking factor because it places subjects or units into homogeneous
groups called blocks. The design itself is then called a randomised block
design.

CS 61061: Data Analytics 100


Example 10: Two-factor Analysis
 A computer manufacturer wishes to compare the speed of four of the firm's
compilers. The manufacturer can use one of two experimental designs.

a) Use 20 similar programs, randomly allocating 5 programs to each


compiler.
b) Use 4 copies of any 5 programs, allocating 1 copy of each program to
each compiler.

 Which of (a) and (b) would you recommend, and why?

CS 61061: Data Analytics 101


Solution
 In (a), although the 20 programs are similar, any differences
between them may affect the compilation times and hence perhaps
any conclusions. Thus in the 'worst scenario', the 5 programs
allocated to what is really the fastest compiler could be the 5
requiring the longest compilation times, resulting in the compiler
appearing to be the slowest! If used, the results would require a
one factor analysis of variance; the factor being compiler at 4 levels.

 In (b), since all 5 programs are run on each compiler, differences between
programs should not affect the results. Indeed it may be advantageous to use 5
programs that differ markedly so that comparisons of compilation times are
more general. For this design, there are two factors; compiler (4 levels) and
program (5 levels). The factor of principal interest is compiler whereas the other
factor, program, may be considered as a blocking factor as it creates 5 blocks
each containing 4 copies of the same program.

 Thus (b) is the better designed investigation.

CS 61061: Data Analytics 102


Solution
 The actual compilation times, in milliseconds, for this two factor
(randomised block) design are shown in the following table.

Compiler

1 2 3 4

Program A 29.21 28.25 28.20 28.62

Program B 26.18 26.02 26.22 25.56

Program C 30.91 30.18 30.52 30.09

Program D 25.14 25.26 25.20 25.02

Program E 26.16 25.16 25.26 25.46

CS 61061: Data Analytics 103


Assumptions and Interaction
 The three assumptions for a two factor analysis of variance when there is
only one observed measurement at each combination of levels of the two
factors are as follows.
1. The population at each factor level combination is (approximately)
normally distributed.
2. These normal populations have a common variance, σ 2 .

3. The effect of one factor is the same at all levels of the other
factor.

Hence from assumptions 1 and 2, when one factor is at level i and


the other at level j, the population has a distribution which is

N(μij, σ²)
 Assumption 3 is equivalent to stating that there is no interaction
between the two factors.

CS 61061: Data Analytics 104


Assumptions and Interaction
 Now interaction exists when the effect of one factor depends upon the level of the
other factor. For example consider the effects of the two factors:
sugar (levels none and 2 teaspoons),
and
stirring (levels none and 1 minute), on the sweetness of a cup of tea.

 Stirring has no effect on sweetness if sugar is not added but certainly does have an
effect if sugar is added. Similarly, adding sugar has little effect on sweetness unless
the tea is stirred.

 Hence factors sugar and stirring are said to interact.

 Interaction can only be assessed if more than one measurement is taken at each
combination of the factor levels. Since such situations are beyond the scope of this
text, it will always be assumed that interaction between the two factors does not
exist.

CS 61061: Data Analytics 105


Assumptions and Interaction
 Thus, for example, since it would be most unusual to find one compiler
particularly suited to one program, the assumption of no interaction between
compilers and programs appears reasonable.

CS 61061: Data Analytics 106


Notation and Computational Formulae
 As illustrated earlier, the data for a two-way ANOVA can be displayed in a
two-way table. It is thus convenient, in general, to label the factors as

a row factor and a column factor.

 Notation, similar to that for the one factor case, is then as follows.

Number of levels of row factor =𝑟


Number of levels of column factor =𝑐
Total number of observations = 𝑟𝑐

Observation in (i j-th cell of table = 𝑥𝑖𝑗

(ith level of row factor and 𝑖=1,2,…,r


jth level of column factor) 𝑗=1,2,…,c

CS 61061: Data Analytics 107


Notation and computational formulae
Sum of c observations in i-th row = 𝑇𝑅𝑖 = ෍ 𝑥𝑖𝑗
𝑗

Sum of r observations in j-th column = 𝑇𝐶𝑗 = ෍ 𝑥𝑖𝑗


𝑖

Sum of all rc observations = 𝑇 = ෍ ෍ 𝑥𝑖𝑗 = ෍ 𝑇𝑅𝑖 = ෍ 𝑇𝐶𝑗


𝑖 𝑗 𝑖 𝑗

 These lead to the following computational formulae which again are similar
to those for one-way ANOVA except that there is an additional sum of
squares, etc. for the second factor.

CS 61061: Data Analytics 108


Notation and computational formulae
Total sum of squares, 2
𝑇2
𝑆𝑆𝑇 = ෍ ෍ 𝑥𝑖𝑗 −
𝑟𝑐
𝑖 𝑗

Between rows sum of squares, 𝑇𝑅𝑖 2 𝑇 2


𝑆𝑆𝑅 = ෍ −
𝑐 𝑟𝑐
𝑖

Between columns sum of squares, 𝑇𝐶𝑗 2 𝑇 2


𝑆𝑆𝐶 = ෍ −
𝑟 𝑟𝑐
𝑗

Error (residual) sum of squares, 𝑆𝑆𝐸 = 𝑆𝑆𝑇 - 𝑆𝑆𝑅 - 𝑆𝑆𝐶

What are the degrees of freedom for SST , SSR and SSC when
there are 20 observations in a table of 5 rows and 4 columns?
What is the degrees of freedom of SSE ?

CS 61061: Data Analytics 109


ANOVA Table and Hypothesis Test
For a two factor analysis of variance this takes the following form.
Source of Sum of Degrees of Mean F ratio
variation squares freedom square
Between 𝑆𝑆𝑅 𝑟- 1 M𝑆𝑅 𝑀𝑆𝑅
rows
𝑀𝑆𝐶
Between 𝑆𝑆𝐶 c-1 M𝑆𝐶 𝑀𝑆𝐶
columns
𝑀𝑆𝐸
Error 𝑆𝑆𝐸 (𝑟-1) (c-1) M𝑆𝐸
(residual)
Total 𝑆𝑆T 𝑟c- 1
 Notes :
1. The three sums of squares, 𝑆𝑆𝑅 , 𝑆𝑆𝐶 and 𝑆𝑆𝐸 are independently distributed.
2. For the degrees of freedom:
(𝑟-1)+ (c-1) +(𝑟-1)+ (c-1) = 𝑟c- 1

CS 61061: Data Analytics 110


ANOVA Table and Hypothesis Test
 Using the F ratios, tests for significant row effects and for significant
column effects can be undertaken.

H0: no effect due to row H0: no effect due to column


factor factor
H1: an effect due to row factor H1: an effect due to column
factor
Critical region, Critical region,
F > 𝐹 α 𝑟−1 , 𝑟−1 𝑐−1 F >𝐹 α 𝑐−1 , 𝑟−1 𝑐−1

Test statistic, Test statistic,


𝑀𝑆 𝑀𝑆
𝐹𝑟 = 𝑅 𝐹𝑟 = 𝐶
𝑀𝑆𝐸 𝑀𝑆𝐸

CS 61061: Data Analytics 111


Example 11: Two-way ANOVA
 Returning to the compilation times, in milliseconds, for each of five
programs, run on four compilers.
 Test, at the 1% significance level, the hypothesis that there is no difference
between the performance of the four compilers.
 Has the use of programs as a blocking factor proved worthwhile? Explain.
 The data, given earlier, are reproduced below.
Compiler

1 2 3 4
Program A 29.21 28.25 28.20 28.62

Program B 26.18 26.02 26.22 25.56

Program C 30.91 30.18 30.52 30.09

Program D 25.14 25.26 25.20 25.02

Program E 26.16 25.16 25.26 25.46

CS 61061: Data Analytics 112


Solution : Dataset
 To ease computations, these data have been transformed (coded) by
x = 100 × (time -25)
to give the following table of values and totals.
Compiler

1 2 3 4 Row(totals) (𝑻𝑹𝒊 )

Program A 421 325 320 362 1428

Program B 118 102 122 56 398

Program C 591 518 552 509 2170

Program D 14 26 20 2 62

Program E 116 14 26 46 202

Column totals ( 𝑻𝑪𝒋 ) 1260 985 1040 975 4260 = T

σ 𝒙𝒊𝒋 𝟐 = 1757768

CS 61061: Data Analytics 113


Solution : Parameters
 The sums of squares are now calculated as follows.
(Rows = Programs, Columns = Compilers)

42602
 𝑆𝑆𝑇 = 1757768 = = 850388
20
1 42602
 𝑆𝑆𝑅 = 14282 + 3982 + 21702 + 622 + 2022 - = 830404
4 20
1 2 2 2 2 42602
 𝑆𝑆𝐶 = (1260 +985 +1040 +975 ) - = 10630
5 20

 𝑆𝑆𝐸 =850388 – 830404 – 10630 = 9354

CS 61061: Data Analytics 114


Solution: ANOVA Table

Source of Sum of Degrees of Mean F ratio


variation squares freedom square

Between programs 830404 4 207601.0 266.33

Between compilers 10630 3 3543.3 4.55

Error (residual) 9354 12 779.5

Total 850388 19

CS 61061: Data Analytics 115


Solution : Hypothesis Test
 H0: no effect on compilation times due to compilers
 H1: an effect on compilation times due to compilers
 Significance level, α = 0.001
 Degrees of freedom, v1 = c − 1 = 3
and v2 = ( r − 1)( c − 1) = 4 × 3 = 12
 Critical region is F > 5.953
 Test statistic FC = 4.55

 This value does not lie in the critical region. Thus there is no evidence, at
the 1% significance level, to suggest a difference in compilation times
between the four compilers.

CS 61061: Data Analytics 116


Reference

 The detail material related to this lecture can be found in

Design and Analysis of Experiments (8th Edition), Douglas C.


Montgomery, John Wiley & Sons, 2013.

CS 61061: Data Analytics 117


Any question?

You may post your question(s) at the “Discussion Forum”


maintained in the course Web page!

CS 61061: Data Analytics 118


Problem 1

 Draw a straight line of between 20 cm and 25 cm on a sheet


of plain white card (only you know its exact length)
 Collect 6 to 10 volunteers from each of Class VII, Class X and Class
XII. Ask each volunteer to estimate independently the length of the
line.

 Do differences in year means appear to outweigh


differences within years?

What is/ are the Factor(s) and Levels here?

CS 61061: Data Analytics 119


Example 2
 Make a list of 10 food/household items purchased regularly by your
family.

 Obtain the current prices of the items in three different shops; preferably a
small 'corner' shop, a small supermarket and a large supermarket or hyper
market.

 Compare total shop prices.

What is/ are the Factor(s) and Levels here?

CS 61061: Data Analytics 120


Questions of the day…

CS 61061: Data Analytics 121


Questions of the day…

CS 61061: Data Analytics 122

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy