Nested and Repeated Measures ANOVA
Nested and Repeated Measures ANOVA
Nested designs are used when levels of one factor are not represented within all levels
of another factor. Often this is because there is no alternative. For instance, if we were
concerned with the effects of acid rain on productivity in British and American lakes, we
might select at random 5 lakes in each country and make 10 productivity measurements
at the surface. The lakes would constitute a random effect while country would be a
fixed effect. However, each lake does not occur in both countries, so lake is,
necessarily, nested within country. Such a design confounds the lake by country
interaction since to estimate the interaction would require measurements of each lake
within both countries, which is impossible. In such a situation, one analyzes the data as
if they represent a fully factorial design with all factors completely crossed, but then the
interaction term (lake by country in this case) is pooled with the effect of the nested
factor (lake in this case), and the country main effect is tested over the effect of the
nested factor pooled with the interaction.
Source df MS F
MSCountry/MSLake +
Country 2-1 SSCountry/dfCountry [
Lake by Country
The appropriate F ratio is MScountry/MS Lake + Lake by Country. Alternatively, if factor A is the
non-nested factor and factor B is the nested factor, the F ratio to test for the effect of the
non-nested factor is F = MSA/MSA/B, where A/B connotes the effect of factor B nested
within factor A and is equal to the sum of main effect of Factor B and the interaction of
Factor A and B that one would obtain from treating the data as being generated by a
fully crossed ANOVA design. The results of this analysis should be equivalent to an
analysis in which the values in each lake were first averaged and a one-way ANOVA
was performed on the averages to test for a country effect.
Nested ANOVA in R
In R, one can obtain the nested analysis simply using the 'aov' command, or using the
'Anova' command from the 'car' package if you have an unbalanced design and want to
use Type II or III sums of squares. However, as for other ANOVA models, F may need
to be re-calculated if the model includes a random factor.
1
In a study designed to estimate the number of associated bacterial species as a
function of life history stage for a moth, DNA was extracted from 5 field collected
individuals of each life stage, and 3 replicate pcr's were run on the bacterial 16s gene.
The pcr product was sequenced using next generation sequencing. The resulting DNA
sequence reads were groups into clusters at the 98% level of similarity, and the number
of bacterial OTU's (organization taxonomic units) were counted for each replicate
sample. These data are in the linked file (micro.csv). Note that individuals were
selected at random, so we must eventually compute the correct F ratios and p values
using the mean squares provided by R. The expected means squares and F-ratios for
this specific study design are given in the notes on Repeated Measures and Nested
ANOVA, in section III, Table B.
Perform the ANOVA using the 'aov' command. If factor B is nested within the levels of
factor A, then the model syntax will include the term A/B. In our example 'individual' is
nested within 'treat" where 'treat' is life history stage.
summary(aov(dat1$numbact~dat1$treat/factor(dat1$individual)))
## Df Sum Sq Mean Sq F value Pr(>F)
## dat1$treat 2 52.04 26.022 7.184 0.00282 **
## dat1$treat:factor(dat1$individual) 12 28.53 2.378 0.656 0.77737
## Residuals 30 108.67 3.622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2
# compute the correct F ratio and determine it statistical significance
fcor=26.022/2.378
fcor
## [1] 10.94281
1-pf(fcor,df1=2,df2=12)
## [1] 0.0019724
To use the 'car' Anova command, first load the 'car' package.
library(car)
# perform the ANOVA using lm, and the Anova command from 'car'
# Note that since we have as balanced design the results are the same
dd=lm(dat1$numbact~dat1$treat/factor(dat1$individual))
Anova(dd,Type=2)
## Anova Table (Type II tests)
##
## Response: dat1$numbact
## Sum Sq Df F value Pr(>F)
## dat1$treat 52.044 2 7.1840 0.002823 **
## dat1$treat:factor(dat1$individual) 28.533 12 0.6564 0.777372
## Residuals 108.667 30
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Although the summary table from the 'car' Anova command does not give the Mean
squares, dividing the Sums of Squares by their respective degrees of freedom will yield
the same values as we obtained using the 'aov' command since we have equal sample
sizes.
Plotting the data shows that the number of associated bacteria is greatest in the adult
stage and lower in the larval stages examined.
library(ggplot2)
p=ggplot(dat1,aes(x=treat,y=numbact)) +
geom_boxplot()
p
3
Note that the boxplot supports the conclusion that there are differences in the number of
bacterial associates between life stages. We do not perform a test on the effect of
individuals nested within life stage effect since it is a random effect.
It is also possible to compute a nested ANOVA by treating the data as if it were not
nested and pooling the sums of squares. However, the nested factor level must be
coded as if it were not nested. In the example above, the data file mico.csv also has a
column labeled "individual2" in which we pretend that the same three individuals were
observed under each level of the treat factor (stage). For example, if the study
described above we would perform a 2 - factor ANOVA with life stage (treat) and
individual2 as our two factors. Since individuals are nested within life stage, we know
that we really cannot estimate an interaction between these factors. However, we trick
R into computing an effect of treat (A), individual2 (B), and an treat:individual2
interaction (A:B). We would then need to pool (add together) the Sum of Squares for
individual2 and the Sum of Squares for treat:individual2 to obtain the Sum of Square for
individual nested within treatment (treat/individual). We would pool their degrees of
freedom as well. In the example from above, the individual effect would have 5 - 1 = 4
degrees of freedom, and the interaction would have (5 - 1) X (3 - 1) = 8 degrees of
freedom, so the total would be 12 as obtained when we explicitly include the nested
factor in the ANOVA model. In general, any nested design can be computed using this
sort of pooling of Sums of Squares approach although it is a bit more cumbersome.
Again, to use the pooling approach with factor B nested within the levels of factor A:
4
MS(A/B)= SS(A/B)/df(A/B)
In this example, one could have averaged the values for each individual, and then
computed a single factor ANOVA to compare differences among life history stages. The
results would be identical to fitting the nested model.
Within subjects designs are used in agricultural and psychological research and have
many applications in biology. These designs are called split-plot, repeated measures, or
generically within subjects. The primary purpose of these designs is to eliminate
uncontrolled variation due to a priori differences or differences in the level of
responsiveness in primary sampling units from the estimate of experimental error. In
this sense we can see that these designs are a way to remove confounding variation by
adding classificatory controls or strata.
Repeated Measures ANOVA has been used increasingly in biology for several reasons.
The first is that it allows us to better control for inter-subject variability. It allows us to
use a subject as its own control. Secondly, it is more economical in use of subjects,
which is especially important when subjects (or study sites) are difficult to locate or get
to, or are limited in number.
5
and Kaiser in the supplemental readings to learn more about these two approaches to
repeated measures ANOVA.
In R, data for repeated measures factors can be on the same line for a single subject. If
a subject is observed under all three levels of factor A, then all three response values
can be on a single line of data. For a two factor design with repeated measures on one
factor (with 3 factor levels for the repeated measures factor and 2 for the between
subjects factor) data would look like this:
The integer code represents the level of the between subjects factor in which each
individual subject is nested, and the three real numbers in successive columns
represent the observation of the random variable of interest in levels 1, 2, and 3 of the
within subjects factor, respectively. Entering data on additional repeated measures
factors would entail adding the additional observations to each line of the data file since
each line represents an individual subject.
Feldman and Connor (1992) examined the effects of rock type and stream pH on the
colonization of cobble habitats by invertebrates. They placed wire mesh baskets full of
similar sized cobbles of 3 different rock types into 3 streams with either slightly acidic
pH values of approximately 5.8, or in 3 streams with neutral pH values of approximately
7. These data are in the linked data file (rock.csv). The data represent the number of
invertebrates detected in each basket in each combination of pH and rock type. The
expected mean squares and F ratios for this specific study design are given in the notes
on Repeated Measures and Nested ANOVA in section II, Table C.
dat2=read.csv("K:/Biometry/Biometry-Fall-2015/Lab8/rock.csv",header=TRUE)
head(dat2)
## pH Rock1 Rock2 Rock3
## 1 1 248 206 250
## 2 1 360 216 198
## 3 1 332 176 234
## 4 2 390 404 521
## 5 2 523 594 486
## 6 2 416 446 433
6
After attaching the data object, dat2, combine the 3 levels of the repeated measures
factor (within-subjects factor) into a matrix called rocks. Then define the factor
'rocktype'. Finally, using the 'lm' function from R, fit a multivariate linear model where
'rocks' is the dependent variable and pH is the only factor.
attach(dat2)
rocks=cbind(Rock1,Rock2,Rock3)
rocktype=factor(c("Rock1","Rock2","Rock3"))
mlm1=lm(rocks~factor(pH))
In this last step we are essentially asking if a linear combination of the levels of the rock
type factor differ by pH.
Now load the 'car' package, and apply the 'Anova' command. Note that we need to
define a data.frame for our repeated measures factor, and a model with the within
subjects design the "idesign~rocktype" part of the command.
# load the package "car"
library(car)
# Using the Car Anova command perform an ANOVA with Type 2 sums of squares
Next we will obtain a summary of this last model. This summary is lengthy, but I will
describe its contents.
The first part of the summary focuses on the Multivariate approach. Here you will find
Multivariate F- values, degrees of freedom, and significance levels for all your within-
subjects tests. The reason for this is that one can conceive of a within-subjects factor as
either a single response variable examined under a series of different treatment levels
(a univariate approach), or a as several different response variables (a multivariate
approach). Usually the Univariate and Multivariate approaches give the same answer,
although F-values, dfs, and p-values will not be exactly the same. The reason for using
the Multivariate approach is that one only has to meet the assumption of Multivariate
Normality and homogeneity of the variance/covariance matrix for this test to work well.
Tests for multivariate normality are found in the MVN package in R, and Bartlett's test
7
for homogeneity of variance/covariance matrices is available in the base installation of
R (bartlett.test). However, in the univariate approach one has to meet the more
restrictive assumption of sphericity of the variance/covariance matrix.
The second part of the summary focuses on the Univariate approach. It first shows an
ANOVA table which includes between and within-subjects factors with no adjustment for
non-sphericity of the variance/covariance matrix (assuming sphericity) . After the
ANOVA table, R reports Mauchly’s test for Sphericity of your variance/covariance
matrix, and two different estimates of epsilon (a measure of how much your data depart
from meeting this assumption). Epsilon is a value between 1 and the lower bound
estimate and it is multiplied by the degrees of freedom for both the numerator and
denominator mean squares to adjust the test for non-sphericity. Deviations from
sphericity lead to too many Type I errors, so by adjusting dfs downward the resulting
univariate F-tests have error rates equal to the nominal error rates (if you claim the p
value is 0.05 then it will be so if you meet the other assumptions of the test).
I recommend using the Greenhouse-Geisser adjustment.
# Obtain a summary
summary(ww)
8
## Sum of squares and products for error:
## (Intercept)
## (Intercept) 61858.67
##
## Multivariate Tests: (Intercept)
## Df test stat approx F num Df den Df Pr(>F)
## Pillai 1 0.99111 446.0004 1 4 2.9718e-05 ***
## Wilks 1 0.00889 446.0004 1 4 2.9718e-05 ***
## Hotelling-Lawley 1 111.50011 446.0004 1 4 2.9718e-05 ***
## Roy 1 111.50011 446.0004 1 4 2.9718e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------
##
## Term: factor(pH)
##
## Response transformation matrix:
## (Intercept)
## Rock1 1
## Rock2 1
## Rock3 1
##
## Sum of squares and products for the hypothesis:
## (Intercept)
## (Intercept) 662008.2
##
## Sum of squares and products for error:
## (Intercept)
## (Intercept) 61858.67
##
## Multivariate Tests: factor(pH)
## Df test stat approx F num Df den Df Pr(>F)
## Pillai 1 0.914544 42.80779 1 4 0.0028205 **
## Wilks 1 0.085456 42.80779 1 4 0.0028205 **
## Hotelling-Lawley 1 10.701947 42.80779 1 4 0.0028205 **
## Roy 1 10.701947 42.80779 1 4 0.0028205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ------------------------------------------
##
## Term: rocktype
##
## Response transformation matrix:
## rocktype1 rocktype2
## Rock1 1 0
## Rock2 0 1
## Rock3 -1 -1
##
9
## Sum of squares and products for the hypothesis:
## rocktype1 rocktype2
## rocktype1 3601.5 -1960.000
## rocktype2 -1960.0 1066.667
##
## Sum of squares and products for error:
## rocktype1 rocktype2
## rocktype1 28376 23794.00
## rocktype2 23794 28788.67
##
## Multivariate Tests: rocktype
## Df test stat approx F num Df den Df Pr(>F)
## Pillai 1 0.4753856 1.359243 2 3 0.37998
## Wilks 1 0.5246144 1.359243 2 3 0.37998
## Hotelling-Lawley 1 0.9061619 1.359243 2 3 0.37998
## Roy 1 0.9061619 1.359243 2 3 0.37998
##
## ------------------------------------------
##
## Term: factor(pH):rocktype
##
## Response transformation matrix:
## rocktype1 rocktype2
## Rock1 1 0
## Rock2 0 1
## Rock3 -1 -1
##
## Sum of squares and products for the hypothesis:
## rocktype1 rocktype2
## rocktype1 22693.5 -5412.000
## rocktype2 -5412.0 1290.667
##
## Sum of squares and products for error:
## rocktype1 rocktype2
## rocktype1 28376 23794.00
## rocktype2 23794 28788.67
##
## Multivariate Tests: factor(pH):rocktype
## Df test stat approx F num Df den Df Pr(>F)
## Pillai 1 0.790732 5.667849 2 3 0.095731 .
## Wilks 1 0.209268 5.667849 2 3 0.095731 .
## Hotelling-Lawley 1 3.778566 5.667849 2 3 0.095731 .
## Roy 1 3.778566 5.667849 2 3 0.095731 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
10
## Univariate Type II Repeated-Measures ANOVA Assuming Sphericity
##
## SS num Df Error SS den Df F Pr(>F)
## (Intercept) 2299083 1 20620 4 446.0004 2.972e-05 ***
## factor(pH) 220669 1 20620 4 42.8078 0.00282 **
## rocktype 4419 2 22247 8 0.7945 0.48447
## factor(pH):rocktype 19597 2 22247 8 3.5236 0.07990 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
# This section shows the estimates of tests for Sphericity of the variance/
covariance matrix. R gives both the Greenhouse-Geisser epsilon (preferred)
and the Huynh-Feldt epsilon which measure departure from sphericity. A value
of 1 indicates that the variance/covariance matrix is spherical. Adjusted p-
values for the within-subjects factors are also given (highlighted in red).
Finally, it is always good to plot the data to get a better feel for the patterns revealed.
numinverts=c(Rock1,Rock2,Rock3)
numinverts
## [1] 248 360 332 390 523 416 206 216 176 404 594 446 250 198 234 521 486
## [18] 433
11
rocks=factor(c(rep(1,6),rep(2,6),rep(3,6)))
rocks
## [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3
## Levels: 1 2 3
pHH=factor(rep(pH,3))
pHH
## [1] 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2
## Levels: 1 2
# obtain a boxplot
p2=ggplot(dat2,aes(x=pHH,y=numinverts,fill=rocks)) +
geom_boxplot()
p2
The graph supports the conclusion that more invertebrates were detected in streams
with neutral pH than in low pH streams, and show some suggestion of an interaction
between pH and rock type although it was not statistically significant.
Data on insect damage on oak trees was collected in order to answer questions
concerning the effects of shading on damage levels. Six oak trees were selected at
random, 3 in the shade and 3 in the sun. The data are included in the linked data file
nest.csv. The file has the following form:
12
Column 1: Light level: 1 = shade; 2 = sun
Reiterating, this file contains data which describe the response of trees in terms of
damage; thus damage is the response variable. The data come from observations taken
from six randomly selected oak trees. Three are located in the shade; and three are
located in the sun. Thus this is a nested design, with trees nested within the level of light
(shade or sun) and leaves nested within trees.
To use the R syntax to perform a nested ANOVA, use the tree coding variable (tree2).
To perform the analysis for pooling the sums of squares use the tree coding variable
(tree).
1.1) What kind of design is this? Draw a graphical representation of the experimental
design, where A = light level, B = trees, and G = group of subjects.
When I ask for a graphical representation of the experimental design, I am looking for a
diagram like I have shown in class that shows all the factors in the experiment, their
levels, and represents each group of subjects with the letter G appropriately subscripted
under each treatment combination. I also ask that you accurately describe the pattern of
crossing and nesting of factor levels and subjects.
a) specify the null hypotheses being tested and turn in the edited output
b) pool the appropriate sums - of - squares and perform the appropriate F test
d) How could the data have been treated differently so a nested ANOVA could have
been avoided?
In our particular example, our nested factor (trees) is a random effects factor since we
chose them at random from a large number of possible trees. Therefore, the final test of
the effect of factor A will involve computing an F ratio with the MS(B(A)) as the
denominator
13
Given that our nested factor is a random effects factor, it also makes little sense to
compute an hypothesis test for the B(A) effect, but if one insisted on doing so then the
within cell or error mean square would serve as the denominator of the F- ratio.
Finally, for our particular problem, an alternative way to analyze the data would be to
average the values among the 15 leaves within each tree, and use the tree means to
compute a independent groups t-test for differences in leaf damage between trees in
the sun versus the shade. The results of this test would be identical to the test of the
light effect in the Nested ANOVA (except the t-value would equal the square root of the
F value). Hence, it is sometimes possible to turn a nested ANOVA into a simpler
problem, particularly if the nesting of factor levels arises because multiple observations
are made on the same subject. Here, by using tree averages, we turn a nested random
factor (the tree factor) into our subjects, hence removing the nested factor levels and
simplifying the analysis.
In this experiment, ten small-mammal trapping grids were established in order to study
the effect of food addition on population densities. Five of the grids received food
supplements while the others were not manipulated. The population levels on each grid
were monitored 3 times at monthly intervals following the food addition. The data are
contained in the linked data file rep.csv. It is in the following form:
Reiterating, here we have 10 grids. 5 have food added; 5 do not. The response variable,
population density, was measured on each grid (subject) 3 times (a repeated measures
factor).
2.1) What kind of design is this? Draw a graphical representation of the experimental
design, where A = food treatment, B = time or month, and G = group of subjects.
2.2) Conduct an appropriate analysis of variance on these data and report the results
for tests of the main effects of food treatment, time, and their interactions (no pooling of
sums - of - squares is necessary.
14
a) specify the null hypotheses being tested and interpret the results of the tests (at α =
0.05) . Don't forget to include your syntax and results in your R.markdown file.
b) How could the study have been designed differently so that a repeated measures
ANOVA could have been avoided?
15