Lec9 Manova
Lec9 Manova
MANOVA
We have considered testing mean difference for two multivariate normal samples in Lec-
ture 3. Let X11 , . . . , X1n1 be i.i.d. Np (µ1 , Σ) and X21 , . . . , X2n2 be i.i.d. Np (µ2 , Σ). Consider
testing
H0 : µ1 = µ2 , H1 : µ1 6= µ2 , Σ is unknown.
Since
n1 + n2
X̄1 − X̄2 ∼ Np (µ1 − µ2 , Σ),
n1 n2
(n1 + n2 − 2)SP ∼ Wp (n1 + n2 − 2, Σ),
• the p-value,
which makes the table useful. Unlike the ANOVA table, the one-way MANOVA table consists
of matrix-valued sum of squares (T, B, W are p × p matrices.) MANOVA uses the p × p
matrix W−1 B as an analogue of the F value.
The matrix-valued statistic W−1 B is in fact closely related to the likelihood ratio statistic.
One can show that the m.l.e. of Σ is S0 = T/n under H0 and S1 = W/n = SP n−k n
under
H1 . The likelihood ratio statistic is
|S0 | |T| |W + B|
W = n log = n log = n log ,
|S1 | |W| |W|
which has the distribution χ2p(K−1) , approximately for large n. Bartlett’s modification for
the likelihood ratio statistic is
1 |W + B|
W ∗ = [(n − 1) − (p + K)] log ,
2 |W|
The hypothesis testing based on W ∗ (or on W ) is called Wilk’s test. Now see that W is an
increasing function of
|W + B|
U= = |Ip + W−1 B|.
|W|
Moreover the statistic U is a function of the eigenvalues λi of W−1 B, since
p
Y
−1
U = |Ip + W B| = (1 + λi ).
i=1
2
Two bad ideas
Situation I A physician has a multivariate dataset consisting of patient information (all quantita-
tive) related to a disease. He has a strong belief that there are two or three subtypes
of the disease. Accordingly, a statistician helps him finding clusters of the dataset
(cluster analysis). Three clusters were found, each of them is nicely interpreted with
distinct characteristics. He goes further by testing whether the mean vectors of these
clusters are different or not. The p-value is extremely small. Does it mean that there
are actually three clusters?
Situation II The effects of two drug compounds are compared with baseline (or placebo), measured
by differentially expressed genes of lab mice. Suppose we have a sample of size n = 120
in total. The number of genes is p > 20, 000, where the usual MANOVA is not
applicable. (why?) Initial dimension reduction leads d = 10 principal components.
MANOVA is applied to this dataset of size 10 × 120. The p-value suggests that the
difference in effects of drug compounds is statistically significant. A biologist is further
interested in a list of genes that are responsible for the difference. For more than 20,000
genes, applications of ANOVA for each of the genes (variables) lead more than 1,000
genes with p-value less than 0.01. Does it mean that all 1,000 genes are important?