0% found this document useful (0 votes)
84 views9 pages

Wilkinson Rogers Formula 2346786

The paper describes a symbolic notation for specifying factorial models for analysis of variance in the GENSTAT statistical program. The notation uses operators like +, *, and . to define model terms for factors, their interactions, and nested effects. It allows concise specification of models for various experimental designs like Latin squares and split-plot designs. Deletion operators allow unwanted terms to be removed from crossed or nested models. The notation is implemented in the GENSTAT language and can convert symbolic expressions into lists of model terms for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views9 pages

Wilkinson Rogers Formula 2346786

The paper describes a symbolic notation for specifying factorial models for analysis of variance in the GENSTAT statistical program. The notation uses operators like +, *, and . to define model terms for factors, their interactions, and nested effects. It allows concise specification of models for various experimental designs like Latin squares and split-plot designs. Deletion operators allow unwanted terms to be removed from crossed or nested models. The notation is implemented in the GENSTAT language and can convert symbolic expressions into lists of model terms for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Symbolic Description of Factorial Models for Analysis of Variance

Author(s): G. N. Wilkinson and C. E. Rogers


Source: Journal of the Royal Statistical Society. Series C (Applied Statistics) , 1973, Vol.
22, No. 3 (1973), pp. 392-399
Published by: Wiley for the Royal Statistical Society

Stable URL: https://www.jstor.org/stable/2346786

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Royal Statistical Society and Wiley are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series C (Applied Statistics)

This content downloaded from


132.174.253.119 on Tue, 11 May 2021 14:39:23 UTC
All use subject to https://about.jstor.org/terms
Symbolic Description of Factorial Models for
Analysis of Variance

By G. N. WILKINSON and C. E. ROGERS


Rothamsted Experimental Station

SUMMARY
The paper describes the symbolic notation and syntax for specifying factorial
models for analysis of variance in the control language of the GENSTAT
statistical program system at Rothamsted. The notation generalizes that
of Nelder (1965). Algorithm AS 65 (Rogers, 1973) converts factorial model
formulae in this notation to a list of model terms represented as binary
integers.
A further extension of the syntax is discussed for specifying models
generally (including non-linear forms).

1. INTRODUCTION
GENERAL computer programs for analysing experiments need a concise, flexible
notation for specifying the appropriate factorial models. The notation in this paper,
and various others due to Zyskind (1962), Hemmerle (1964), Nelder (1965), Fowlkes
(1969) and Claringbold (1969), were discussed at an international workshop meeting
on the computational aspects of analysis of variance at the University of Wisconsin in
1970 (Muller and Wilkinson, 1970).
The present notation for model formulae includes the addition, crossing and
nesting operators common to most of the notations mentioned, a dot operator for
defining multi-factor model terms and deletion operators for eliminating unwanted
terms from otherwise simple formulae. Submodel functions may be substituted for
factors in a formula, to specify regression sub-models for partitioning factorial effects.
The notation is implemented in the GENSTAT language (Nelder et al., 1973) (which
also includes a special pseudo-factor operator not described here). The GENSTAT
system is currently in operation at Rothamsted, the Edinburgh Regional Computing
Centre, Cambridge and Bristol Universities and other centres. Algorithm AS 65
(Rogers, 1973) converts symbolic factorial model formulae to a list of model terms
represented as binary integers.
Further extensions of the notation are readily envisaged, e.g. a diallel function of
parental genotype factors and a similarity-link operator for combining random terms
with a common variance, such as rows and columns in a lattice square design. A
general extension of notation to include linear or non-linear regression models is
described in Section 4.

2. OUTLINE OF THE NOTATION


2.1. Simple Factorial Models
Factorial models can be expressed symbolically as a sum of model terms, using the
operator +, and a dot operator to link factor names in multifactor terms. Thus
the following two alternative models for a two-way table of observations indexed by
392

This content downloaded from


132.174.253.119 on Tue, 11 May 2021 14:39:23 UTC
All use subject to https://about.jstor.org/terms
FACTORIAL MODELS FOR ANALYSIS OF VARIANCE 393

factors A and B,

(i) Y1j = m + ai + bj + (ab)i, (1)


(ii) Y1j = m + ai + (ab)ij, J
where i = 1, 2, ...,p andj = 1, 2, .. ., q, can be written symbolically with a grand mean
term taken as implicit,

(i) A+B+A.B, )(2)


(ii) A+A.B.J

Formulae of this form are termed simple factorial modelformulae.


Readers unfamiliar with computing languages should note that there is an essential
modularization of information here. Formulae such as (2) specify only the factorial
structure of the model. Other information needed for analysis, such as the numbers of
levels of the factors and the positions of the observations in the A x B table, would be
specified by other statements in the programming language.
Note also that the meaning of the term A B in the models (2) is affected by the
marginal terms that precede it, and a modem stepwise fitting algorithm (cf. Wilkinson,
1970) will automatically generate constraints in the parameter estimates (effects)
corresponding to marginal aliasing, such as that of (ab)j. with ai, etc. Thus A B
automatically represents, in formula (2) (i), interaction effects with zero row and column
sums, and in (2) (ii), within-class deviations with respect to A.

2.2. Crossing and Nesting Operators


Simple factorial models can be usually expressed more concisely by using the
crossing and nesting operators (Nelder, 1965), represented here by * and / respectively,
to indicate a multi-way table with or without certain margins. The following expan-
sions to simple form show the meaning of these operators:

(i) A*B=A+B+A.B, (3)


(ii) A/B =A+AB.

The nesting operator suppresses the marginal term B which is irreleva


archical model.

2.3. Block and Treatment Formulae


It is statistically necessary, of course, to distinguish between the random (or error)
terms in the model, arising from the physical (block) structure of the experimental
units, and the systematic (treatment) terms. This is done in the GENSTAT language with
separate declarations of what are termed block and treatment model formulae, for
instance

'blocks' blocks/plots, (4)

'treatments' nitrate*density. (5)

2.4. Crossed and Nested Formulae


With the introduction of bracketing where necessary, crossed and nested formulae
suffice to specify the usual models for most experimental designs. For instance,

This content downloaded from


132.174.253.119 on Tue, 11 May 2021 14:39:23 UTC
All use subject to https://about.jstor.org/terms
394 APPLIED STATISTICS

Latin squares, Youden squares, lattice squares and plaid designs have a block structure

rows*cols (6)

or reps/(rows*cols) (7)

while split-plot or split-row and split-column designs have block structures such as

blocks/mainplots/subplots (8)

or repsl(rows*cols)/subplots (9)

or (rows/subrows)*(cols/subcols). (10)

Treatment model structures are usually of the crossed form

e.g. nitrate*phos*potash

but nested structures (hierarchical models) are sometimes needed,

e.g. group/variety (11)

or spray/(type*dose), (12)

where spray is a factor indicating whether experimental plots were sprayed or not
(with insecticide, say). Note that in this example the factors type and dose would
include null levels associated with the unsprayed plots.
The general rules for determining simple factorial formulae from formulae such as
(7)-(12) are given in Section 3.

2.5. Deletion Operators


Corresponding to the operators

+ *l
are three deletion operators with meanings as follows:
Operator
(i) - Delete the specified term(s) from the preceding model.
(ii) -* As for (i), and also any corresponding higher-order terms.
(iii) -/ Delete only the corresponding higher-order terms.
These are useful for deleting unwanted terms from crossed and nested formulae when
the corresponding simple factorial sum of terms would be otherwise too lengthy.
The following equivalent model expressions illustrate their meaning (see rules in
Section 3).

A*B*C-A B. C = A +B+ C+A *B+A * C+B C, (13)

A*B*C-*B C = A*(B+ C), (14)

A*B*C-/A = A+B*C. (15)

The ANOVA directive of the GENSTAT language also provides a model-order contro
parameter for suppressing, from the analysis, all treatment terms above the order
specified.

This content downloaded from


132.174.253.119 on Tue, 11 May 2021 14:39:23 UTC
All use subject to https://about.jstor.org/terms
FACTORIAL MODELS FOR ANALYSIS OF VARIANCE 395

2.6. Submodels
An important requirement for the analysis of variance of factorial models is the
ability to specify submodels for partitioning factorial effects into regression com-
ponents; linear, quadratic and cubic trends of main effects for instance, and inter-
actions of these such as linear (A) x linear (B), linear (A) x quadratic (B), etc., where A
and B are different factors.
It is usually sufficient in practice to specify submodels only for the main effects of
each factor, since these then define by implication the corresponding compound
submodels for higher-order factorial terms. Submodels are specified in the GENSTAT
language by substituting for factors in the treatment model formula the appropriate
submodel functions of them, which are of the form

submodel-function (factor, order [,X]), (16)

the square brackets indicating an optional third parameter.


The functions currently available are POL (polynomial regression), REG (multiple
regression), POLND and REGND, where adding the affix ND (no deviations) indicates
that a deviations term is not to be considered a part of the submodel for that factor
when generating compound submodels for higher-order terms, and leads to suppres-
sion of terms like deviations (A) x linear (B) in the compound submodel for A x B
interactions, say.
The order parameter indicates the order of polynomial required or the number of
x-variates of a multiple regression.
The X parameter, if present, specifies an x-variate for polynomial regression or a
matrix whose rows are the x-variates of a multiple regression. If omitted (in the case
of POL, POLND) a polynomial regression on the quantitative levels of the factor is
implied.
Example. The treatment formula

GENOTYPE*POL(SITE, 1,SITEMEAN)*POLND(DENSITY,2) (17)

would produce the type of genotype x site analysis described by Finlay and Wilkinson
(1963), in which the sensitivity of each genotype with respect to site is characterized by a
linear regression of yield values for that genotype on the site means (over all genotypes);
together with an extension of the analysis for linear and quadratic trends of yield on
density (sowing rate) and their interactions with genotype and site. If a linear sub-
model were also specified for genotype, as for site, single degrees of freedom for non-
additivity would be produced.

3. GENERAL SYNTAX AND INTERPRETATION


3.1. Syntax
A factorial model formula in the GENSTAT language is an expression, interpreted
from left to right, with factors as the basic operands, bracketing where needed and the
following dyadic operators with precedence values (1 = highest) as indicated:

Operator * / * + - -/ - (18)
Precedence 1 2 3 4 4 4 4 (

(We omit from consideration here the substitution of submodels for factors, which
does not affect the primary, factorial model.)

This content downloaded from


132.174.253.119 on Tue, 11 May 2021 14:39:23 UTC
All use subject to https://about.jstor.org/terms
396 APPLIED STATISTICS

The assignment of operator-precedence reduces the need for bracketing to define


the left and right operands of each operator in model expressions. Compare the familiar
assignment of precedence to the operators x and / over + and - in arithmetic
expressions. The brackets in the expression (a+ b) x c are needed to define the left-
hand operand of x, whereas in the expression a+ b x c the left-hand operand of x is
unambiguously defined as b by the operator precedence, so that bracketing as in
a + (b x c) is unnecessary.
The left-to-right rule also reduces the amount of bracketing when successive opera-
tors have the same precedence. Thus a/b/c is unambiguously interpreted as (a/b)/c.

3.2. Evaluation Rules


The interpretation of a model formula follows from its expansion as a simple
factorial model, i.e. its evaluation as a sum of factorial terms, a term being either a
factor or a dot-product of factors. In the rules given below, A and B denote model
terms, S and T sums of model terms and L and M model formulae:
Simplification.

Delete repetitions of operands in a dot-product, (19)

e.g. A-B-A = A-B.

Delete repetitions of model terms in a sum, (20)

e.g. A +B+A = A +B.

Ordering of model terms. Some of the evaluation rules below may not produce a
statistically appropriate ordering of model terms, so that re-ordering may be required.
An essential order requirement is that any term in a simple factorial model should
precede all terms to which it is marginal, i.e. A before A B, etc. A stronger require-
ment (implemented in GENSTAT) iS that terms be arranged in increasing order with
respect to the number of factors in a term, with terms of the same order arranged in a
natural sequence with respect to the factors defining them.

Distributive rule for dot-product

S-T=ZAB foralltermsAinS,BinT. (21)


For example,

(A+B)C= A C+B C.

General definitions of crossing and nesting operations. These may


terms of + and dot operations as follows:
(i) L*M=L+M+L-M, (22)

(ii) L/M = L+FAC(L)-M, (23)

where FAC(L) is the dot-product of all factors in L. It will usually be a term in the
expansion of L. For example,
(A+B)*C= A+B+C+(A+B)-C
= A+B+C+A C+B C,

(A*B)/C = A +B+A *B+A *B C,

(A+B)/C= A+B+A B.C.

This content downloaded from


132.174.253.119 on Tue, 11 May 2021 14:39:23 UTC
All use subject to https://about.jstor.org/terms
FACTORIAL MODELS FOR ANALYSIS OF VARIANCE 397

Deletion operations

S- T: Delete from S any terms in T, (24)

S-/ T: Delete from S any terms to which a


is marginal, (25)
S-*T = S-T-/T. (26)

4. EXTENSION TO GENERAL MODELS


A set of n observations may be regarded as a point in a n-dimensional vector space,
the sample space. A model for the observations specifies a subspace, not necessarily
linear, of the sample space, and a parameterization of this subspace.

4.1. Linear Models


For the linear factorial models considered so far the model subspace is determined
by the incidence variates associated with the factorial parameters. The symbolic
notation so far described characterizes the relevant subspace without explicit naming
of the parameters, which are assumed to be in 1-1 correspondence with the incidence
vectors.
The notation can be extended to describe any linear subspace by admitting as
operands not only factors but also x-variates, or more generally matrices whose
columns define a set of x-variates, and by extending the definition of dot-product.
Definition of dot-product. If Xl, X2 denote n xp, n x q matrices (incidence matrices
if XI and/or X2 are factor names), the symbolic dot-product XI * X2 represents an
n xpq matrix, each row of which is the direct product of the corresponding rows of
Xl, X2 respectively, i.e. comprises all products xlx2 of elements x1, x2 from the
respective rows.
Note that the simplification rule (19) for repeated operands in a dot-product
applies only to factors (or matrices X with the property that X and X- X determine
the same subspace).
Examples. (i) With XI and X2 defined as above, the symbolic model

X1*X2 = XI+ X2+XI X2 (27)

indicates the linear model subspace determined by the p+q+pq variates associated
with Xl, X2 and XI X2.
(ii) Introducing a symbolic exponentiation operator **, a complete second-degree
model with respect to XI and X2 is concisely described as

(XJ+ X2)**2 = (XJ+ X2)*(XJ+ X2)

= X1+X2+X1lX1+X1 X2+X2 X2, (28)

deleting repeated terms.

4.2. Non-linear Models


In the context of a programming language an important feature of the symbolic
notation for linear models is that explicit naming of the parameters associated with the
model vectors is avoided.

This content downloaded from


132.174.253.119 on Tue, 11 May 2021 14:39:23 UTC
All use subject to https://about.jstor.org/terms
398 APPLIED STATISTICS

A non-linear model, however, requires an algebraic specification which necessarily


involves explicit names for parameters, as in the asymptotic regression formula

MAXVAL*{l- exp ( RATE*X) } (29)

in which MAXVAL and RATE are the parameters, X is a column vector of known
x-values and * here denotes item-by-item multiplication. This raises two points:
(1) Declaration of parameters. A distinction must be made between symbols in an
algebraic expression that represent parameters and those that represent variables with
known values. This can be done for instance with declarations such as

'PARAMETERS' MAXVAL,RATE. (30)

(2) Modes of expression. Since the algebraic and symbolic modes of expression
involve the same operator symbols +,-, *c, / but with different meanings, an un-
ambiguous indication of which mode of expression is being used in particular contexts
is required. Thus, a directive 'MODEL' might carry a modifier to indicate mode:

or 'MODEL/A' algebraic expression (31)


'MODELIS' symbolic expression (linear model only).

Mixed mode expressions may sometimes be necessary. For instance, the parameters
MAXVAL and RATE in (29) might depend on certain factorial treatments with
symbolic model A *B. This can be effected by introducing a special function

LM(symbolic expression, parameter name), (32)

where LM stands for linear model. The symbolic argument defines a set of x-variates
and the second argument is a name for identifying the corresponding array of para-
meters. The functional notation enables symbolic expressions to be introduced into
otherwise algebraic expressions. Thus, (29) could be modified to

LM(A*B,MAXVAL)*{1 - exp (-LM(A *B,RATE)*X) }. (33)

Mixing of modes as in (33) can be avoided by allowing models to be named and


used in specifying other models, or by extending the definition of parameters. For
instance if MAXVAL and RATE are defined to be linear models with the declaration

'MODEL/S' MAXVAL = A*B: RATE=A*B (34)

or if, alternatively, the parameter definition (30) is extended to include the appropriate
symbolic models, e.g.
'PARAMETERS' MAXVAL,RATE $ A*B, (35)

the simpler formula (29) can be used in place of (33).

5. ACKNOWLEDGEMENT
The authors wish to thank a referee for valuable suggestions which substantially
improved the presentation.

REFERENCES
CLARINGBOLD, P. (1969). An approach to conversational statistics. In Statistical Computation
(R. C. Milton and J. A. Nelder, eds.) pp. 267-283. New York: Academic Press.
FINLAY, K. W. and WILKINSON, G. N. (1963). The analysis of adaptation in a plant-breeding pro-
gramme. Aust. J. Agric. Res., 14, 742-754.

This content downloaded from


132.174.253.119 on Tue, 11 May 2021 14:39:23 UTC
All use subject to https://about.jstor.org/terms
FACTORIAL MODELS FOR ANALYSIS OF VARIANCE 399

FOWLKES, E. B. (1969). Some operators for ANOVA calculations. Technometrics, 11, 511-526.
HEMMERLE, W. J. (1964). Algebraic specifications of statistical models for analysis of variance
computations. J. Ass. Comput. Mach., 11, 234-239.
MULLER, M. E. and WILKINSON, G. N. (1970). Statistical algorithms and computational aspects
of the analysis of variance. Report on ANOVA Workshop, 1970, University of Wisconsin.
NELDER, J. A. (1965). The analysis of randomized experiments. I. Block structure and the null
analysis of variance. II. Treatment structure and the general analysis of variance. Proc. Roy.
Soc. A, 283, 147-178.
NELDER, J. A. et al. (1973). GENSTAT Reference Manual, Rothamsted Experimental Station.
ROGERS, C. E. (1973). Algorithm AS 65. Interpreting structure formulae. Appl. Statist., 22, 414-
424.
WILKINSON, G. N. (1969). Facilities in a statistical program system for analysis of multiple-
indexed data. In Statistical Computation (R. C. Milton and J. A. Nelder, eds.), pp. 201-228.
New York: Academic Press.
WILKINSON, G. N. (1970). A general recursive procedure for analysis of variance. Biometrika,
57, 19-46.
ZYSKIND, G. (1962). On structure, relation, sigma and expectation of mean squares. Sankhya, A,
24, 115-148.

This content downloaded from


132.174.253.119 on Tue, 11 May 2021 14:39:23 UTC
All use subject to https://about.jstor.org/terms

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy