0% found this document useful (0 votes)
34 views175 pages

EIAR Manual

Uploaded by

Wondimeneh Taye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views175 pages

EIAR Manual

Uploaded by

Wondimeneh Taye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 175

1.

Introduction
Field experimentation is used to obtain new information or to improve the results of
previous findings (such as new variety or improved practices). It helps to answer
questions such as:

• Is the improved variety higher yielding than the local varieties?


• Which fertilizer level gives optimum yield?
• Which insecticide is the most effective?

What is required is to obtain information through


• Objective
• Verifiable observations

Experimentation is thus applying knowledge and experience from related subjects to


device new research tools. There are several ways of experimentation, One may simply
observe a scenario and decide based on own subjective judgment or require other tools
and methods to assist in the process of decision making. Here, the word scientific
signifies the fact that experimentation employs appropriate planning data collection,
application of approprate statistical, mathematical and/or specific testing tools and
devices.

One of the methods is known as scientific or experimental method.


It is used for investigating basic principles, for finding cases and effects, and for
drawing conclusions.

Steps in experimental methods are as follows:

1. Define and state the problem


(From literature review socio- economic studies or senior researchers
experience)
2. State objectives.
3. Develop hypothesis.
(It is an assumption to be accepted or rejected depending on observed facts).
4. Test of hypothesis.
(appropriate experiment is conducted to obtain information and facts that
helps to identify the true situation )
treatment selection, application, experimental design.....
5. Data analysis.
6. Interpretation of results.
7. Inferences and conclusions.
8. Preparation of complete & precise report.
(Here you will have outcome for present work and also propose other
investigation).

Therefore scientific experiment may be defined as systematic examination of selected


factors in a controlled situation where precise observation are made in relation to the

1
objective.

In the process of experimentation there are several sources of errors that may be
encountered at all stages of the work.

Common sources of error include may be summarized as follws::

• inaccurate equipment
• faulty objectives
• personal bias
• Inadequate controls /checks
• inadequate replication
• Lack of uniformity in soil fertility
• Topography or drainage
• Damage by rodents, birds, insects & diseases
• Others (poor seed germination, incorrect measurement, wrong plot size,
lack of uniformity cultural practices, competition effect ...)

The next question is what are the contributions of statistical methods?

In real life situation there are several factors that may affect the performance of varieties,
for example. That is because there are heterogeneity in soil, differences in slope, fertility
gradient, moisture content, etc. and it is not possible to find a plot of land that is
perfectly homogenous even for smallest possible plots. Biological world is full of
variability. This makes it difficult to visually observe and rate the performance of
variety. Therefore statistical methods are required to reveal the inherent variabilities and
Separately estimate performances of varieties that occurred due to variability and
performances due to true genetic capacity for ease of comparisons or estimation of mean
values. Statistical methods are thus used to measure the random variation (residual). It
helps to differentiate treatment variation from random variation. For statistics to
contribute effectively, the experiment must be conducted under optimum condition .

An error that might be anticipated in particular stage of a trial often propagates to all
stages that follows. This means that even if some one takes extra care in selection of
material, set up of the field, etc. a minor error that might have been contracted at one of
these stages may detract from achieving the required goal. Therefore, to be able to
achieve trustworthy conclusions and inferences, statistical methods should be considered
in designing, data collection, analysis, interpretation and result presentation stages.

Once the data is collected appropriate analytical tools will be selected and the data
subjected to analysis. This step is very crucial in the sense that it is an indication of the
fact that the statistical procedures are serially connected. That means, to produce an
acceptable result the trials must be designed properly, data must be collected properly
and correct analytical method must be used. It is quite difficult to compromise at analysis
stage if the design was initially wrong.

The Puropose of this manual is therefore to prepare a framework for linking the
designing, data management and analysis stages. In each chapter, designs or models are

2
discussed first, then programs for SAS, Genstat or MSTATc presented. Finally, outputs
from statistical packages are discussed.

This manual is prepared in 12 chapters. The first 4 chapters deals with conceptual
framework and method of analysis for some standard designs. Chapter five deals with
design and analysis of incomplete blocks. The remaining chapters entertain different
issues and topics such as specialized analysis methods, correlation & regression, cluster
analysis and comparative discussion on different statistical packages.

We are very grateful to EARO management for making available resources for
preparation of the manual. We are very thankful to research staff who read the draft of
this document and rendered constructive comments. Finally, we thank all those who
directly or indirectly contributed to the success of this work.

3
2. Developing Field Experiments
As we discussed in previous section, it is important to mention the use of statistical/
biometrical methods in relation to the stages of experimentation, because the beginning
of statistical/ biometrical planning should be right from inception of the study. Unless
problem is clearly stated, and methods to overcome the problem through experimentation
is well established, plan regarding statistical component will not be successful.

2.1 Steps for developing field experiments

The following points may help as a guideline:

 State the problem to be solved clearly.


 Find out what has been done in same line of work to avoid duplication. This will be
done through literature search (Books, Journals, Manuals, Reviews, Handbooks).
And consultation with other experienced people (old farmers, extension workers,
researchers…)

 After literature search, work plan must be stated concisely and clearly. This
constitutes a "proposal". A proposal should contain the following major points.

Introduction \ Background
Justification
Objective
Experimental procedures
Expected output, Activities and Indicator of Performance
Literature review
Materials & methods
Beneficiaries
Work plan
Financial requirement

Example of objective

• To compare the response of x variety of wheat to different levels of nitrogen


on verti-Soil in two different cropping seasons and two locations.

This is a crucial stage in experimentation and care should be taken in selecting


appropriate design, in determining characteristics to be measured, treatments, number
of replications whether control is required etc. The choices are all guided by the objective
of the study. If you have some doubts about your formulations it is good to see a
biometrician as early as possible.

Also remember that presence of control treatments could be an advantage in a situation


where the investigator can't control the environmental factors at desired level of precision.
Therefore it helps to make treatment (especially new selection) comparison easier and
meaningful.

4
2.2 Hypothesis Testing
Once the experiment is setup properly and the objectives are indicated clearly, the next
step is to formulate hypothesis to be tested. Hypotheses are normally derived from
objectives and they are put in the form of statement. For example, in comparing
improved variety with a standard check, the hypothesis to be tested is that “the yielding
ability of improved variety is same as that of the check” and this is normally referred to
as the null hypothesis (Ho). The reverse is know as alternate hypothesis (H1) Often,
several hypotheses could be derived from a single objective. The concept may be
elaborated as follows.

There are three possibilities for testing hypothesis. The first possibility is by assuming
that the mean yield for improved variety is calculated from sample while check mean is
an absolute mean (population mean). The second possibility is to consider both new
variety and check mean to have obtained from sample observation (which is often the
case in experimentation). The third possibility is to compare two varieties from sample
observation, often known as independent samples test.

Possibility I Possibility II Possibility III


Ho : 8 = µ Ho : 8 = 8o Ho : 81 = 82
Hi : 8 # µ Hi : x # 8o Hi : 81 # 82

The corresponding test statistics is given in same order.

Z = 8- µ t = 8 -8o F = Treatment MS
σ S Error MS

For the third possibility both t and F test statistics could serve, but in cases where more
than two samples are compared F test is used as an over all test to detect if any two
means are different from each after, while the t-test is used to make pairwise
comparisons. Note that for two sample cases F= t2

Example
Suppose that mean yield of standard variety is 20 per acre, with standard error (s.e) of 2
and a mean yield of 25 is observed for the new variety.

Z= 25-20 = 2.5
2
Looking at the Normal probability table, a probability of obtaining, by chances, a value
of Z greater than 2.5 or less than -2.5 is 0.0724. Therefore, a probability of 0.0724 is
called a significance level for this test. We say that the difference between the newly
observed yield of variety is significantly greater than mean of the control at 0.0724. The
question now is which levels should be termed significant and which ones are not.

5
What is the Significance level?

• If the probability of obtaining as extreme value as this or more is small, then we


say that, Yield of 25kg have not occurred by chance, it must be different from the
mean. There are three common levels, 5%,1% and 0.1%
• If Significant at 5% , we say the event has provided an evidence against the
hypothesis
• If significant at 1%, we say, strong evidence. etc….

Choice of Significance Levels

Basically one can calculate an exact probability level at which a particular differences is
significant, but in this way there could be hundred significance levels and it may not be
possible to standardize results. Therefore, statisticians and subject matter experimenters
agreed to set threshold value of probability below which significance can be declared.
This value was chosen to be 5% but, there are numerous probability points below 5%. To
solve this problem, again, experimenters agreed to pick 5%,1% and 0.1% levels since
this levels believed to give range of significant level for most practical purposes.

In this exercise two mistakes can be made.

1) One may demand too high a level of significance and accept the null hypothesis
Many occasions when in fact it is not true.

2) Too low level of significance will mean that the hypothesis is often rejected
when in fact it is true.

The balance between the two has to be struck by considering their likely consequences.

In practice both too high and too low level of significance have advantages or
disadvantages. Consider the following hypothesis:

Ho: Irrigation has no effect on grassland Yield

Suppose t = 2.5 was obtained after collecting relevant data.

Now consider the two possibilities:

1) Suppose that this hypothesis is tested at 5% level. We are saying that if a mean
difference that produce significant result at 5% is obtained then we conclude that
irrigation has an effect. But, look at it from another angle. If all possibility are
explored, however, there might be other ways of increasing yield further. But due
to the fact that we concluded significant result at 5% and recommended irrigation to
be installed could stop to look for other alternatives of increasing yield. Therefore,
opportunities for increased yield might have been lost just because we selected 5%
level.

2) If we choose 1% level, on the other hand, we might declare non- significant result

6
which could have been significant at 5%. In this case, we are overlooking the
improvement that irrigation contributed to yield increase. Therefore, we are going
to advise policy makers that irrigation is not important (whereas in reality there is
some evidence of its importance at higher probability level), consequently
expensive equipment purchased for the irrigation purpose might be dumped.

Note Classification of a result as a significance is not a proof, but only an evidence

The above two statements are related to type I and type II errors.
1) Type I error :- reject Ho, when it is true. It is represented by α
2) Type II error :- accept Ho when it is false. It is represented by B.
3) Power of a test :- is probability of accepting H1, when it is true

Power = 1- pr(type II error). Therefore, the objective of statistical tests should be to


try to maximize power

The lesson from this example is that both high or low significance levels have their own
advantages and disadvantage and the choice of the levels must depend on the nature of
the trial. For example, tests used in social sciences normally use high probability levels
(like 10%) because the objective is often to detect small difference, and the results may
not bring an irreversible problem on the subject of study even if the result is wrong. On
the other hand, in medicine very small level are used, because more precision is required
to protect life.

7
3. Principles of Experimental Design
In this section reasons for applying statistical design will be assessed and important
concepts and terms required to implement designed experiment will be discussed in
detail.

3.1 Basic objectives in experimental design:


The major objectives may be stated as follows.

1. To provide estimates of treatment effects or differences among treatment


effects.
2. To provide an efficient way of confirming or denying conjectures about
the response to treatment.
3. To estimate the reliability of conjectures.
4. To estimate the variability of the experimental materiel (plots, animals
constitute experimental material).
5. To provide a systematic pattern for conducting an experiment.

3.2 Some basic concepts


Number of plots

In any experiment, i,e., a plant breeding trial, fertilizer trials, livestock trial, etc., the
design must give fairly reliable measures of relevant sources of variation.

For example, in a comparison of two varieties if one variety is planted in one plot and the
other Variety on another plot:

• We could observe the difference in yield between the two plots,


• But cannot measure the experimental error (only one units is used per treatment),
since we do not know weather the observed differences is due to varietal
difference or difference between the two plots

Experimental error

The random variation among experimental units occur due to biological variation, soil
variation, variation in technique, etc. If the breeder wants to test significance of the
difference between varieties and to compute an interval estimate of difference, one must
have a measure of experimental error.

Two main sources of experimental error

i) Inherent variability: In the experimental material to which the treatments are


applied
ii) Lack of uniformly: In the physical conduct of experiment or in other words,
failure to standardize the experimental techniques.

8
Example: Experiment on effectiveness of a certain medication.

The responses variable is time of recovery. Inherent variability may result due to:

- Age of the individuals


- Sex or weight . . . .

Lack of uniformity in the physical conduct may be due to:

- Time of taking the drug or


- Dose.

Replication

Replication can provide an estimate of the experimental error. Here the breeder repeats,
or replicates, each variety more than once in his experiment. Replication serves a number
of purposes in a designed experiment.

1. It provides an estimate of experimental error.


2. It increases precision by reducing standard error.
3. Increases representation, since wider area is used for the experiment.

The question often asked is 'How many replications'?

There is no simple answer to this question because it depends on many considerations


including the following:

 The resource available


 Availability of experimental material
 The treatment structure
 The relative importance of different comparison

For variable experimental material (e.g on- farmers field ) larger replication is required
than for less variable material (e.g controlled experimental situation). Nevertheless, it is
a good idea to always safeguard the degree of freedom for error sum of square. The
quality of an experiment is normally judged by the amount of information from which
experimental error is estimated. Some people argue that the degree of freedom for error
terms for on-station field experiments must be more than 16 and some journals indirectly
follow such rules for acceptance of manuscripts.

Randomization

Randomization ensures that no treatment is consistently favored by being placed under


best conditions. Either plots are allocated to treatments randomly, or treatments are
allocated randomly to plots. Randomization serves the following purposes:

1. To eliminate bias. Randomization ensures that no treatment is favored or


discriminated against by systematic assignment to plots in a design.

9
2. To ensure independence among the observations. This is necessary condition for
validity of assumption to provide significance tests and confidence intervals.

Blocking

Blocking refers to grouping of plots in to blocks of more or less uniform plots. That is,
plots within the same block are homogeneous.

Reasons for blocking:

1. It can increase the precision of an experiment. Differences among blocks are


removed from the experimental error in the analysis of the results.
2. Treatments are compared under more nearly equal conditions because
comparisons are made within blocks of uniform plots.
3. It can increase information from an experiment because blocks can be placed at
different locations.

Let’s us now illustrate the use of blocking, replication and randomization using a simple
(trivial) approach. How the objective, treatment contrast and hypothesis testing exercise
are well related is best illustrated as follows:

Suppose that we are working on agronomic trail of comparing the yielding ability of a
new variety A and standard variety B. The logical hypothesis is that "A" will yield more
than "B". The objective of the experiment will be to test this hypothesis. The question
now is how testing is performed.

Suppose we prepared two plots side by side and plant one plot to A and the other to B.

A B

Grain yield of the two plots will be collected and used to compare the two varieties.

Suppose plot planted to A yielded 5 q/ha and plot planted to B yielded 4.5 q/ha.
Is it possible to say that A is higher yielding than B? why?

The answer is obviously "No!". Because there are many factors that can influence the
yields of the two plots, such as differences in soil fertility, moisture, damage etc...For
this reason even if we had planted the two plots to the same variety, the yields probably
would have been different. Therefore, the difference in yields between the two plots, one
planted to A and the other to B, could be due to something other than varietal
differences.

The next question is how to separate differences due to performance of varieties


from other factors.

10
Let's now introduce some concepts. Consider two plots planted to same variety A, and
two plots planted to varieties A and B.

Plot 1 Plot 2 Plot 1 Plot 2

A A A B

A (plot 1) - A (Plot 2) = Experimental error

A - B = observed treatment differences

Therefore, Variety A would be superior to B only if the observed treatment difference is


greater than experimental error. This implies that, as a researcher you must be able to
generate experimental error so that you judge your experimental results. Often we deal
with more than two treatments and the measure of experimental error would implicitly
be calculated from the “ average” of differences between two or more plots that received
same treatment.

In any experiment the experimental error is bound to exist. That is it can't be totally
eliminated but can be minimized. A concept that should be clear here is that one can
change the size of experimental error in experiment, but one can’t change the real
difference between the two varieties. This implies that the ability to detect a true
treatment difference depends on the ability to minimize or precisely estimate
experimental error.

If two people conducted an experiment with same varieties at different place, they would
come up with different conclusions simply because of difference in their abilities to
measure and control experimental error.

The question now is how can we measure and control experimental error?

Since the experimental error is the difference between plots treated alike, the way to
measure it is to treat a number of plots alike. This is known as replication. Each
treatment must be replicated a number of times in experimental plots.

A B A B

B A B A
Rep I Rep II Rep III Rep IV

If we accept importance of replication, the next question is how to make replications.


Suppose there are four plots for two varieties such that the fertility gradient of the field is
along the width of that field, and we have allocated treatment B to plots more fertile than
the remaining two.

11
B A

B A

When we try to estimate variety differences, the difference may not be only real
differences but also fertility differences between plots. That is, estimates are biased in
favor of variety B. This incorporates an idea known as systematic error. It is an error of
deliberately allocating variety B to good plots.

This method, therefore, reduced experimental error by the amount of systematic error,
which is added to variety differences by misleading conclusions.

To overcome such kind of problem we use a concept known as RANDOMIZATION. It


ensures an equal chance for all experimental plots to receive any treatment. It avoids
constant favoring of one or more treatments. Randomization is assigning treatments to
the experimental plots at random. Therefore, both randomization and replication are
necessary conditions to obtain unbiased estimate of experimental error. Replication
provides an estimate and randomization insures that the estimate is unbiased.

Having obtained unbiased estimate of experimental error, we now look for possibilities
of reducing experimental error. This is what we call method of error or local variation
control.

There are several techniques for this. One of them is "Blocking". We will see in next
session.

12
4. Experimental Designs for Field Experiments.
To select appropriate experimental design that best suit objectives of an experiment, the
following must be known.

• Type and number of treatments involved.


• The degree of precision desired.
• The size of "uncontrollable" variation.

In selecting appropriate design it is a good idea to go for the simplest design that controls
variability and attains the desired precision. Commonly used designs are, CRD, RCBD,
Split plot, Strip plot, Lattice.

4.1. Completely Randomized Design (CRD)


It is one of the least used design but could be helpful in a situation where experimental
area is believed to be uniform and few treatments are examined.

It is a flexible design. The statistical analysis is also simple even when there are unequal
replications or missing values.

Suppose that we have 8 uniform experimental plots for a two variety comparison

1 5
2 6
3 7
4 8

Since we believed that the 8 plots are uniform and no systematic variation, the next step
is to randomly allocate the two varieties to the plots.

For a random allocation we can use either what is called random number table or lottery
method.

Random numbers are found at the back of most statistical textbooks. They are pseudo
Random numbers generated on computer using random number generator.

To allocate varieties to plot randomly select one of them, say A. Then start at any point
in the table and move in any direction you like considering only the first digit. Allocate
A to the number you draw, as long as the number you picked is less than 9.

For example, by moving downwards starting from top, we obtained numbers 4,8,2,1,
skipping repetitions and those greater than 7. Therefore, variety A will be allocated to
plots 1,2,4 and 8, and variety B to the remaining plots.

13
Method of Analysis

Model : Xij = µ + αi + εij where µ = grand mean


αi = treatment effect
εij = random error.

Taking best estimators for parameters and rearranging, summarizing and squaring gives

∑∑ (xij - 8 . .)2 = ∑∑ (xij - 8i.)2 + ∑∑(xi.- 8 . .)2

The computational formula is :-

SST = ∑∑( xij - 8 . .)2 = ∑∑xij2 - x2..


p2
SSt = ∑∑( xij - 8i . .)2 = ∑∑ xij2- ∑xi2.
p
SSE = SST - SSt

The test of hypothesis is

Ho : µ =µ2 . . . or
Ho : αi = 0 for all ;

ANOVA

Source d.f. SS MS F- value


Total rp-1 SST
Treatment p-1 SSt MSt = SSt/(p-1) F=MSt/MSE
Error p(r-1) SSE MSE = SSE/ (p(r-1))

The total variation is partitioned into two components:

a. Variation among treatment means


b. Variation among plots within treatments (error).

If F- calculated is larger than the tabulated value then we reject the null hypothesis and
conclude that there are significant differences among the treatment means.

Coefficient of variation (cv)

It is used to measure the reliability of the experiment. The theoretical framework is that
if an attempt is made to compare two experiments based on mean value then the
comparison may be biased since measurements may be done on same scale. On the
other hand if variance is used for comparison, experience showed that variance is highly
related to mean and if mean is large variance tends to be large invalidating the
comparison. Therefore, to compare two similar experiments, a ratio of variance by mean

14
is selected as this quantity has no unit of measurement and since it remains constant
throughout.

_
CV = (√MSE) x 100 expressed as a percentage , where Y = Grand mean
_
Y
Usually the common ranges of CV under controlled experiments are:

6 to 8% for variety trials,


10 to 12% for fertilizer trials,
13 to 15% for pesticide and herbicide trials.

Example

An animal scientist in performing a study of weight gains randomly assigned 8 animals


to each of 3 diets. The following weight gains (Kg) were recorded. Here it is assumed
that the 8 animals are very similar in all aspect, otherwise use of CRD is not valid.

Treatment (Diet)

A B C
14.20 12.85 14.15
14.30 13.65 13.90
15.00 13.40 13.65
14.60 14.20 13.60
14.55 12.75 13.20
15.15 13.35 13.20
14.60 12.50 14.05
14.55 12.80 13.80 Total
Sum y1. = 116.95 y2. =105.50 y3. =109.55 y..=332.00
_ _ _ _
Mean Y1= 14.691 Y2=13.188 y3= 13.694 Y =13.833

To get the entries of the ANOVA table first compute the correction term

C = (y..)2 = (322.00) 2 = 4,592.667


rp (8)(3)

SST = ∑i∑j(yij) 2 -C =[(14.20)2+(14.30)2 +...)] - C = 12.2683


SSt = (1/r) ∑i(yi.)2 – c =(1/8)[(11.95)2 +(105.5)2 +(109.55)2] – C = 8.4277
SSE = SST - SSt = 3.8407

ANOVA
Source d.f SS MS F
Total 23 12.2683
Diet 2 8.4277 4.2139 FT = 23.04**
Error 21 3.8407 0.1829

15
Cv = (√.1829) 100 = 3.1%
13.83

Presentation of results

Here diet is a treatment to be tested using F- ratio. Note that degree of freedom for the
error term is 21 which fairly give a good estimate of experimental error. Remember that
if not sufficient information (df) is used for estimating error, then the F – ratio may not
be valid and all tests may not make sense. Given tabulated F- values for 5% and 1%.

F.05(2, 21) =3.47, F.01(2,21) =5.78

Therefore, the differences among the weight gain means of the diets are highly
significant.

This test is known as an overall test of difference between treatments, because so far we
could only know that the 3 diets are not the same or atleast one of them are different
from the other two. The next step is to investigate which ones are different from each
other. t – tests are often used for this purpose.

_ _
t = Y1 - Y2 = 14.7 – 13.2
√(2*S/n) √ (0.183*2/8)

= 5.32

But, t0.01(21)< 3.5 therefore, diet A is superior to B. t – value for comparing diet B with
C is also 2.4 which is significant at 5% . Now let’s ask a question. The difference
between means of B and C is 0.5 but it is significant at 5% level, why? Although the
numerical mean difference between the two is small, because of small experimental error
(0.183) even such small differences turned out to be significant. That means not only
mean value, but level of variability also contributes to significance tests. The standard
error of a mean is given by:

√0.183/8 = 0.15

Which is a good indicator of the level of precision with which the mean has been
estimated. Standard error for differences between two means is given by

√2S2/r = √2 x 0.183/8 = 0.21

The same result may be obtained by using different statistical program. For example, the
following SAS command produce ANOVA result that follows.

16
infile “filename”;
input diet weitg ;
run;
Proc ANOVA;
Class diet;
Model weitg =diet;
Run;

ANOVA result for CRD design, SAS Computer output(as it is)

The SAS System 25


02:43 Thursday, November 9, 2000

Analysis of Variance Procedure

Dependent Variable: WEITG


Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 8.42770833 4.21385417 23.04 0.0001

Error 21 3.84062500 0.18288690

Corrected Total 23 12.26833333

R-Square C.V. Root MSE WEITG Mean

0.686948 3.091466 0.427653 13.83333

Unequal Replications

If the treatments are not equally replicated only minor modifications are necessary both
for hand calculation and for the SAS Program.

Proc GLM is preferable for unequal replications in SAS.

Proc GLM;
Class diet;
Model witg = diet;
LSMEANS diet ;
Run
Here the LSMEANS produces adjusted means for the diet.

17
4.2. Randomized Complete Block Design (RCBD)

CRD assumed a world with little variability, which is not practical in most cases,
because there are always fertility trends which casts problem on application of CRD. In
the presence of trend the use of CRD is misleading because there is a chance that one of
the varieties will receive good or all bad plots which leads to a biased estimate of error.

Suppose that we have a plot of land to test yielding ability of 4 varieties, A, B, C, D and
suppose that fertility increases as indicated by an arrow.

..........>

This means that the land is no more uniform and the use of CRD is not supported.

Next, we have to divide the land into blocks of 4 plots so that plots within block are as
uniform as possible. Once the size of land is determined it is the plot size which
determines the number of blocks in this particular example. For simplicity, let’s assumes
we require two blocks. Blocking will be perpendicular to the direction of fertility
gradient. Divide the field into equal parts, and divide each block into 4 plots in the
direction of the fertility trend.

Block I Block II

Next, randomly allocate the four varieties to the four plots in each block. Note that the
advantage of RCBD over CRD is that it separately calculate between block variation
which other wise would have entered into the experimental error.

Block I Block II
A C
C A
D D
B B

18
Analysis of variance of RCBD

Model : xij = µ + αi + βj + εij

Assuming ∑αi = 0 , ∑βj = 0


εij ∼ NID (0 , σ2)
X ij ∼ N(µ + αi + βj, σ2 )

X ij = µ + αi + βj + εij
= µ +(µi. - µ) +(µ.j -µ) +(X ij - µ i. -µ.j +µ)

Best estimates of parameters for the above equation is give as.

X ij = 8.. +( 8i. - 8 ..) + (8.j - 8..) + (8ij - 8i. - 8.j + 8..)

Squaring both sides and summarizing over i and j gives

∑∑(x ij - 8..)2 = ∑∑(8i. - 8..) 2 + ∑∑ (8.j - 8 ) 2 + ∑∑( Xij- 8i. - 8.j + 8..) 2

SST = SSt + SSB + SSE

Short- cut computational formula is as follows:


SST = ∑∑ X ij2 - X..2/rp
SSt = ∑Xi. 2 /r - X.. 2 /rp
SSB = ∑ X2.j /p – X..2/ rp

SSE = SST -SSt – SSB

ANOVA
Source d.f SS MS F
Total rp-1 SST
Block r-1 SSB MSB FB = MSB/MSE
Treatment p-1 SSt MSt Ft = MSt/MSE
Error (r-1)(p-1) SSE MSE

Coefficient of variation :
cv = (√MSE )100
=
y

Significance tests:

Null hypothesis: There is no significant difference among the treatment means.

If Ft exceeds the table value (Fα With p-1, (r-1)(p-1) d.f.) we reject the null hypothesis
and conclude that there are significant differences among the treatment means.

19
Relative Efficiency:

To see how much precision has been gained by a randomized block design in place of
the simpler, less restrictive completely randomized design, we use the relative efficiency
, RE as
RE = (r-1)MSR+r(p-1) MSE
(rp-1) MSE

If RE>1 blocking has been effective. The quantity (RE-1) measures the percentage
increase in precision due to blocking.

Example

To compare the effect of six different sources of nitrogen on the yield of wheat 24 plots
were grouped into four blocks with soil type as the blocking factor(block size = 6).

The yields (Kg/plot) are shown below.

Soil Type
Treatment I II III IV Total Mean
1 32.1 35.6 41.9 35.4 145.0 36.50
2 30.1 31.5 37.1 30.8 129.5 32.38
3 25.4 27.4 33.8 31.1 117.7 29.42
4 24.1 33.0 35.6 31.4 124.1 31.02
5 26.1 31.0 33.8 31.9 122.8 30.70
6(control) 23.2 24.8 26.7 26.7 101.4 25.35
Total 161.0 183.3 208.9 187.3 740.5
Mean 26.83 30.55 34.82 31.22 30.85

Manual calculation for ANOVA is as follows

SST = (32.1)2 +(35.6)2 + ... +(26.7)2 -(740.5)2/6x4


= 492.36

SSt = (1/4) [(145.0)2 + (129.5) 2 +. . . +(101.4) 2] - (740.5) 2 /6x 4


= 255.28

SSB = (1/6) [(161)2 + (183.3)2 +(208.9)2 +(187.3)2] - (740.5)2/6 x 4


= 192.55

SSE = 492.36 - 255.28 -192.55


= 44.53

Mean squares are calculated as


MSt = SSt/p-1 = 255.28/5 = 51.06

20
MSB = SSB/r-1 = 192.55/3 = 64.18
MSE = SSE/(p-r)(r-1) = 44.53/15 = 2.97

Ft = MST = 51.06 = 17.19


MS E 2.97

ANOVA of wheat yield Data


Source d.f. SS MS F
Total 23 492.36
Blocks 3 192.55 64.18
Treatment 5 255.28 51.06 17.19**
Error 15 44.53 2.97

CV =(√ 2.97) 100 = 5.6%


30.85

RE = (r-1) MSR+r(p-1)MSE = (3) (64.18)+4(5)(2.97) = 2.69


(RP-1) MSE (23)(2.97)

F0.01 with [p-1, (r-1) (p-1) ] d.f. = 9.72.

Presentation of results

The difference between this and CRD ANOVA is the presence of block. If this data was
assumed as if it came from CRD, then the error sum of square would have been 237.08
since this block effect would be added on to the error.

Since Ft = 17.19> 9.72 the effect of the different sources of nitrogen on the yield of wheat
indicates highly significant differences among sources. Blocking by soil type increases
the efficiency of the randomized block design by 169% compared to a completely
randomized design. As we described earlier the ANOVA table showed an evidence of
differences between the six treatments. But we still do not know which ones are different
and need to do individual comparisons of means. At this point there are a number of
alternatives either to run multiple range tests or contrasts. This will be discussed in
section 4.4. The important points to remember in RCBD is that also we have two sources
of variation we are interested to test only the treatment and not the block effect, because
the objective is not to detect differences between blocks but control for variability.
However, if we are interested to know whether blocking is effective relative efficiency
could do. Suppose that blocking was not effective, shall we change our analysis to CRD
or report as it is. This is a source of controversy for biometricans themselves, but one
important thing to note is that the major difference between designs is randomization.
Here since in RCBD treatments were randomized within block there is a strict restriction
for treatments and this makes it different from CRD where treatments are randomized
across the experimental land. Therefore, trying to analyze an experiment designed in
RCBD with CRD will simply override this reality and affect the result.

Same analysis could also be done using computers. For example, to use MSTATC, select
“FACTOR” from the main menu and then select “RCBD 1 factor” among the 35 sub-

21
menu listed. The ANOVA structure is exactly the same as the one showed above. In
addition MSTATC gives LSD value and S.e (sy) for different mean group.

To use SAS for same analysis, the procedure may be given as (Omitting data step):

Proc ANOVA;
Class rep treat;
Model yield = rep treat;
Run;

Similarly, proc GLM could do the same purpose.

Note that here we have two factor columns, rep & treatment and the model statements
contain both of them. Here the output is two-staged. The first part of the ANOVA
contains model & error components only, the second part splits the model component in
to two, Nitrogen & soil individual effect.
SAS output for the rcbd design

Analysis of Variance Procedure

Dependent Variable: YIELD


Sum of Mean
Source DF Squares Square F Value Pr > F
Model 8 447.8316667 55.9789583 18.86 0.0001
Error 15 44.5279167 2.9685278

Corrected Total 23 492.3595833

R-Square C.V. Root MSE YIELD Mean

0.909562 5.584146 1.722942 30.85417

Analysis of Variance Procedure

Dependent Variable: YIELD

Source DF Anova SS Mean Square F Value Pr > F

SOIL 3 192.5545833 64.1848611 21.62 0.0001


NITROGEN 5 255.2770833 51.0554167 17.20 0.0001

4.2.1. Use of Coefficient Of Variation (Cv)


There are three quantities often calculated from ANOVA, which are not used for
particular comparison within experiment but for a general measure of mean yield and
precision for comparison with other, similar, experiments.

22
They are

1) Grand mean (8)


2) Unit standard deviation ( √ S2)
3) CV (CV = S/X)
Uses

1) Often the amount of variability is found to be related to the mean level of the yield in
an experiment. Hence, CV is more stable quantity than the unit standard deviation.
Therefore, CV helps to compare precision of two similar experiments.

2) Sometimes the experiment is stated in terms of being able to detect, with a given
probability, a stated percentage difference between treatment means. Thus CV can
be used to decide how many replicates are needed.

S = (C.V) 8
p(Z > x-8 ) = α
S/√n

⇒ Z = x-8 x-8 =d
S/√n
⇒ ( S/√n ) Z = d ⇒ n = (SZ)2
d2
where X= an observed value.
_
X= mean value
S = Unit standard error
d = mean difference
Z = the Z score under normal distribution.

Drawbacks

1) Unfortunately the amount of variation is not simply related to the mean , and CV can
often be a meaningless quantity.
Examples

a) Expert 1 Expert 2 Expert 3


S = 2.4 S= 3 S = 1.5
µ = 14 µ = 18 µ = 10

CV = 17% CV =16% CV = 15%

In this example similar experiments were done 3 times and the CV are nearly constant.
Therefore, it is now meaningful to quote CV for future use and reference when further
experiment is preformed on same line. On the other hand, in the following experiments

23
the CV are not constant (assuming that the experiments were done under acceptable level
of precision). Hence there is no guarantee that when the next experiment is done the CV
would be close to these ones. Therefore, CV here is not useful as such.

b) Expert 1 Expert 2 Expert 3


S=3 S =5 S =2
µ=9 µ =25 µ=4

2) As in the above example, CV can only be useful quantity, where experience shows
that it does tend to take approximately the same value in different experiments.
For example CV, of whole lactation milk yield has been found to be 25% (mead
and curnow1998). This enables us to calculate with certain amount of confidence,
the number of cows that would be required to achieve some degree of accuracy in
the measurement of lactation milk yield in the future.

3) For some measurements the s.e may be just as stable as the CV Hence CV may
not be necessary.

In general, if CV is found to be different from typical value expected ( i.e. of other


experiment), then

a) Either S or 8 have changed

b) If CV is large; Look at your data to see if there is rouge values to be checked.

24
4.3. More Restrictive Designs
In randomized block design we control one source of variation by grouping the
experimental units into blocks. There are experimental designs which control two or
more sources of variation.

4.3.1. Latin Square Designs


In RCBD, we assumed fertility gradient occurs only in one direction, what will happen if
it occurs in two directions? This gives rise to a different design known as Latin square.

The Latin square design controls two sources of variation. To construct a Latin square
for P treatments we require P2 experimental units. These units are classified into P
groups, of P units each, based on one source of variation. This is commonly called the
row classification. The units are then grouped into P groups, of P units each, based on
the second source of variation. This is commonly called the column classification.

Suppose that we have 3 varieties to be tested in a situation where there is fertility


gradient in one direction and slope in another directions.

------------>Fertile

Slopy 

Since homogeneity is lost, it is logical to block following both direction trend.

Columns

Row

Because of additional blocking structure, there is some restrictions in randomization.


That is every treatment must occur once and only once in each column and row.

Columns
A B C
Rows B C A
C A B

Advantage of Latin square is again to reduce experimental error. This can be shown as
follows:
Total variation = treatment variation + error.

25
But error = random + row variation + column variation.

Therefore, it removes the amount of row and column variations from the error term.

Latin square design is not suitable for large number of treatments. Because since the
number of replications is equal to number of treatments, for large treatments say 25 or
more, one requires a lot of replication which may not be attained, and randomization
procedure itself is difficult.

What about RCBD ? The answer again is no. Remembering the fact that in RCBD plots
within a block are expected to be uniform, as number of plots per block increases block
tend to lose uniformity. Therefore, it is advised not to use blocks of more than about 16
plot for small- sized plots and even less for relatively large plots, except in a well
established research centers, where research plots are made relatively uniform over time.
Loss of homogeneity in a block has several consequences. The first is that the treatments
applied may not be compared efficiently within a block. As a result the difference
between any two treatment in a block will also include differences between blocks which
may lead to failure of additivity assumption and the use of ANOVA in such condition
will not be valid. One of the major problems in this regard in this country is that
optimum plot size have not been determined for most crops and optimum block size are
thus not known. Attempt has been made by biometrics services to elevate this problem.
Work has been done to determine optimum plot size for wheat and same activity is under
way for potato.

Therefore, complete block and Latin square do not serve for variety trail, since several
entries are normally handled in variety trial.

Analysis
Model : µ + αi +βi + γk + εijk
Where µ = overall mean
αi = treatment effect
β; = row effect
γk = column effect
εij = error term

Using the best estimates for the parameters of the given model and squaring and
summing over all subscripts we get,

∑∑∑ (xijk - 8...)2 = p2 ∑ (8i..- 8...)2 + p2∑(8.j. - 8...)2 + p2∑ (8..k - 8...)2
+ ∑∑∑(xijk - 8i.. - 8.j. - 8..k + 28…)2
The computational formula is

SST = ∑∑∑ x2ijk - CF


SSt = ∑ x2i../p - CF
SSR = ∑x2.j. /p - CF
SSC = ∑ x2 ..k/p - CF
SSE = SST - SSt - RSS- CSS

26
where CF = X2 .../p3

ANOVA for Latin Square

Source d.f SS MS F
Total p2-1 SSTo
Rows p-1 SSR MSR =SSR/ (p-1) FR = MSR/MSE
Columns p-1 SSC MSC = SSC/(p-1) FC = MSC/MSE
Treatment p-1 SSt MST = SSt/ (p-1) FT = MSt/MSE
Error (p-1)(p-2) SSE MSE = SSE/(p(r-1))

Example

An experiment was designed to determine the effect of diets A, B, C, D, and a control, E


on liver cholesterol in sheep. Recognized sources of variation were body weight and
age. The researcher randomly selected a 5 x 5 Latin square, and randomly arranged the
results of the experiment from the different treatments as presented below.

Age group

Weight group I II III IV V Total


1 A 3.38 B 3.37 D 3.04 C 3.27 E 2.44 15.49
2 D 3.70 E 2.88 B 3.35 A 3.46 C 3.34 16.73
3 C 3.58 D 3.56 A 3.69 E 2.67 B 3.51 17.01
4 E 3.32 A 3.71 C 3.74 B 3.81 D 3.41 17.72
5 B 3.48 C 3.91 E 3.27 D 3.74 A 3.64 17.61
Total 17.46 14 .43 17.09 16.95 16.34 84.56

Total diets
A B C D E
17.87 17.52 17.41 17.18 14.58

ANOVA of liver cholesterol Data


Source d.f SS MS F
2
Total p -1= 24 SSTot = 2.9666
Rows p-1 = 4 SSR = .8740 MSR = .2185 FR = 7.12**
Columns p-1 = 4 SSC = .1656 MSC = .0414 FC = 1.35
Treatment p-1 = 4 SST = 1.5589 MST = .3897 FT = 12.69**
Error (p-1)(p-2) = 2 SSE = .3681 MSE = .0307

CV = (√.0307) 100 = 5.1%


3.411

27
Presentation of result

There are highly significant differences among the diets. There is also evidence of
significant differences among the weight groups.

Age groups appeared to have little effect. Therefore, further experiments in this or
similar sheep might use randomized block design with weight groups as blocks, since
blocking by age group was not useful. Since there is evidence of differences between the
5 diets, the next step is to investigate which ones are different from each other by using t-
test, LSD or other multiple range test if required.

SAS procedure for analyzing the above data set is as follows, omitting data entry step.

Proc ANOVA;
Class weight age diets;
Model Y = weight age diets;
Run;

Computer output for latin square design


The SAS System

Analysis of Variance Procedure

Dependent Variable: CHOLEST


Sum of Mean
Source DF Squares Square F Value Pr > F

Model 12 2.59859200 0.21654933 7.06 0.0010

Error 12 0.36799200 0.03066600

Corrected Total 24 2.96658400

R-Square C.V. Root MSE CHOLEST Mean

0.875954 5.134194 0.175117 3.410800

Analysis of Variance Procedure

Dependent Variable: CHOLEST

Source DF Anova SS Mean Square F Value Pr > F

WEIGHT 4 0.87402400 0.21850600 7.13 0.0035


AGE 4 0.16562400 0.04140600 1.35 0.3079
DIETS 4 1.55894400 0.38973600 12.71 0.0003

Latin square design can also be analyzed using MSTATC

Relative Efficiency

The effectiveness of either the row or column grouping may be estimated by computing
the relative efficiency with either rows or columns as blocks in a randomized block
design.

28
For Rows : RE = MSR +(p-1) MSE
p (MSE)

For Columns: RE = MSC + (p-1) MSE


p(MSE)

Then the treatments are randomly assigned to rows and columns such that each treatment
occurs only once in each row and column.

For this example, RE(rows) = 0.22 + 4 +0.031 = 2.77


4 x 0.031

RE (columns) = 0.0414 + 4 x 0.031 = 1.3


4 x 0.031

The result shows that both rows and colums are effective but rows are more important
for blocking.

Restriction

From a practical standpoint there are some restrictions on the number of treatments
which can be tested in a Latin square

1. If P is very small (p<4), the d.f. for error is very small.


2. If P is very large (p>10), more than 100 (i.e.,p2) experimental units are needed.

Multiple Latin squares

These groups of designs have come to effect as a solution to some limitation in Latin
square. For example, if P is Small then the d.f for error may be increased by using more
than one square.

multiple latin square could be applied in two ways. The squares could be taken
independently or they could be merged to form a rectangle in which each treatment
appears once in each column and twice or more in each row depending on how many
squares joined. Such difference also leads to different forms of analysis. Example given
below corresponds to the case of independent squares .

Example of ANOVA for independent squares.

Source d.f.
Total np2-1
Squares n-1
Rows in squares n(p-1)
Columns in squares n(p- 1)
Treatment p-1
Error n(p-1)(p-2)+(n-1)(p-1)

29
Example

A researcher wanted to compare the effect of three feed additives, A,B,C, on milk
production of diary cattle. The animals had to be individually fed in digestion stalls, of
which only three were available. He decided to run a series of Latin squares. Since
three additives were to be tested, and since only three digestion stalls were available, he
decided to use 3 x 3 Latin squares with rows representing feed periods and columns
representing cows assigned to the individual digestion stalls. To obtain sufficient
degrees of freedom for error and to obtain sufficient replication for reasonable
precessions, he decided to run four squares. The experimental plan and the milk
production (pounds) during last 5 days of each 7-day period are given below.

Animal
Square Period 11 12 13
11 C 115 A 139 B 127
I 12 A 138 B 209 C 224
13 B 125 C 186 A 172

21 22 23
21 C 176 B 163 A 135
II 22 A 186 C 201 B 175
23 B 146 A 101 C 134

31 32 33
31 C 186 B 194 A 166
III 32 A 130 C 180 B 162
33 B 123 A 97 C 137

41 42 43
41 A 128 C 154 B 150
IV 42 C 137 B 129 A 106
43 B 164 A 138 C 168

ANOVA of milk Production Data

ANOVA

Source d.f. SS MS F
Total 35 33,624.97
Square 3 1,738.30 579.43 6.25
Period in square 8 19,094.67 2,386.83 25.74
Cow in square 8 5,944.67 743.08 8.01
Treatment 2 5,549.05 2,774.52 29.92**
Error 14 1,298.28 92.72

30
CV = ( √ 92.73 ) 100 =6.3%
152.53

Summary of results

There were highly significant differences in mean of 5-day yields among the three
additives. Removal of period differences was most effective, while removal of differences
among squares was least effective. It is suggested that a similar design be used in future
work of this type.

Not that it is not possible to analyze data from such designs in MSTATC

SAS procedure for analyzing the above data set is as follows (omitting the data step)

proc GLM
class square period cow feed;
model yield=feed square period (square) cow (square);
run;

Analysis of Variance Procedure in SAS

Dependent Variable: YIELD

Source DF Sum of Squares Mean Square F Value Pr > F

Model 21 32458.916 1545.662 17.27 0.0001

Error 14 1252.7222 89.480

Corrected Total 35 3711.638

R-Square C.V. Root MSE YIELD Mean

0.962 6.190 9.459 152.805

Source DF Anova SS Mean Square F Value Pr > F

FEED 2 5599.055 2799.527 31.29 0.0001


SQUARE 3 1729.416 576.472 6.44 0.0058
period(square)8 19163.555 2395.444 26.77 0.0001
cow(Square) 8 5966.888 745.861 8.34 0.0003

31
4.3.2 Cross-Over Designs
In diary husbandry and biological assay a design has been used which closely resembles
the Latin square but may have some advantages when the number of treatments is small.
The cross-over design is particularly appropriate when the difference between the rows is
substantially the same in all replicates. Even if the difference between rows is known to
be variable, the cross-over design may be preferable in small experiments where few
degrees of freedom are available for error.

In the simplest case where there are two treatments A and B, the units are first grouped
into two pairs as if a randomized block design were to be used. Suppose that from
previous knowledge one unit in each pair is expected to give a higher response than the
other and that the difference is in favor of the superior unit is expected to be about the
same in all pairs. It will clearly be advisable to insure that each treatment is applied to
the “better” members in half the replicate and to the “poorer” member in the other half.
The pairs of replicates, which must be even in number, are divided at random into two
equal sets, the first set to receive treatment A on the superior member of each pair, and
the second to receive B.

In diary husbandry the cross-over may be used to compare the effects of two feeding
rations on the amount and quality of milk produced by the cow. Since cows vary greatly
in their milk production, each ration is tested on every cow by feeding it during either the
first or the second half of the period of lactation, so that each cow gives a separate
replicate. The milk yield of a cow declines sharply from the first to the second half of its
period, so that the first half is always “better”, in the sense above. Whether a cross-over
is superior to a set of Latin squares depends on circumstances. The rate of decline is not
constant from cow to cow; it is greater in general for high yielding ability, each pair
being made a separate 2x2 Latin square. This plan likely to give a smaller error than the
cross-over, though sometimes not sufficiently smaller to counterbalance the loss of
degrees of freedom.

The design can be used with any number of treatments, subject to the restriction that the
number of replicates must be a multiple of the number of treatments. With three
treatments, for example, a plan can be drown up from the 3 cycles ABC, BCA, CAB,
where the order of the letters denotes the row to which each treatment is applied. Each
cycle is allotted at random to one-third of the replicates. For higher numbers of
treatments a design is constructed in the same way from the columns of any Latin square.
When the number of treatment exceeds four, however, the degrees of freedom for error
are sufficiently large so that a set of Latin square is usually preferable.

Once the cross over effect is taken care of by such design, then the analysis in just like
that of Latin square.

32
Columns (Replications)
1 2 ... a Total
First (poorer) AX11 BX12 . . . BX1a X1.
Second (better) BX21 AX22 . . . AX2a X2.
Totals X1. X2. ... X.a

Correction factor = X..2


2∗a

Total : [x211 + . . .+ X22a] - C.F

Columns: ½ [x211 + X.22 . . .+ X2.a] – C.F = 1/2a ∑X2; _CF

Rows: (X1. – X2.)2 - CF


2*a

Treatment: (A. – B.)2 - CF


2*a
In general, if there are t treatments and replicates, the degree of freedom subdivide as
follows:

Columns db/(a-1)
Rows ( t-1)
Treatments ( t – 1)
Error (t – 1)( r – 2)
Total (tr – 1)

4.3.3 Greco- Latin Square Design


This allows for one or more restrictions on randomization. If it is desired to group the
experimental units in three or more ways, use is made of the Greco - Lain square.
Consider a P x P Latin square, superimpose on t a set of the treatments which are
denoted by Greek letters. A three way variability may not be a common problem, but the
design is very useful if such problem arises.

4 X 4 Greco-Latin Square

Row 1 2 3 4
1 Aα Bβ Cγ Dδ
2 Bδ Aγ Dβ Cα
3 Cβ Dγ Aδ Bγ
4 Dγ Cδ Bα Aβ

The statistical analysis of a Greco-Latin square design is a straightforward extension of


the Latin square.

33
Analysis of variance

Model : Xijkl = µ; + αi + βj + γk + δ l + εijkl

Where αi = treatment (Latin) effect


δ l = Greek effect

Using the best estimators for the parameters, transposing 8 to the LHS , squaring both
sides and taking the sum over all the subscripts, be obtained as follows: the ANOVA
equation.

SS Latin = ∑ (latin)2 - CF
N
SS Row = ∑ (Row total) 2 - CF
n
SS Column = ∑ (Column total) 2 - CF
n
SS Greek = ∑ (greek) 2 - CF
n
SST = ∑∑ X2 jk - CF
SSE = SST - SS lattin -SS Row - SS column - SS Greek.
where CF = ∑ X2jk
n2

The ANOVA Table

Source d.f SS MS F
Total n2-1 SST
Row n-1 SSR
Column n-1 SSC
Latin n -1 SS Lat
Greek n-1 SSGree
2
Error n - 4n +3 SSE

4.3.4. Youden Square Design


A Youden square design is a design that can be obtained by omission of one row or one
column from a Latin square design, i.e. it is an incomplete Latin square. This design is
particularly important in animal research where one of the rows or the columns are
missing. For example, in a latin square of 4 period by 4 parity and if the fourth period
was not performed for some reason, then it is still possible to analyze such data set using
Yuden square.

34
The Analysis is

Model : Xijk = µ + αi +βj + γk + εijk


∑αi = ∑βj = ∑γk = 0
εijk ∼ NID (0, σ2)
The computational formula

SST = ∑∑∑ X2ijk - X2...


N

SSR = ∑ X.j2. - X2...


# of Columns N

SSC = ∑ X2 ..K - X2 ...


# of Rows N

SSt (adjusted ) = ∑Qi2


(a-1) aλ

Where Qi = (a-1) Xi.. - ∑nih X..k

{
1 if treatment i appears in block h
nih=
0 otherwise.
1
Here, note that adjustment is required for SSt because all treatments did not appear in all
rows.
Example

A researcher wanted to compare five feed additives, A, B, C, D, E on milk production of


diary cattle. The animals had to be individually fed in digestion stalls, of which only
four were available, S1, S2, S3, S4. He decided to run Youden square, since five additives
are to be tested and since only four digestions stalls were available. The milk production
(pounds) during the first 5 days are given below. Therefore, the design become Youden
and not Latin square because the last stall is missing.

S1 S2 S3 S4 Total
D1 A:31 B:24 C:20 D: 20 95
D2 B:21 C:27 D:23 E: 25 96
D3 C:22 D:27 E: 25 A:29 103
D4 D:20 E:25 A:33 B:25 103
D5 E:18 A:37 B:24 C:24 103
Total 112 140 125 123 500

Feed A B C D E
Total 130 94 93 90 93

35
Analyzing the data as follows

1. SST = (31)2 + (24) 2 + . . . + (24)2 - (500)2/20 = 424


2. Day SS = [(95)2 +(96)2 + . . . +(103)2] /4 - (500)2/ 20 = 17
3. Stalls SS = [(112)2+(140)2 + . . . + (123)2 / 5 - (500)2 ]/ 20 = 79.6
4. To find Feed SS first find the Q ‘s

Q1 = (a-1) X1.. - ∑n1h X..h


= 4(130) - (95 + 103 +103 +103)
= 116
Q2 = 4(94) - (95 +96 + 103 +103)
= -21

Q3 = 4(93) -(95 +96 +103 +103)


= -25

Q4 = 4(90)- (95 +96 103 +103)


= -37
Q5 = 4(93) - (96 +103 +103 +103)
= -33

SSt (adj) = (116)2 +(-21)2 + . . . +(-33)2 , λ = r(k-1) = 4(3) = 3


5(5-1)(3) (a - 1) 4
= 283

ANOVA table

Source d. f S.S MS F
Feed (adj) 4 283 70.75 12.75**
Days 4 17 4.2
Stall 3 79.6 26.5
Error 8 44.4 5.55
Total 19 422

Presentation of results

Due to incomplete block, i.e each treatment could not appear in all rows, feed sum of
square was adjusted for days and stall. We will consider the concept of incomplete
block design in the next chapter in detail. Also note that we are interested to test effect
of feed only and the other two factors were used as a local control.

36
The SAS procedure for analyzing the above problem is as follows:
data lines;
do days=1 to 5;
do stall=1 to 4;
input feed$ yield @@;
output;
end;
end;
datalines; /* this replaced cards commands in new version*/
a 31 b 24 c 20 d 20
b 21 c 27 d 23 e 25
c 22 d 27 e 25 a 29
d 20 e 25 a 33 b 25
e 18 a 37 b 24 c 24
;
run;
proc GLM;
class days stall feed;
model yield = days stall feed;
run;

SAS output with the above procedure is:

Analysis of Variance Procedure

Dependent Variable: YIELD


Source DF Sum of Squares F Value Pr > F

Model 11 380.10000000 6.30 0.0074

Error 8 43.90000000

Corrected Total 19 424.00000000

R-Square C.V. YIELD Mean

0.896462 9.370165 25.00000000

Source DF Anova SS F Value Pr > F

DAYS 4 17.00000000 0.77 0.5715


STALL 3 79.60000000 4.84 0.0332
FEED (adj) 4 283.50000000 12.92 0.0014

Note that we can’t analyze youden using MSTATC.

4.4 . Mean Comparison Methods

37
There are two major approaches to mean comparison, multiple range tests and contrast
methods.

4.4.1. Multiple Comparison Tests

Multiple range (comparison) test is the process of - making all possible pair wise
comparisons of all treatment means. For example experiment with 5 treatments have (52)
=10 pairwise comparisons.

When do we use

• Do not use with structured treatments. That is multiple comparison tests should not
be carried out on variables with varying levels, for example 0,50,100,150 levels of
fertilizer.
• Desirable to make pre- determined comparisons
• Appropriate when individual comparison rather than the whole experiment is the
conceptual unit of interest.
• Making comparisons suggested by data, or making all possible comparisons not
initially planned invites for errors known as comparisonwise, experimentwise or
familywise.

Note :- If one is prepared to test unplanned comparisons by using multiple range test,
he/she has to take the risk of getting larger error rates and should be cautious about the
result. There are basically three types of errors.

Comparisonwise error rate :-(CWER):

It is the probability of rejecting the null hypothesis when it is true in repeated


experiments and it is given as:

CWER = No of erroneous testes


Total number of tests

Comparisonwise error can occure in planned experiments as well, but that of non-
planned experiments are much greater.

Experimentwise type I error rate: -

Assuming that an experiment is repeated several times. It is the probability of obtaining


at least one significant difference when in fact there are no differences.

E[pr(≥ 1 sig /all NH is true)] = 1-(1-α)p


Where P = No of comparisons being made
α = Significance level

If, for example, we have five treatments in an experiment and interested to do multiple
comparisons, probability of obtaining at least one significant difference is equal to 1-(1-

38
0.05) 10 = 0.40. This means we have very high probability of obtaining atleast one
significant difference when there is in reality none. As we also see from the formula, the
probability increase as the number of comparison increases. That is why the use of
multiple range test is criticized.

NOTE :- In an experiment where only a single comparison is made, the two error rates
are the same.

When using protected LSD (i.e. we first used F test ) the unit for judging the error is no
more comparisonwise , but experimentwise.

Familywise error rate :-

A family is subset of hypothesis. All possible pairs of comparisons, each treatment


versus control, a meaningful set of contrasts or a subset of pairwise comparison of
interest are each a family.

Family wise error rate = No of family with at least one error


Total No of family tested.

There are a number of multiple range tests in practice. Let’s now look at them one by
one.

1) Least significance difference (LSD)

LSD = t α /2, N-1 √ [s2(1/ni + 1/nj)]

The procedure is to compare any pairwise mean difference against the calculated LSD
value and conclude there is difference between the means if the difference between two
mean is greater than LSD value. The advantage of LSD over other multiple range test is
that;

• It is simple
• Uses pooled variance or error mean square from ANOVA table.

2) Duncans multiple range test

Rp = rα(p,f) Syi P = treatments Syi = MSE


α = sig level n

read rα (p,f) from Duncan's table

Steps
∗ Put the treatment averages in ascending order (Rank them)
∗ Compute the largest mean and compare smallest with the highest.
∗ Then the next smallest with the largest, so on and so fourth,
∗ Continue this way until all are compared

39
For hand calculation of Duncans multiple range test consult Gomez and Goemez. For
computing purposes MSTAT and SAS gives multiple range test. But, packages like
GENSTAT do not because they want to discourage the use of multiple range tests.

Drawbacks of Duncan’s

a) It requires a greater observed difference to detect significance. Therefore, duncans is


termed as conservative, because some differences which may not be significant in
Duncans could be significant in other tests.

b) It needs multi-valued critical value so that the difference between treatments required
for significance depends on the number of treatments in experiment. In reality
however, it is hard to believe that the difference between two treatment means
depends in any way on what others were included in the experiment.

3) Newman -keuls test :

• Compute critical values


kp= qα(p,f) Syi qα(p,f) studentized % point for group of mean
of size p.
• The comparison is done as in Duncan's

Drawbacks

• Conservative
• Low power
• Multivalued critical value

4) Turkey's test
Tα =qα (a,f)Syi

Drawbacks
• Needs in reduction of Type I error which is very theoretical

5) Scheffe's Test

S = √ft Fα(ft ,fe) , ft = treatment d.f Fα = read from table


fe = error df
Critical value = S x Sq Sq= √(∑ci 2) ns2

Draw backs

∗ Large critical value which implies conservativness


∗ Low power
∗ Designed for comparison of contrast

40
Example

Suppose there are three researchers A, B and C. And further suppose that they all want
to compare 10 varieties of wheat in 4 blocks of 8 plots.

Researcher A :-

His plan is to look for a variety that give highest yield among the 10 varieties. Hence he
planned to compare all possible pairs of varieties. After harvest he used 5% sig level and
carried out LSD to compare all 45 pairs.

1) Since his conceptual unit of interest is individual pairs of comparisons, and not the
entire experiment (the family of all 45 pairs), he have not committed
45
expermentiwise error rate(which is = 1 - (0.95 ) ) (cramer & waker)

2) But since the LSD is not protected, he expects to make about (0.05× 45= 2.25) 2
type I error out of the 45 comparisons if all 10 varieties are genetically alike.

Researcher B:-

He has initially planned to compare only three varieties he suspected could give high
yield. But after data collection, he wanted to change his mind. And wanted to test effects
suggested by the data. Thus want to compare the two extremes at 5% level (in fact after
performing F-test).

1) The (supposedly) 5% level is simply an illusion. He actually testing at 60 % level.


That is he gets 60% of the time significant results (cochran and cox,1980), when in
fact the tests may not significant.

2) The comparisonwise error is much more than 0.05 , which is equal to experiment
wise error rate(steel and torie).

Researcher C :-

C carried out a different, well-designed experiment. He wanted to compare two types of


fertilizer, two types of sowing data and irrigation in Split -Plot design. After analyzing
the result he found that, main effect and two-way interactions were significant. He then
decided to run mean separation in Mstatc of the main effects and interactions.

1) Multiple comparison procedures accept only a single MSE , but to separate mean of
interactions the split-plot design need involvement of both main-plot and sub-plot
MSE. Hence leads to erroneous conclusion.
2) Comparisons between different factors, which are not needed, may be carried out.
3) Multiple comparisons are not suitable for factorial structured treatments.

41
Conclusion
• In general comparisonwise error rate is more sound and useful than experimentwise
error rate in pairwise comparisons (cramer and walker,82). But when making
unplanned comparison (no special interest) the conceptual unit is the experiment
itself and incurs experimentwise error rate.

• It is always advisable to make only per-determined comparisons, because you are


confined to get significant results, even when there is none.
• Do not make comparisons suggested by data. If you do, you always get exaggerated
significance level (cochran and cox).
• Avoid uses of factorial structure with multiple comparison.
• There is considerable dis-agreement and confusion among statisticians and
researchers in the choice of pairwise multiple comparison methods. As we have seen
earlier many of them have drawbacks which puts a blind spot in their usage. Most of
them are conservative, have less power, and are very theoretical. However, for its
simplicity and less drawbacks LSD seems to attract attention of many experimenters.

Dangers of Making Many Comparisons

• It is always advisable to make only per-determined comparisons, because


Probability of obtaining at least one significant result when in fact there is
no differences among treatments is

Pr(≥1 sig/all NH is true) = 1 - (1 - α )p ≈ αp


Where p = no of comparisons being made
α = Level of significance
• Do not make comparisons suggested by the data
For example, if the smallest and largest means are compared at a (supposedly) 5%
level of significance,

Probability of a significance is:-


0.13 when comparing 3 treatments
0.40 when comparing 6 treatments
0.90 when comparing 20 treatments

When in fact there is no real differences between treatment means. Therefore beware of
using multiple comparison method. Multiple range tests have a place only when the
treatments are completely unstructured , but even then one should use them keeping the
above two points in mind.

4.4.2 Orthogonal Contrast

We have already discussed in detail the failure of using multiple range tests. One may
ask what is the alternative then? This will take us to another concept known as orthogonal

42
contrasts. Contrast, as the name indicate, is any comparison made between treatments.
The word “orthogonal” however qualifies it further. Two contrasts are orthogonal if the
sum of the product of their coefficients is zero. Orthogonality simply means
independence. Suppose consider a simple example where three treatments, A, B, C are
compared. In principle, since we have 3 treatments the number of orthogonal contrast
generated is only 2 (equal to treatment degree of freedom). If we compare A with B and
A with C, do we really need to compare B with C. The answer is no because by
comparing A with both B and C, we already have an idea about the comparison between
B and C. Therefore, the comparison between B and C is redundant because it is not
orthogonal (do not add new information) to the earlier two contrasts. Therefore, one must
be strategic when searching for orthogonal contrasts. It would be easy in situation where
treatments have structure. For example, if A is control and B and C are new technologies
then the contrast must be designed in such a way that treatment question are answered.
Here, the obvious approach is to a) Compare control with the average of the other two
and b) to compare the new technologies among themselves. Any third contrast would
depend on these two and can’t be included.

Consider the response of millet to added nitrogen. The following 4 treatments were used:

1. Control - no nitrogen
2. Boot - 100 kg nitrogen applied at booting
3. Thin - 100 kg nitrogen applied at thinning
4. Split - 50 kg nitrogen applied on each occasion
SAS program for calculating contrasts may be given as follows (including data step)

options nocenter ps=65 ls=65;


data crd;
input treat$ yield;
datalines,
boot 11.4
split 10
thin 12.1
control 9.9
control 12.3
boot 12.9
split 12.2
thin 13.4
split 11.9
thin 12.9
control 11.4
boot 12.7
split 11.3
boot 12.3
thin 13.8
control 10.2
;
proc glm ;
class treat;
model yield =treat;
contrast 'boot v split ' treat 0 1 0 -1;
contrast 'thin v av of boot and split ' treat 0 –1/2 1 –1/2;
contrast 'control v average of others' treat 1 –1/3 –1/3 –1/3;
run;

The SAS output for the above program is as follows :

SAS 22:08

General Linear Models Procedure

43
Dependent Variable: YIELD

Source DF Sum of Squares F Value Pr > F

Model 3 10.82687500 4.57 0.0235

Error 12 9.47750000

Corrected Total 15 20.30437500

R-Square C.V. YIELD Mean

0.533229 7.456338 11.91875000

Source DF Type I SS F Value Pr > F

TREAT 3 10.82687500 4.57 0.0235

Contrast DF Contrast SS F Value Pr > F

boot v split 1 8.82000000 11.17 0.0059


thin v av of boot & split 1 1.12666667 1.43 0.2554
control v average of others 1 0.88020833 1.11 0.3119

The ANOVA table shows significant treatment effect. When contrasts are considered
only boot vs. Split is significantly different from each other.

Note that contrasts can also be calculated in MSTATC

4.5 Analysis of Covariance


Experiments may be planned very well initially by taking all possible local control into
considerations. But unforeseen circumstances may occurs after commissioning of the
trial and some experimental units can be affected negatively or positively. Experiments
can thus be modified during data analysis stage, without altering the initial design so that
such types of environmental effects are eliminated from the estimates of the treatment
effects, with the hope that these estimates are made more accurate. Sometimes we have
to recognize that an initial measurement, x, could be made on each experimental unit,
like initial age or weight of animal, in the absence of treatment effects, might be
expected to be strongly correlated with the yield variable, y. The accuracy of the
experiment can then be improved by adjusting the value of the y variable by this initial
variable, x, often referred to as a concomitant variable or covariate. The technique is
potentially very useful. It often happens that some source of variation which can not be
controlled by the design can be measured by taking additional observations. In biological
sciences problems such as water lodging, damage by birds, damage by insects, which
often occur after planting may be accounted for by taking additional information based
on the extent of damage so that it will be used to adjust mean yield. A useful assumption
here is that, since the additional observations are to measure environmental effects, they
must not be influenced by treatments.

44
The analysis of covariance (ANCOVA) has the following uses

1. To increase precision in randomized experiment


2. To adjust for sources of bais in observational studies
3. To throw light on the nature of treatment effects in randomized experiments
4. To study regression under multiple classification condition.

Model specification and analysis methods may be given as follows:

Model:- yij = µ + αi + β(Xij - 8..) + εij Were µ = over all mean


αi = ith treatment effect
β = Slope of covariate

To analyze covariance by hand first obtain sum of squares for different effects, then
estimate the coefficients. We present hand calculation for ease of understanding the
concept, otherwise the steps are tedious and not convenient for hand calculation.

1. Obtain sum of squares


_
Total SS for Y = TYY = ∑∑(Yij – Y..)2

Total SS for X = Txx = ∑∑ (Xij -8..)2


_
Total SS for XY = Txy = ∑∑ (yij –Y..) (Xij - 8.. )
_ _ _
Between classes SS for Y = Byy = ∑ (Yi. – Y..)2
Between classes SS for X = Bxx = ∑ (8i. - 8..)2
_ _
Between classes SS for XY = Bxy - ∑((8i - 8..) (Yi. –Y..)

Error SS for Y = TYY – BYY = EYY


Error SS for X = TXX – BXX = EXX
Error SS for XY = TXY – BXY = EXY

2. The least square estimates for µ, β, and α are Y.. , b and αi respectively

Where b = Exy and


Exx
_ _
αi = Yi. –Y.. –b (8i. - 8..)

The error (or residual ) SS is :

SSE1 = EYY - E2xy with a (n-1)-1 degree of freedom


Exx
Suppose we don’t have any treatment effect, i.e

Yij - µ + β (Xij - 8..) + εij

45
_
In this case the least square estimates for µ and β are Y.. and b = Txy
Txx
Therefore to test Ho : αi = 0 for all i,

F = [ (SSE2 – SSE1)/a-1] / [SSE1 / (a(n-1) –1)]

To test the significance of β

F = E2xy /Exx ∼ F (1, a(n-1)-1)


MSE1

Note once again that analysis of covariance by hand is tedious and it is always better to
use statistical packages.

Suppose consider a hypothetical experiment with four varieties and each variety
replicated three times. It was assumed that some of the varieties were affected by
rodents in some plots. Particularly, variety number 2, which performed very well in the
first two replications, performed poorly in the last replication. This means that some
varieties responded below their genetic potential due to factors not initially considered.
Now, the question is, what will happen if we proceed with calculation of ANOVA using
the observed data set? The answer from analysis of this data set is clear. The comparison
will be in favour of varieties not affected by rodents. Thus, the results shows non-
significance between the four varieties despite a clear mean difference for the two
replication.

This is obviously not fair and adjustment is required. This is also the difficult part of the
exercise. In animal experiments where use of covariate is relatively straightforward, and
where the purpose is to attain high final weight, the animals are expected to have
differences in their initial weight and animal with high initial weight may attain high
final weight even if the treatment was not so effective. Hence, the initial weight is
normally used as an adjustment factor for initial differences in weight and this is simply
called covariance. Here, however, we have to construct another variable to be used as an
adjustment factor. For this particular data we assumed that a three- scale rating of
condition of the yield is sufficient. Smallest scale (number 1) was given to a plot
damaged most by rodents and a scale of 3 for plots not damaged at all, while number two
was given to intermediate damage. When this variable was used as a covariate, the
result of ANOVA was completely changed, differences between treatments being
significant at about 3% probability level. The analysis has a number of benefit. The first
is that the CV for analysis without covariate was much higher than that with covariate,
possibly because the former possess unadjusted mean and large variability. Second,
variety means are adjusted for damages caused by rodent. Third, precision for ANOVA
and estimates improves tremendously. Mean is adjusted using the following formula.

Adjusted Yi = unadjusted Yi – b(Xi –X)

Where Yi = Mean Grain yield for ith variety


Xi = Mean covariate corresponding to ith variety

46
X = Grand mean for covariate.

b represents the slope between the yield and the covariate given by

b = Exy/Exx = 20.79

The data (below) and analysis using both MSTATC and SAS is given below for
comparison.
Rep Facor A yield Covariate
1 1 30 3
1 2 78 3
1 3 23 3
1 4 28 2
2 1 45 3
2 2 80 3
2 3 36 3
2 4 65 3
3 1 21 2
3 2 18 1
3 3 34 3
3 4 29 3

Analysis using MSTATC

The following analysis is based on the observed data without use of covariate. As
discussed earlier, differences between treatment means is not significant at 5%
probability level, when infact the difference between treatment number 2 and number 3
is visually so large.

Data file: COVAR_


Title: covariance analysis
Function: FACTOR
Experiment Model Number 7:
One Factor Randomized Complete Block Design

Variable 3: yield

Grand Mean = 40.583 Grand Sum = 487.000 Total Count = 12

T A B L E O F M E A N S

1 2 3 Total
-------------------------------------------------
1 * 39.750 159.000
2 * 56.500 226.000
3 * 25.500 102.000
-------------------------------------------------
* 1 32.000 96.000
* 2 58.667 176.000
* 3 31.000 93.000
* 4 40.667 122.000
-------------------------------------------------

47
A N A L Y S I S O F V A R I A N C E T A B L E

K Degrees of Sum of Mean F


Value Source Freedom Squares Square Value Prob
-----------------------------------------------------------------------
1 Replication 2 1926.167 963.083 3.1453 0.1163
2 Factor A 3 1477.583 492.528 1.6085 0.2835
-3 Error 6 1837.167 306.194
-----------------------------------------------------------------------
Total 11 5240.917
-----------------------------------------------------------------------

Coefficient of Variation: 43.12%

s_ for means group 1: 8.7492 Number of Observations: 4


y

s_ for means group 2: 10.1027 Number of Observations: 3


y

Let’s now use covariance analysis to adjust for the damage and re-test all the effects. At
the beginning of the analysis the table of means shows unadjusted and adjusted means.
Adjusted mean for treatment two is large compared to unadjusted, while treatment
number 3 which has not been affected by the damage is adjusted downwards so that the
effect of rodent could diminish from the other treatments. This is an indication of how
important covariance analysis is in the test of hypothesis.

Title: covariance analysis

Function: FACTOR (with covariance analysis)

Experiment Model Number 7:


One Factor Randomized Complete Block Design

Variable 3: yield

Grand Mean = 40.583 Grand Sum = 487.000 Total Count = 12

T A B L E O F M E A N S

Unadjusted Total Adjusted


1 2 4 3 3 3
-------------------------------------------------
1 * 2.750 39.750 159.000 38.017
2 * 3.000 56.500 226.000 49.569
3 * 2.250 25.500 102.000 34.164
-------------------------------------------------
* 1 2.667 32.000 96.000 32.000
* 2 2.333 58.667 176.000 65.598
* 3 3.000 31.000 93.000 24.069
* 4 2.667 40.667 122.000 40.667

48
-------------------------------------------------

The numbers in the "Unadjusted 3" column are the means based on the
actual data in the file. The "Adjusted 3" column contains the means
adjusted with the following formula: Adj Yj = Unadj Yj - b(Xj-X)
where Adj Yj is the adjusted variable 3 mean, Unadj Yj is the
unadjusted variable 3 mean, Xj is the variable 4 mean for that
treatment combination, X is the variable 4 grand mean, and b is Exy/Exx
the slope of the data).

b = Exy/Exx = 20.79412

U N A D J U S T E D S U M S O F P R O D U C T S

K Degrees of
Value Source Freedom V4xV4 V4xV3 V3xV3
-------------------------------------------------------------------
1 Replication 2 1.167 46.083 1926.167
2 Factor A 3 0.667 -27.667 1477.583
-3 Error 6 2.833 58.917 1837.167
----------------------------------------------------------------------
Total 11 4.667 77.333 5240.917
----------------------------------------------------------------------

A N A L Y S I S O F C O V A R I A N C E T A B L E

K Degrees of Sum of Mean F


Value Source Freedom Squares Square Value Prob
-----------------------------------------------------------------------
1 Replication 2 395.037 197.518 1.6136 0.2879
2 Factor A 3 2423.686 807.895 6.5999 0.0344
Covariate 1 1225.120 1225.120 10.0084
-3 Error 5 612.047 122.409
-----------------------------------------------------------------------

Coefficient of Variation: 27.26%

K Value Effective Error Mean Square S. E. of Mean Number of Obs.


------- --------------------------- ------------- --------------
1 147.6112 6.0748 4
2 132.0100 6.6335 3

NOTE: Use appropriate effective error mean square for mean


separation in RANGE or CONTRAST when Error df >= 20
and only when treatments have no significant effect on x.
If these conditions are not met, consult a statistician
for appropriate mean separation.

Analysis using SAS software


Proc GLM
Class Rep FACTORA;
MODEL yield = COVAR REP FACTORA;
LSMEANS FActORA

49
Run;

Note the analysis in SAS carefully. SAS now distinguish the variable “COVAR” as
covariate (as different from factor) because this term was not declared as factor in the
class statement. Thus it is considered as a continues variable used for adjusting and
assigned 1 df.
Analysis with covariate

General Linear Models Procedure

Dependent Variable: YIELD

Source DF Sum of Squares F Value Pr > F

Model 6 4628.87009804 6.30 0.0309

Error 5 612.04656863

Corrected Total 11 5240.91666667

R-Square C.V. YIELD Mean

0.883218 27.26211 40.58333333

Source DF Type III SS F Value Pr > F

REP 2 395.03676471 1.61 0.2879


FACTORA 3 2423.68557423 6.60 0.0344
COVAR 1 1225.12009804 10.01 0.0250

T for H0: Pr > |T| Std Error of


Parameter Estimate Parameter=0 Estimate

INTERCEPT -21.20343137 B -1.27 0.2609 16.73084870


REP 1 3.85294118 B 0.45 0.6688 8.48560420
2 15.40441176 B 1.67 0.1566 9.24697280
3 0.00000000 B . . .
FACTORA 1 -8.66666667 B -0.96 0.3814 9.03361551
2 24.93137255 B 2.68 0.0437 9.29551367
3 -16.59803922 B -1.79 0.1342 9.29551367
4 0.00000000 B . . .
COVAR 20.79411765 3.16 0.0250 6.57292075

NOTE: The X'X matrix has been found to be singular and a


generalized inverse was used to solve the normal
equations. Estimates followed by the letter 'B' are
biased, and are not unique estimators of the parameters.

FACTORA YIELD Std Err Pr > |T| LSMEAN


LSMEAN LSMEAN H0:LSMEAN=0 Number

1 32.0000000 6.3877308 0.0041 1


2 65.5980392 6.7530341 0.0002 2
3 24.0686275 6.7530341 0.0161 3
4 40.6666667 6.3877308 0.0014 4

50
Pr > |T| H0: LSMEAN(i)=LSMEAN(j)

i/j 1 2 3 4
1 . 0.0153 0.4325 0.3814
2 0.0153 . 0.0090 0.0437
3 0.4325 0.0090 . 0.1342
4 0.3814 0.0437 0.1342 .

NOTE: To ensure overall protection level, only probabilities


associated with pre-planned comparisons should be
used.

Also note that this time SAS gave additional output. In the parameter estimate Section
of the output, slope is given for each level of the factors (rep and factor A) and the
covariate (COVAR). This is because the GLM procedure fit a Linear model of a
regression type. The coefficient given to covariate (20.794) in SAS is exactly the same
as the slope (b) calculated by MSTATC. The advantage of SAS is that it also gives the
test statistics (t= 3.16 for the covariate and P[T > t]) and respective standard error.
LSMEAN in SAS represent adjusted mean for the covariate, which is exactly the same
as adjusted mean obtained in MSTATC. In last few lines of the output, SAS gave
probability that an adjusted mean LSMEANi is different from LSMEANj. For example,
probability that LSMean1≠LSMean2 is 0.0153. Therefore, there is a significant
difference between the two. Note that if we specify the “ means” statement in SAS
program, then we can request for multiple range tests, but if we specify LSMEANS then
we can obtain probability for differences between any two treatment but not multiple
range test results.

4.6 Analysis of multiple experiments.


Multi location experiments are important in plant breeding and agronomic research.
They are used to assess yield stability, response of genotypes across environments,
estimate and predict yield accurately, and to guide in the selection of best performing
genotypes. The objective of this section is not to give a full account of the analysis of
multi- location trials as this would mean a complete manual by itself, but for the purpose
of this training manual we will give analysis for some standard approaches currently
found in some statistical packages.

The conventional analysis of variance usually begins from separate analysis for each
environment. Note that for separate analysis, the trials must have been designed in such
a way that sufficient information (error df) is available for valid interpretation of results.
Results of separate analysis is used for two purpose:

1) For inferences regarding particular site and


2) For obtaining necessary information such as mean and variance which is useful for
further analysis.

Once the separate analysis is done and before combining the data, it is necessary to check
for assumption of variance homogeneity. If variances of environments are not

51
homogenous, it is not possible to combine data as it is. Transformation is required to first
fulfil the assumption, and data will be combined and analysis made on the transformed
scale. A good account has been given in Gomez and Gomez regarding the types of
transformation available for each data (distribution) type and we would like to direct
readers to this book for further references.

The following data was taken from maize breading program to describe the necessary
steps required in combined analysis. The data contains 18 environments (2 years X 9
locations), each with 3 replication and 20 entries. RCBD was used for planting. Hence
the data file contain 1080 observation. The maximum MSE is 82.6 for environment 4
and the minimum is 31.7 for environment 13. The ratio of largest MSE to the smallest is
2.58. When Bartlett test was done there is an evidence of heterogeneity of variance at
about 5% level. Therefore, it was decided that a square root transformation made (after
observation of the data distribution) on yield for all environments and the resulting data
re-analyzed. A combined analysis was done in two ways:

1. Using untransformed data, and


2. Square root transformed data.

The result of ANOVA from both types of analysis however turned out to be very similar.
This is probably for two reasons. Since the MSE ratio is around two, even if Bartlett test
gave significant result, the heterogeneity is not a such serious. Hence, the result of the
transformed data may not be very different from that of raw data. Second, finding a
perfect transformation is often difficult ( is even impossible) and it might not be possible
to stabilize variance through transformation as required.

Therefore, unless the level of heterogeneity is serious, it is better to work with the raw
data. When there is a serious variance heterogeneity, it is advisable to try different
transformation to select the one that stabilized variance best. In the Ethiopian condition
it is very common to get a variance ratio of more than two and it does not mean that there
is always problem of heterogeneity. There are several evidences that shows that Ethiopia
is variable even in small locality. That means since variance is related to mean,
performance of genotypes at different environment might be subjected to different level
of variability owing to the different soil and agro-climatic factors. Therefore we should
be able to tolerate some level of heterogeneity, possibly consider those significant at 1%
using Bartlett test as heterogeneous.

Tolerance is also important as it is usually difficult to get appropriate transformation for


moderate heterogeneity. Sometimes, when we are trying to stabilize variance through
transformation we might even end up in a more heterogeneous situation just because
appropriate transformation was not found. In practice, therefore, one has to try a number
of transformations, test for heterogeneity each time with the transformed data, and select
the one which stabilized variance more (which shows least heterogeneity).

Form of combined analysis

When data is combined over locations and years for analysis, the analysis of variance
structure may take different forms. There are three possibilities:

52
1. When same location and randomization is used each year
2. When same location is used each year but different randomization is adopted and
3. When different location is used each year. The structure of ANOVA is slightly
different for each of them. Data from maize was used to illustrate the issue.

The difference among the three ANOVA is mainly on the relationship between
replication on one hand, and year and location on the other hand. When same location
and randomization is used each year then replication is said to be nested under location,
but when same location is used with different randomization then replication is nested
under location by year interaction. In MSTATC this can be done easily from the
“FACTOR” menu using sub-menu number 16,17 and 18. But in packages like SAS and
GENSAT one has to specifically write the model statement so that the software
understand and differentiate the nesting condition. In SAS, the following program (old
version of SAS) produces ANOVA for same location and randomization.

Proc ANOVA; ( or proc GLM;)


Class rep loc year treat;
Model yield = loc rep (loc) year loc* year rep*year (loc)
loc * treat year * treat loc * year * treat;
Run;

For other models slight modification is required. Proc MIXED, a recent development in
SAS program, is particularly useful where split unit designs are used for multi-location
(year) experiments. The simplest example for split-plot design with 5 sub-plot and M
main plot treatments may be given below.

Proc mixed:’
Class block MS;
Model Y=M/S;
Random block M * block;
Run;

Here the “ random” statement declares an error term to be used for testing the effects.
The output for mixed procedure differs from the standard approaches very much. This is
one of the most powerful procedure recently incorporated in SAS, for more information
refer the help system in SAS or its manual.

The mixed procedure in used to fit a variety of mixed linear models. It is a generalization
of standard linear model used in the GLM procedure. The mixed procedure handles data
that exhibit correlation and non-constant variability. Therefore, it could be safe to use
mixed procedure in a situation where we do not know about the data very much. This
procedure, unlike the standard ones, model means, variances and covariances. Mixed
model is known for its two components, fixed and random effect. The name “mixed” is
meant to show the combination of the two in to one model. The fixed effects are those
whose effects are to be determined e.g variety, breed, etc.. while the random-effects are
additional unknown random variability assumed to impact the variability of the data e.g
block. In the case of fixed effect, the conclusions apply only to those factor levels which

53
were used in the study, while in the random effect case the conclusion applies to the
population from which the factor levels used in the study has been drawn from.

The mixed procedure has several options which could be used for different tests of
hypothesis about a particular data set. Effects that can used to tests other effects or that
are needed to separately estimated because of their additional randomness are included
in the “ random” statement. Whatever is not indicated in “ random “ statement must
have appeared in the model statement where fixed effects are declared.

Table 4.6.1: ANOVA for same location and randomization for each year.

Source df SS
L 8 64710.08
R(L) 18 4353.69
Y 1 12753.99
LY 8 25,088.16
RY(L) 18 2427.50
T 19 37385.76
LT 152 31127.34
YT 19 5157.52
LYT 152 25028.99
Error 684 33580.46
Total 1079 467707.50

In the case of same location but different randomization, we will have replication nested
under the interaction of location and year [R(LY)] which is basically the sum of
replication nested under location and replication nested under interaction of location by
year. The sum of square for location, year, treatment and interactions between this 3
factors, and hence tests of hypothesis, remain the same whether same location and
randomization or same location and different randomization is used. But, when
locations are different each year, the tests are also slightly different (table 6.4.2.). For
example, tests for Location by year interaction (LY), and Location by Treat (LT) are
different. Therefore, it is really important to consider seriously which one of the above
we should use to analyze multi- location/year data.

Table 4.6.2: ANOVA for different Locations each year.

Source Df SS
Y 1 12753.99
L(Y) 16 315596.25
R(LY) 36 6781.20
T 19 37685.76
YT 19 5157.5

54
LT(Y) 304 56152.3
Error 684 33580.46

Menu item in MSTATC are not available for some multiple experiments. For example,
for a four factor experiment (say, variety, management system weeding date and
fertilizer level) in split plot design where variety and management are on main plot and,
fertilizer and weeding date are on sub-plot can’t be found on MSTATC menu directly,
but can be generated using the last sub-menu called “ other designs”. In MSTATC
design generations work on the basis of “ k- values”. The k-values are identifies of the
different effects. For example, if we have 3- factor (including rep) designated as A, B,
C, and if we need to construct ANOVA. start with K-value of 1, which gives you factor
A (which is replication), if we type number 2 it will give we Factor B. Now, if we give
K- value of 3, it will automatically take the next logical component which is A*B. But,
we do not need A*B because, we do not want interaction of rep with others. Therefore,
to Avoid A*B we ignore number 3 and type 4 which gives factor C, so on and so fourth.
Therefore a complete ANOVA may be given as:

ANOVA for 3 factor experiment (including rep).

K Source df
1 A(rep) a-1
2 B b-1
4 C c-1
6 BC
-7 error

Note that to get the error term use negative numbers. The best strategy to produce
ANOVA for design not available in the menu is to first list down the sources of
variations in ANOVA table then one can easily select the corresponding K-values. Even
if you made mistake in selection you can always go back using upwards arrow and enter
the correct k-values.

We will give ANOVA structure for two cases in MSTATC as follows:

55
1. Four factor split –plot experiment in RCBD.

Here we assumed to have variety (V1) and management (V2) on main plot & fertilizer
(V3) and weeding date (V4) on split plot.

k-value Source of Variation


1 rep
2 V1
4 V2
6 V1*V2
-7 Error (main plot)
8 V3
10 V1*V3
12 V2*V3
14 V1*V2*V3
16 V4
18 V1*V4
20 V2*V4
22 V1*V2*V4
24 V3*V4
26 V1*V3*V4
28 V2*V3*V4
30 V1*V2*V3*V4
-31 error

SAS program for the above design is as follows:

Proc mixed;
Class rep V1 V2 V3 V4;
Model Y= V3*V4/ V1*V2;
Random = rep rep * V1 rep* V2 rep * V1* V2;
run;

56
2. A three-factor split-plot design in RCBD combined over locations and year. Two
factors assigned on main plot and one factor on sub-plot.

k- value Source of Variation


1 Location (L)
3 R(L)
4 Year (Y)
5 LY
7 RY(L)
8 A
9 LA
12 YA
13 LYA
16 B
17 LB
20 YB
21 LYB
24 AB
25 LAB
28 YAB
29 LYAB
-31 Error
32 C
33 LC
36 YC
37 LYC
40 AC
41 LAC
44 YAC
45 LYAC
48 BC
49 LBC
52 YBC
53 LYBC
56 ABC
57 LABC
60 YABC
61 LYABC
-62 error

These designs can also be analyzed using SAS by specifying which term is used to test
the intended hypothesis. Therefore, modifying the program written for split-plot design
in previous section could suffice. Equally, use of proc mixed can do the job,

57
4.6.1. Yield stability
One of the objectives of multi-location trails is to investigate an overall performance of
varieties over locations. We normally expect some genotypes to show a relatively strong
dependence on a combination of the important site characteristics while others would
show lesser dependence. The former group are known to be unstable while the latter are
stable genotypes. Therefore, to identify these genotypes a regression of genotype yield
on site characteristics is important. However, the independent environmental
characteristics which help to distinguish stable genotype from unstable one are is usually
unknown or difficult to obtain. As an alternative for the unknown environmental index,
the mean yield of all genotypes at each site is used as a measure of site performance, and
genotype yield is regressed on an overall mean yield of all genotypes in a site.

returning to the analysis of variance model for combined data where the observed mean
yield (Yij) of the ith genotype at the jth environment is expressed as:

Yij = µ + Gi + Ej + (GE)ij + eij

Now since the problem lies with the interaction component (GEij), it is partitioned into a
component due to linear regression (bi) of the jth genotype on environmental mean and a
deviation (dij).

(GE)ij = bi Ej + dij

Hence
Yij = µ + Gi + Ej + (bi Ej + dij) + eij

Thus, part of the genotype’s performance across environments or genotype stability is


expressed in terms of three empirical parameters : mean performance, slope or regression
and sum of square deviation from regression. Conventional interpretation of regression
coefficient is used as a measure of genotypic stability across environments. Values of bi
close to 1 indicate a relatively stable genotype which tend to vary less with changing
environment. A stable genotype is thus characterized by b close to 1 and deviation from
regression close to zero.

Several authors criticized the regression approach for its statistical and bilogical
limitations. The statistical limitations are:

1. Genotype mean is not independent of the overall mean of environments,


2. Error associated with slopes of genotypes are not statistically independent, since
sum of square for deviation can’t be subdivided orthogonally among the G
genotypes,
3. It assumes a linear relationship between interaction and environmental means.

The biological limitations are

1. That the relative stability of two genotypes depends not only on particular sites, but

58
also on the set of other genotypes included in the analysis.

2. When only few very low or very high yielding sites are included in the analysis fit of
regression may be influenced by genotypes performance in those few extreme
environments, which might mislead conclusions.

Nevertheless, the regression approach of Eberhart and Russel for determining stable
genotype is still extensively used and reported in all journals.

Stability using Eberhart and Russel regression approach can be done in a number of
packages. MSTATC produces the slope (bi), standard errors of bi and Error mean
square, but do not produce deviation from regression as defined in literature.

When you do stability in MSTATC, you should first get mean of each genotypes for
each location, because the program do not require raw data with replications, proceed as
follows:

1. Select “STABIL” from main menu


2. Then MSTATC asks some questions and displays variables to be used in
analysis, so that you fill their respective variable number.

Yield ________
Variety ________
Treat ________
Location ________

Now from the list of variables, you can tell MSTATC on which variable number yield is
available (eg 8), etc. By now you might recognize that MSTATC requires variety and
treatment columns, but we may not have a treatment factor in our trial. Never worry
about it, you may give same variable number for both variety and treatment. If the trail
is done over locations without repeating over years then you may supply variable
number of location, otherwise create another variable called environment which is a
combination of location and year. Once you done this, the next step is to compute the
slope by giving a combination of variety and treatment. The combination must be that
variety 1 and treatment 1, etc., since both of them are coming from same variable.

To do the same analysis in Agrobase, ( breeding & agronomic package developed in


Canada) you first need to prepare the average (over replication) of yield for each variety
and environment. In this example, we will have a matrix of 20 X 18 data set. In the
bottom of the data Agrobase requires the following information: GRAND MEAN,
CHECH MEAN, CV, LSD, MSE, SED, ALPHA, R-SQUARED, REP-MS AND REPS
for each environment. Agrobase produces stability ANOVA, partition of deviation from
Linearity of response by variety and the slope with corresponding tests.

For the example we considered from maize program the output is given below. The
outputs of b’s from both packages are exactly the same, but output of Agrobase has been
presented here for completeness.

59
Table 6.4.1.1: Eberhart /Russell stability regression ANOVA

Source Df SS MS F-value Pr>F


Total 1079 142447.944
Varieties 19 12561.636 661.139 11.38 0.000
Env.+ in Var.x Env. 340 129886.308 382.019
Env. In linear 1 109448.816
Var.x Env. (linear) 19 1853.940 97.576 1.68 0.0381
Pooled deviation 320 18583.552 58.074
Residual 720 13458.487 18.692
Grand mean = 63.511 R- squared = 0.8569 C.V. = 11.79%

Partition of deviation From Linearity of response by variety


Var.# Sum of squares F-Ratio Pr.>F Beta Deviation Var.Name
1 853.2765 2.8530 0.000 1.0372 34.6374 Var –1
2 515.5678 1.7239 0.038 0.8470 13.5306 Var-2
3 1236.8334 4.1355 0.000 0.7073* 58.6097 Var-3
4 696.5973 2.3292 0.002 0.8933 24.8450 Var-4
5 957.4175 3.2012 0.000 0.9489 41.1463 Var-5
6 458.4271 1.5328 0.082 1.0207 9.9594 Var-6
7 578.3608 1.9338 0.015 1.0896 17.4552 Var-7
8 542.6947 1.8146 0.026 0.9755 15.2261 Var-8
9 1136.8739 3.8013 0.000 1.2129* 52.3623 Var-9
10 1926.3175 6.4409 0.000 1.1215 101.7025 Var-10
11 1019.3208 3.4082 0.000 1.0747 45.0152 Var-11
12 1076.6928 3.6000 0.000 0.8939 48.6010 Var-12
13 574.0826 1.9195 0.016 0.8253 17.1878 Var-13
14 508.4451 1.7000 0.042 1.0302 13.0855 Var-14
15 502.5376 1.6803 0.046 1.1223 12.7163 Var-15
16 791.6489 2.6470 0.000 1.2119* 30.7857 Var-16
17 808.3875 2.7029 0.000 1.0313 31.8319 Var-17
18 1143.8512 3.8246 0.000 0.8692 52.7984 Var-18
19 415.1948 1.3883 0.140 1.1307 7.2573 Var-19
20 2841.0244 9.4993 0.000 0.9566 158.8717 Var-20

Note: F-ratios computed using residual of 18.6923, with 720 df varietal mean squares
computed from (Sum of Squares)/(No. sites-2) standard error of beta = 0.1030

This is basically the beginning of investigating the G x E interaction. There are several
methods developed and available for explaining G X E and estimating adjusted yield.
This section, therefore, may just serve readers as an introduction to G x E interaction
modeling.

Interpretation of results

In the ANOVA of table 6.4.1.1, the SS due to variety x environment is further


partitioned in to i) SS due to variety x environment (linear) and ii) SS due to deviation
from linearity. The first component (i) shows regression SS which is an indicator of
existence of linear relationship between variety and environment. The idea is that if

60
relationship exists, variety response could be predicted.

The hypothesis that there are no genetic differences among varieties for their regression
on the environment index is tested by the F-value of 1.68, which is significant in the
table above for partition of deviation from linearity, the fourth column is test for
deviation from regression for each variety. Consequently, only genotypes no 6 and 19
have deviation not significantly different from zero. The desired entiries are those which
have slope of unity, Sd2 near zero and high yield. However, those genotype which have
b close to unity and small Sd2 are know to be stable. Here most varieties have slops not
significance different from unity. But only variety 6 have Sd2 =0, thus this genotype is
the most stable. However, genotypes like 8,7,15 are also relatively stable. Genotype like
3,9 & 16 performs very well in good environments.

61
5. Incomplete Block Design
5.1. General Concept
So far we were concerned with situations where all treatments can be accommodated in a
block. That means a situation where number of treatment is equal to number of
experimental units in a block. The next question is what shall we do if we have large
number of treatments, say 20 or 30 and little resource is available which may not
accommodate such large number. This brings a new concept known as incomplete block
design.

Suppose we have 9 treatments but only 3 blocks and 3 plots in a block.

Block I Block II Block III


T1 T4 T7
T2 T5 T8
T3 T6 T9

This type of arrangement is incomplete because not all treatments appear in same block.
Each incomplete block contains three treatments.

Therefore, incomple block design may be defined as a design where all treatments can’t
appear together in a block, since number of plots per block is less than number of
treatments. Such circumstance can occur in two ways. One is when we have large
treatments and plots per block can’t accommodate them all as block may loss
homogeneity if extended. Second, practical application of treatments to blocks may not
match. That means, it may not be possible to get space for all treatments to be compared
within block.

Let’s consider two design structures and discuss the benefit and difficulties associated
with incomplete block.

(I)
1 a b b c c a b c

block 2 a a b c c a b b

3 a a b b c a b c

The design is Symmetric, which implies that all pair wise mean difference, A-B, A-C are
all Equally precise. However, this design is basically not incomplete block but repeated
appearance of a treatment in the same block. It therefore gives a contrasting idea with
that of incomplete block. The problem in incomplete block design is exactly the opposite
to this particular design. It is important to note that although all treatments did not occur
equal number of time per block, the design turned out to be in balance because of the
symmetry.

62
The following design is not symmetric, but it is in total balance, because variance of
mean differences are equal, i.e., (A-B) = var (b-c) = 2σ2
3
(II)
1 a a b c

2 a a b c

3 b c

The above design is a combination of incomplete block and extended block. The first two
blocks are extended because treatment A appeared twice in those blocks, while the third
block is incomplete as it could not accommodate treatment A. At first glance it looks as
if it can’t be balance, particularly since A have not occurred in block 3. This shows that
the idea of indirect information in treatment comparison is so powerful that sometime
mathematical approach is required to test whether a particular design is in total balance.
This concept will be elaborated further below.

Resolvable Incomplete Block (IB)

The nicety of complete block is that comparison of treatments within block is possible
since all treatment appear in a block. That means treatments are compared free of block
effects. In incomplete block, however, this is not possible because some treatments may
not occur together in block. This takes us to new concepts and ideas.

Two concepts
• Intra block analysis
(Comparison of treatments within block)
• Inter block analysis
(Comparison of treatment between blocks)

This second option gives the incomplete block designs the power to compare two
treatments which have not occurred together in a block but which occurred with some of
the treatments in block which is common to both.

Example Suppose that we have five treatments to be compared in two blocks of 4 plots
each

T1 T2
T2 T3
T3 T4
T4 T5

Although T1 have not occurred with T5 to be compared in some block, they compared
through T2, T3, and T4 which occurred with both T1 & T5 in one of the two blocks. That
means, since the problem of comparing T1 and T5 is that they do not occur in same block,
and that trying to compare them will also lead to difference between the two blocks,

63
technically we say effect of T1-T5 is confounded with block effect. Therefore, we can
only compare T1 and T5 by first estimating block effect. Block effect can thus roughly
estimated by subtracting valves of T2, T3 and T4 of block 1 from that of block 2. This
difference can now be considered as block difference and used as an adjustment factor
for comparing T1 and T5. Statistical packages infact apply least square method to
estimate block effects and adjust the comparison.

The other family of incomplete block is the non-resolvable incomplete Block (IB) design.
Let’s consider the problem with an example.

7 treatments in two replication of 3 blocks each

Rep I Rep II

T1 T3 T5 T1 T3 T5

T2 T4 T6 T2 T4 T6

This design have two replications, but the same randomization was used in both cases and
the difficulty of analyzing the data remain the same despite additional resource use.
Hence, it is not possible to COMPARE, say, t1 - t4 since there is no inter or intra block
information available. Estimation may be improved if the randomization in the second
rep is changed, possibly with the rule that prohibit occurrence of two treatments together
twice in the experiment. Hence,T1 and T2 should be separated in the second rep.

A classical example of incomplete block design is balanced incomplete block design


(BIB). The BIB designs are defined in terms of the number of treatments, a, number of
units per block, k, size, b, the treatment replication, r, and the number of times a treatment
pair occurs together in a block, λ. Balance occurs when λ is the same for all pairs of
treatments. That means when all treatments occur together equal number of times in a
block. This implies that

λ (a-1) = (b-1)r

So it is uncommon for combinations of a, b, and r to produce an integer value of λ. When


the number of treatments in a trail is large, the target of satisfying the requirement of BIB
design remains important, though rarely achievable exactly.

There are several types of incomplete block designs available for use to- date. It is also a
relatively difficult areas of experimental design as analysis is more complicated.

64
5.2 Lattice (quasi-factorial) designs
This design was initially introduced by Yates in 1936. They are class of BIB (PBIB).
Designs.

Three type of lattice designs may be identified:

❶ Square Lattice

♦ Treatments should be a perfect square


♦ There are r replications
♦ If r = S+1 (S = block size), then design is BIB (or Balanced lattice)
♦ If r < S + 1 ⇒ partially balanced Lattice ( PBIB)

❷ Cubic Lattice
♦ Treatment should be a perfect cubic such that 3√ Treat = S (block size)

♦ It is a PBIB

❸ Rectangular Lattice

♦ Treat = p(p-1), indexed by ordered pair (x, y) , such that X#Y

rep = 2 (For simple rectangular lattice)

Treatments with 1st coordinate occupies P blocks, while 2nd coordinates occupy another
P blocks of size p-1.

Simple, triple and rectangular lattice designs are effective when plots are LONG and
NARROW.

Suppose we are interested to test 9 varieties but could not get a block of size 9. The
experimental land is such that width is only enough for three plots and increasing fertility
trend lengthwise. So we use balanced Lattice design with four replication

Rep I Rep II Rep III Rep IV

8 7 9 4 7 1 1 5 9 2 4 9
5 6 4 9 3 6 6 7 2 1 6 8
2 1 3 8 5 2 8 4 3 5 3 7

Note that each of the possible pairs of treatments appear together once, and only once, in
one of the incomplete blocks. So you should aim to obtain every treatment occurring
with every other in the whole experiment. For example, treatment 8 occurs with 7 and 9
in block 1 of rep I, with 5 and 2 in block 3 of rep II, with 4 and 3 in block 3 of rep III and
with 1 and 6 in block 2 of Rep IV.

To fulfil requirements of balanced lattice design we need a large number of replication,

65
which can’t be met in practice. The alternative, thus, is to switch to partially balanced
lattice design. The logic of partially balanced lattice is to use less number of replications
than that required by a balanced lattice. The obvious shortcoming is that all possible
pairwise occurrence of treatments can not be attained. This implies comparison for some
pair of treatments is more precise than others.

PROBLEM: 1.
Allocation of treatments to blocks
2.
They are not available for certain number of treatments (due to
restrictions)
∗ Square Lattices are effective when plots are SQUARE ( They control variability in
two direction)

❹ Balanced Lattices

The number of treatments must be an exact square. The size of block is the square root of
this number. Incomplete blocks are combined in groups to form separate replication. The
special feature of the balanced lattice, as distinguished from other lattices, is that every
pair of treatments occurs once in the same incomplete block. Consequently, all pairs of
treatments are compared with the same degree of freedom.

Example

For pigs of a given breed, previous experience indicated that a considerable part of the
variance in growth rate between animals can be ascribed to the litter. Hence the
experiment was planned so that litter difference would not contribute to the intra-block
error. The pigs were divided into sets of 3 litter-mates. Two sets of 3 were assigned to
each block, within a block, each set. Thus the experimental unit was composed of 2 pigs
each feeding the same pea. The plan and growth rates are given in the table 5.2 (cochran
& cox, 1957).

Table 5.2 : Gains in weight (pounds/day) for a total of 2 pigs.

Rep I
Block
(1) (2) (3)
1
2.20 0.84 2.18
(4) (5) (6)
2
2.05 0.85 1.86
(7) (8) (9)
3
0.73 0.60 1.76

66
Rep II
Block
(1) (4) (7)
4
1.19 1.2 1.15
(2) (5) (8)
5
2.26 1.07 1.45
(3) (6) (9)
6
2.12 2.03 1.63

Rep III
Block
7 (1) (5) (9)
1.76 1.16 1.11
8 (2) (6) (7)
1.76 2.16 1.80
9 (3) (4) (8)
1.71 1.57 1.13

Rep IV
Block
10 (1) (5) (8)
1.77 1.57 1.43
11 (2) (4) (9)
1.5 1.6 1.42
12 (3) (5) (7)
2.04 .93 1.78

The steps in the analysis is: (The algebraic formula refers to a K x K lattice in blocks of
K units with r =(k+1) replicates).

1) Calculate block totals, replication totals, grand total G, and treatment total T.
2) For each treatment, calculate the total Bt for all blocks in which the treatment
appears. For treatment 4, this is

Bt = 4.76 +3.54 + 4.41 + 4.52 = 17.23

3) Compute the quantities

W = KT – (k + 1) Bt + G

The over all sum should be exactly zero.

67
Table 5.2.1 treatment totals and adjustment factors

W=(KT -(k-1)Bt Adj. total Mean per


T Bt +G T + µw unit
1 6.97 18.61 +3.89 7.21 1.80
2 7.36 21.24 -5.46 7.02 1.76
3 8.05 21.16 -3.07 7.86 1.96
4 6.42 17.23 +7.76 6.91 1.73
5 4.01 18.37 -4.03 3.76 0.94
6 7.62 21.03 -3.84 7.38 1.84
7 5.46 18.10 +1.40 5.55 1.39
8 5.61 18.05 +2.05 5.74 1.44
9 5.92 18.47 +1.30 6.00 1.50
57.42 172.26 0.00 57.43

4. The analysis of variance is now obtained. The total SS and SS for replication and
treatments are found in the usual way. The sum of squares for blocks within
replication adjusted for treatment effects is

∑w2 = 1.4206
k3 (k+1)

5. Calculate the adjustment factor

µ = (Eb – Ee) = 0.1776 – 0.0773 = 0.0628


K2Eb 9 x 0.01776

Where Eb and Ee are blocks and intra block m.s respectively


N.B: If Eb is less than Ee, µ is taken as Zero and no adjustment are applied.

ANOVA table
General d.f d.f (k=3) S.S M.S
Replication K 3 0.0774
Treatments (k2 – 1) 8 3.2261
Blocks (adj.) (k2 - 1) 8 1.4206 0.1776
Intra-block error (k –1)(k2 –1) 16 1.2368 0.0773
Total (k3 +k2 – 1) 35 5.9609

6. For t-tests, calculate the effective error mean square as:


Efe = Ee (1+kµ) = 0.0773 (1+3 *(0.0628)) = 0.0919

Efficiency of lattice compared to complete block = RCBD error MS = 0.1107 = 1.20


Effective error MS 0.0919

68
Presentation of results and further discussions

In this analysis treatment sum of square was not adjusted for incomplete block, and only
the block within rep was adjusted for treatments to help obtain the RCBD error
equivalent for further comparison. The result, however, shows significant treatment
effect. Now, once the ANOVA is obtained a pairwise comparison of treatments could
also be done. Here, we are going to have two types of standard errors, one for entries in
same block and the other for entries in different blocks.

Also note that we have two types of error estimate for incomple block, intra-block error
and effective error. The former is estimated from within block entries while the later is
given above. Some packages use effective error for tests, while others use the intra-
block error.

Another important output in lattice design is efficiency of lattice design. Here, the sum
of square for blocks within replication adjusted for treatments plus intra block error
constitute the RCBD error sum of square. For this example, RCBD error sum of square
(when replication are considered as blocks) is equal to 1.4206 + 1.2368 = 2.657 with df
of 24 (8 + 16). Here, the efficiency of lattice design compared to RCBD is about 120%.
Therefore, the use of lattice was acceptable. It means that experimental plot within
replication is heterogeneous and the use of RCBD (i.e. considering replication as
complete blocks since all entries appear in each replication and ignoring the incomplete
blocks within replications) is not supported.

The adjustment factor which is, calculated in Lattice is normally an indicator of the level
of balance in the design. For example, if µ approach zero, then it means that no
adjustment is required. The adjustment factor is also useful for adjusting treatment
means, since the row entry means obtained are affected by difference between the
incomplete blocks.

The SAS program for computing the above problem is

proc glm;
class rep bl tre;
model res=rep bl(rep) tre ;
run;

The bl(rep) statement indicates that block is nested within rep.

The ANOVA result for same data set using SAS is:

General Linear Models Procedure

Dependent Variable: RES

Source DF Sum of Squares F Value Pr > F

Model 19 4.72409259 3.22 0.0111


REP 3 0.07738889 0.33 0.8011
BL(REP) 8 1.42060370 2.30 0.0746
TRE (adj) 8 2.50192593 4.05 0.0084

69
Error 16 1.23680741

Corrected Total 35 5.96090000


R-Square C.V. RES Mean

0.792513 17.43132 1.59500000

In the recent version of SAS a procedure for lattice has been developed and the following
command may be used as an alternative.

proc lattice data = “ dataset”;


variable yield;
Run;

Note that the way the program is written has an assumption of how the data should be
entered, since we do not specify the block and rep factors in the proc statement. The data
should be read as follows. Only the input statement is given here and the reader can read
the data from a file or can use cards to enter directly as usual.

input group block treatment yield… ;

The output from proc lattice may be shown below:

Source df sum of squares


Group 3 0.077
Block(group) (adj.) 8 1.4206
Treatment (unadj,) 8 3.226
Intra block error 16 1.2368
RCBD error 24 2.6574

Variance of means in same block 0.04593


LSD at 0.01 level 0.6259
LSD at 0.05 level 0.4543
Efficiency of Lattice relative to RCBD 120.55

Adjusted means are also presented.

The following proc glm program gives same result as that of proc lattice:

Proc GLM
Class rep bl tre;
Model res = rep tre bl (rep);

The difference of this GLM program from the previous one is the position of the “tre”
treatment. Here “tre” is put after “rep” so it is not adjusted.

The reason for using two procedures for same analysis is to show users the power of
GLM procedure. Thus with GLM procedure we could set a complete analysis for lattice.
The efficiency could be calculated as indicated in step 6 of previous page,

70
Comparison of some important statistical packages using their output of the Lattice
design is necessary at this point as there are some practical limitations in the use of out
puts from such packages.

Example data was taken from Ato Gelana Soboka’s MSC thesis work entitled “Heterosis
and combining ability between CRMMYT’S and Ethiopian MG”. We are grateful to him
for allowing us use this data set as an example. The experiment was designed as simple
Lattice with 2 replications and a square incomplete block of 10 x10 (100 entries). The
experiment was done at 3 locations.

The data was analyzed using MSTATC, SAS and Agrobase and the results compared.

Result of Mstatc

To analyze Lattice in MSTATC, select ANOVALAT from main menu. It is possible to


analyze square or rectangular Lattice with same or different arrangement for single
environment only. Therefore, first select whether you have square or rectangular and
whether you have different arrangement. The arrangement here refers to randomization
and MSTATC allows you up to 10 arrangements. For example, if you work out plan for
first rep and repeat the same in for second then we say the two reps have the same
arrangement. MSTATC requires the variables to appear in the data file in the following
order,

1= replication, 2 = block 3 = entry.

MSTATC produces treatment (adjusted and unadjusted SS), block within reps (adjusted
for treatment), three types of error, efficiency, LSD, CV and adjusted entry means. The F
– test for entries is the ratio of treat SS to effective error SS. In the example below
efficiency of Lattice was found to be about 110.44 %

71
Data file : ALBK1_
Title : Heterosis and combining ability between CIMMYT's and
Ethiopian MG

Function : ANOVALAT
Data case no. 1 to 200

Variable number 4
yield kg/ha (12.5%m)

A N A L Y S I S O F V A R I A N C E T A B L E

For Square Lattice Design

Source of Degrees of Sum of


Variance Freedom Squares Mean Square F-value Prob
----------------------------------------------------------------------
Replications 1 61355672.695 61355672.695
Treatments
-Unadjusted 99 193736099.915 1956930.302 2.04 0.001
-Adjusted 99 191328871.032 1932614.859 2.23 0.000
Block within Rep(adj.)18 30837088.212 1713171.567
Error
-Effective 81 70225170.087 866977.408
-RCB Design 99 94793265.600 957507.733
-Intrablock 81 63956177.388 789582.437
----------------------------------------------------------------------
Total 199 349885038.210

Efficiency of Lattice: Compared with Randomized Complete Blocks


110.44

Grand Sum = 1434559.13 Grand Mean = 7172.7957 Total Count = 200

Coefficient of variation: 12.9812 percent.

Least Significant Differences


P = 0.05 lsd = 1852.6292
P = 0.01 lsd = 2456.1888

Variable number 4
yield kg/ha (12.5%m)

AGROBASE

ANALYSIS OF Lattice design is found in the “Analysis of variance” sub-menu. It


analyze both square and rectangular Lattice with less restrictions when compared to
MSTATC. Lattice could be more difficult in Agrobase as you have to state the model
and other necessary options. Agrobase also expects variables in the order of ENTRY,
REP and BLOC (note spelling for block).

Agrobase also gives treatment (adjusted and unadjusted), adjusted block (within rep) and
intra block error effects. It also gives CV, R2 , RE , adjusted entry means and different
standard errors. The good things about Agrobase is that it calculates the adjustment

72
factor µ which we saw in previous hand calculation example, so that we can determine
the level of balance of the design. Unlike MSTATC, Agrobase automatically determine
the arrangement and report the repetitions rather than requesting from user.

There is a wide difference in the output between the two packages. To start with, adjusted
treatment SS for MSTATC is 191328871.03 (with 99 df ), while this value is
159135138.5 ( with 99 df) for Agrobase. The difference between adjusted and unadjusted
treatment SS for MSTATC is so small, while for AGROBASE it is large. This is
possibility because the two packages based their principle of adjustment on different
references. For example, MSTATC seems to have used a rough approximation of
treatment sum of square from adjusted treatment totals which has been adopted by
cochran and cox in 1957. Cochran and cox argued that since there was not perfect
adjustment factor and since unadjusted treatment means contains block effect, comparing
it with intrablock error is not correct. Thus they roughly adjusted treatment SS from
adjusted totals and used effective error as a divisor for F – tests. On the other hand
AGROBASE is based on recent development in the incomplete block and used the extra
sum of square principle (Mead,1988) to adjust treatment sum of squares. Thus, no
effective error SS was calculated here and all the test was done on the intrablock SS. The
relative efficiency of Lattice is also calculated as a ratio of RCBD error to intrablock error
SS. For this and other reasons the adjusted entry means area also different for the two
packages.

ANALYSIS OF VARIANCE

Dependent variable: YIELD

Source df SS MS F- value Pr > F


Total 199 349884946.0
REP 1 61356610.9 61356610.9 35.81 0.0000
ENTRY[unadj.] 99 193735577.0 1956925.0
Block within REP[adj.] 18 30837756.5 1713208.7
Intrablock error 81 63955001.6 789567.9
ENTRY [adj.] 99 159135138.5 1607425.6 2.04 0.0005
Grand mean = 7172.800 R-squared = 0.8172 C.V. = 12.39%

Relative efficiency to A RCBD = 110.4%


The S.E.D. between entries in the same block = 912.2148
The S.E.D. between entries in different blocks = 935.2560
[Analyzed as a partially balanced lattice design, mu = 0.0359]
[Analyzed with 1 repetitions of the basic design]

LSD for ENTRY [adj.] = 1549.2571 S.E.D.= 931.1091


t (1-sided a=0.050, 81 df) = 1.6639 MSE = 866964.11977

SAS

SAS is a general purpose program and the output expected from SAS depends on how
you supplied the program to the package and which sum of square you use since it do not
strictly follow a pre-determined format of output for lattice as in MSTATC or Agrobase,
when proc GLM is used. When proc lattice is used, however, the output resembles that
from the two packages we have seen. In incomplete block design, the order of entering

73
the factors in the model specification in SAS is most important when proc GLM is used.
For example the two orders: 1) Rep, treat, block (rep) and 2) Rep, block (rep), treat do not
produce same result for type I sum of square, because for (1) treatment is not adjusted
but for (2) treatment is adjusted for block (within rep). But if we consider results of type
III sum of square both (1) and (2) are the same because in type III SS each effect is
adjusted for all other effects included in the model. For example when (2) was fitted,
block (rep) SS (type I) was 73120835.5, but for type III SS, the result is 30837048.1
because type III SS adjusts the effect for all other factors included in the model
irrespective of the order of entry to the model.

Regarding the analysis result, SAS and AGROBASE agrees very much, as adjusted
treatment mean and block (rep) is very close for the two. The error sum of square that
SAS supplies is the intrablock error and the tests could be done using it.

The next question is how do we analyze lattice for more than one environment (Year or
location). Obviously both MSTATC and Agrobase can’t handle this. But SAS can easily
handle through similar steps. All the 3 location of this data set was combined using the
following procedure in old SAS version and results obtained.

Proc GLM;
Class rep block Loc treat;
Model yield = Loc rep(Loc) bloc (rep) treat treat*Loc;
Run;

The corresponding output for type III SS is :

ANOVA
Source Df SS MS F Prob
Locati 1 108620.7 108620.7 0.011 <.0001
Rep(Loc) 2 75608772.3 37804386 37.03 <.0001
Bloc(rep) 18 30006166.3 1667009.2 1.62 <.0561
Treat 99 316291936.8 3194868.0 3.13 <.0001
Treat*Loc 99 174673488.7 1764378.7 1.73 <.0008
Intrablock 180 183768312 1020935.1
error

Note: Here rep is assumed to be nested under location and Block is also assumed to be
nested under replication.

74
5.3. Alpha Designs (Generalized Lattice)
Generalized lattice, or alpha designs (Peterson and Williams, 1997) provide lattice (with
some loss of balance and efficiency) where conventional square or rectangular lattices can
not be generated, such as 29 treatments in two complete blocks.

Alpha Lattice are:


• Generalization of the Lattice designs
• Advantage of α- design is that it is flexible. It exists for all combination of variety,
rep and plots per block.
• Can have incomplete block of different size
• Checks can also be used
• There are different alpha lattices

α(0,1), α(0,1,2). . . . .

The notation α(0,1), α(0,1,2) etc… refers to occurrence of treatments in a block. For
example α(0,1) is a symbol which says that some treatments occurs only once in a block
with others, while others do not occur together with other in a block at all.

The development of alpha designs removed the restrictions on the number of treatments
to be considered and its relation with block size required for lattice designs. To
demonstrate randomization of the design by hand, consider the comparison of 28 (p = 28)
treatments where the block size is 4(r = 4), i.e., p is multiple of r. Here p = sr, and we
write down the four sets of s = 7 treatments

Set 1 Set 2 Set 3 Set 4


1 8 15 22
2 9 16 23
3 10 17 24
4 11 18 25
5 12 19 26
6 13 20 27
7 14 21 28

Then as for a lattice design, we construct blocks in successive replicates so that (i) each
block includes one treatment from each set; and (ii) for each replicate the treatments in
each block have not occurred together in a block in any previous replicate.

For the first replicate, we choose the 7 blocks to be, I: the first treatment of each set; II:
the second treatment of each set; VII: the seventh treatment of each set. For the second
and subsequent replicates we rotate the list of treatments in the several sets so that the
sets occur in different positions relative to each other. For example, the second
replication might be based on the rearrangement:

75
Set 1 Set 2 Set 3 Set 4
1 9 17 26
2 10 18 27
3 11 19 28
4 12 20 22
5 13 21 23
6 14 15 24
7 8 16 25

Set 1 is unchanged, set 2 is moved up one position, set 3 is moved up two positions and
set 4 is moved up four positions. The blocks for the second replicate are formed from
the first, second,..., seventh positions in the cycled sets.
For the third replicate, set 2 is moved up three places from the original position, set 3 six
places and set 4 five places:

Set 1 Set 2 Set3 Set 4


1 11 21 27
2 12 15 28
3 13 16 22
4 14 17 23
5 8 18 24
6 9 19 25
7 10 20 26

The number of places by which each set is rotated are chosen so that each pair of sets
occurs in different relative positions in each replicate (e.g. for sets 3 and 4 , in the first
replicate treatment 15 occurs with treatment 22, in the second replicate with treatment
24, and in the third with treatment 28). The resulting design is shown in table 5.3.

76
Table 5.3 Alpha design for comparing 28 treatments in blocks of 4 plots

Replicate 1
Block I 1 8 15 22
Block II 2 9 16 23
Block III 3 10 17 24
Block IV 4 11 18 25
Block V 5 12 19 26
Block VI 6 13 20 27
Block VII 7 14 21 28

Replicate 2
Block VIII 1 9 17 26
Block IX 2 10 18 27
Block X 3 11 19 28
Block XI 4 12 20 22
Block XII 5 13 21 23
Block XIII 6 14 15 24
Block XIV 7 8 16 25
Replicate 3
Block XV 1 11 21 27
Block XVI 2 12 15 28
Block XVII 3 13 16 22
Block XVIII 4 14 17 23
Block XIX 5 8 18 24
Block XXI 6 9 19 25
Block XXI 7 10 20 26

77
Table 5.3.1. Alpha design for comparing 50 treatments in 3 replicates, each replicate
consisting of 10 blocks of 5 plots.

Replicate 1 Replicate 2 Replicate 3


Plots Block
I II XI XII XXI XXII

1 45 22 21 5 25 31
2 34 39 2 11 48 50
3 8 26 19 41 35 3
4 17 31 16 40 9 42
5 42 4 31 3 32 14

III IV XIII XIV XXIII XXIV

6 5 36 7 35 26 17
7 28 24 46 17 49 12
8 25 44 39 38 41 16
9 14 38 25 13 23 19
10 29 3 45 23 33 46

V VI XV XVI XXV XXVI

11 12 27 20 12 13 19
12 43 50 32 26 6 18
13 33 23 18 6 7 8
14 48 15 50 14 15 36
15 21 46 34 1 37 29

VII VIII XVII XVIII XXVII XXVIII

16 10 7 4 36 22 2
17 32 30 8 47 28 47
18 37 41 44 15 40 43
19 11 9 33 48 34 45
20 47 16 37 22 21 24

IX X XIX XX XXIX XXX

21 49 2 49 42 5 39
22 1 18 28 43 38 27
23 19 6 10 27 30 44
24 20 40 24 29 4 11
25 13 35 9 30 20 1

78
Alpha designs are most effective when the block size, r, is less than the square root of the
number of treatments, p, and hence less than s. When r > s, some treatment pairs occur
together in a block in more than one replicate, which can lead to unequal levels of
precession for some comparisons (i.e., some treatment comparisons are made with greater
precision than other treatment comparisons).

Particular set of rotations for the sets for each replicate is important. Some rotation
patterns provide more efficient designs than others. A computer program, ALPHA
provides optimal designs for trials with up 500 treatments (Journal of Agricultural
Science (1978), Vol. 90, pp.395-400, provides optimal designs for trials with up to 100
treatments.)

If the number of treatments, p, is not an exact multiple of r, a modified method of


construction is used. First we construct a design for the smallest value of p’, which is
greater than the number of treatments and which is an exact multiple of r (s=p’/r). The
surplus (dummy) treatments are all included in the last set of s treatments. This ensures
that these unwanted treatments will never occur together in a block in the constructed
design. For the actual design we simply omit the dummy treatments, so that in each
replicate there is a mixture of block sizes r and (r-1). For alpha designs the estimations is
done using the REML method (which is available in GENSTAT, SAS, BMDP, S-PLUS,
etc). Randomization and design layout generation can also be obtained easily from
statistical packages such as GENSTAT, Alpha, SAS, Agrobase and CycDesign.

REML provides a fully efficient analysis, but two aspects of fitting models are not
explicit in the results. First we should check that the degrees of freedom on which
variance components are estimated are sufficient for reliable estimates; minimum of 8 d.f.
is a safe limit. Second, the process of selecting a model itself introduces a small
additional imprecision into the estimation procedure, which is not reflected in the
standard error of treatment differences from the REM analysis.

Example for Alpha design is given on Table 5.3 . The data is the yield obtained by
applying 28 treatments.

Here we applied the GLM procedure to roughly estimate the ANOVA so that every body
can use since the earlier version of SAS do not support REML procedure.

REPLICATION I
Block 1 30 19 26 18
Block 2 28 26 17 29
Block 3 31 27 30 20
Block 4 15 27 31 26
Block 5 22 16 19 27
Block 6 32 22 19 21
Block 7 22 30 31 20

79
REPLICATION II

Block 8 31 31 25 30
Block 9 33 22 26 19
Block 10 19 16 28 34
Block 11 23 24 27 14
Block 12 27 24 20 34
Block 13 28 28 19 30
Block 14 26 27 17 19

REPLICATION III

Block 15 31 20 16 21
Block 16 25 20 35 26
Block 17 21 29 25 30
Block 18 24 21 34 25
Block 19 18 25 28 32
Block 20 24 34 25 23
Block 21 25 29 31 37

The SAS program applied for the above data is (by omitting the data step):

proc glm ;
class blo repl trea;
model yi=repl blo(repl) trea;
Run;

General Linear Models Procedure

Dependent Variable: YI

Source DF Sum of Squares F Value Pr > F

Model 47 1428.59950249 0.99 0.5136


REPL 2 59.73809524 0.98 0.3864
BLO(REPL) 18 359.05188344 0.65 0.8326
TREA 27 961.28997868 1.16 0.3311
Error 36 1101.21002132

Corrected Total 83 2529.80952381

R-Square C.V. YI Mean

0.564706 22.08093 25.04761905

As in Lattice design, here block with rep and treatment effects are both adjusted for each
other, and the error term is intrablock error. SAS analyzes most incomplete design with
the GLM procedure as long as the incompleteness is so serious that terms could not be
easily estimated. The REMEL procedure refine the analysis through the variance
component technique. For this example, none of the terms are significant at 5% level.

80
5.4. Cyclic Designs
They are family of incomplete block designs which are based in cyclic development of
one or more initial blocks. They are basically developed for ease of allocation of
treatments to blocks, in case of incomplete blocks- where keeping balance is often a
difficult exercise. All BIB and PBIB designs obtained through the method of differences
are called cyclic designs. If there are v treatments, block of size k and treatments
replicated r times, then the C(v,k,r) can be generated.

Example

Let treatments, V = 6, block size k = 6 and treatments applied are replicated r = 6 times.

❶ If the initial block is (0,1,2), then successive blocks are : (1,2,3), (2,3,4),
(3,4,5),(4,5,0),(5,0,1) α (0,1,2)

❷ Suppose initial block is (1,2,4). Then subsequent blocks will be (2,3,5),(3,4,0),


(4,5,1) (5,0,2), (0,1,3)

Problem with Alpha Lattice!

Extraneous effects, such as harvesting, lodging, slope and wind can be confounded with
rows and affect yield estimate

To overcome such problems it is better to use resolvable ROW-COLUMN DESIGN to


control variability in two directions.

5.5. Row Column Design


• They are generalization of the square lattices
• Latin square is a special case of row-column. But, it is restrictive, as number of
treatment = number of rep.
• Row-column design, however, is More flexible (less restrictions) and more
economical than Latin or square lattice.
• Non-orthogonal row and column design
 If number of treatments is equal to number of rows/ columns, then treatment is
orthogonal to rows/ columns
 Columns can also be non-orthogonal to rows, if all rows/ columns do not have
equal length.

Example

Consider seven treatments in seven blocks of three units each. (the case of one way
blocking);

81
❶ Incomplete Block (IB) Design
Block

1 2 3 4 5 6 7

A A A B B C C

B D F D E D E

C E G F G G F

This is balanced design because each treatment occurred with each other in a block. For
example, if you observe treatment A, we see that A occurred with B and C in Block1,
with D and E in block 2 and with F and G in block3.

❷ Suppose, consider this to be a row-column design


Since t = column, each treatment can appear once in every row which implies that
Treatment is orthogonal to rows. But, since T > rows implies that treatment are non-
orthogonal to columns. With row and column designs allocation of treatment to both
rows and columns is usually difficult. The following row-column design may be
obtained

Column
1 2 3 4 5 6 7

1 A D F B G C E

Rows 2 B A G F E D C

3 C E A D B G F

Although the design was balanced when subjected to one way blocking it is now
partially non-orthogonal and unbalanced when row and columns are considered.
The SAS program computing the hypothetical data is:( the data is shown in the ‘cards’
statement)

options nocenter ls=65 ps=65;


data yte;
do row= 1 to 3 ;do column=1 to 7;
input $tre res @@;
output;
end;end;
cards;
a 2.3 d 2.5 f 1.9 b 1.8 g 2.2 c 0.9 e 2.8
b 3.1 a 2.7 g 2.0 f 2.9 e 2.3 d 2.6 c 2.3
c 2.9 e 1.8 a 2.7 d 2.6 b 2.8 g 2.1 f 3.0
;
proc Glm;
class row column tre;
model res= row column tre;
run;

82
The SAS output is:

General Linear Models Procedure

Dependent Variable: RES

Source DF Sum of Squares F Value Pr > F

Model 14 3.90285714 1.06 0.5036


ROW 2 1.16666667 2.22 0.1896
COLUMN(adj.) 6 1.66476190 1.06 0.4741
TRE (adj.) 6 1.07142857 0.68 0.6742
Error 6 1.57523810

Corrected Total 20 5.47809524

R-Square C.V. RES Mean

0.712448 21.43449 2.39047619

Such kind of designs can’t be analyzed with MSTATC or even in Agrobase. MSTATC
can only manage Lattice design among the incomplete blocks. The SAS result shown
above is adjusted treatment and column effects. In SAS, the three types of sum of square
are to simplify some of the difficulties related to adjustments. Type one sum of square in
SAS produce SS in the order of fitting, that is any effect following another effect is
adjusted for the preceding effect. But type III SS adjusts each effect for all other effects
irrespective of the order of fitting. Hence, when there is a considerable non-
orthogonality, SS obtained when effects are fitted in different order is different.

83
5.6 Unreplicated Design
Objectives in experimental design are:

❶ To estimate treatment contrasts with minimum variance


❷ To obtain a valid estimation of experimental error

But in breeding experiments the problems are:

1. In plant breeding the main objective is to maximize genetic gains

2. Broad assessment of the population to be improved (with few resource, less


precision)
3. Screen large number of lines

Thus, several materials may enter into the initial stages of breeding line and there is no
sufficient resource to sufficiently replicate all entries for statistical analysis. On the
other hand, it is not possible to get a land which is uniform enough to enable us assume
that all entries are applied on homogenous plots and can be compared directly.
Therefore, it is important to look for ways of either control soil heterogeniety or develop
appropriate design.

How to control soil heterogeneity?

There are several ways of overcoming the problem, the most useful ones are:

❶ Grid mass selection:

By using only test genotypes in the field, we compare only genotypes within
hypothetical square blocks. It looks fair enough. The problem, however, is that it is
not possible to compare all of them at once and you may have to select several
entries.

❷ Use site as a replicate:

One replicate of a design is used in each site. This is even reasonable, but the
problem often is that genotype by environment interaction could be a limiting factor
for successful comparison.

❸ Use repeated check plots (common).

Since the difficulty is to estimate error mean square from replicated plots to be used in
F-ratio, the other and most commonly used alternatives among breeders is replicate
checks only in each block and calculate error mean square from them. This will lead us
to the family of Augmented designs.

84
5.6.1 Augmented designs, developed by Federer (1956)
The principle followed in this design is as follows:

 All checks appear in each block


 New selections appear only once in the whole experiment (that means they are not
replicated)
 Yield of new selections will be adjusted for differences in block.
 Block size is determined by number of block (b), checks (c) and new selections (v)
 Number of block is determined by V and C

Example

24 new selections, 3 check variety (represented by A, B and C) were considered in an


experiment. The layout may be shown as follows: Note that checks are replicated so that
each check appears once in each block and the new entries appears only once in the
whole experiment. Thus the error mean square are calculated from the checks

Block
1 2 3 4 5 6
A C 21 A B A
13 17 A 2 C 19
8 9 C B 12 C
B A 15 10 16 20
C 11 B C 5 B
18 8 23 3 6 4
7 24 1 22 A 14
B1 B2 B3 B4 B5 B6

• Selection of yield adjustment factor

Since all new selections did not occur together in a block, a direct comparison of
them is not possible, but since checks occurred together in a block it is possible to get
a reasonable estimate of block effect so that it will be used in adjusting treatments.
The adjustment index is calculated as follows:

Aj = 1 (Bj-M) Bj = block total


C M = sum of check means
C1 = ∑ci
6
• Adjusted yield of new selection is:
Yij = Y ij -Aj
The analysis of variance is calculated just like RCBD with C checks as treatments and b
blocks, with a total of c x b observations.

85
ANVOVA for checks

Source df
Block b-1
Checks c-1
Residuals (b-1)(c-1)

It should be noted that the precision of residual mean square depends on the number of
check plots you included per block. Include as much checks as possible given the
resource and the material. Particularly, checks below 3 are not acceptable in trials with
augmented designs.

Due to non-balance created as a result of occurrence of new entries in a block, different


standard errors are used in comparing different terms. Consequently there are four types
of comparisons, hence four different variances.

I) Between two checks = √ 2RMS


b
II) Between Vi's in same block = √2RMS
b
III) Between Vi's in different blocks = √2RMS

IV) Between Vi and check mean = √RMS (1+1/b)


(see Example for analysis below)

Example

Analysis of the standard Augmented Design of the above example using Agument
software.

BLOCK CHECK TOTAL MEANS CORR.


FACTOR
1 12 13 24
1 137.00 135.00 133.00 134.00 539.0 134.75 -0.44
2 137.00 135.00 137.00 133.00 542.0 135.35 0.31
3 137.00 135.00 133.00 133.00 538.00 134.00 -0.69
4 138.00 135.00 137.00 134.00 544.0 136.00 0.81
TOTAL 549.00 540.00 540.00 534.00 2163.0
MEAN 137.25 135.00 135.00 133.50 540.8 135.19

86
The four standard errors are:

1. between two checks = 1.222


2. between Vis is same blocks = 0.407
3. between Vis is different blocks = 1.222
4. between Vi and check mean = 1.2942
Overall Grand total: 13097.00 Overall Grand Mean: 136.43

Variable: HD

ANALYSIS OF VARIANCE:

SOURCE OF DF SUM OF SS% MEAN F


VARIATION SQUARES SQUARE
TOTAL 15 46.44 100.0
BLOCKS 3 5.69 12.2 1.90 1.41
CHECKS 3 28.69 61.8 9.56 7.13
ERROR 9 12.06 26.0 1.34

The above ANOVA can also be constructed using hand calculation by only considering
the checks as treatments and blocks. That means since we have four check varieties the
data can be analyzed based on RCBD with four treatments.

5.6.2 Modified Augmented Design


Modified augmented design type 1 & 2 (Line and Poushinsky, 1983,1985) provides row
and column error control with systematic and random placing of repeating checks when
large number of non-replicated entries are tested. The design is laid out as a square
arrangement. These are designs laid out as a split-plot design with whole plots arranged
in 3x3 squares of nine sub-plots and the central plot as a control plot (MAD-1) or in rows
and columns as a Latin square (MAD-II). For MAD-II, each whole plot has
3,5,7,9,11,13, or 15 rectangular sub-plots, with the central plot being a control plot
assigned to one or more check varieties. For MAD-II, therefore, these are r x c whole
plots (since the intersection of row x column is a whole plot), with s sub-plots per whole
plot, producing a total of r x c x s plots. If n is the number of whole plot with two sub-
control plots, then the number of test plots will be: r x c x s - ( r x c + 2 x n ).

For MAD-I the sub-plots should be square, or almost square, so that sub-plots are equi-
distant from the central plot. The 3x3 squares are planted such that the central plots are
laid out as a Latin square, with orthogonal rows and columns, to give a bi-directional
estimate of soil heterogeneity. To estimate sub-plot error, two whole plots are randomly
chosen for each control variety and each control variety is thus randomly assigned.

Modified augmented design type 2 is more commonly used than MAD-1, because MAD-
2 is meant for field layout with long, rectangular sub-plots, typical of small grains
breeding programs.

87
Example
∗ How to distribute the check genotypes in the field?
∗ What should be the frequency of the checks?

Augmented Row- Columns

GOAL: To obtain the most accurate ESTIMATE of row and column effects in order to
make the most accurate estimate of entry mean yields.

Figure 5.6.2: Design plan for MAD-I with 40 check plots and 120 entry plots.
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A B A B
A B A B
A B A B
A B A B
B A B A
B A B A
B A B A
B A B A
A A B
B B

* Only a portion of the plan is presented here.

ANOVA for MAD-I

Source Df
Total 40
Correction for MEAN 1
Row (ignoring checks and columns) 9
Column (ignore checks, Eliminating column) 15
Check , (Eliminating row and column) 1
Error 14

The modified augmented design is well furnished in Agrobase package. However, it can
be analyzed by any other packages.

To illustrate the use of modified augmented design II (MAD-II)data was taken from
AGROBASE training material and analyzed as follows. The data set is also attached to
help readers understand the design and apply to their own field of research.

Note how the data is arranged. The central plot is always occupied by the control plot.

Cp = Control plot
Csp = Control sub-plot

Design plan for this MAD-II example may be given below:

88
Fig. 5.6.2.1. Plan for MAD-II with 16 control plots and control sub-plot, and 48 test
entries.

Column
1 2 3 4
1 3 5 B
Row 1 2 A 6 9
C C C C
B B 7 A
A 4 8 10

23 19 15 11
2 24 20 16 12
C C C C
A 21 17 39
B 22 18 40

25 29 33 37
3 26 30 34 38
C C C C
27 31 35 39
28 32 36 40

A 45 A B
4 47 46 43 A
C C C C
48 B 44 41
B A B 42

Note that intersection of row by column is known as whole plot. Therefore, there are 16
whole plots, each whole plot is split in to 5 sub-plots. The central plot is always
assigned to control plot. Here since the sub- control plots are not assigned to control
varieties at each whole plot, the test entry occupies 48 plots of the 80 plots available.
Had there been two control sub-plots per whole plot, we could use only 32 entries for
this experiment.

89
Table 5.6.2 : Data file for modified augmented design.
Entry Name Plot Row Col Cp Csp Yld
1 ENTRY -1 1 1 1 56.91
2 ENTRY -2 2 1 1 57.81
9010 CONTROL 3 1 1 1 58.25
9033 SUB-CONTROL B 4 1 1 2 57.7
9023 SUB-CONTROL A 5 1 1 1 55.84
3 ENTRY - 3 6 1 2 61.08
9022 SUB - CONTROL A 7 1 2 1 59.84
9010 CONTROL 8 1 2 1 61.78
9032 SUB- CONTROL B 9 1 2 2 60.46
4 ENTRY - 4 10 1 2 61.16
5 ENTRY - 5 11 1 3 66.69
6 ENTRY - 6 12 1 3 64.93
9010 CONTROL 13 1 3 1 65.57
7 ENTRY - 7 14 1 3 63.64
8 ENTRY - 8 15 1 3 61.93
9038 SUB - CONTROL B 16 1 4 2 69.15
9 ENTRY - 9 17 1 4 67.75
9010 CONTROL 18 1 4 1 67.7
9028 SUB-CONTROL A 19 1 4 1 69.2
10 ENTRY - 10 20 1 4 69.45
11 ENTRY - 11 21 2 4 72.76
12 ENTRY - 12 22 2 4 72.48
9010 CONTROL 23 2 4 1 72.44
13 ENTRY - 13 24 2 4 70.15
14 ENTRY - 14 25 2 4 72.50
15 ENTRY - 15 26 2 3 65.46
16 ENTRY - 16 27 2 3 64.87
9010 CONTROL 28 2 3 1 67.6
17 ENTRY - 17 29 2 3 67.73
18 ENTRY - 18 30 2 3 71.37
19 ENTRY - 19 31 2 2 63.86
20 ENTRY - 20 32 2 2 68.48
9010 CONTROL 33 2 2 1 66.03
21 ENTRY - 21 34 2 2 62.23
22 ENTRY - 22 35 2 2 67.36
23 ENTRY - 23 36 2 1 59.57
24 ENTRY - 24 37 2 1 57.08
9010 CONTROL 38 2 1 1 58.46
9027 SUB - CONTROL A 39 2 1 1 61.56
9037 SUB - CONTROL B 40 2 1 2 58.89
25 ENTRY - 25 41 3 1 65.29
26 ENTRY - 26 42 3 1 64.24
9010 CONTROL 43 3 1 1 60.5
27 ENTRY - 27 44 3 1 61.21
28 ENTRY - 28 45 3 1 61.78
29 ENTRY - 29 46 3 2 64.29
30 ENTRY - 30 47 3 2 66.35
9010 CONTROL 48 3 2 1 67.61
31 ENTRY - 31 49 3 2 68.33
32 ENTRY - 32 50 3 2 67.27

90
33 ENTRY - 33 51 3 3 67.46
34 ENTRY - 34 52 3 3 73.28
9010 CONTROL 53 3 3 1 72.77
35 ENTRY - 35 54 3 3 70.69
36 ENTRY - 36 55 3 3 70.64
37 ENTRY - 37 56 3 4 77.13
38 ENTRY - 38 57 3 4 79.02
9010 CONTROL 58 3 4 1 73.38
39 ENTRY - 39 59 3 4 76.2
40 ENTRY - 40 60 3 4 73.01
9031 SUB-CONTROL B 61 4 4 2 79.07
9021 SUB - CONTROL A 62 4 4 1 79.01
9010 CONTROL 63 4 4 1 78.2
41 ENTRY - 41 64 4 4 76.02
42 ENTRY - 42 65 4 4 79.2
9025 SUB - CONTROL A 66 4 3 1 75.18
43 ENTRY - 43 67 4 3 72
9010 CONTROL 68 4 3 1 72.33
44 ENTRY - 44 69 4 3 76.14
9035 SUB - CONTROL B 70 4 3 2 71.32
45 ENTRY - 45 71 4 2 69.04
46 ENTRY - 46 72 4 2 73.07
9010 CONTROL 73 4 2 1 67.7
9036 SUB - CONTROL B 74 4 2 2 73.31
9026 SUB - CONTROL A 75 4 2 1 72.59
9024 SUB - CONTROL A 76 4 1 1 64.61
47 ENTRY - 47 77 4 1 67.59
9010 CONTROL 78 4 1 1 69.75
48 ENTRY - 48 79 4 1 68.02
9034 SUB - CONTROL B 80 4 1 2 64.34

There are 16 control plots in the whole experiment and the preceding ANOVA is based
on those plots, therefore whole plot errors are estimated from control plots. Sub-plot
error is estimated from the residual of RCBD ANOVA on whole plots containing sub-
plot controls considered as “blocks” and the control varieties as “ treatments”.
Adjustment for difference in row and column is also required. There are two types of
yield adjustment, but we will discuss what is called method 3 adjustment. For this
method the adjustment factor is regression coefficient, b. Once this value is determined,
yield can be adjusted as follows:

Yijk = Yijk C b(Yij C M)

Where Yijk = value of control plot in ith row and jth column line in the ijth whole plot
C = control line
b = regression coefficient

In the ANOVA below, unadjusted mean of control plot is given to be 67.504 and
regression coefficient of 1.02.

91
Output from the Agrobase package is given as follows.

Page : 1 ANALYSIS OF VARIANCE


MODIFIED Augmented design (2) : Control plots ANOVA

ANALYSIS OF VARIANCE
Modified augmented design (2) : Control Plots ANOVA

Dependent variable: YLD

Source df SS MS F-value Pr> F


Total 15 484.622
ROWS 3 162.559 54.186 11.48 0.0020
Columns 3 279.582 93.194 19.74 0.0003
Residual 9 42.481 4.720
Grand Mean = 67.504 R – squared = 0.9903 C.V. = 3.22%

File: C:\AGRO97\DEMO\DEMAUG22.DBF

Whole – plot error = 4.720


Sub-plot error = 4.061

____________________Control plot means (unadjusted)______________

No. 1: 67.504

Method 3 regression coeff. = 1.020

Same analysis could be obtained using other packages. But, more steps may be required
to rearrange the data set to suit the software to be used. The first activity is to collect the
control plot yield in a 4 x 4 row and column (table 5.6.2.2). Then, it turn out to be just a
row and column design without a treatment factor. Yield will be the response variable.
Table 5.6.2.2: Control plot yield
Column
1 2 3 4
Row 1 58.25 61.78 65.57 67.7
2 58.49 66.03 67.6 72.44
3 60.50 67.61 72.77 73.38
4 69.75 67.70 72.33 78.20

To enter the data in to computer create 3 variables, ROW, COLUMN and YIELD as
follows:

ROW COLUMN YIELD


1 1 58.25
2 1 58.49
. . .
. . .
. . .
. . .
4 4 78.20

92
To analyze using MSTATC, select FACTOR menu, then select “CRD Two Factor” sub-
menu. To analyze same data in SAS use the following program:

PROC ANOVA;
Class ROW COLUMN
Model YIELD = ROW COLUMN;
Run;

Both MSTATC and the SAS program (above) give the same result as the one obtained
from Agrobase.

93
6. Treatment structure
6.1 Factorial experiments
So far we covered four families of experimental designs. These designs are applicable to
any type of experiment, regardless of the structure of treatments. Now we will look at
two types of experiments. Single factor and multi - factor experiments.

Examples we have been using so far concerns a single factor experiment. Single Factor
experiment means that the investigator is concerned in testing several levels of one factor
while keeping all other factors at a constant level.

For example, in variety trail we cited earlier, the aim was to compare the yielding ability
of different varieties under uniform management and cultural practices. Variety is
considered as a one factor, while the different varieties are considered as different levels
of the factor.

To widen the scope of an experiment, it is a common practice, these days, to include two
or more factors so that their effect and cross-effects are tested simultaneously: for
example testing five varieties under 3 nitrogen levels. This is commonly known as
factorial experiment (Which we will see later on).

The experimental designs discussed so far are useful to study the effect of a single factor
on the response variable. In some experiments it becomes necessary to study the effects
of two or more factors. For example, a plant breeder might be interested in measuring
the effect of variety and planting date on the yield of wheat. Such experiments that
consist of 2 or more combinations of different factors are referred to as factorial
experiments. In factorial experiments the treatments are compared on the basis of the
effects of the individual factors and their interactions.

There are advantages and disadvantages in using factorial experiments. If the purpose of
the experiment is to investigate the effect of each factor, rather than the combination of
levels of the factors that produce a maximum response, then separate experiments could
be conducted where each deals with a single factor.

Usually factorial experiments are more efficient than single-factor experiments but they
can lead to an increase in complexity, size and cost of an experiment especially if some
of the factors included are less important. Therefore, it is preferable to use an initial
single factor experiment, especially if the number of potentially important factors is
large.

Another disadvantage of factorial experiments lies in the less precise estimates that result
from problems associated with heterogeneity of factors as the number of treatment
increases.

The advantage of factorial structure, over single factor experiment, is that it will enable
us to simultaneously test effects of several factors and their cross effects known as

94
interactions. Interaction provides a measure of the pattern of the effect of one factor over
the other factor.

Consider two varieties (V1 and V2) and two Nitrogen levels (N1 and N2). Consider mean
yield of a1 for V2 when N1 is applied, and b1 when N2 is applied, and mean yield of a2
for V1 when N1 is applied and b2 where N2 is applied.

The graph below shows that Nitrogen increase the yield of both varieties, V1 and V2 at
different rate. This means that there is an interaction between variety and nitrogen level.
Consequently, the two lines seem to cross each other at some point if they are extended.
For this example, the interaction is described by the difference between (a1 –a2) and (b1-
b2). Obviously, as we can see from the graph the distance between b1 and b2 is much
smaller than the distance between a1 and a2, hence there is interaction between N and V.
Thus, we say that nitrogen increases yield of both varieties but application of N2 boost
yield of V1 whereas, N2 increases yield of V2 slightly over N1, since the difference in
yield between the two varieties is small for N2

b1
a1 * V2
* b2 * v1
a2
*

N1 N2

In the following graph, the two lines are parallel measuring that nitrogen increases yield of both varieties at
the same rate. Which implies no interaction between them.
Here, unlike the above graph, the distance between a1 and a2 is nearly the same as that between b1 and b2.
Thus, we may say that the difference between the two varieties remain the same irrespective of the type of
fertilizer applied. Hence, variety do not depend on the type of fertilizer applied.

b1 V 2
*
a1 b2 V1
* *
a2
*

N1 N2
The following table shows the treatment structure of a 2-factor experiment, the effect of
fiber (factor A) and protein (factor B) in a diet. The levels of fiber are three, f0, f1, and f2
(i.e., low fiber, medium and high). On the other hand, protein has two levels, p0 and p1
(low and high). This can be represented by a two-way table where the cross classification
of each level of one factor occure with each level of the other.

95
Factor B (protein)

P0 P1

f0 F0p0 f 0 p1
Factor A
(Fiber) f1 F1P0 f 1 p1

f2 F2p0 f 2 p1

Factorial experiments may be laid out in any convenient design such as CRD, RCBD,
Latin square, etc.

For example, if CRD is to be applied in the above study, the individual treatment
combinations, f0p0, f1p0, f0p1, f1p1, f2p1 and f2p1, are regarded as treatments and
experimental units (animals ) are randomly assigned to each of them.
The resulting ANOVA will provide tests for the significance of the difference among the
proteins, the difference among the fibers, or the significance of the protein X fiber cross
effect (interaction).

Example : The nutritive value of a certain edible fruit was measured in a total of 24
specimens representing six specimen of two varieties grown in each of two geographic
regions taken at random . The results were as follows.

Geog. Region (A) Variety (B)


(i) (ii)
I 6.9 13.4
11.8 14.1
6.2 13.5
9.2 13.0
9.2 12.3
6.2 13.7

II 8.9 9.1
9.2 13.1
5.2 13.2
7.7 8.6
7.8 9.8
5.7 9.9

Test for a significant difference between regions, varieties and their interaction.

Here no blocking is used, hence the design adopted is completely randomized design
(CRD). Calculation of Analysis of variance can be done in two ways I) By hand or II)
using computer programs as follows:

96
Manual calculation.

Summary table for the data (totals pr region and variety)


Geog. Region Variety (b)
(a) (i) (ii)
I 49.5 80.0 129.5
II 44.5 63.7 108.2
Total 94.0 143.7 237.7

Multiplier for treatment


Factorial effect Factorial effect total Contribution to
effect 49.5 44.5 80.0 63.7 treatment SS
A -1 +1 –1 +1 -21.3=(-1*49.5+1*44.5-1*80+ 1*63.7) (-21.3)2/(4*6)=18.904
B -1 -1 +1 +1 49.7 (49.7)2/(4*6)=102.920
AB +1 -1 -1 +1 -11.3 (-11.3)2/(4*6)=5.320

Tss = (6.9)2+(11.8)2+. . . +(9.9)2 – (237.7)2/24 = 187.57


Ess =Tss – Trss = 187.57 – 127.144 =60.43

Note that the multipliers given are coefficients for main factor effects. For example the
coefficients for A (-1 +1-1+1 ) finds the difference between the two levels of A for each
level of variety. The factorial effects is thus calculated easily by multiplying treatment
totals by their respective coefficients.

ANOVA table

Source of variation d.f SS MS F


Geog. Region (a) 1 18.90 6.254*
Variety (b) 1 102.92 34.057***
Interaction (ab) 1 5.32 1.760
Error 20 60.43 3.022
Total 23 187.57

Presentation of results:

There is an evidence of differences between the regions and between the varieties. The
evidence for differences between varieties is very strong compared to that of region.
Here since both factors have only two levels it is already clear from this test which levels
are higher than others and no furthers contrasts are important. On the other hand, the
interaction effect is not significant at 5% level. Hence, varieties are not associated with
regions. That means regions treat both varieties equally. Therefore, presentation of the
main effect means alone is important, since no cross effect exists.

97
Table of Means
Levels
1 2
Region 10.79 9.02
Factory
Variety 7.83 11.79

S.e. of any main factor mean = √s2 /rxv = √3.022/6x2 = 0.502

S.e of difference between the levels of region √2s2/rxr = √ (3.022)x2/12 = 0.710

From the table above, region 1 is better yielding than region 2 and variety 2 performs
better than variety 1.

Same analysis could be done with computer. For example, MSTATC does this analysis
using “CRD with 2 factor” sub-menu under the “Factor “ menu. SAS has been used to
do analysis of same data set as follows:

SAS program for analyzing the above problem is:

Proc anova;
Class geo var yield;
Model geo var geo*var;
Run;

SAS output for the above problem

SAS 1:17 Monday, January 22,2001

Analysis of Variance Procedure

Dependent Variable: YIELD

Source DF Sum of Squares F Value Pr > F

Model 3 127.14458333 14.03 0.0001


GEO 1 18.90375000 6.26 0.0212
VAR 1 102.92041667 34.07 0.0001
GEO*VAR 1 5.32041667 1.76 0.1995
Error 20 60.42500000

Corrected Total 23 187.56958333

R-Square C.V. YIELD Mean

0.677853 17.54993 9.90416667

Let’s us now consider another more complicated example, and discuss presentation
styles in detail:

Suppose that a fertilizer experiment involving 3 levels of variety and, 2 management


system with 3 replication was done and the following ANOVA table obtained after data

98
analysis. Note that most statistical packages could be used to do analysis of this
experiment (for example, in MSTATC, “RCBD three factor” sub-menu under “Factor”
menu could be used.)

ANOVA Table
Source df MS
Block 2 234
Fertilizer (F) 2 105.3
Variety (V) 3 79.5*
Managnt (M) 1 128.5**
FxV 6 466.0***
FxM 2 93.6*
VxM 3 28.3
FxVxM 6 16.4
Error 46 26.2
Total 71

Now the major concern here is which table of mean to present because we have both
main effects and interaction means. As clearly depicted in ANOVA table above all main
effects and two interactions are significant at 5% level. If we consider F, V and F x V all
of them are significant, but since differences between response of say V is highly
affected by the level of fertilizer applied, the main effect means do not give much
information. That means we can’t conclude, for example, by saying that variety A is
different from others, because since interaction effect is significant there is an indication
that variety A might be different from other varieties for some fertilizer levels and not
for others. Therefore, we need to present interaction mean only (F x V). Regarding F
and M, however, the main effects are significant but interaction is not therefore it is
better to present the main effect means alone. In situations where main effects are highly
significant (say at 0.1% level) and interactions are significant at 5% level, it is good idea
to present both main effect and interaction means.

6.1.1 The Split-plot or Nested Design


In 2-factor experiments where it is desired to favor one factor over the other or where
practical situation dictates application of one factor on larger plots, the split-plot design
is used. In this design the main plot is being split into smaller subplots so as to
accommodate a more precise measurement of the subplot factor and its interaction with
the main plot factor. This means that the precision achieved by the subplot factor is at
the expense of the precision for the measurement of the main plot factor.

These designs are also called nested designs, because of repeated sampling and sub-
sampling that occur in the design. Since more independent measurements are taken for
the subplot factor than the main plot factor, the subplot factor is measured more precisely
than the main plot factor, hence comparison between sub-plot factor are more precise.

Example

Suppose that we study the properties of 3 varieties of cotton breed for resistance to wilt

99
as the subplot factor and 4 dates of sowing as the main plot factor, and the experiment is
conducted in 4 replications.

The steps in randomization layout are given below.

1. Divide the experimental area into 4 blocks( r =4). Further divide each block into 4
main plots as shown in Figure 6.1.

Rep. I Rep. II Rep. III Rep. IV

1
Main plot 1

Main plot 1

Main plot 1

Main plot
Figure 6.1 Layout of an experimental design into 4 blocks and 4 main plots

2. Randomly assign the main plot treatments Factor A, (4 dates of sowing: D1, D2, D3,
and D4,) to the main plots among the 4 blocks or replications as shown in Figure 6.2

rep I rep II rep III rep IV

D4 D2 D1 D3 D1 D4 D2 D3 D3 D2 D4 D1 D1 D4 D3 D2

Figure 6.2 Random assignment of 4 dates of sowing the four main plot in each of
the 4 replications

3. Divide each of the main plots in to 3 subplots and randomly assign the cotton
varieties (Factor B) to each subplot as shown in Figure 6.3.

Rep.I Rep. II Rep. III REp.IV


V V V V V V V V V V V V V V V V
1 3 2 2 3 1 2 1 3 1 2 1 2 3 3 1
V V V V V V V V V V V V V V V V
3 2 1 3 1 2 1 3 2 3 3 2 1 2 1 3
V V V V V V V V V V V V V V V V
2 1 3 1 2 3 3 2 1 2 1 1 3 1 2 3
D4 D2 D1 D3 D1 D4 D2 D3 D3 D2 D4 D1 D1 D4 D3 D2

Figure 6.3 Random assignments of 4 subplot treatments to each of the 4 subplots in

100
each replication.

We can see that the number of times the subplot factor is tested is much larger than that
of the main plot factor. Therefore, the subplot factor is estimated more precisely than the
main plot factor, and care should be taken when deciding on the factor to be assigned to
the subplot. Also note that the major difference between RCBD and split plot lies on
the concept of randomization. The dates of sowing are randomized on the main plots
only, while the varieties are randomized on the sub-plots within main plot. In the case of
RCBD, it is a factorial combination of date by variety which is randomized within a
block. For example, since we have 16-treatment combination of date by variety, we
normally randomize them on to 16 plot within each block. In the case of split plot since
we have two sized plots and randomization is being done at two stage, we have two
different error terms. One of the problems in these areas that some researchers may
initially design their experiment as split plot but analyze as RCBD. This automatically
ignores the difference in randomization between the two and the split plot treatment will
be affected while the main plot treatment is falsely boosted.

Analysis of variance: In performing the analysis of variance of such an experiment


where the split-plot design is used, the experimenter must consider a separate analysis for
the main plot factor(factor A) and the subplot factor (factor B)

Analysis
Model : Xijk = µ + αi + βj(i) + ε(ij)k

Were βj(i) = the effect of j th factor under the ith factor

Using the least square estimators for parameters, transposing 8 … to the left,
squaring and taking the sum over all the subscripts we get.

∑∑∑(Xijk - 8…)2 = bn∑(8i.. - 8…)2 + n∑∑(Xijk. - 8ij.)2 +∑∑∑(Xijk - 8ij.)2

The computational formula is

SSA = ∑8i2… - CF
bn
SSB = ∑8j2.. - CF
an
SSBA = ∑∑Xij. 2 – CF
n

SSTot = ∑∑∑ Xijk 2- CF


Abn

SSE = ∑∑∑Xijk2 - ∑∑(xij.) 2 , CF = X2…


n abn

101
The ANOVA table
Source d.f SS MS F
Replication r-1 rss
A a-1 SSA SSA/A-1
Error (a) R(a-1) Err SS ErrSS/r(a-1)
B b-1 SSB SSB/ b-1
BxA (a-1)(b-1) SSBA SSA x B/(a-1)(b-1)
Error (b) ab(n-1) SSE SSE/ab(n-1)
Total abn-1 TSS

Many statisticians and senior researchers, however discourage the use of split plot for a
number of reasons:

1. In split plot the sub-plot factor is more precise in the expense of main-plot factor and
this will have a due influence on our conclusion regarding the main plot factor.
2. Split plot have about 5 s.e’s for comparing the levels of different factors, and people
often make a mistake when supplying error mean square for performing multiple
range tests in MATATC (see later).
3. The main plot error often associated with small degree of freedom which may be
open to criticism. But, to increase the df one may have to select several levels of a
factor which is both non-economical and not practical. Therefore, there is no need of
complicating things, unless practical applications of treatments require such
differential plot size, when the use of RCBD will equally do an excellent job.
Practical necessity occures in situation like irrigation trial or method of ploughing
where large plots are practically inevitable as the treatments can not be done on small
plots.

Example

In a study carried out by agronomists to determine if major differences in yield response


to N fertilization exist among widely grown hybrids of maize, the subplot treatments
were 5 hybrid (H1, H2, H3, H4, and H5). The main plot treatments were N rates of 0, 35,
70, and 105 kg/hectare where broadcast method was applied before planting. The study
was replicated 2 times. Grain yield data of 5 maize hybrids grown with 4 levels of
Nitrogen in a split-plot experiment with 2 replications is given below.

102
N rate Kg/ hectare
Replication Hybrid 0 35 70 105
Yield bu/ hectare
I H1 130 150 170 165
H2 125 150 160 165
H3 110 140 155 150
H4 115 140 160 140
H5 115 170 160 170

II H1 135 170 190 185


H2 150 160 180 200
H3 135 155 165 175
H4 130 150 175 170
H5 145 180 195 200

The degrees of Freedom


Replication d.f. = r - 1 = 2 - 1 = 1
Main plot factor (A) d/f.= a - 1 = 4 - 1 = 3
Error (a) d.f. = (r - 1) (a - 1) = (2 - 1) (4 - 1) = 3
Subplot factor (B) d.f. = b -1 = 5 - 1 = 4

A X B d.f. = (a - 1)(b -1) =(4 - 1) (5 -1) = 12


Error (b) d.f. = a(r -1) (b -1) = 4(2 -1) (5 -1) = 16
Total d.f. = rab - 1 = 2 X 4 X 5 - 1 = 39

Analysis of Variance of Hybrid X N response experiment in a split-plot


design

Source d.f. SS MS F
Total 39 20,244.38
Replication 1 4,100.63 4,100.63
Nitrogen(A) 3 12,015.88 4,017.29 42.76**
Error (a) 3 281.87 93.96
Hybrid(B) 4 2,466.25 616.56 20.56
Nitrogen X Hybrid
(A X B) 12 863.75 71.98 2.40ns
Error (b) 16 479.97 29.99

The interactions between Nitrogen rate and hybrid were not significant. The differences
between the N rates are significant at 1% level. This shows that there is strong evidence
that grain yield was significantly affected by nitrogen. Similarly, the results indicate that
in both replications grain yield was significantly affected by the hybrid.

SAS procedure for analysis is as follows (omitting data step)

103
Proc GLM;
Class Replicat Nitrogen Hybrid yield;
Model Yield = Replicat Nitrogen Replicat *Nitrogen Hybrid Hybrid* Nitrogen;
Test H = Nitrogen e = replicate * Nitrogen;
Test H = replicate e = Replicate * Nitrogen;
Means Nitrogen Hybrid;
Run;

The SAS software produces the following output(after rearangment of


terms for convenience)

SAS 15:45 Thursday, January 12, 2001

Analysis of Variance Procedure Dependent Variable: YIELD

Source DF Sum of Squares Mean of Squares F Value Pr >


Model 23 19764.3750000 859.3206 28.64 0.0001
REPLICAT 1 4100.6250000 4100.625 43.64 0.0001
NITROGEN 3 12051.8750000 4017.291 42.75 0.0001
Error (a) 3 281.8750000 93.958 0.0548
HYBRID 4 2466.2500000 616.562 20.55 0.0001
HYBRID*NITROGEN 12 863.7500000 71.979 2.4 0.0520
Error(b) 16 480.0000000 30.00
Corrected Total 39 20244.3750000 519.0865

R-Square C.V. YIELD Mean

0.976290 3.485903 157.12500000

Level of ------------YIELD------------
NITROGEN N Mean SD

0 10 129.000000 13.0809446
35 10 156.500000 13.3437459
70 10 171.000000 13.7032032
105 10 172.000000 19.3218357

Level of ------------YIELD------------
HYBRID N Mean SD

1 8 161.875000 21.8660696
2 8 161.250000 22.1601315
3 8 148.125000 19.9888362
4 8 147.500000 20.3540098
5 8 166.875000 27.5081157

One of the disadvantages of using split-plot design is that there are several s.e to use for
comparing the levels of the different factors. One mistake that people often do in
applying LSD or Duncans multiple range test in split-plot is that they supply error (b) to
MSTATC so that the program compute and test mean differences for main plot, split plot
and as well as their interactions. As you might see below s.e for comparing two levels of
main plot factor for same or different level of sub-plot factor is even more complicated.
Therefore, to do multiple range test in MSTATC supply the relevant error mean square.
To compare levels of main plot factor, for example supply error (b) with n = rxa. To

104
compare levels of split-plot factor at one level of main plot factor supply errors(b) with
n = r etc..

1. S.e for comparing two Nitrogen means


= √(2Sa2/(rb))= √(2sa2 /10)
2. S.e for comparing two hybrid means
= √(2Sb2/(ra))= √(2sb2/8) = 2.739
3. S.e for comparing two hybrids at a single Nitrogen level
= √(2Sb2/r)= √(2sb2/2) = 5.477
4. S.e for comparing difference between two hybrids at two levels of
Nitrogen
= √(4Sb2/r)= √(4sb2/2) = 7.746
5. S.e for comparing two Nitrogen levels, either for the same variety
or for different varieties
= √{2[(b-1)sb2+Sa2]/(rb)}= √{2[4sb2 + s2a]/10} = 6.541

When more than two factors are available, there are different possibilities for designing
split-plot. When 3 factors are included, two factors may be applied on main plot and the
third factor on split-plot or visa vis. Consider the case of 3 factor, where two levels of
fertilizer, two management techniques and two varieties were compared in split-plot
design with 3 replications. It was desired to put both fertilizer and management
technique on main plot and variety on sub-plot. Experimental plan for the first rep is
given below. Four main plots are required because we have 2 x 2 = 4 Fertilizer-
management combination and under each main plot we require 2 sub-plot to match with
the levels of variety.

Fig 6.1.1: Experimental plan for 3-factor split-plot (for first rep) with fertilizer and
management factors on main plot and variety on sub-plot.

F1M1 F1M2 F2M1 F2M1


V1 V1 V1 V1
Rep I
V2 V2 V2 V2

An artificial data was created and analysis done using this design, and the result from
MSTATC output is given below.

A N A L Y S I S O F V A R I A N C E T A B L E

K Degrees of Sum of Mean F


Value Source Freedom Squares Square Value Prob
------------------------------------------------------------------
1 Replication 2 597.970 298.985 1.8876 0.2312
2 Fertl(F) 1 42.682 421.682 2.6622 0.1539
4 Managm(m) 1 60.802 60.802 0.3839
6 FM 1 927.527 927.527 5.8558 0.0519
-7 Error (a) 6 950.370 158.395
8 Variab (V) 1 245.760 245.760 1.5597 0.2470
10 FV 1 1264.402 1264.402 8.0243 0.0221
12 MV 1 93.615 93.615 0.5941
14 FMV 1 430.107 430.107 2.7296 0.1371
-15 Error (b) 8 1260.567 157.571
------------------------------------------------------------------

105
Coefficient of Variation: 31.34%

s_ for means group 1: 4.4496 Number of Observations: 8


y

s_ for means group 2: 3.6331 Number of Observations: 12


y

s_ for means group 4: 3.6331 Number of Observations: 12


y

s_ for means group 6: 5.1380 Number of Observations: 6


y

s_ for means group 8: 3.6237 Number of Observations: 12


y

s_ for means group 10: 5.1246 Number of Observations: 6


y

s_ for means group 12: 5.1246 Number of Observations: 6


y

s_ for means group 14: 7.2473 Number of Observations: 3


y

Note the analysis of variance structure. Fertilizer, management and their interaction are
all in the main plot stratum, while all others are in the slit-plot stratum. When standard
error are calculated, different error mean square are used. For example, s for mean
y
group 2, group 4, and group 6 used error (a), while s for group means 8,10,12 and 14
y
are used error (b). If, for example, comparison of levels of variety at a fixed level of
fertilizer and management is required then both error (a) and error (b) will be used as
shown in calculation of different s.e’s in previous page.

6.1.2 Strip- Plot Design


It is sometimes desirable to get more precise information of the interaction between
factors than on their main effects. The strip-plot is a 2-factor design that allows for
greater precision in the measurement of the interaction effect while sacrificing the
degree of precision on the main effects.

In measuring the interaction effect between 2 factors, the experimental area is divided
into 3 plot. We allocate factor A and B, respectively, to the vertical and horizontal-strip
plots, and allow the interaction plot to accommodate the interaction between these 2
factors. The vertical and the horizontal plots-are perpendicular to each other. However,
in the strip-plot design the relationship between the vertical and the horizontal plot sizes
is not as distinct as that of main and subplots as in the split-plot design. Randomization
in strip-plot is also different from that of split-plot. Here the vertical and horizontal
factors are independently randomized on the rows and columns of the experimental area.

106
Example

Suppose that we study the properties of 4 wheat cultivates (horizontal factor) and 4
nitrogen rates (vertical factor), and the experiment is a strip-plot design with 4
replications.

Steps to be followed in preparation of layout is as follows:

1. Divide the experimental area into 4 blocks or replications(r = 4). Further divide
each block into 4 horizontal strips. Randomly assign the 4 horizontal-strip
treatments (a = 4) to each of the strips.

Rep. I Rep.II Rep.III Rep. IV


c3 c4 C1 c2
c1 c2 C3 c4
c4 c3 C4 c1
c2 c1 C2 c3
N2 N1 N4 N3 N1 N4 N2 N3 N3 N3 N4 N1 N4 N1 N3 N2

Figure 6.4 A strip-plot of an experimental area with random allocation of 4 cultivares


(C1, C2, C3, and C4) to the horizontal strips, and 4 nitrogen (N1, N2, N3, and N4,) to the
vertical strips with 4 replications.

2. Divide each block or replication into 4 vertical strips and randomly assign the 4
vertical-strip treatments (b = 4) to vertical strips as shown in Figure 6.4.

107
Grain Yield of 4 Wheat Cultivares Grown at 4 Nitrogen fertilizers levels

N rate Yield /acre


Cultivator Rep I Rep II Rep III Rep IV
C1 40 72 74 76 70
80 76 75 74 78
120 72 74 73 75
160 74 76 82 86

C2 40 60 62 64 65
80 61 63 69 68
120 70 72 69 70
160 72 70 82 86

C3 40 75 73 72 80
77 78 77 82
120 80 82 86 83

160 84 82 84 89

C4 40 65 68 63 72
80 68 72 74 76
120 69 68 70 72
160 73 75 74 76

Degree of Freedom

Replication d.f. = r-1 = 4-1 =3


Horizontal factor (A) d.f. = a-1 = 4-1 = 3
Error (a) d.f. = (r-1)(a-1) = 9

Vertical factor (b) d.f. = b-1 = 4-1 = 3


Error (b) d.f. = (r-1) (b-1) = 9
AXB d/f. = (a-1) (b-1) = 9

Error (c) d.f. = (r-1)(a-1)(b-1) = 27


Total d.f. = rab-1 = 63

108
Analysis of Variance of Wheat Cultivator X Nitrogen response
Experiment in a Strip-Plot Design

Source d.f. SS MS F
Total 63 2865.90

Replication 3 257.52 85.84


Cultivar (A) 3 1282.15 427.38 76.59**
Error(a) 9 50.23 5.58

Nitrogen (B) 3 761.27 235.76 33.04**


Error (b) 9 69.11 7.68

Cultivar X Nitrogen
(A X B) 9 237.98 26.44 3.43**
Error (c) 27 207.64 7.69

• There is significant effect of Cultivares


• There is significant effect of Nitrogen
• Significant interaction between cultivare and nitrogen, and this indicates that the N
recommendation developed for one cultivar can not be applied to the other cultivars
tested in the experiment, i.e., further comparisons among the individual N X Cultivar
combinations are needed.

The same analysis could be done in SAS using the following program.

Proc GLM;
Class Rep variety Nitrogen;
Model yield = rep variety Rep*var Nitrogen
Nitrogen * rep Nitrogen *var;
Test H = Variety E = rep * var;
Means variety Nitrogen Variety * Nitrogen;
Run;

Strip-split plot design

This design is possible when we have a third factor to be applied on the interaction plot.
That means, the interaction plots is sub-divided into split-plots to accommodate the levels
of the third factor. Hence, two factors will be applied on vertical and horizontal and the
third factor will be on the split-plot. In our strip-plot example, we applied cultivar on
vertical and Nitrogen on Horizontal plot. Let’s now consider planting date as additional
factor and apply it on the split-plot. The design for one interaction plot is shown below

109
Vertical factor = Nitrogen
Horizontal Sowing Sowing Sowing
(cultivar) date 1 date 2 date 3

The analysis could be done both in MSTATC and SAS. In MSTATC select “RCBD 3
Factor strip plots” sub-menu.
The ANOVA layout looks the following.

Source df
Rep r-1
Horizontal (cultivar) c-1
Error (a) (r-1)(a-1)
Vertical (Nitrogen) n-1
Error (b) (r-1)(n-1)
CxN (c-1)(n-1)
Error (c) (r-1)(n-1)(c-1)
Sub-plot (sowing date) s-1
CS (c-1)(s-1)
NS (n-1)(s-1)
CNS (c-1)(n-1)(s-1)
Error (d) cn (r-1)(s-1)

Same analysis could done using the following SAS program.

Proc ANOVA;
Class rep cultivar Nitrogen sowing;
Model yld = rep cultivar rep * cultivar Nitrogen Nitrogen* cultivar
Nitrogen * rep * (cultivar) sowing sowing * rep
Sowing * cultivar sowing * rep * cultivar sowing * Nitrogen
Sowing * cultivar * Nitrogen
Sowing * Nitrogen*rep (cultivar);

Test H = cultivar e = cultivar *rep;


Test H = Nitrogen Nitrogen* cultivar e = Nitrogen* rep (cultivar);
Test H = sowing e = sowing * rep;
Test H = sowing * Nitrogen Sowing *cultivar * Nitrogen e = sowing *
nitrogen *rep (cultivar);
Run;

Note that there are 4 types of error sum of squares and if we leave test to SAS, SAS tests
all effects against the last error mean square. Therefore, it is important that we use the
TEST statement with appropriate error term declaration (e) to obtain proper test statistics.
The first test statement says that the hypothesis to be tested is declared with H = cultivar
and the error to be used for this test is given as e = cultivar * rep. All other test which
have not been declared in the program are tested using the last error mean square.

110
Remark

The ranges of experimental design used in agriculture are numerous and it is not possible
to present all of them on training manual of this type. It is always good to go through
books and literature to develop one's own caliber in designing experiments. In reality,
you are not expected to know all kinds of experimental designs. There are always people
who will help you with this problem. So, either see senior researchers or consult
Biometricians.

6.2 Concepts of Confounding


Size of block could often be less than the number of treatment combination available for
the experiment because

1) For practical situation where it is physically impossible to obtain experimental


unit which can accommodate all factorial combination in a block
2) If block size is more than 18 experimental units and if length of experimental
units is more than two meters, block may loss homogeneity in the Ethiopian
context, therefore it is not permissible to force all treatment combination in to a
block.
This, therefore, means that we can't estimate all effects in factorial structure. Consequently, we are going
to loss accuracy on certain high-order interactions.
Let’s consider the following example to facilitate understudying of treatment allocation
to blocks.

An experiment with two wheat variety, two fertility level and two weeding frequency was
conducted to evaluate variety-fertilizer –weed relation. Thus, the experiment constitutes
23 factorial structure (a total of 8 treatment combinations). However, there are only 4
plots available per block, which is half of what is required for complete block. Now the
question is, is it possible in the first place to continue experimentation with such
resources only? The answer is yes, it is possible to do the experiment and analyze the
data by using what is called confounding.

Obviously all treatment combinations can't go into one block (we could only
accommodate 4 treatment combination per block), therefore we have to split treatment
combination into two set. The consequence of such decision is that all effects can not be
estimated fully. The next question is which effect to ignore and how to allocate
treatments to blocks. Important steps are as follows:

Decide the least useful effect and confound it. Often higher order interactions are less
important, particularly than the main effects, and it is common practice to confound
higher order interaction. Therefore, Suppose ABC interaction is confounded. ABC
interaction is obtained from

(abc)+(a)+(b)+(c)-(ab)- (ac)- (bc) – (1) ………………………………………(1)

Where effects are represented by their higher level. For example aoboco is represented as
(1), a1b1c1 = abc, aoboc1 = c, etc. Now allocation of treatment to blocks is such that all

111
effects that have an odd number of letters with abc will be allocated to one block and all
others to second block. For example, effect aobo c1 (represented by C) is allocated to say
block 1 since it has 1 letter in common with abc and effect ab which have even number of
effect with abc (which two – a and b) is allocated to second block, etc. The plan is
therefore as follows.

Plan

Block 1 2
abc ab
a ac
b bc
c (1)

Comparing this plan with effect of interaction given above (in equation 1) it is clear that
all positives go to one block and all negatives goes to another block. For example, (ab)
which have negative sign is allocated in block 2 while (a) with positive sign is allocated
in block1. Therefore AxBxC interaction is completely confounded with block effect.
That means when we try to estimate interaction effect we are also estimating block
effect. It is practically not possible to separate the two effects. That is why we say block
and interaction effect are confounded.

Now, the next question is can we estimate other effects? The answer is YES.
For example consider. A (effect) = (abc +ab+ac+a) – [(1)-b-c-bc)]. Now, in the first
block we have two (abc and a) positive and two (b and c) negative effects. Similarly, in
the other block we have two from each sign. Hence effect of A is

{(abc + a)-(b + c)} + {(ab + ac) -(bc + a)}


↓ ↓
From Block I from Block II

Hence estimation of A effect is free from block differences


Sum of square for different sources of variation are calculated as usual and the ANOVA
structure is given similarly.

ANOVA structure (assuming 6 blocks)


Source df
Blocks 5
A,B,C,AB,AC,BC, 6
error 12
Total 23

In the above example we demonstrated principle of confounding in two blocks only


where ABC is confounded. In case of more than one replication, different effects can be
confounded in different replications.

Let us see a practical example on 24 factorial confounded in Blocks of 8 units. (cochran

112
1983).

The factors were: Dung (d): none, 10 tones per acre


Nitro chalk (N): non, 0.4 cwt.N per acre
Super phosphate (p): None, 0.6 cwt. P2O5 per acre
Muriate of potash (K): none, 1.0 cwt.K2O per acer

Table 6.2.1: Plan and yields (Beams in pounds) of a 24 factorial experiment

Rep I Rep II
p k d npk npk d p dnk
Block a 45 55 53 36 Block a 43 42 39 34
dnk dnp dpk n n dnp k dpk
41 48 55 42 47 52 50 44

dp nk dk pk nk Dp (1) np
Block b 50 44 43 51 Block b 43 52 57 39
dnpk (1) dn np pk dk dnpk dn
44 58 41 50 56 52 54 42

With 16 treatment combinations in blocks of 8 plots, only one factorial effect is


confounded in each replication. From the plan it can be seen that DNPK, the highest
order interaction, was confounded in each of the two replicates. The computation
proceeds as follows:

Step 1. Calculate total for each treatment combinations, the block and replicate total, and
the grand total.
Step2. Calculate factorial effect totals. The treatment combinations and their total yields
are placed in a systematic order in the first two columns, and the remaining
columns give the factorial effects. The factorial effects total are shown in the
bottom of table below. For example, the factorial effect for D is calculated as

-115 + 95 – 89 + …….+98 = -8

113
TABLE 6.2 CALCULATION OF FACTORIAL EFECT TOTALS IN A 24 EXPERIMENT

Treatment Total Factorial effect


Combination yield D N DN P DP NP DNP K DK NK DNK PK DPK NPK DNPK
(1) 115 - - + - + + - - + + - + - - +
D 95 + - - - - + + - - + + + + - -
N 89 - + - - + - + - + - + + - + -
Dn 83 + + + - - - - - - - - + + + +

P 84 - - + + - - + - + + - - + + -
Dp 102 + - - + + - - - - + + - - + +
Np 89 - + - + - + - - + - + - + - +
Dnp 100 + + + + + + + - - - - - - - -

K 105 - - + - + + - + - - + - + + -
Dk 95 + - - - - + + + + - - - - + +
Nk 87 - + - - + - + + - + - - + - +
Dnk 75 + + + - - - - + + + + - - - -

Pk 107 - - + + - - + + - - + + - - +
Dpk 99 + - - + + - - + + - - + + - -
Npk 79 - + - + - + - + - + - + - + -
Dnpk 98 + + + + + + + + + + + + + + +
1502 -8 -120 32 14 88 50 8 -12 -14 -32 18 28 -22 -32 50

The total for DNPK is not required, since owing to the confounding DNPK will not appear explicitly in the
ANOVA.

Step 3. The contribution of each factorial effect to the treatments SS in the analysis of
variance is now obtained. Since there are 32 plots, the square of each effect is divided by
32. The ANOVA is shown below. For example, to find sum of square (ss) for D effect,
square factorial effect of D(-8) and divide by 32. Hence SS for D = (-8)2 = 64 = 2
32 32
Source d.f SS MS
Replication 1 3.1
Blocks in reps 2 123.2 61.6
D 1 2.0
N 1 325.1
P 1 6.1
K 1 4.5
DN 1 32.0
DP 1 242.0
NP 1 78.1
DK 1 6.1
NK 1 32.0
PK 1 24.5
DNP 1 2.0
DNK 1 10.1
DPK 1 15.1
NPK 1 32.0
Error 14 340.0 24.29
Total 1277.9

114
Note that DNPK is omitted from ANOVA since its SS is included in the blocks SS.

The following SAS program may be used to analyze this example.

Proc ANOVA;
Class rep block D N P K;
model yield = rep block (rep) D N P K D*N D*P N*P D*K N*K P*K
D*N*P D*N*K D*P*K N*P*K;D*N*P*k;
Means D N P K D*N D*P N*P D*K N*K P*K D*N*P
D*N*K* N* P*K D*N*P*K;
Run;

The SAS output for this problem is:

SAS 11:16 Wednesday, January 10, 2001 20

Analysis of Variance Procedure

Dependent Variable: YLD

Source DF Sum of Squares F Value Pr > F

Model 17 938.12500000 2.27 0.0634

Error 14 339.75000000

Corrected Total 31 1277.87500000

R-Square C.V. YLD Mean

0.734129 10.49532 46.93750000

Source DF Anova SS F Value Pr > F

REP 1 3.12500000 0.13 0.7251


BLO(REP) 2 123.25000000 2.54 0.1146
D 1 2.00000000 0.08 0.7783
N 1 325.12500000 13.40 0.0026
P 1 6.12500000 0.25 0.6232
K 1 4.50000000 0.19 0.6733
D*N 1 32.00000000 1.32 0.2701
D*P 1 242.00000000 9.97 0.0070
N*P 1 78.12500000 3.22 0.0944
D*K 1 6.12500000 0.25 0.6232
N*K 1 32.00000000 1.32 0.2701
P*K 1 24.50000000 1.01 0.3321
D*N*P 1 2.00000000 0.08 0.7783
D*N*K 1 10.12500000 0.42 0.5288
D*P*K 1 15.12500000 0.62 0.4430
N*P*K 1 32.00000000 1.32 0.2701

115
Partial confounding

Partial confounding occurs when different effects are confounded in different


replications, so that at the end it will be possible to estimate all effects included in the
analysis. That means, for example, ABC effect is confounded in first replication, but it
can be estimated from the last two replications.

Example. Consider 23 factorial in 3 reps each with two blocks of size 4. Different
effects are confounded in different replications.
AC (effect) = {abc +ac + bc +(1)} - {ab +bc +a +c}
BC (effect) = {abc + bc + a +(1)} - {ab-ac-b-c}

Replication
Rep I Rep II Rep III
1 2 1 2 1 2
Abc Ab Abc Ab Abc ab
A Ac ac Bc Bc ac
B Bc B A A b
c (1) (1) C (1) c

confounded ABC AC BC

♦ Effects ABC, AC and BC are each confounded in one of the replication


♦ Relative information for confounded effect, relative to unconfounded effect =2/3
(i.e effect is not confounded in 2 reps out of 3)
♦ For unconfouned effect calculate ANOVA an usual
♦ For confounded effect use reps in which it is unconfounded.

ANOVA
Source df
Blocks 5
A 1
B 1
C 1
AB 1
AC 11
BC 11
ABC 11
Error 11
Total 23

116
7. Cluster Analysis for Characterization and
Classification

7.1 Introduction
Cluster Analysis is a multivariate statistical analysis technique for grouping observation
(entries or environments) into classes based on the characteristics measured prior to the
analysis. For example, classify entries based on their site mean. Elements within the
same group or cluster are relatively homogenous and element between clusters are
relatively heterogeneous. This technique was initially developed and used in the area of
taxonomy for several decades and eventually introduced to the breeding and agronomy
areas in early 70's.

Example 1: Taxonomy

Objective: To cluster accessions in to groups

Let’s assume that we have m number of accessions and we have measured n number of
traits for each. Further assume that the experiment was done in RCBD with 3
replications. That means we have 3m number of observations in the whole data set.
However for cluster analysis replications are not needed because cluster classifies
accessions based on their respective mean traits. Therefore, we need to take average over
replication and prepare data with accessions as cases (row) and traits as variables
(column).

The data entry format would look like the following

Traits
1 2 3 4 . . . n
Accessions (plant (hair (seed)
length) color) size
1 . . . . . . . .
2 . . . . . . . .
3 . . . . . . . .
. . . . . . . . .
. . . . . . . . .
m . . . . . . . .

117
Example 2: plant breeding

Objective : to cluster varieties / sites into groups

Entry mean yield . at different sites


Variety 1 2 3 . . . . n sites
1 2030 1645 1216
2 1923 2020 1934
3 . . .
. . . .
. . . .
m 1833 2305 1563

Cluster analysis is important to answer specific question such as:-

1) How a region can be divided into sub-regions objectively?


2) In the face of budget cut, which testing sites should be dropped?
3) If reducing number of replicates but increasing number of testing sites for a
particular trial is required, which additional sites should be considered?
4) Whether the site, where selection in the early generation is performed, is
representative?
5) In cases where there are several landraces in early stage of variety selection, entries
are grouped and a number of landraces selected from each group to proceed to the
next breeding trials.

Cluster analysis is also used to group genotypes into homogenous sets based on their
response to the environments considered. This result normally complements the stability
parameters calculated using the regression techniques we described earlier. Therefore,
clustering is used for characterization, classification and study of G x E interaction to
help in variety development programs.

In terms of G × E interaction, therefore, the objective of cluster analysis is to remove or


minimize effect of G×E interaction from the data so that mean yield will be adjusted for
interaction effect.

Example: 3 socio- economic studies

In technology adoption studies where a number of farmers are asked opinion about a
particular variety, clustering may be used by grouping the respondents based on their
opinions so that differences in response about the adopted technology is fairly evaluated.

In developing new varieties, plant breeders are always faced with the choice of two
strategies:

(I) to breed widely adapted varieties within the region, or

(II) to breed for varieties each adapted to a specific sub-region within the region.

Whichever breeding strategy is chosen, stratification of environments is essential. It is

118
obvious that for the second option, contrasting sub-regions within which environments
are similar need to be identified. However, even for the first option, representative
environments from contrasting sub-regions are also required to effectively screen for
wide adaptability during the intermediate and final stages of testing. Various approaches
were suggested to stratify environment into sub-groups, but so far, cluster analysis based
on grain yield and other characteristics has been widely used.

Cluster analysis has been used to classify genotypes. Based on similarity of response to
environments, genotypes can be assigned into qualitative homogeneous groups. This
information complements the stability parameters obtained from the regression analysis
made popular by Finaly & wilkinson and Eberhart & Russell.

7.2 Types of Classification Available


Fig. 1 shows the types of classification available with different computational and
mathematical considerations for different users' requirement. An exclusive classification
is one in which a given element occurs on one class and one class only. In an intrinsic
classification, all attributes used are regarded as equivalent.

119
CHOICE I

Nonexclusive Exclusive

CHOICE 2

EXtrinsic Intrinsic

CHOICE 3

Hierarechical Nonhierarchical

CHOICE 5 CHOICE 4

Divisive Aggiomerative Serial Simultaneous


Optimization Optimization

CHOICE 6

Monothetic Polythetic

Fig.1. Types of classification.

All exclusive intrinsic classifications are used for the same sort of purpose, so the strategies
within the general class are largely interchangeable and choice will largely be determined by
pragmatic consideration, mainly of a computational nature. The most important choice is that
between hierarchical and non-hierarchical strategies. A hierarchical strategy always optimizes a
route between the entire population and the sets of individuals of which it is composed. The rout
may be defined by progressive fusion of divisions.

With a non-hierarchical strategy, it is the structure of the individual groups which is optimized, since these are made
as homogeneous as possible. For cases in which homogeneity of groups is of prime importance, the non-hierarchical
strategies are very attractive, however their current state of development lags far behind the hierarchical strategies.

An agglomarative strategy is one that proceeds by progressive fusion, beginning with the
individuals and ending with the complete population. All agglomerative classifications are
polythetic. This is the most commonly used strategy.

A polythetic system is one based on a measure of similarity applied over all attributes, so
that an individual is grouped with those individuals which, on the average, it most
resembles. It appears that the ideal hierarchical system is divisive polythetic, but only a
limited number of programs are available.

7.3 Measuring Similarity Between two Items?


120
Classification requires defining some measure of similarity or closeness or distance between
elements to be classified. For continuous variables, the simplest, oldest and most commonly
used is the Euclidean distance. In 2 dimensions, suppose that 2 points have coordinates (x11,
x21) and (x12, x22), respectively, then the Euclidean distance =

√[(X11-X12)2 + (X21-X22)2]

For example the distance between coordinates (5,7) and (8,5) is

√[(5-7)2 + (8-5)2] = √ 13 = 3.61

Usually, there is no need to take the square root since the operation does not change the order of
how close the points are. Therefore, the alternative approach is known as squared Euclidean
distance. When there is a number of variables, the presentation on paper becomes impossible but
can only be presented as a formula. If there are s variables, the squared Euclidean distance
between the ith and jth entry can be given by

d2i j = Σs ( xik - xjk) 2


k=1

When the variables have different units, one needs to standardize the data before computing the
distance. This is known as Standardized Euclidean Distance. One can use the range for this
purpose, but commonly we use the standard error, i.e., subtract the variable mean from each
observation and then divide by the phenotypic standard error of that variable. After
standardization, each variable will have a mean of zero, unit variance and become dimensionless
(no units of measurement). If the term standardized Euclidean distance is used without further
explanation, it is this particular form of standardization that is implied.

Another common measure of similarity is the simple correlation coefficient. It can be shown
mathematically that the standardized squared Euclidean distance between two variables is related
to the correlation coefficient, and thus they provide identical information

For grouping environments based on grain yield of a number of genotypes, Euclidean distance
should also be standardized when differences in environmental mean yield are large due to
recognizable reason. Otherwise, there will be a tendency for groups having similar yields to
fuse together (Fig.3). It should be noted that high and low yielding environments which rank
cultivars identically are more similar in a selection sense than 2 environments of similar yield
which rank cultivars differently.

7.4. Strategies for Forming Groups?


Hierarchical agglomerative strategy is the most commonly used clustering technique.
Immediately after the first fusion, we need to define a fusion strategy that will measure the
similarity between individual/group and group/ group and determine the subsequent course of the
analysis. There are many fusion strategies, only the common and simple ones are presented:
1. Single linkage (also called Minimum Distance or Nearest - Neighbor):-

121
The distance between 2 clusters taken as the distance between the 2 nearest points, one from
each cluster .The strategy is called space-contracting; as a group grows it becomes progressively
easier to join. Clustering is very weak or non-existent. Individuals tend to join a group one
after another, a phenomenon known as chaining. As such, it is undesirable for most biological
applications.

2. Complete linkage (also called Maximum Distance or Furthest- Neighbor):-

The distance between 2 clusters taken as the distance between the 2 furthest points, one from
each cluster. The strategy is called space-dilating; it sharpens any discontinuities so that
clustering is intensified. The data are thus distorted. It is possible that outliers may be swept into
a 'non-conformist group' whose members share only the property that they are rather unlike
everybody else, including each other. Such strategy is useful when it is known or suspected that
no sharp discontinuities exit and it is wished to produce as 'clean' a classification as possible,
i.e./ to make the best of a bad job.

3. Centroid:

The distance between 2 clusters is defined as the distance between the group centroids. The
centroid is the point whose coordinates are the means of all the observations in the cluster. If the
cluster has one observation, then the centroid is the observation itself. It is not monotonic.ie,
later fusions may form at a smaller distance. When drawn on paper, there will be 'reversals'
which cause a later fusion to lie below earlier ones and make the dendrogram difficult to follow.

4. Group-average:

The distance between 2 clusters is defined as the average distance between all pairs of
observations from each cluster. It is nearly space-conserving and can be regarded as such for
most practical purposes. It is monotonic, without the problems of centroid linkage. It avoids the
extremes of either the single or maximum linkage methods.

Several simulation studies comparing various methods of cluster analysis has been performed.
Although the results of such study were inconsistent and confusing in many respects, average
linkage or ward’s minimum variance found to show best performance in grouping cases.

It should therefore be noted that different fusion strategies produce quite different results, so the
choice of an appropriate fusion strategy is extremely important.

7.5. Interpretation of results

An agglomerative strategy results in a dedrogram or tree diagram, a visual presentation of the


successive fusions as we pass from the individuals to the one big cluster ( see fig. below).

122
7

3
2
1

b c a d g e f

Fig.A typical dendrogram showing alternative methods of display.


1

For the above figure, beginners can easily jump to the conclusion that b and f are the most
unlike, simply because they are at opposite ends. However, we should not forget that a
node simply represents a fusion between two equivalent entities, and any one of the nodes
can be rotated without altering the sequence of fusions (see the figure below)

7
6

5
Fig. The same dendrogram as
4 Fig.13.1; the original
dendrogram has been
3 rotated round two of its nodes.

2
1
d b c a e f g

The final partition or the population is defined by cutting the dendrogram. Choosing the level of
the cut, and thus the number of classes needs a lot of judgment, except when real well defined
groups exit. As a general role, large increases in distance between two sequences should be a
signal for examining those particular steps. A large distance indicates that at least two very
different observation exist, one from each of the clusters just combined. The result should be
inspected from a subject matter perspective.

In cluster analysis determining the number of clusters is most difficult job. Because, there are no
satisfactory methods for determining the number of population clusters for any type of cluster
analysis. Several attempts has been done through simulation studies to search for best methods
of determining number of classes. A notable attempt has been done by Milligan and Cooper

123
(1985) and Cooper and Milligan (1984). They compared thirty methods for estimating the
number of population clusters using four hierarchical clustering methods. They identified three
best performing criteria in their simulation studies, namely, pseudo F-statistics, pseudo t2 statistic
and the cubic clustering criteria (ccc). The procedure for selecting appropriate number of clusters
is to look for consensus among the three statistics. That is local peaks of the ccc and pseudo f
statistics combined with a small value of the pseudo t2 statistics and large pseudo t2 for the next
cluster fusion.

Basically, this criteria locates all feasible cluster numbers and some cluster numbers can be
equally appropriate. For example, the criteria may suggest a 2 or 4 clusters from a particular data
set, but judgement should also consider the biological aspect of the data. Therefore, once
candidates are selected by cluster analysis practical aspect of the research result may be
considered for final decisions. Cluster analysis uses different clustering methods, and these
methods use different measures of distance and fusion strategies which can lead to very different
grouping. It is therefore important to make several analyses using different methods on the same
data to help in final decisions.

Part of the data was taken from pulse program to illustrate cluster analysis in three packages.

The SAS package (version 6.12) was used to produce the following.

The data set contains 50 entries and for each entry 15 traits were measured. The traits include
plant height, days to heading, days to Maturity, number of tillers, spikes, kernel weight, grain
yield and other quantitative characteristics.
SAS program used to generate output below is:

PROGRAM EDITOR Command = =>

00001 options 1s=80


00002 infile’c:\girma\gnt3’;
00003 input x1 -x15
00004 run;
00005 proc cluster method=avarage pseudo ccc noeigen outtree = tree;
00006 id x1;
00007 var x2 –x15;
00008 proc tree noprint sort height=n;
00009 proc tree noprint out=part ncluster=3;
00010 id x1;
00011 copy x2;
00012 proc sort;
00013 by cluster;
00014 proc print uniform;
00015 id x1;
00016 var x2;
00017 format x2 1.;
00018 by cluster;
00019 run;
00020

Three clustering methods were used: average linkage, centroid and ward’s minimum variance. In
all of them the pseudo F reached it peak at cluster three. The pseudo t2 reached its minimum at
cluster 3 in both average linkage and centroid clustering with higher values below and above it.

124
Therefore, it seems that all the three methods agree on same number of clusters. Looking for
second candidates is a bit difficult because if we go for cluster2, the pseudo F is smaller than that
of cluster 3 and t2 is larger than that of cluster 3. This particular data seems to have a clear
pattern where entries could be ambiguously grouped. Hence we may judge that the entries could
be classified in to 3 clusters based on 15 characteristics.
The next step in the SAS program is to go back to line 10 and write n clusters = 3 and re – run the
program so that we obtain list of entries that fall in each group. Old version of SAS can also give
dendrogram but it is not presentable for reporting purposes. The new version has shown
tremendous improvement in this regard.

Average Linkage Cluster Analysis

Root-Mean-Square Total-Sample Standard Deviation = 14.08106


Root-Mean-Square Distance Between Observations = 44.52822

Number Frequency Normalized


of of New Pseudo Pseudo RMS
Clusters Clusters Joined Cluster F t**2 Distance Tie

49 6 7 2 1012.01 . 0.031760
48 9 10 2 317.98 . 0.074484
47 64 63 2 218.43 . 0.089831
46 14 15 2 179.83 . 0.097891 T
45 24 23 2 157.61 . 0.105336 T
44 59 60 2 147.21 . 0.105336
43 1 4 2 131.48 . 0.125039
42 CL46 12 3 119.28 1.88 0.126043
41 3 2 2 114.43 . 0.127040
40 25 CL45 3 108.69 1.73 0.130950 T
39 54 58 2 106.78 . 0.130950
38 66 65 2 103.78 . 0.142035
37 CL42 16 4 99.03 1.84 0.146693
36 CL49 5 3 94.16 29.00 0.148967
35 70 69 2 91.54 . 0.166551
34 57 56 2 89.24 . 0.172501
33 67 62 2 87.70 . 0.173956
32 19 20 2 86.75 . 0.175400
31 61 68 2 85.67 . 0.183824
30 CL40 21 4 82.86 2.97 0.186997
29 CL48 CL43 4 78.20 5.76 0.189232
28 26 27 2 78.59 . 0.193188
27 30 29 2 78.99 . 0.199608
26 CL34 51 3 79.19 1.49 0.201494
25 22 28 2 80.27 . 0.202119
24 52 CL44 3 79.55 5.30 0.216574
23 CL38 CL31 4 78.01 2.68 0.222886
22 CL36 CL29 7 73.11 5.28 0.223451
21 CL37 11 5 73.23 4.18 0.229574
20 17 18 2 75.38 . 0.232304
19 CL28 CL25 4 76.44 1.90 0.238200
18 CL26 53 4 77.56 2.20 0.258019
17 CL39 CL24 5 75.75 4.74 0.270737
16 CL21 CL32 7 72.52 5.92 0.292554
15 CL47 CL33 4 72.18 8.11 0.295385
14 CL23 55 5 74.11 2.96 0.306076
13 CL27 CL19 6 73.44 4.35 0.327951
12 CL22 CL41 9 71.27 8.24 0.337719
11 CL16 CL20 9 71.83 4.54 0.347861
10 CL30 CL13 10 69.41 7.14 0.362641
9 CL15 CL17 9 66.70 9.16 0.412084
8 CL35 CL14 7 67.20 9.34 0.462813
7 CL8 CL9 16 59.58 9.87 0.518391
6 CL12 8 10 66.63 6.77 0.531920
5 CL6 CL10 20 55.06 24.66 0.595437
4 CL7 CL18 20 56.60 12.12 0.668186
3 CL11 13 10 77.96 11.03 0.734190
2 CL5 CL3 30 71.74 38.35 0.901291
1 CL2 CL4 50 . 71.74 1.262827

125
Centroid Hierarchical Cluster Analysis

Root-Mean-Square Total-Sample Standard Deviation = 14.08106


Root-Mean-Square Distance Between Observations = 44.52822

Number Frequency Normalized


of of New Pseudo Pseudo Centroid
Clusters Clusters Joined Cluster F t**2 Distance Tie

49 6 7 2 1012.01 . 0.031760
48 9 10 2 317.98 . 0.074484
47 64 63 2 218.43 . 0.089831
46 14 15 2 179.83 . 0.097891 T
45 24 23 2 157.61 . 0.105336 T
44 59 60 2 147.21 . 0.105336
43 CL46 12 3 126.67 1.88 0.116152
42 25 CL45 3 114.23 1.73 0.119891
41 1 4 2 110.93 . 0.125039
40 3 2 2 108.69 . 0.127040
39 CL43 16 4 100.51 1.84 0.130091
38 54 58 2 100.35 . 0.130950
37 66 65 2 99.03 . 0.142035
36 CL49 5 3 94.16 29.00 0.148118
35 70 69 2 91.54 . 0.166551
34 57 56 2 89.24 . 0.172501
33 CL42 21 4 83.53 2.97 0.172987
32 67 62 2 83.12 . 0.173956
31 CL48 CL41 4 76.77 5.76 0.174680
30 19 20 2 77.59 . 0.175400
29 CL37 68 3 76.55 2.17 0.181060
28 CL34 51 3 76.02 1.49 0.182101
27 CL36 CL31 7 67.02 5.28 0.186040
26 26 27 2 68.66 . 0.193188
25 29 CL26 3 69.29 1.34 0.193840 T
24 22 28 2 71.13 . 0.202119
23 CL29 61 4 71.31 1.96 0.204051
22 52 CL44 3 72.25 5.30 0.210072
21 CL39 11 5 72.44 4.18 0.214673
20 17 18 2 74.60 . 0.232304
19 CL28 53 4 75.21 2.20 0.232906
18 30 CL25 4 76.33 1.88 0.234226
17 CL38 CL22 5 74.68 4.74 0.239490
16 CL21 CL30 7 71.62 5.92 0.255396
15 CL18 CL24 6 69.79 3.63 0.267791
14 CL47 CL32 4 70.58 8.11 0.278692
13 CL23 55 5 73.44 2.96 0.279088
12 CL16 CL20 9 72.37 4.54 0.288023
11 CL33 CL15 10 68.00 7.14 0.297711
10 CL27 CL40 9 69.41 8.24 0.305794
9 CL14 CL17 9 66.70 9.16 0.350901
8 CL35 CL13 7 67.20 9.34 0.426808
7 CL8 CL9 16 59.58 9.87 0.397489
6 CL10 8 10 66.63 6.77 0.503030
5 CL6 CL11 20 55.06 24.66 0.509651
4 CL7 CL19 20 56.60 12.12 0.577081
3 CL12 13 10 77.96 11.03 0.708941
2 CL5 CL3 30 71.74 38.35 0.788646
1 CL2 CL4 50 . 71.74 1.105995

126
Ward's Minimum Variance Cluster Analysis

Root-Mean-Square Total-Sample Standard Deviation = 14.08106


Root-Mean-Square Distance Between Observations = 44.52822

T
Pseudo Pseudo i
NCL Clusters Joined FREQ SPRSQ RSQ F t**2 e

49 6 7 2 0.000021 0.99998 1012.0 .


48 9 10 2 0.000113 0.99987 318.0 .
47 64 63 2 0.000165 0.99970 218.4 .
46 14 15 2 0.000196 0.99951 179.8 . T
45 24 23 2 0.000226 0.99928 157.6 . T
44 59 60 2 0.000226 0.99905 147.2 .
43 1 4 2 0.000319 0.99873 131.5 .
42 3 2 2 0.000329 0.99840 122.1 .
41 54 58 2 0.000350 0.99805 115.4 .
40 CL46 12 3 0.000367 0.99769 110.6 1.9
39 25 CL45 3 0.000391 0.99730 106.8 1.7
38 66 65 2 0.000412 0.99688 103.8 .
37 CL40 16 4 0.000518 0.99637 99.0 1.8
36 70 69 2 0.000566 0.99580 94.9 .
35 CL49 5 3 0.000597 0.99520 91.5 29.0
34 57 56 2 0.000607 0.99460 89.2 .
33 67 62 2 0.000618 0.99398 87.7 .
32 19 20 2 0.000628 0.99335 86.7 .
31 61 68 2 0.000690 0.99266 85.7 .
30 26 27 2 0.000762 0.99190 84.4 .
29 30 29 2 0.000813 0.99109 83.4 .
28 22 28 2 0.000834 0.99025 82.8 . T
27 51 53 2 0.000834 0.98942 82.7 .
26 CL39 21 4 0.000916 0.98850 82.5 3.0
25 17 18 2 0.001101 0.98740 81.6 .
24 52 CL44 3 0.001201 0.98620 80.8 5.3
23 CL48 CL43 4 0.001245 0.98496 80.3 5.8
22 CL38 CL31 4 0.001477 0.98348 79.4 2.7
21 CL37 11 5 0.001505 0.98197 79.0 4.2
20 CL30 CL28 4 0.001518 0.98046 79.2 1.9
19 CL34 CL27 4 0.001729 0.97873 79.2 2.4
18 CL35 CL23 7 0.002422 0.97630 77.6 5.3
17 CL22 55 5 0.002543 0.97376 76.5 3.0
16 CL41 CL24 5 0.002809 0.97095 75.8 4.7
15 CL25 CL32 4 0.002954 0.96800 75.6 3.4
14 CL47 CL33 4 0.003170 0.96483 76.0 8.1
13 CL29 CL20 6 0.004273 0.96055 75.1 4.4
12 CL18 CL42 9 0.005937 0.95462 72.7 8.2
11 CL21 CL15 9 0.006116 0.94850 71.8 5.9
10 8 CL26 5 0.007697 0.94080 70.6 15.1
9 CL36 CL14 6 0.009572 0.93123 69.4 8.5
8 CL16 CL19 9 0.016251 0.91498 64.6 14.7
7 CL10 CL13 11 0.016453 0.89853 63.5 8.5
6 CL11 13 10 0.018463 0.88006 64.6 11.0
5 CL9 CL17 11 0.021921 0.85814 68.1 10.3
4 CL5 CL8 20 0.042935 0.81521 67.6 11.9
3 CL12 CL7 20 0.046836 0.76837 78.0 18.8
2 CL3 CL6 30 0.169241 0.59913 71.7 38.4
1 CL2 CL4 50 0.599131 0.00000 . 71.7

After a three clusters level is decided, to know which entry belongs to


which cluster, SAS program is modified to output list of entries per
cluster as follows.

127
---------------------------------- CLUSTER=1 ------------------------------

X1 X2

6 1
7 2
9 3
10 4
24 *
23 *
1 5
4 8
3 9
2 *
25 *
5 6
26 *
27 *
30 *
29 *
22 *
28 *
21 *
8 7

---------------------------------- CLUSTER=2 ------------------------------

X1 X2

64 *
63 *
59 *
60 *
54 *
58 *
66 *
65 *
70 *
69 *
57 *
56 *
67 *
62 *
61 *
68 *
51 *
53 *
52 *
55 *

---------------------------------- CLUSTER=3 ------------------------------

X1 X2

14 *
15 *
12 *
16 *
19 *
20 *
17 *
18 *
11 *
13 *

128
Cluster analysis can also be done using SPSS and CLUSTAN GRAPHICS (a small
program written by Dr. Wishart, dedicated to cluster analysis). SPSS version 10.1
produces a number of outputs. It produces agglomeration schedule which shows history
of how clusters combined, cluster membership and Dendrogram for average linkage
clustering. To get cluster membership you must supply number of clusters that you think
is appropriate in a sub-menu called “statistics” of the cluster dialog box. Thus SPSS and
SAS follow very similar procedure in their outputs, but SAS gives definite criteria for
judging the number of appropriate clusters. Denderogram produced by SPSS is attached
for comparison.

CLUSTAN GRAPHICS produces only Dendrogram and cluster means for each
characteristics measured. The difference between this package and others is that clustan
automatically determine number of clusters appropriate for a particular data set, and
highlight the clusters with colours. This might have advantages or disadvantages. The
advantage is that users will not have many problems to identify appropriate number of
clusters. The disadvantage is that users do not have alternatives to consider biological
aspect and choose one among given alternatives. Dendrogram from CLUSTAN is also
included for references.

129
SPSS dendrogram

130
8. Establishing Field Experiments
Establishing field experiments is one of the most important area requiring due attention
before beginning a trial. This is also an appropriate stage where homogeneity of the land
made available for an experiment and the number of treatments selected for this purpose
are examined. Particularly, the trade off between cost and precision or availability of
resources and level of precision that can be attained with such resource should be
thoroughly examined prior to execution.

8.1 Selection of experimental site


The first issue is size. The frequent and obvious question in this regard is: can it
accommodate the whole experiment? This have several consequences. First if you are
planning to conduct the experiment in complete block, then you must be sure that each
block can contain all treatment combination. Otherwise, you need to explore possibility
of using incomplete block designs. However, note that plots may not need to be
contiguous on a field to form a block. Pplots located apart could be considered as
elements of same block as long as homogeneity within block is maintained.

Is there sufficient information on land & soil characteristics so that the history of land is
well known to set up local control mechanisms like blocking structure and direction and
possibility for use of covariate?

Is it representative of the environment? This is particularly important if conclusions and


inferences are required for the whole agro-ecology

Is the soil uniform (texture, depth . . . .)? The soil need to be uniform to make the
blocking structure efficient, otherwise there is a possibility that genotype by environment
interaction component, which hinder yield estimation, appear in the analysis. It is
therefore advantageous to know past cultural practices and ensure adequate access to
roads for proper follow up of the experiment.

To minimize soil heterogeneity:

♦ Avoid previous experimental areas where treatments may have residual effects on
the soil, since they might create differential response of varieties within the same
blocking structure.
♦ Avoid areas used as unplanted alleys in previous works.
♦ The choice of proper experimental design that fit the pattern of soil heterogeneity is important. There
are now a number of possibilities in this regard. One obvious alternative is, first to observe pattern of
soil heterogeneity and select design that can match the pattern. Recent development in this area, also
known as spatial statistics, is becoming a powerful tool that can detect spatial variation and the
information is used during analysis.
♦ Choosing proper plot size and shape is also a serious problem in several research
programs because plot size and shape being used in several commodities are based
on either experiences of other countries or own personal Judgement.
♦ Orientation of plots & blocks should be carefully done as they have to go with the
pattern of soil heterogeneity.

131
♦ Increase replication. This may hold to obtain more information and estimate
experimental error more precisely.

8. 2 Selecting experimental material and methods


Once the researchable problem is identified and objectives set properly, the next step is
to select experimental materials and method of experimentation. Make sure that
treatments to be selected are independent, and that the experimental units are equally fair
to all treatments. If treatments have to be subjected to machines or if they have to pass
through certain mechanisms, make sure that all treatments be handled in the same way,
that they are subjected to the system for same period of time, under same meteorological
condition, etc. Also select appropriate statistical /mathematical analysis tools that could
correctly reveal inherent patterns in the data.

8.3 Randomization
Perform randomization that suits your experimental design. Note that each experimental
design have their respective randomization. For example, in RCBD randomization is
done within block. In split-plot two stage randomization is required. First randomize
main plot factor on the main-plots per block, then randomize split –plot factor on split-
plot per main-plot. Particularly, use separate randomization for each location if multi-
location trail is thought. Otherwise same treatments placed side by side in all blocks and
problem of serial correlation may occur. Therefore, do not send same copy of
randomization plan to cooperating centers.

8.4 Preparation of planting plan, notebook and other records


Once the experimental design is selected and treatments identified, planting plan has to
be prepared ahead of time. First prepare a design protocol which consists of treatment
codes, description of their levels and all appropriate information required to explain a
particular treatment, and planting plan. Sketch the plan carefully and show the levels
that are assigned to specific experimental unit. The protocol should also include,
planting/harvesting dates, humidity, and average rainfall of the area. Data collection
format should also be prepared and all characteristics to be measured indicated. Start
with factor variables (rep, treatment, etc.) and continue to characteristics measured
(height, seed size, yield, etc.). Most statistical or database packages take-in data in the
same way and it is not important to prepare data format having a particular package in
mind. A sample data entry format has been given below for two factor RCBD design
over 2 location.

132
Factor variables Characteristics
Plot Location Rep Variety Nitrogen Row Column Plant Seed/ Seed Yield/of
No height pod size
1 1 1 1 1 1 1 . . . .
1 1 1 1 2 . . . . . .
1 1 1 2 1 . . . . . .
1 . 1 2 2 . . . . . .
. . 2 . . . . . . . .
. . 2 . . . . . . . .
. . 2 . . . . . . . .
. . 2 . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . 4 . . . . . . . .
. . 4 . . . . . . . .
. . 4 . . . . . . . .
. 1 4 . . . . . . . .
. 2 1 . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
2 4 2 2 . . . . . .

Plot number could be important for some packages and not for others, therefore, it is
good idea to keep it in the structure. Row and column are also included here even if the
design was not row-colum design. This is because, recent developments in methodology
enabled us to do what is called “nearest neighbor analysis” which takes in to account the
spatial variability available in the experimental field and make adjustment to parameters
estimated from the data.

8.5 Other important aspects.


♦ Care should be taken in applying chemicals and using equipment.

♦ Since treatment have to be compared on similar grounds, land preparation should be


uniform, particularly within blocking structure.

♦ For crops requiring transplanting, appropriate time and condition is needed to be


fulfilled

♦ Cultural management must be uniformly applied to all experimental units.

133
9. Collection of Experimental Data and
Underling Assumptions
9.1 Data Collection
We have already discussed in previous section design protocol, and now we assume that
data to be collected has been exhaustively listed in material and methods and the
researcher is already aware of what data to be collected. The concern here is therefore
techniques for collecting data from the field. Techniques for collection data depends on
the type of information required. For measuring grain yield, for example, we use
destructive method. That is harvesting the whole plot (for standard on station plots). For
large plots and for some plant characters, we normally sample plants, group of plants or
part of a plant (e.g. leaf) to take a data. This is basically identifiable in the beginning and
should be indicated in materials and methods.

In sampling methods, there are 3 important component to be considered seriously.

1) Sampling unit
2) Sample size
3) Selection method

Sampling is required because, we can't measure all plants in a plot, e.g., leaf area, plant
height, panicle weight etc. Therefore one should plan to achieve fair representation of
samples by employing a reasonable sampling methodology. Use same sample plant at
all stages of observation for measuring the same character. Otherwise there is no sense
of repeated measurement and the values obtained may not be comparable as they are.

Use proper measurement technique and record data in the proper manner. Take care in
selecting unit of measurement (e.g meter) and make sure to decide on number of decimal
places. Decide what data to be collected at the planning stage of the trial and don’t
gather data just because you get them. The danger of collecting too much data is (i)
creating confusion because it might be difficult at later stage which one to choose and
use (ii) if you can’t use all the data collected then it turn out to be waste of resource and
time.

Date of recording data must be specified for some characteristics (e.g. days after
seedling). For qualitative data, codes should be consistently used. Care should also be
taken regarding definition of measurements. For example, plant height for seedling or
Juvenile plants is the distance from ground level to the top of the tallest leaf. For mature
plants, it is the distance from ground to tallest panicle. Grain weight is a weight of
cleaned and dried grain harvested from a unit area. Moisture content should also be
determined abruptly. Leaf area could be measured in different ways and it is subject to
your own ability to manage different methods, because leaf of different plants have
different shapes which can’t be generalized using single geometry.

134
Exceptionally large values should attract your attention, because they could be outliers.
If a value is false observation, then it will have a due influence on the ANOVA or
regression results, often misleading conclusions.

Often you may design your experiment very well, say by using blocking or otherwise to
adjust for systematic patterns in the field, but there could be unexpected damage (e.g by
rodents.) in some of the plots. You may register those plots that have such problem and
calculate an index for them based on severity and then apply what is called covariance
analysis to adjust for the damage (see 4.5 for detail).

The last activity in this regard is to prepare data for computer. The structure of the data is
nearly similar for all statistical packages (the structure is given in 7.4). Plots are treated
as case numbers (rows) and the characteristics measured (known as variables) occupy
columns. These days there are ample facilities to transfer data file from one package to
another and one may use any statistical package, spreadsheet or database for data entry
since it is possible to convert data to different format when required.

9.2. Assumptions Underling analysis of Variance


The construction of analysis of variance is based on certain assumptions. These
assumptions are just like the assumptions we make in real life situation. For example,
suppose you are interested to build a house on a land leased from individual without
official permission and assuming that the plot will eventually be yours some day. Let’s
further assume that you went ahead and built a house with this prior assumption. Soon
after you finished building the government issued strict guideline of reclaiming plot of
lands illegally occupied. Consequently, your house has been demolished by officials
since it is illegal. Now, there was nothing wrong with the technicality of building the
house. But, since the assumption you started with about the land was not correct you are
obliged to loss your investment. This is true with the data analysis as well. You collect
your data nicely: and the data might even look as if it has apparent structure. But
whatever analysis you may do with such data and however good is the model you may
fit is not valid if the assumption on which the analysis of variance is based is not correct.
Therefore, after data collection we do not have to rush for analysis, first it is always good
to check for assumptions of ANOVA.

There are four types of assumptions for ANOVA.

1) Assumption of Normality

Analysis of variance and several test procedures (f, t, z) require data to be approximately
normally distributed. In agriculture or biological sciences, nature has endowed creatures
with normal distribution and there is no serious problem in this regard. Often, such
problem arises for count, scaled or proportion data.

To Check This Assumption

♦ Plot residuals against expected values of normal order statistics (SAS,


GENSTAT). If the plot is straight then the data is normally distributed.

135
♦ Simple plot of data will roughly help to see the approximate distribution of the
data.

♦ Use background information

2) Variance homogeneity assumption

When means from different population are to be compared, one important assumption is
that populations from which the means are obtained roughly have similar variance. If
variance of different population is very different, it is not possible to pool variance and
perform significance test. This is important assumption to be checked when working
with multi-location and several year trial and when a combined analysis of variance is
required.
Check in particular if

1) Data are counts


2) Treatment means are considerably different

Method of detection

1) Use F-ratio of the two variances. If the ratio is much greater than 2, then the
two population belong to different level of variability.

2) Use X2 test (compare the ratio of variances with critical points of X2)

Example: Treatment Mean


Control 150
Insecticide A 85
“ B 8
“ C 1.2
s.e of means 6.4

For this example homogeneity is suspicious because the treatment means are very
different. Hence, the s.e of 6.4 underestimate variability for control and over estimate
that of insecticide C.

Additivity: -

For RCBD, the difference in yield between the treatment means should, on average, be
the same in each block (i.e. no block * treatment interaction).

136
Example (Fictious data)

Model Addivity Multiplicative


Block 1 2 1 2
Treat 1 10 20 10 50
Treat 2 15 25 15 75

⇒ 15-10 = 25 –20 ⇒ additive 15-10 # 75-50 => non- additive

This is often a source of controversy in most research review meetings because no one
knows what size of a block is exactly uniform (optimum) to avoid problem of non-
additivity. Obviously, as number of treatments to be included in a complete block
increase, block loosing homogeneity. This gives rise to occurs of block by treatment
interaction which means that since plots in the block are not uniform some treatments are
favored. Therefore, it is not valid to compare treatments under such condition. The
solution to such problems is to determine optimum plot size for different crops so that
optimum block size are consequently fixed.

Example of insect count


Block
Type of trap
X
Site 1 Site 2 Site 3
A 11 47 105 54
B 13 62 170 82
C 15 87 165 89
D 7 43 86 45

Although means appear different, they are not significantly different in the above table, when subjected to
analysis of variance, because data is multiplicative. So transforming data using Log is suggested.

After transformation the F value was found to be 24 which shows a highly significant
difference between the treatments. Therefore, this is evidence that failure of assumption
creates illusion on conclusions.

4. Independence

Observation on one unit should not be influenced by an observation on another.


Independence can be achieved by randomization. The problem we often face at EARO
is that same randomization is sent to collaborative research center for multi-location
trials. The problem is that since the same treatments occur side by side at all location
they develop what is called serial correlation, which give rise to failure of independence
assumption. Because treatments allocated to adjecent plots are correlated.

137
Two Special Cases of Assumption Failure

When data are: -

1) Poisson
2) Binomial

In both cases mean is related to variance. To correct the problem:

1) Use Arcsine transformation


2) Use square root √x Or √ (x + 1)

5. Detection And Choice Of Transformation

Several textbooks and professional statisticians suggest a wide range of detection


methods, some of which are difficult for subject matter experimenters. A crude method
is however to calculate, for each treatment (and block), the mean and range (or s.d ).
This roughly gives an idea of what is wrong with the data and how to select appropriate
transformation.

1) If range is approximately Constant , use original scale


2) If range/ mean is approximately constant, use Log scale
3) If range /√ mean approximately constant use root scale.

Presentation

Presentation is another problem area if assumption fail and a remedy is adopted. The
reason is analysis and conclusions are appropriate on the transformed scale, but
presentation of the means on transformed scale does not give sense to users. So, there
are two group of opinions. One group provokes the idea of presenting raw means only
while inferring on transformed data. The other group provokes the idea of presenting
both transformed and raw mean hoping that readers would also understand, by looking
the two scales side by side, why transformation was required in the first place.

138
10. Statistical Analysis In Different Packages

10.1 Overview
There are several statistical packages available. However, few of them (SAS, GENSTAT
and SPSS) have totally monopolized the market. SAS is the major one in terms of the
number of modules (module is a program devoted to specific task like regression analysis,
etc…), number of users and the variety of model incorporated. It also covers a wide
range of disciplines like statistical/econometrics, biomertical, bio-medical, industrial,
socio-economics, operations research, quality control, GIS and mathematical modeling.
For this reason it is being used by several institutions all over the world. Such diversity in
analytical capability pushed price of this package so high that it is becoming unaffordable
to the third world.

Genstat, British origin, is dedicated to agricultural and bio-medical analytical methods. It


is particularly strong in analyzing variety of experimental designs, models and graphics,
and most popular among agricultural, biological or medical/pharmatucal researchers. It
also offers programming tools for people to write programs for new models and add to
the library. The price, so far is reasonable and affordable by most institutions. Genstat is
still in the process of development and program writing could be difficult for beginners.

SPSS(Statistical package for the social science), American origin, was initially developed
targeting the social sciences where survey work is most common. These days, however,
they are trying to widen their horizon for marketing and are in the process of
incorporating models and designs from other areas. Particularly this last version of SPSS
(10.1) included Generalized models, multi-variate analysis and others. However, SPSS is
not preferable for designed experiments since there is no design principle in this package.

There are also other packages like, MSTATC MINTAB, INSTAT, AGROBASE,
SYSTAT, STATISTICA, CLUSTAN GRAPHICS and others which are limited in
coverage and have smaller data management capacity. Most of them were written for
specific purposes. MSTATC, for example, was written to handle agronomic and breeding
experiment that are designed in complete block or lattice designs, and it can’t do analysis
for trials with incomplete block structure.

As described earlier, SAS is quite ok on most activities like data management, analysis of
both designed & non-designed experiments and surveys. As is evident in table below,
most statistical packages have data management problems, most of them allow users to
enter data into limited spreadsheet structure pre- determined and are not flexible to
change format or data values or structure. In all such packages, data file merging,
concatenating, transposing, rearranging variables, grouping cases are difficult. In SAS
and SPSS this activities can be easily done.

139
10.2. AGROBASE versus MSTATC

AGROBASE is statistical software designed mainly for breeding & agronomic


experiments. It incorporates four basic functions that a good statistical software should
give.

They are:
1.
Data/ file management facilities,
2.
Randomization and layout,
3.
Analysis and
4.
Experimental procedures/management. It also offers other small facilities
like programming in AGROBASE.
The first screen of AGROBASE (main window) is composed of three features: the
AGROBASE MAIN MENU, BUTTON BAR. And AGROBASE system status.

The basic difference between AGROBASE and MSTATC is that the former has better
data management facilities, advanced statistical models and designs, randomization
procedures, pedigree management for some crops and other useful utilities which are not
available in the MSTATC.

AGROBASE is breeding oriented while MSTATC is agronomic oriented. However,


AGROBASE is less user friendly when compared to MSTATC, because for example
database creation and management in Agrobase is not as simple as it is in Mstatc. Some
of the procedures in Agrobase are not clearly stated, and some of the tools are a direct
copy from dbase IV program and it could be difficult for some one to understand
database management who do not know the dbase IV program. For example, steps in
database creation is not stated clearly. Deletion of records in agrobase is a two stage
process just like it is in dbase programs and is confusing.

10.3. The SAS System


The SAS system is a software system for data analysis. It provides data analysis with one
system to meet all the computing needs. It is a programming to do all purposes of
analysis.

The SAS system is a comprehensive and flexible information delivery system that
supports of

- information management and retrieval


- data modification and programming
- report generating
- analysis
- file handling.

By taking advantage of features built into windows, the SAS system has increased power
and flexibility so that it can communicate and share data with other windows applications

140
(like EXEL, ACCESS …)

SAS system also has a menu driven way, SAS/Assist, which allows us to carry out most
data operations and analysis by mouse selections rather than by writing programs.

10.4. Genstat
Genstat is a comprehensive statistical system designed to summarize and analyze data on
computer. It provides standard statistical analysis by use of menu system, which can
itself be modified and extended through the use of command Language. It manages data
that entered within Genstat own spreadsheet or imported from other file format. It
performs standard tests, summarizes, compares, fit distribution and graphs. It transform
data using calculation facilities inbuilt. It analyzes data generated from one-way analysis
of variance to complex multi-strata designs. It assess relationship between variables
using regression and correlation, perform multivariate analysis such as principal
component and cluster analysis. It can even go far to time series and spectral analysis.
For advanced statistical methods, Genstat furnishes programming Language which, if
managed properly, can do most of the modeling practices available in agricultural
research.

10.5. Comparison of different packages

1. The major statistical packages available on the market may be compared as follows,
Major Statistical packages

Data Non –
Soft ware type Management Designed exp. Surveys Designed
SAS E E E E
GENSTAT G E F F
MINTAB P G P F
SPSS E P E G
SYSTAT F G F F
INSTAT P G P F
AGROBASE F G P P
MSTATC F F P P

KEY:
E = Excellent
G = Good
F = Fair
P = Poor

141
2. Output of different packages for same data set

The following ANOVA tables has been given in the same way they would have been
printed by their respective packages only the ANOVA component is presented here for
saving space, otherwise most of these packages also produce other output like means,
standard errors and CV.
ANOVA of MSTATC

SOURCE D.F SS MS F Prob.


Rep 3 248.550 82.85 2.89 0.0792
Factor A 4 1068.300 267.075 9.32 0.0012
Error 12 343.700 28.6417

ANOVA OF MINTAB

Worksheet size : 100000 cells


MTB > ANOVA 'yield' = ret treatment;
SUB > Restrict

Factors Type Levels Values


Rep Fixed 4 1 2 3 4
Treatment Fixed 5 1 2 3 4 5

Analysis of variance for yield

Source DF SS MS F F
Rep 3 248.550 82.85 2.89 0.0792
Treatment 4 1068.3 267.075 9.32 0.0012
Error 12 343.7 28.6417
Total 19 1660.55

MTB >

ANOVA tables for MSTAT and MINTAB have similar structure. They are basically
developed for designed experiment and understands what rep and treatment means.
MINTAB is a small program but can handle most basic designs and analysis. It manages
incomplete blocks of modest non-orthogonality, fractional factorial generation and
composite designs. Among multivariate methods, MINTAB handles principal
component, dicriminant analysis and others. Unfortunately, cluster analysis is not
available in release 9.2 or earlier. It can read ASCII data and export it. SYSTAT is a bit
different because it considers rep* treatment interaction. SPSS does not suit for
designed experiments because, the ANOVA table is slightly different. For example, it
considers all the factors that entered into the model as treatments and produces
interaction of all terms if the data is submitted directly. For this example, “univariate”
sub menu under “General linear models” menu were used in SPSS 10.1 with unique sum
of square. To avoid interactions of rep by treatment one must select “model” option and

142
use “custom” declaration of terms and select only main effects so that error could be
calculated. SPSS do not produce outputs like CV, grand mean and LSD. It can only
analyze designs of simple factorial nature. In addition, the way it calculate degree of
freedom for interaction is odd and in principle not correct for designed experiments.
Here rep, and treat have 3 and 4 df respectively which will be 12 df for their interaction,
however SPSS allocated only 11 and assigned 1 df for error. Similarly, SYSTAT did not
produce the error, rather it put the 12df as rep* treat interaction just like SPSS.

The recent version (version 10.1) of SPSS produce more structured and more general
ANOVA output. It now operates under the “General linear model” menu, as opposed to
“ANOVA model” in earlier versions, and give a general ANOVA structure, which
serves for a variety of purposes. For example, it furnishes for principle of random and
fixed effect models. A factor is termed as fixed if the level of that factor are the only
available once in the whole of population and the levels were not selected. Random
refers to the fact that levels of factors has been chosen from population which have other
similar factor levels and this accounts for the random variation in selections. This
principle is now used in both SAS and GENSTAT. In the model sub-menu type of sum
of square must be chosen along with whether interaction should included. SPSS uses
four types of sum of square just like SAS. As mentioned earlier you must decide how to
get error sum of square, particularly in several factor- factorial experiment where SPSS
produces interaction of rep with all other factor which should normally pooled and
termed error in other packages. One of the quality of SPSS is that it can handle
incomplete factorial combinations and adjust treatment means. But, it can not handle
most experimental designs used in breeding or agronomy.

On the other hand, GENSTAT produce a general ANOVA structure which can serve for
all types of designs. It designates stratum to each level of random variation. Here, the
REP stratum has only the rep, while treatment enters the next stratum. Also note that
ANOVA results for most packages are the same except for rounding error, but that of
SPSS is completely different from the rest. Therefore it should be noted that SPSS may
not be appropriate for designed experiments.

ANOVA OF GENSTAT

***** Analysis of variance ******


Variate: YIELD

Source of variation d.f. s.s m.s. v.r F.pr


REP stratum 3 248.550 82.85 2.89
REP *Units* stratum
TREAT 4 1068.300 267.075 9.32 <001
Residual 12 343.700 28.6417
Total 19 1660.550

* MESSAGE: The following units have large residuals.

REP 3.00 *units* 5 -5.46 s.e. 2.7

143
***Standard errors of differences of means***

Table TREAT
Rep 4
D.f. 12
s.e.d. 1.894
ANOVA OF SYSTAT

Effects coding used for categorical variables in model.


Categorical values encountered during processing are:

Rep (4 levels)

1. 2 3, 4
TREAT (5 levels) 2, 3, 4, 5

Dep var: YIELD N: 20 Multiple R: 1.0000 squared Multiple R: 1.0000

Analysis of Variance
Source Some-of -Squares DF Mean-Square F. Ratio p
REP 248.55 3 2.89 . .
TREAT 1068.30 4 9.32 . .
REP*TREAT 343.7000 12 28.6417 . .
ERROR 0.0 0 . .
ANOVA OF SPSS
14 Aug 88 SPSS for WINDOWS Release 6.1

***ANALYSIS OF VARIANCE ***


YIELD
By REP
TREAT
UNIQUE sums of squares

All affect entered simultaneously

Source of Variation Sum of DF Mean F Sig of


squares
Main Effects 1681.913 7 240.273 7.145 .281
REP 21.428 3 7.143 .212 .882
TREAT 1641.529 4 410.382 12.203 .211
2- WAY interaction 51.078 11 4.643 138 .979
REP TREAT 51.078 11 4.643 .138 .979
Explained 1717.667 18 95.426 2.838 .440
Residual 33.630 1 33.630
Total 1751.297 19 92.174
20 cases were processed.
0 cases (.0 pct) were missing.

144
ANOVA of SAS

SAS 10:16 Tuesday, January 23,2001

Analysis of Variance Procedure


Class Level Information

Class Levels Values

REP 4 1 2 3 4

FACT 5 1 2 3 4 5

Analysis of Variance Procedure

Dependent Variable: YIELD

Source DF Sum of Squares F Value Pr > F

Model 7 1316.85000000 6.57 0.0024

Error 12 343.70000000

Corrected Total 19 1660.55000000

R-Square C.V. YIELD Mean

0.793020 11.42325 46.85000000

Source DF Anova SS F Value Pr > F

REP 3 248.55000000 2.89 0.0792


FACT 4 1068.30000000 9.32 0.0012

3. Analysis of Combined data using SPSS & Mstatc


Analysis of multi-location or multi- year data is a common practice in agricultural
research, but not many packages provide appropriate structure for that. For example,
MINTAB, or SYSTAT or INSTAT do not give alternatives for nesting of rep within
location/year, or for new location each year or same locations each year. Such
difference in trials brings much alteration to the structure of ANOVA. Since MSTATC
is developed for breeding & agronomic research, it has all possibilities for analyzing data
from such trials. It is also possible to do same analysis in Genestat and SAS, but Genstat
is more strong in this respect if one knows how to write a program. From the output
shown below the major difference between MSTATC and SPSS is that MSTAT
calculates sum of square for rep nested under location and do not include rep by other
factor interaction, while SPSS does. The degree of freedom and sum of square of
MSTATC for rep is completely different from that produced by SPSS. Therefore, it
should be noted that analysis of most designed experiments may not be appropriate in
SPSS and SYSTAT. SPSS is most useful for survey data or data collected without any
prior design etc.

145
Combined analysis of 2 Factors in Mstatc
(Assuming RCDB)

Source D.F S.S F Prob.


Location 1 408.985 3.6997 0.0704
R(L) 6 752.569 1.1346 0.3823
Factor (A) 1 2547.986 23.0494 0.0001
LA 1 0.763 0.0069
Facr (B) 1 916.731 8.2929 0.0100
LB 1 721.822 6.5297 0.0199
AB 1 115.104 1.0412 0.3211
LAB 1 785.358 7.1045 0.0158
Error 18 1989.801

Combined Analysis (Assuming Split-plot)

Source D.F S.S F


Loc 1 408.985 5.4086 0.0590
R(L) 6 752.569 1.6587 0.2771
Factor (A) 1 2547.986 33.6955 0.0011
LA 1 0.763 0.0101
Error 6 453.708
Factor B 1 916.732 7.1615 0.0202
LB 1 721.822 5.6389 0.0351
AB 1 115.10 0.8992
LAB 1 785.358 6.1352 0.0291
Error 12 1536.093

146
Combined Analysis of same data set in SPSS

* * * ANALYSIS OF VARIANCE * * *

YIELD
By LOC
REP
FACTA
FACTB

UNIQUE sums of squares


All effects entered simultaneously

Source of variation Sum of squares DF Mean F Sig


square of F
Main effects 2973.867 6 495.645
LOC 222.609 1 222.609
REP 86.828 3 28.943
FACTA 2040.715 1 2040.715
FACTB 623.715 1
2- way Interactions 1565.641 12 130.470
LOC Rep 208.117 3 69.372
LOC FATA 19.624 1 19.372
LOC FACTB 464.982 1 464.982
REP FACTA 102.386 3 34.129
REP FATAB 741.097 3 247.032
FACTA FACTAB 29.435 1 29.435
3- way Interactions 1618.420 10 161.842
LOC REP FACTA 1618.420 3 54.972
LOC REP FACTAB 670.749 3 223.583
LOC FACTA FACTAB 516.242 1 516.242
REP FACTA FACTAB 266.512 3 88.837
4- way Interaction 444.940 3 148.313
LOC REP FACTA 444.940 3 148.313
FACTAB
Explained 6602.868 31 212.996
Residual .000 0 .000
Total 6602.868 31 212.996
32 cases were processed.
0 cases (.0 pct) were missing

147
11. Regression and Correlation Analysis

In previous section the objective of analysis was to either see whether there is differences in
mean values between groups (such as varieties or treatments) or to classify observations
(landraces, species, are etc...). But in this section we will study relationship between two or
more variables.

Response to treatments can be exhibited either by the crop or livestock in terms of changes in
such biological features as grain yield or animal weight or by surrounding environment in
terms of changes in such features as insect incidence in an entomological trial and soil
nutrient in a fertility trial.

To assess association between traits or characters there is a need for statistical procedure that
can simultaneously handle several variables. If two plant characters are measured to
represent crop response, the ANOVA and mean comparison procedures can evaluate only
one character at a time, even though response in one character may affect the other, or
treatment effects may influence both characters. Regression and correlation analysis thus
allows a researcher to examine any one or a combination of the three types of associations as
long as the variables considered are measured quantitatively for theoretical reasons.

The three types of associations are:

1. Association between response variables


2. Association between response and treatment
3. Association between response and environment.

Regression analysis describes the effect of one or more variables, called independent
variables, on a single variable, called dependent variable. The latter is expressed as a
function of the former.

Correlation analysis, on the other hand, provides measure of the degree of association
between two or more variables or the goodness of fit of a prescribed relationship to the data
at hand.

Regression and correlation analysis can be classified into your types:

1. Simple linear regression & correlation


2. Multiple-linear regression & correlation
3. Simple non-linear regression & Correlation
4. Multiple non-linear regression & correlation

148
Linear Relationship
The functional form of the linear relationship between a dependent variable Y and
independent variable X is represented by the equation:

Y= α + βx

Where α is the intercept


β is slope of the line

When there is more than one independent variable, say K independent variables (x1, x2,....,
xk), the simple linear functional form of the equation Y = α + βx can be extended to the
multiple linear functional form

Y = α + β1x1 + β2x2 + ... + βkxk

Where βi = the amount of change in Y for each unit change in xi

The presence of βi (βi # 0) indicates the dependence of Y on Xi. Thus, the test of
significance of each βi to determine whether or not βi = 0 is an essential part of the
regression analysis.

Hypothesis to be tested in regression is

Ho: β = O, i.e there is no relationship between dependent (y) and independent (x).
Ha: β # O, i.e there is linear relationship between the two.

11.1 Simple linear Regression & Correlation


In nitrogen fertilizer trial, for example, all other factors such as phosphorus application,
potassium application, plant population, variety and weed control that affect yield must be
controlled. If they are allowed to vary, the assumption of one independent variable would not
be satisfied and simple regression analysis would be inappropriate.

Analysis deals with the estimation and tests of significance concerning the two parameter α
& β. It does not provide any test as to whether the best functional relationship between X
and Y is indeed linear, because the analysis has been performed under the assumption of
linear relationship between the two.

Steps

1) Compute the estimates of regression parameters α and β as


_ _
a = y - bx
b = ∑ x y/∑x2

149
2) Plot the observed points and draw a graphical representation of the estimated
equation.
3) Test the significance of β.

• Compute the residual mean square as:

S yx2 = ∑Y2 - (∑ x Y)2


∑x2___
n-2

• Compute the tb value as

tb = b ____
√(S2 yx/∑x2)

• Compare the computed tb with tabular t value with (n-2)d.f

4) Construct the (100 - α) % C.I for β as

b ± tα√(S2 ys/ ∑x2)

Regression analysis could be done in several statistical packages. But results from most
common ones are presented below.

MSTATC

There are two option in MSTATC for regression, REGR and MULTIREG. The former is
designed to do a simple linear regression for different groups and compare homogeneity of
regression coefficients. The latter does simple and multiple linear regression.

Example data was taken from soil compaction study available in MSTATC example
directory. Two entries on two compaction methods were used in RCBD with 3 rep. Pod
length, yield, seed/pod, 1000 - seed weight and pods/plant were measured.

For the case of simple linear regression, yield was taken as dependent and pod length as
independent. Output from MSTATC is shown below.

MSTATC gives a lot of output for regression, uncorrected and corrected cross products
variance-covariance, correlation matrix and others. The cross product matrix is used as an
input for further analysis. The regression equation for this data is :

Y = 182.78 + 0.825 (pod length)

The s.e for the slope is 0.737 which is high since the estimate itself is not very high. Hence,
there is an indication that the estimated slope is not precise. In the ANOVA table, regression
sum of square is not significant at 5% level. Thus t value for b (1.119) indicates that there is
no strong evidence of linear dependence of yield on pod length (p = 0.265).

150
Coefficient of determination is calculated as regression sum of square divided by total sum of
square. For this example R2 = 0.011 (4346.77/413722.59) which is very low, again
indicating poor fit of the model. Here, regression d.f is one because two parameters are to be
estimated and d.f is 2 - 1 = 1.

In the residual table three important results are given: predicted, residual and standardized
residuals.

The predicted column is obtained by substituting pod length and calculating respective yield.
For example, predicted value for case 1 could be calculated as follows:

Predicted Yield = 182.78 + 0.825(74.0)


= 243.83

Residual is defined as the difference between observed value of yield and predicted for a
value of pod length. For example, residual for case 1 could be calculated as:

Residual = observed - predicted


= 227.0 - 243.83
= -16.8

Standardized residual given in the last column of residual table is calculated as

Residual / √Residual MS for case 1, for example

Std. Res = -16.83/√3469.26


= -0.2857

Large residuals are signals of problem. It simply means that the regression model fitted do
not predict that particular value very well. Hence, there must be a problem with observation
of that particular value in the data set. The value could be outlier or mistyped or could even
be a true value. Therefore, it is good to go back to the data and examine observation with
large residual. For this data, for example, observation no 24,44,76,84, and 94 have large
residuals and need to be re-examined.

151
Data file : COMPACT
Title : SOIL COMPACTION

Function : MULTIREG
Data case no. 1 to 120

10-POD LENGTH
YIELD

4 5
4 6.38679e+003
5 5.26896e+003 4.13723e+005

Variance - Covariance Matrix


-------------------------------

4 5
4 5.36705e+001
5 4.42770e+001 3.47666e+003

Correlation Matrix
---------------------

4 5
4 1.000
5 0.103 1.000

Variable Regression Standard Std. Partial Std. Err. of Student


Number Coefficient Error Regr. Coeff. Partial Coef T Value Prob.
-------------------------------------------------------------------------------
4 8.2498e-001 7.3702e-001 1.0250e-001 9.1573e-002 1.119 0.265
-------------------------------------------------------------------------------

Intercept = 182.784280

Coefficient of Determination (R-Square) = 0.011


Adjusted R-Square = 0.002
Multiple R = 0.103
Standard Err of Est. = 58.901

ANALYSIS OF VARIANCE TABLE

Sum of Squares df Mean Square F Signif


------------------------------------------------------------------------------------
Regression 4346.771175 1 4346.77118 1.25 0.265
Residual 409375.820492 118 3469.28661
Total 413722.591667 119
-------------------------------------------------------------------------------------

152
RESIDUAL TABLE (sample)

Case# Observed Predicted Residual Std. Res


---------------------------------------------------------
1 227.000 243.833 -16.8326 -0.2858
2 355.000 246.308 108.6925 1.8454
3 229.000 249.607 -20.6074 -0.3499
4 257.000 246.308 10.6925 0.1815
5 218.000 251.257 -33.2574 -0.5646
6 217.000 251.257 -34.2574 -0.5816
7 240.000 253.732 -13.7323 -0.2331
8 347.000 257.032 89.9678 1.5274
9 241.000 251.257 -10.2574 -0.1741
10 272.000 253.732 18.2677 0.3101
11 192.000 251.257 -59.2574 -1.0061
12 206.000 253.732 -47.7323 -0.8104
13 184.000 257.857 -73.8572 -1.2539
14 318.000 257.032 60.9678 1.0351
15 291.000 261.157 29.8429 0.5067
16 310.000 259.507 50.4928 0.8573
17 176.000 253.732 -77.7323 -1.3197
18 220.000 252.082 -32.0824 -0.5447
19 221.000 250.432 -29.4324 -0.4997
20 241.000 253.732 -12.7323 -0.2162
21 178.000 258.682 -80.6822 -1.3698
22 241.000 257.032 -16.0322 -0.2722
23 217.000 261.157 -44.1571 -0.7497
24 426.000 256.207 169.7927 2.8827
25 262.000 246.308 15.6925 0.2664
26 252.000 258.682 -6.6822 -0.1134
27 232.000 251.257 -19.2574 -0.3269
28 248.000 252.907 -4.9074 -0.0833
29 233.000 251.257 -18.2574 -0.3100
30 192.000 250.432 -58.4324 -0.9921
31 221.000 255.382 -34.3823 -0.5837

153
SAS

The following SAS program (excluding data step) produces output that follows.




Proc Reg;
Model yield = podlegth/ ss1 p r clm;
Run;

Note that in the model statement commands after the /sign denote an option. For example,
"p" tells SAS to produce predicted values and residuals. If we need to fit regression without
intercept, we use “noint” command after the backslash.

SAS gives ANOVA table, R2, CV and parameter estimate just like MSTATC. If all options
are used, it also give output ranging from ANOVA to model diagnostic tools.

SAS 16:22 Tuesday, September 5, 1989 1

Model: MODEL1
Dependent Variable: YLD

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Prob>F

Model 1 4346.77118 4346.77118 1.253 0.2653


Error 118 409375.82049 3469.28661
C Total 119 413722.59167

Root MSE 58.90065 R-square 0.0105


Dep Mean 255.14167 Adj R-sq 0.0021
C.V. 23.08547

Parameter Estimates

Parameter Standard T for H0:


Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 182.784280 64.86594668 2.818 0.0057


PLGT 1 0.824977 0.73701905 1.119 0.2653

Other packages
Regression can be done in several other packages, SPSS, MINTAB, GENSTAT, INSTAT,
etc.

154
11.2 Simple Linear Correlation Analysis
The simple linear correlation analysis deals with the estimation and test of significance of the
simple linear correlation coefficient, r, which is the measure of the degree of linear
association between two variables x and y. The value of r lies within the range of -1 and +1,
with extreme values indicating perfect linear association and mid value of zero indicating no
linear association. The intermediate value of r indicates the portion of variation in one
variable that can be accounted for by the linear function of the other variable. Negative and
positive signs indicates the type of association, i.e, when one variable increases the other
decreases or when one increases the other will increase to or vice versa. Note that the value r
= 0 indicates the absence of linear relationship only. It doesn't show the absence of
relationship, since we are concerned with linear relationship only.

Analysis

Steps.

1) Computer r = ∑xy__
√[(∑x2)(∑y2)]
2) Test the significance of the simple linear correlation by comparing the
computed r value with the tabular r value. r is significant at α level of
significance if the absolute value of the computed r value is greater than the
corresponding tabular r value at the α level of significance.

Correlation analysis is one of the simplest models that can be fitted in a range of statistical
and other packages like Excell. It may be obtained directly when correlation analysis is
submitted or indirectly when other models like regression is fitted. In the example of
MSTATC, for example, correlation between yield and pod length is 0.1 which is not
significant at 5% probability level.

11.3 Homogeneity of Regression Coefficients


In factorial experiment where more than one factor is used, the linear relationship between
response and a given factor may have to be examined over different level of the other factors.
When several linear regressions are estimated, it is usually important to determine whether
the various regression coefficients differ from each other.

Let’s us see for two regression lines

Y1 = a1 + b1x
Y 2 = a 2 + b2 x

Step 1) Calculate residual mean squares S2yx (1) and Sy2(2)


Step 2) Compute the pooled residual means square as:

155
S2p = (n1-2)S2yx(1) + (n2 -2)S2yx(2)
n1 +n2 - 4

Step 3) Compute t values as:

t= b1 - b2
√[Sp2 (1/∑x12 + 1/∑x22)]

Step 4) Compare the computed t with tabular t values with (n1 + n2 - 4) d.f

The above example, regression of yield on pod length, was re-analyzed using compaction as a
grouping variable in MSTATC (REGRE option). In the output below, a lot of information is
given. Mean values, variances and covariance of X and Y is given in a table. Test statistics
and results for the slope (b) for both groups is also given. Slope for compaction 1 is 2.37,
which is significant at 3%, and for compaction 2 is - 0.44, not significant at 5%. The
ANOVA tables test whether regression coefficients are homogenous. Here difference
between levels, i.e intercept is not significant but difference between angles, i.e. slope, is
nearly significant at 5.7% which shows that there is evidence that the slope is heterogeneous.
Therefore, it is possible to conclude that separate regression lines are appropriate for each
group. Hence the equation for each group is given as:

1. Yield = 53.4 + 2.37 (pod length) ................. for compaction 1


2. Yield = 287.9 - 0.44 (pod length)) .................for compaction 2

A detailed discussion about comparison of regression will be given latter.

Data file : COMPACT


Title : SOIL COMPACTION

Function : REGR
Data case no. 1 to 120

REGRESSION
X-variable 4 10-POD LENGTH
Y-variable 5 YIELD
Group variables 2

156
From To DF X-BAR Y-BAR VAR.x VAR.y COVAR
----------------------------------------------------------------------------------------
1 60 58 87.43 260.85 49.74 3698.37 118.02
61 120 58 87.98 249.43 58.36 3247.61 -25.52
---------------------------------------------------------------------------------------
Total 118 87.71 255.14 53.67 3476.66 44.28
Within Gr 117 54.05 3472.99 46.25
Between Gr 0 9.08 3910.21 -188.38

From To DF r a b s.b t P%
------------------------------------------------------------------------------------
1 60 58 0.2752 53.4071 2.3726 1.0885 2.1796 0.033
61 120 58 -0.0586 287.9072 -0.4373 0.9779 -0.4472
---------------------------------------------------------------------------------------
Total 118 0.1025 182.7843 0.8250 0.7370 1.1193 0.265
Within 117 0.1067 0.8557 0.7368 1.1613 0.248
Between 0 -1.0000 -20.7576 0.0000 0.0000

TEST FOR DIFFERENCES BETWEEN LEVEL REGRESSIONS

ANALYSIS OF VARIANCE TABLE

Degrees of Sum of Mean F


Source Freedom Squares Square Value Prob
-----------------------------------------------------------------------------------------
Differences 2 16741.862 8370.931 2.47 0.089
Differences in level 1 4233.210 4233.210 1.22 0.271
Error 117 405142.611 3462.757
Differences in angle 1 12508.652 12508.652 3.70 0.057
Error 116 392633.959 3384.776
------------------------------------------------------------------------------------------

11.4 Multiple Linear Regression


The simple linear regression and correlation analysis, as described before, has an obvious
limitation. It is applicable only to cases with one independent variable. However, with the
increasingly accepted perception of the interdependence between factors of production and
with the increasing availability of experimental procedures that can simultaneously evaluate
several factors, researchers are increasing the use of factorial experiments so the need for
handling several independent variables is obvious.

When all independent variables are assumed to affect the dependent variable in a linear
fashion and independently of one another, the procedure is called multiple linear regression
analysis. The equation for K independent variables is

Y = α+ β1x1 + β2x2 + ... +βkxk

157
The data has n(k+1) observation and the layout is

Case no Y X1 X2 ... Xk
1 Y1 X11 X21 ... Xk1
2 Y2 X12 X22 ... Xk2
3 Y3 X13 X23 ... Xk3
. . . . ... .
. . . . ... .
n Yn X1n X2n ... Xkn

Analysis: Let us see using 2nd degree polynomial

Y = α + β1x1 + β2x2

Step 1) Solve the normal equations

b1∑x12 + b2∑x1x2 = ∑x1y


b1∑x1x2 + b2∑x2 = ∑x2y

Solving for b1 and b2 results

b1 = (∑x22)( ∑x1y)-( ∑x1x2)( ∑x2y)


(∑x12)( ∑x22)-( ∑x1x2)2

b2 = (∑x12)( ∑x2y) - (∑x1x2)( ∑x1y)


(∑x22)(∑x12) -∑(x1x2)2

1) Compute the intercept α as:


_ _ _
α = Y - b1 x - b 2 x 2

2) Compute

• SS due to regression as:

SSR = b1∑x1y + b2∑x2y

• Residual SS as:

SSE = ∑y2 - SSR

• Coefficient of determination, as:

R2 = SSR
∑Y2

158
3) Test the signification of R2
• Compute the F value as:

F= SSR/2
SSE/(n-2-1)

Compare the computed F value with the tabular F value with (k,n-k-1), ie (2,n-3)d.f

The significance of the linear regression implies that some portion of the variability in Y is
indeed explained by the linear function of the independent variables. The size of R2 value
provides information on the size of the portion. Obviously, the larger R2 value is, the more
important the regression equation is in characterizing Y. However, this is not always the case
particularly in a situation where there is multicollinearity between the independent variables.

MSTATC output for multiple regression is given below. From table of parameter estimate it
is clear that seed per pod, seed weight pod per plant significantly contribute to variability in
yield, whereas pod length do not. Normally, if s.e of a slope is greater than its coefficient
estimate then that variable will be dropped even if it is significant. This is because since s.e
is a measure of precision, coefficient with a large s.e will have less precision in estimating
population coefficient, hence may not be included in the model. Therefore, watch out for s.e
in multiple regression.

Multiple regression model of yield is given as follows:

Yield = -423.4 + 36.7 (seed/pod) + 0.90 (seed weight) + 10.3 (pod/plant).

R2 = 0.89

One of the difficulties in regression analysis is the choice of variable to include in to


regression model. For example, in the above data when pod length was included alone, the
coefficient was 0.825 with R2 = 0.011. When seed/pod was added ( to pod/plant) both the
coefficient and R2 changed. Coefficient became 0.64 while R2 is 0.02. Further when seed
weight and pod/plant are added to the two, there was a significant change in both coefficient
and R2. Coefficient for seed per pod is now 36.7 and R2 = 89%. R2 has now showed
considerable improvement as most of the variability in yield is explained by the independent
variables. Therefore, note that as number of variables in the model increases the R2 also
increases.

Data file : COMPACT


Title : SOIL COMPACTION

Function : MULTIREG
Data case no. 1 to 120

10-POD LENGTH
SEED PER POD
1000-SEED WEIGHT
PODS PER PLANT
YIELD

159
Uncorrected
Minimum Maximum Sum Mean Sum of Squares
-----------------------------------------------------------------------------------
4 64.00 106.00 10525.00 87.708 929517. 00
6 3.00 7.00 703.00 5.858 4173. 00
7 148.00 294.00 22659.00 188.825 4319683.00
8 14.00 39.00 2808.00 23.400 69640.00
5 169.00 446.00 30617.00 255.142 8225395.00
------------------------------------------------------------------------------------

120 Cases read 0 Missing cases discarded

Determinant of matrix = 0.467169

Variable Regression Standard Std. Partial Std. Err. of Student


Number Coefficient Error Regr. Coeff. Partial Coef T Value Prob.
------------------------------------------------------------ -----------------------------------------
4 5.9987e-001 3.2000e-001 7.4532e-002 3.9759e-002 1.875 0.063
6 3.6668e+001 3.3296e+000 4.2121e-001 3.8247e-002 11.013 0.000
7 8.9639e-001 1.3418e-001 2.8252e-001 4.2290e-002 6.680 0.000
8 1.0336e+001 3.4283e-001 1.0077e+000 3.3425e-002 30.149 0.000
-------------------------------------------------------------------------------------------------------

Intercept = -423.404368

Coefficient of Determination (R-Square) = 0.890


Adjusted R-Square = 0.886
Multiple R = 0.944
Standard Err of Est. = 19.869

ANALYSIS OF VARIANCE TABLE

Sum of Squares df Mean Square F Signif


---------------------------------------------------------------------------------------
Regression 368321.096939 4 92080.27423 233.24 0.000
Residual 45401.494728 115 394.79561
Total 413722.591667 119
----------------------------------------------------------------------------------------

Variable Selection

So far we only entered the four independent variables and estimated their individual
contributions. We do not know which combination would give best regression model. This
is necessary because some of the independent variables might be highly correlated (multi-
collinearity) and given different results when some are out or in the model. Logically, if two
variables are correlated, then they contribute the same thing to dependent variable, and using
both of them in the model might just be waste of resource and time.

Therefore, selection of best contributing variables is necessary. There are a number of


possibilities for variable selection in regression, forward, backward and stepwise. Forward
selection first use a criteria we initially set for entering variable to the model and at first step
choose best performing variable and enter into the model. In the second step it will look for
the next best variable that satisfies the criteria, so on and so fourth, finally only those variable

160
that could manage to be included will be published for users. Backward selection is the
reverse procedure. It starts with full model (model that contains all variables) and drop
variable that can't satisfy criteria for staying, until all variables are compared against the
standard set. Stepwise regression, as the name indicates, is a procedure that uses both
forward and backward approaches simultaneously.

Selection methods are not available in MSTATC. Here, SAS was used to run stepwise
selection procedure using same data set. The model statement in the SAS program for simple
regression analysis section may be modified to handle multiple regression.

Proc Reg;
Model yield = podlength podplant seedw /selection = stepwise SLE = 0.05;
Run;

In the first step, SAS selected pod/plant to be entered since it satisfy the criteria, i.e
significance level for entry (SLE) is 0.05. At the second step seed-pod was entered and
pod/plant could not be dropped, etc. The selection procedure checks at each step variables
which satisfy the criteria and that contribute better than all others in the pipeline for entry in
to the model. After a variable is entered it again looks for a variable among those already in
the model, that is short of criteria due to entry of another variable in the model, for exclusion.
Therefore, at each step double-checking occurs, i.e a new variable is entered into the model
and its effect on those already in the model is evaluated. This takes care of multicollineary.
If two variables are highly correlated and if they also highly contribute to dependent variable
obviously both of them can't remain in the model, because if one of them is added to the
model when the other one is already in the model the contribution of the latter given the
former would be so small that it will not make sense at all.

Let's explain the problem of variable selection with a simple example. Assume that the
Biodiversity institute is collecting landraces that best contribute to biodiversity collection of
the nation. It is therefore expected that those landraces collected are unique and could be
considered as an independent wealth of the nation. How do they select landraces then?
Suppose a botanist finds a new landrace and wanted to include it in to the collection. First he
should make sure that this new landrace is not of the same type with any one landrace already
on the shelf. For this he must have mechanisms which we called criteria. Otherwise the
institute may fill all their available space with two or more landraces of same type which do
not uniquely contribute to the collection. Variable selection in regression analysis works
exactly in the same way. We do not require two variables of same nature to be in the model
since both of them will not uniquely contribute to dependent variable.

161
SAS 11:42 Wednesday, September 6, 1989
4

Stepwise Procedure for Dependent Variable YLD

Step 1 Variable POD Entered R-square = 0.69613465 C(p) =202.43302320

DF Sum of Squares Mean Square F Prob>F

Regression 1 288006.63319772 288006.63319772 270.33 0.0001


Error 118 125715.95846894 1065.38947855
Total 119 413722.59166667

Parameter Standard Type II


Variable Estimate Error Sum of Squares F Prob>F

INTERCEP 54.89459588 12.53839737 20421.30676655 19.17 0.0001


POD 8.55756713 0.52047907 288006.63319772 270.33 0.0001

Bounds on condition number: 1, 1


--------------------------------------------------------------------------------

Step 2 Variable SEED_PD Entered R-square = 0.79838537 C(p) = 97.28027733

DF Sum of Squares Mean Square F Prob>F

Regression 2 330310.06647126 165155.03323563 231.66 0.0001


Error 117 83412.52519541 712.92756577
Total 119 413722.59166667

Parameter Standard Type II


Variable Estimate Error Sum of Squares F Prob>F

INTERCEP -130.71551128 26.18770601 17762.52677099 24.91 0.0001


SEED_PD 28.59843509 3.71259149 42303.43327354 59.34 0.0001
POD 9.32982959 0.43741071 324350.07868186 454.96 0.0001

Bounds on condition number: 1.055444, 4.221777


--------------------------------------------------------------------------------

Step 3 Variable SEEDW Entered R-square = 0.88690768 C(p) = 6.51410119

DF Sum of Squares Mean Square F Prob>F

Regression 3 366933.74522700 122311.24840900 303.24 0.0001


Error 116 46788.84643967 403.35212448
Total 119 413722.59166667

162
Parameter Standard Type II
Variable Estimate Error Sum of Squares F Prob>F

INTERCEP -415.54726410 35.79820428 54350.39682349 134.75 0.0001


SEED_PD 39.44003681 3.01541072 69002.69009415 171.07 0.0001
SEEDW 1.04451212 0.10961618 36623.67875573 90.80 0.0001
POD 10.35923279 0.34629190 360956.50524625 894.89 0.0001

SAS 11:42 Wednesday, September 6, 1989


5

Bounds on condition number: 1.230651, 10.87248


--------------------------------------------------------------------------------

All variables in the model are significant at the 0.0500 level.


No other variable met the 0.0500 significance level for entry into the model.

Summary of Stepwise Procedure for Dependent Variable YLD

Variable Number Partial Model


Step Entered Removed In R**2 R**2 C(p) F Prob>F

1 POD 1 0.6961 0.6961 202.4330 270.3299 0.0001


2 SEED_PD 2 0.1023 0.7984 97.2803 59.3376 0.0001
3 SEEDW 3 0.0885 0.8869 6.5141 90.7983 0.0001

11.5 Non Linear Relationship


The functional relationship between two variables is non-linear if the rate of change in Y
associated with a unit change in X is not constant over a specified range of x values. A non-
linear relationship among variables is common in biological organisms, especially if the
range of values is wide. A typical example is the pattern of plant growth over time which
usually starts slowly, increases at a fast rate at intermediate growth stages, and slow towards
the end of the life cycle.

Simple Nonlinear Regression


There are numerous functional forms that can describe a non-linear relationship between two
variables, and choice of the appropriate regression and correlation technique depends on the
functional form involved.

The first technique is linearization, either through transformation or through creation of new
variables. This has two advantages:-

1) After linearization, the regression procedure for the linear relationship is


applicable

2) The technique has wide applicability because most of the non-linear


relationships found in agricultural research can be linearized through variable

163
transformation or variable creation.

A. Transformation of Variable:- A non-linear relationship between variables can be


linearized by transforming one or both - linear forms, commonly encountered in
agricultural research, that can be linearized through variable transformation.

1) The nonlinear form Y = αeβx


Transformed to Y = α' + βx by taking the natural logarithm, where.
Y = lnY and α' = 1nα

2) The non linear form Y = αβx


Transformed to Y' = α' + β'x by taking the logarithm, where
Y' = log Y α' = log α and β' = log β

3) The non linear form 1/Y = α + βx


Transformed to Y' = α + βx
Where Y' = 1/Y

4) The non linear form Y = α + β/x


Transformed to Y = α + βx'
Where X' = 1/x

5) The non linear form Y' = (α + β/x)-1


Transformed to Y' = α + βx'
Where Y' = 1/Y and X' = 1/x

After transformation apply simple linear regression technique. After estimating the
regression coefficients, use them to derive an appropriate estimate of original regression
parameters based on the specific transformation used.

Let's consider the following fictious data set to be used as an example.

Y X
100 8
216 8.5
193 9.2
94 6.0
208 5.8
306 10.3
177 9.5
290 10.1
256 12.0
285 8.8

We will now try to fit a growth curve of the form

Y = αeβx

164
to this data set.

The first activity is to linearize the model to the form

1n (Y) = ln(α) + βX

Which means that Y has to be transformed using the natural logarithm function as follows.

ln (Y) = 4.6,5.37,5.26,4.54,5.34, 5.72,5.18,5.67,5.55,&5.65

Fitting the data using linear regression (that is using ln(y) as dependent variable and X as an
independent ) the parameters are estimated as follows:

1n (α) = 3.46 with s.e = 0.46


β = 0.19 with s.e = 0.19

By taking anti - logarithm of 1n(α), α can be estimated to be 31.8. Therefore the curve in its
original scale is given as

Y = 31.8e 0.19x

B. Creation of new variable:- In agricultural research this technique is most commonly


applied to the kth degree polynomial:

Y = α + β1x + β2x2 + ... + βkxk

Such an equation can be linearized by creating K new variables : Z1, Z2, ... Zk, to form
multiple linear equation of the form:

Y = α + β1Z1 + β2Z2 + ... + βkZk


Where Zk = Xk

With the linearized form resulting from the creation of new variables, the procedure for
multiple linear regression and correlation analysis can be directly applied..

The most common use of polynomial form is in response curve studies. For this quadratic
polynomial of the form

Y = α + β 1 x + β 2 X2

is often used to locate the optimum response of treatments. A fictious fertilizer trail data
(shown below) was used as an example, where 3 levels of DAP applied and Y yield
harvested. The following data was generated for yield (y) and fertilizer level (x). The
objective is to look for level of fertilizer that gives optimum yield by assuming a quadratic
relationship between yield and fertilizer.

165
Y X X2
8.5 1.2 1.44
9.3 1.2 1.44
8.6 1.2 1.44

18.4 2.4 5.76


23.8 2.4 5.76
19.5 2.4 5.76

10.3 3.6 12.96


11.1 3.6 12.96
10.9 3.6 12.96

It is good to observe pattern of the data using graph to see whether the quadratic form is
valid. Here, yield increased until 2.4 application then steadily declined, which shows a near
quadratic relation. A multiple linear regression of Y on X (X1) and X2 (X2) has been fitted to
estimate b1 and b2. The result was

Y = -24.53 + 36.76 X1 -7.49 X2, which is

Y = -24.53 + 36.76X -7.49 X2

The next step is to find level of fertilizer that gives optimum yield. This is done using
differentiation principle of calculus

Y = -24.53 + 36.76X -7.49 X2

The method is to differentiate Y with respect to X and set the result to Zero. If ∂/∂x is
positive, then the result obtained is a maxima.

∂y/∂x = 36.76 - (2 x 7.49)x


∂y/∂x = 36.76 - 14.98 x = 0
X = 2.45

Therefore, we may conclude that the maximum yield is obtained for a fertilizer level of 2.45

Non-Lineari zed non-Linear Regressing

For functions that can’t be linearized such as population curves and lactation curves for cows,
there is a procedure called NLIN in SAS. One can handle all non-Linear models as long as
the functional form in known. For example, for population growth model y= Aexp[B(x-
1960)] the following program helps estimate the required parameters.

Proc nlin;
Parms A = 3.9
B = 0.022;
Model pop = A* exp[B*( year-1790)];
Run;

166
The non-linear regression often require initial values for parameters to be estimated, because
there could be more than one parameter to be estimated but only one equation is available and
it is not possible to estimate both simultaneously therefore, the package first substitute initial
values for one parameter and then estimate the other. It continues with this turn by turn and
refine the estimate until the system converges..

In the above example, A and B are the ones to be estimated. Hence a range of initial values
has been supplied for SAS to use it. So, we require only “ pop” as a dependent variable in the
model.

For this example, SAS required 8 iterations to have the estimates converged. Here we present
the first two and last two iteration result.

Iter A B SS
0 3.9 0.022 13963.7
1 5.55 0.9196 8577.3
. : : :
. : : :
. : : :
. : : :
. : : :
7 13.998 0.0147 1861.6
8 13.9979 0.0147 1861.6

Here, the iteration for estimation of A converged at the 8th round, because the difference
between 13.9979 is so small that even if 9th iteration has to continue no further refinement is
expected in the value of A .

The NLIN procedure have several options for different results and could be specified.
However, note that non-linear regression is difficult than linear ones to specify and estimate
parameters. Therefore, some complicated models may still be difficult to fit in SAS and you
must watch your result carefully before reporting.

11.6 Comparison of regression


Consider a response variate Y and regressor (explanatory) variate X. Further assume that
observations are grouped by a factor A with k levels. Factors could be variety, treatment,
location , etc. The objective here is to fit regression of Y on X for the different levels of the
factor variable and test for best regression model.

There are four possibilities for the model. They are 1. Separate lines for each levels of the
factor, that is each lines differ from each other both in slope and intercept, 2. Parallel lines
for each levels, that is the lines have same slope but have different intercept so that the rate of
change remain constant, 3. Concurrent line, that is lines which have different slope but have
same intercept. 4. Coincident line, or single line through all data point. The models may be
summarized as follows in the order they appeared above.

167
Model Parameters Residual SS d.f
1 E(yi) = αi + βi Xij α1,α2... αk,β1,β2,...,βk RSS1 n-2k
2 E(yij) = αi + β xij α1,α2, ...,αk,β RSS2 n-k-1
3 E(Yij) = α + βiXij α,β1,β2,..., βk RSS3 n-k-1
4 E(yij) = α + βiXij α,β RSS4
K
Note i = ith levels of factor; j = 1, ... ni , n =∑ ni
i=1

To analyze data using this model the data structure in computer must be clearly presented.
Here we require additional variates to the explanatory variable X which are called dummy
variables, represented by "1" and "0" only. The general data entry structure may be given as
shown below ( in matrix form). It is composed of four parts, the dependent, Y, the K-
dummy variables, the independent X, and the interaction of Dummy by X (D.X).

Y Dum1 Dum2 Dum3................Dumk X Dum1.X............Dumk.X


y11 1 0 0 0 X11 X11 0
y12 1 0 0 0 X12 X12 0
. . . . . . . .
. . . . . . . .
y1 n1 1 0 . . x1n1 x1n1 0
y21 0 1 . . . . .
y22 0 1 . . . . .
. . . . . . . .
. . . . . . . .
y2 n2 . 1 0 . X2n2 0 .
. . 0 1 . . . .
. . 0 . . . . .
. . . 1 . . . .
y3 n3 . . 0 . . . .
. . . . . . . .
. . . . 0 . . .
. . . . 1 Xki . xk1
. . . . . . . .
. . . . . . . .
yknk 0 0 0 1 xknk 0 Xknk

Two ANOVA tables are possible based on whether parallel or concurrent lines are important.

ANOVA 1

Source df SS
Single line 1 Syy - RSS4
Deviation of paralel K-1 RSS4 - RSS2
from single
Deviation of separate K-1 RSS2 - RSS1
from parallel
Residual n-2k RSS1
Total n-1 Syy

168
ANOVA 2

Source df SS
Single line 1 Syy - RSS4
Deviation of concurrent from single K-1 RSS4 - RSS3
Deviation of separate from concurrent K-1 RSS3 - RSS1
Residual n-2k RSS1
Total n-1 Syy

Let's consider the following hyphotetical example.

It was suspected that grain yield of potato (in kg) per plot linearly depend on number of
tubers in a plot. This is because the potato research team believes that tuber size for varieties
X and Y do not differ significantly among themselves. However, it was not known wether
the pattern of linearity is the same for both varieties. Therefore, it was proposed to compute
comparison of regression.

Variety Yield Number of tubers


1 90.5 6
1 88.3 6
1 72.4 4
1 103.6 12
1 99.9 13
1 54.5 3
1 33.6 2
1 48.4 3
2 112.3 15
2 106.0 10
2 93.6 7
2 26.4 2
2 58.9 3
2 77.8 5
2 108.7 9
2 100.5 9

The next step is how to enter the data to the computer. We already have the response and the
explanatory variables, what we additionally need is the dummy variables for both varieties as
the variable "variety" in the dataset above will not help us any more. To generate explanatory
variables (tuber number) for each type of variety we have to multiply dummy by tuber
number (D.X) to get the interaction.

169
Variety 1 Variety 2
Yield (D1) (D2) D1 .X D2. X X
90.5 1 0 6 0 6
88.3 1 0 6 0 6
72.4 1 0 4 0 4
103.6 1 0 12 0 12
99.9 1 0 13 0 13
54.5 1 0 3 0 3
33.6 1 0 2 0 2
48.4 1 0 3 0 3
112.3 0 1 0 15 15
106.0 0 1 0 10 10
93.6 0 1 0 7 7
26.4 0 1 0 2 2
58.9 0 1 0 3 3
77.8 0 1 0 5 5
108.7 0 1 0 9 9
100.5 0 1 0 9 9

The analysis could be done in any statistical package as long as we know which variables we
should fit for the different line types listed above. For example, the following commands in
SAS can analyze the data as required. Two possibilities are available for fitting this model.

Alternative one:




Proc reg data = compreg.reg;
model yield = D1 D2 D1.x D2.x /*for separate line
model yield = D1 D2 X ; /*for parallel line*/
model yield = D1.X D2.X X; /*for concurrent line*/
model yield = X; /*single line*/

In this case D2 will be set to zero for the first two models since it is a linear combination of
D1. The model for separate line will be :

Y = β0 + D1 + D1x + D2x

Hence, for variety 1 : Y = (β0 + D1)+ D1x (note that the intercept is now β0+D1)

for variety 2 : Y = β0 + D2x

For parallel , the general model is

170
Y = β0 + D1 + βx

For variety 1: Y = (β0 + D1) + βx


For variety 2: Y = β0 + βx

SAS program for the second alternative is the following:

proc Reg;
model yield = D2 D2.x x;
model yield = D2 x ;
model yield = D2.x x ;
model yield = x

For the separate lines, the general model is :

Y = β0 + D2 +D2x + B2x

for variety 1: Y = β0 + B2x


for variety 2: Y = (β0 +D2) + (D2 + B2) x

For parallel

Y = β0 + D2 + βx

variety 1: β0 + βx
variety 2: (β0 + D2) + βx

SAS gives ANOVA table for each model fitted and parameter estimate. For example,
ANOVA table for separate analysis (for the first alternative) is as follows.

SAS 17:45 Thursday, September 14, 1989


4

Model: MODEL1
Dependent Variable: YLD

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Prob>F

Model 3 8851.21081 2950.40360 13.563 0.0004


Error 12 2610.30669 217.52556
C Total 15 11461.51750

Root MSE 14.74875 R-square 0.7723


Dep Mean 79.71250 Adj R-sq 0.7153
C.V. 18.50243

The following parameters have been set to 0, since the variables are a

171
linear combination of other variables as shown.

D2 = +1.0000 * INTERCEP -1.0000 * D1


Parameter Estimates

Parameter Standard T for H0:


Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP B 38.982661 11.21902861 3.475 0.0046


D1 B 2.062913 14.81456014 0.139 0.8916
D2 0 0 0.00000000 . .
D1X 1 5.363988 1.33052604 4.031 0.0017
D2X 1 6.205645 1.32447664 4.685 0.0005

The equation for the two varieties may be given as :

For variety 1. Yield = (38.983 + 2.063) + 5.364 x => Y = 41.046 + 5.364X


For variety 2: Yield = 38.892 + 6.205x

Let's now construct the two ANOVA tables by subtracting the error sum of square of one
model from another to be able to get extra sum of square.

ANOVA 1

Source df SS MS F
Single line 1 8755 8755 40.3**
Deviation of parallel 1 2706.25 - 2654.03 = 52.22 52.22 0.24
from single
Deviation of separate from 1 2654.03 - 2610.3 = 43.73 43.73 0.20
parallel
Residual 12 2610.3 217.5

ANOVA 2

Source df SS MS F
Single 1 8755 8755 40.3**
Deviation of concurrent from 1 2706.25 - 2614.52 = 91.73 91.73 0.42
single
Deviation of separate from 1 2614.52 - 2610.3 = 4.22 4.22 0.019
Concurrent
Residual 12 = 2610.3 217.0

Once the deviations are obtained the tests are conducted in the usual way, that is by
comparing the mean square of a given model with the error mean square of the full model (a
model that contains largest number of parameters). Consequently, none of the parallel,
separate and concurrent lines are appropriate. Note that the question here is which regression

172
line best represent the pattern in the data set. Now, assuming that a linear relationship
between yield and number of tubers, we are testing additional contribution of parallel,
separate or concurrent as compared to single line through both varieties. Hence, additional
contribution of say parallel lines is a deviation of error SS for single from that of parallel.
For this data set a single line for both varieties were found to be appropriate. Hence, the final
model chosen is

Yield = 39.75 + 5 .87x

Same analysis could be done in other statistical packages like MSTATC, SPSS, GENSTAT
and others as long as one follows the data entry and analysis steps described earlier.

173
12. Exercises
1. Four treatments, A,B,C, & D are to be used in an experiment. You have a resources
of 25 plots in a rectangular area of this nature.

A
B
C
D
D

Design an experiment!

2. Suppose you have 3 treatments A, B, C, to be compared. You have a field if 9 units


long and 3 units wide to use. Design the experiment, give ANOVA layout and
calculations of SS. What are possible designs?

3. Suppose we have 9 varieties labeled 1,2...9, in three blocks of size 3, replicated


twice. Assuming that we are less interest in pair -wise comparison of (3, 9), (8,4),
(3,1) but more interest in pair-wise comparison of (3,7), (8,9), (6,5), (5,2).

Produce a sensible design,

4. We have 2 varieties, 2N levels and 2 seed rates. Wanted to apply N and seed rate to the main plot
and variety to sub-plot. Produce appropriate split-plot design and sketch ANOVA layout.

5. Give SAS program for question Number 3.

6. If an experiment with Lattice design is to be replicated in 3 location give SAS


program for doing a combined analysis

7. Consider 81 entries in lattice design. Suppose that Mr. A used 9x9 lattice with 10
replication, Mr B with 5 replication and Mr C with 2 replication. What is the
difference between the three designs.

8. Suppose that you want to conduct an experiment with 3 varieties and 4 fertilizer
level, but could not replicate. Discuss the possible solution for conducting this
experiment. Particularly, is it possible to test effect of main plot with single
replication?

174
13. Further Reading
1. Afifi, A.A and Clark. V. (1984) Computer-Aided Multivariate Analysis.
LifetimeLearning Publication, Velmont California, Chapter 16.
2. B. Jones and M.G Kenward: Design and Analysis of Cross-Over trials.
Chapman and Hall
3. Brian S. Vandell:
Practical Data Analysis for Designed Experiments.
4. Campbell, L.G. and Lafever, H.N (1980).
Effects of locations and years upon relative yields in the soft red winter wheat region.
Crop Sci . 20: 23-28.
5. CC Lindner and C.A Rodger (1997):
Design Theory. CRC press.
6. Cochran W.G (1977) Sampling Techniques 3rd ed, John Wiley & Sons Inc., New
York
7. Cochran, W.G and Cox, G.M (1957). Experimental Designs, 2nd ed,
John Wiley & Sons, New York
8. Cox (1958) Planning of experiments. Wiley, New York.
9. D.K Boniface (1994): Experimental Design and Statistical Method. CRC press, UK
10. Donald W. and J.H. Skillings. (1999) A First Course in the Design of Experiments.
CRC press
11. Federer, W.T (1955) Experimental Design: Theory and Application.
Mac Millan, New York
12. Gomez K.A and Gomez A.A (1984). Statistical Procedures For Agricultural
Research (2nd ed). John Wiley & sons, New York
13. Mead R. (1988). The design of Experiments: Statistical Principles for Practical
Applications. Cambridge University Press
14. Mead R., Curnow R. and A.M Hasted (1999). Statistical methods in Agriculture
and experimental biology (second edition). Oxford University press
15. Trudy A. Watt : Introductory Statistics for Biology Students
16. W.T Federer, Springer, Verlag: Statistical Design and Analysis for Intercropping
Experiment- volume 2 crops.
17. Williams, W.T (1976) pattern Analysis in Agricultural Science.
CSIRO & Elsevire Scientific publishing /co., Melbourne.
18. SAS manual 1999. SAS institute, carry USA.

175

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy