0% found this document useful (0 votes)
29 views24 pages

M346-201306 XXX

The document outlines the structure and requirements for the M346/E Module Examination in Linear Statistical Modelling, scheduled for June 12, 2013. The exam consists of two parts, with Part 1 requiring one essay question and Part 2 requiring three questions, each focusing on various statistical modeling concepts. Specific guidelines for answering questions, including the importance of clarity and structure, are provided, along with detailed instructions for submitting completed work.

Uploaded by

mariuszwarczak91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views24 pages

M346-201306 XXX

The document outlines the structure and requirements for the M346/E Module Examination in Linear Statistical Modelling, scheduled for June 12, 2013. The exam consists of two parts, with Part 1 requiring one essay question and Part 2 requiring three questions, each focusing on various statistical modeling concepts. Specific guidelines for answering questions, including the importance of clarity and structure, are provided, along with detailed instructions for submitting completed work.

Uploaded by

mariuszwarczak91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

M346/E 

Module Examination 2013


Linear Statistical Modelling

Wednesday 12 June 2013 10.00 am – 1.00 pm

Time allowed: 3 hours

This examination is in TWO parts. Part 1 carries 25% of the total available marks
and Part 2 carries 75%.

You should attempt ONE question from Part 1: this question carries 25 marks. You
should attempt THREE questions from Part 2: each question in this part also carries
25 marks.

You are advised not to cross through any work until you have replaced it with another
solution to the same question (or part of question). Crossed through work will not be
marked.

In Part 1 of the paper, if you answer both questions, your better score will count
towards your result. In Part 2 of the paper, if you answer more than three questions,
your best three scores will count towards your final mark.

This question paper is rather long because of the inclusion of tranches of GenStat
output. Do not let its length put you off. In your initial reading of the paper,
you will be able to either ignore or pass over very quickly all such output.

Please start each question on a new page, and cross out rough working.

At the end of the examination


Check that you have written your personal identifier and examination number on
each answer book used. (You may well have used only one answer book.) Failure to
do so will mean that your work cannot be identified.

Put all your used answer books together with your signed desk record on top. Fasten
them in the top left corner with the round paper fastener. Attach this question paper
to the back of the answer books with the flat paper clip.

Copyright 
c 2013 The Open University
PART 1 (Questions 1 and 2)
You should attempt ONE question from this part of the examination,
which carries 25% of the total available marks. Each question carries
25 marks. A guide to mark allocation is shown beside each question
thus: [4].
In each question in Part 1 you are asked to write a short essay on a
topic from the course. By the word ‘essay’, we do not mean to imply
that your answer should be entirely text; formulae and mathematical
symbols, if appropriate, are allowed. However, you should think of
this as an essay question in the senses of structure and readability.
Indeed, 4 of the 25 marks will be awarded for putting the essay
together in a reasonably clear manner, including a reasonable
structure with beginning, middle and conclusion, and reasonably
concise use of language. References to specific data-based examples in
the course are not expected. However, it may be useful to illustrate
points by giving special cases, perhaps in mathematical form (e.g.
Y ∼ N(0, σ2 ) is a special case of a distributional assumption, and
α + β1 x1 + β2 x2 is a special case of a formula for a regression mean).

Question 1
Write an essay describing the role of blocking in the design and
analysis of experiments.
Your answer should include:
• a brief description of what a block is in this context; [2]
• an outline of the reasons for using blocks in the design of an
experiment; [4]
• a brief description of an experimental situation where more than
one kind of block is involved, making it clear why it is necessary
to have more than one kind; [6]
• a brief explanation of how linear models for experimental data
take account of blocks; [5]
• a brief explanation of the reason why the ANOVA commands in
software like GenStat do not routinely display p values for blocks. [4]
The remaining four marks are for the clarity and structure of your
essay. [4]

M346 June 2013 2


Question 2
Write an essay in which you describe the role of transformations in
linear and generalized linear modelling.
Your essay should include:
• a brief definition of what transformation means in this context; [2]
• a brief explanation of the main reasons for transforming either
the response variable or the explanatory variable(s) in normal
linear models, making it clear which type of variable (response,
explanatory or both) is transformed in each case; [9]
• a brief explanation of the circumstances in which fitting a
generalized linear model might be a better alternative than
transforming the data and fitting a normal model; [5]
• a brief explanation of the circumstances in which you would still
want to transform the data in fitting a non-normal generalized
linear model. [5]
The remaining four marks are for the clarity and structure of your
essay. [4]

M346 June 2013 TURN OVER 3


PART 2 (Questions 3 to 7)
You should attempt THREE questions from this part of the
examination, which carries 75% of the total available marks. Each
question carries 25 marks. The mark allocation for each part of each
question is shown beside each part thus: [4].

Question 3
Toughness and fibrousness of asparagus are major determinants of
quality. A method for quantifying asparagus texture is based on
measuring the maximum shear force necessary to cut through the
spears. A study was carried out where each of 18 randomly selected
spears of asparagus from a local market had its maximum shear force
to cut and its percentage content of dry fibre weight measured. The
data were recorded in a GenStat file containing the variables force and
weight with the measured shear forces (in Kgf) and fibre dry weights
(in %) respectively.
(a) A scatterplot of the data is given in Figure 1.

Figure 1
(i) On the basis of this plot, would you say it is reasonable to fit
a simple linear regression model to the data? Briefly explain
your answer. [3]
(ii) The researcher who gathered the data considered
transforming the variables by taking the log of both force and
weight. Under what circumstances would that be helpful? [2]

M346 June 2013 4


(b) The following output is generated by GenStat from fitting a simple
linear regression model, Model A, to the (untransformed) data.
Model A
Regression analysis
Response variate: weight
Fitted terms: Constant, force
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 1 2.8320 2.83200 224.64 <.001
Residual 16 0.2017 0.01261
Total 17 3.0337 0.17845
Percentage variance accounted for 92.9
Standard error of observations is estimated to be 0.112.
Estimates of parameters
Parameter estimate s.e. t(16) t pr.
Constant 1.7588 0.0658 26.72 <.001
force 0.008340 0.000556 14.99 <.001
(i) What is the regression equation for Model A? On the basis of
this model, calculate a point estimate for the percentage fibre
dry weight of an asparagus spear for which the maximum
cutting shear force is 120 Kgf. [3]
(ii) The output for Model A gives the required information to
test the hypothesis that the slope of the regression line is
zero. Give the value of the test statistic and report the
results of this test, stating your conclusions clearly. [3]
(iii) A composite residual plot for Model A is given in Figure 2.
Explain whether the plot indicates that the assumptions of
the simple linear model hold for this model. [5]

Figure 2

M346 June 2013 TURN OVER 5


(c) The asparagus spears in the study were selected from two groups,
one with green stems and the other with purple stems. It was
thought that green colour was a contributing factor for higher
percentage of fibre content of asparagus. The data were recorded
in a GenStat factor called green, in which those with purple colour
were coded “0” and those with green colour were coded “1”.
To investigate whether colour had an effect on percentage of fibre
dry weight, the following two models were fitted using GenStat:
Model B: Constant + force + green
Model C: Constant + force + green + force.green
The following output was obtained.
Model B
Regression Analysis
Response variate: weight
Fitted terms: Constant + force + green
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 2 2.9199 1.459962 192.46 <.001
Residual 15 0.1138 0.007586
Total 17 3.0337 0.178454
Percentage variance accounted for 95.7
Standard error of observations is estimated to be 0.0871.
Message: the following units have large standardized residuals.
Unit Response Residual
17 3.4900 2.13
Estimates of parameters
Parameter estimate s.e. t(15) t pr.
Constant 1.7821 0.0515 34.59 <.001
force 0.007450 0.000505 14.77 <.001
green 1 0.1644 0.0483 3.40 0.004
Parameters for factors are differences compared with the reference
level:
Factor Reference level
green 0

Model C
Regression Analysis
Response variate: weight
Fitted terms: Constant + force + green + force.green
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 3 2.9200 0.973341 119.86 <.001
Residual 14 0.1137 0.008121
Total 17 3.0337 0.178454
Change -1 -0.0001 0.000099 0.01 0.914

M346 June 2013 6


Percentage variance accounted for 95.4
Standard error of observations is estimated to be 0.0901.
Message: the following units have large standardized residuals.
Unit Response Residual
17 3.4900 2.12
Estimates of parameters
Parameter estimate s.e. t(14) t pr.
Constant 1.7874 0.0719 24.85 <.001
force 0.007388 0.000765 9.66 <.001
green 1 0.152 0.125 1.22 0.244
force.green 1 0.00012 0.00105 0.11 0.914
Parameters for factors are differences compared with the reference
level:
Factor Reference level
green 0
(i) Give a simple description of the size of the effect of having
green colour on the percentage of fibre dry weight of an
asparagus spear, as identified by Model B. [1]
(ii) What can you say about whether there is a difference
between the regression slopes for the purple and green stem
groups, on the basis of Model C? [1]
(iii) With Model C, unit 17 was flagged as having large residual
and units 6 and 15 had the highest leverage, but an index
plot of Cook statistics shows that only unit 17 has a large
value. Explain why. [3]
(iv) Explain carefully which of the three models (Models A, B
and C) most appropriately describes the relationship between
the percentage of fibre dry weight of an asparagus spear and
the shear force required to cut it. [4]

M346 June 2013 TURN OVER 7


Question 4
These are data based on a 5% sample of all births occurring in
Philadelphia, USA in 1990. The sample has 1115 observations on five
variables:
black Mother is black (a factor, with 1=yes, 0=no),
educ Mother’s years of education (whole years, ranging from 0 to 17),
smoke Whether mother smoked during pregnancy (a factor, with
1=yes, 0=no),
gestate Gestational age in weeks, and
grams Birth weight in grams.
The response variable is grams, and the rest are explanatory variables.
(a) On the basis of plots of the data, it was decided that gestate
should be transformed to lgest = log(gestate) (using base e for the
logs). Give a reason why this decision might have been made. [2]
(b) Explain why it does not matter whether black (or smoke) is
recorded as a factor or as a variate in GenStat. [2]
(c) Using lgest in place of gestate, the following GenStat output gives
the correlation matrix for the variables in the data set, the results
of a multiple regression analysis of the full four-variable model,
and the results of four individual simple linear regressions of
grams on each of the explanatory variables in turn. (The
correlation matrix and full multiple regression analysis are given
direct from GenStat, except that the details of the high leverage
observations have been omitted to save space; the individual
simple linear regression results have been edited into a single
table.)

M346 June 2013 8


Correlations

black
educ −0.1458
lgest −0.1627 0.0563
smoke 0.0524 −0.2257 −0.1426
grams −0.2565 0.1187 0.7000 −0.2281
black educ lgest smoke grams
Number of observations: 1115

Regression analysis
Response variate: grams
Fitted terms: Constant, black, educ, lgest, smoke
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 4 235948058. 58987015. 310.09 <.001
Residual 1110 211150093. 190225.
Total 1114 447098151. 401345.
Percentage variance accounted for 52.6
Standard error of observations is estimated to be 436.
Message: the following units have large standardized residuals.
Unit Response Residual
748 1900. −3.35
1045 4830. 3.60
Message: the following units have high leverage.
(Here GenStat gave a list of 24 units, with leverages between 0.0156
and 0.0696.)
Estimates of parameters
Parameter estimate s.e. t(1110) t pr.
Constant −15929. 623. −25.55 <.001
black 1 −178.0 27.2 −6.54 <.001
educ 10.45 6.46 1.62 0.106
lgest 5242. 168. 31.21 <.001
smoke 1 −176.3 31.6 −5.58 <.001
Parameters for factors are differences compared with the reference
level:
Factor Reference level
smoke 0
black 0

Estimates of parameters for the individual simple regressions


Parameter estimate s.e. t(1113) t pr.
black 1 −330.7 37.4 −8.85 <.001
educ 35.87 8.99 3.99 <.001
lgest 5572. 170. 32.70 <.001
smoke 1 −337.7 43.2 −7.82 <.001

M346 June 2013 TURN OVER 9


What does this output suggest about which of the four possible
explanatory variables should be included in a good regression
model based on a subset of the four available variables? Explain
your answer clearly, making explicit which parts of the output
relate to each of your conclusions. Comment on the presence of
observations with high leverage. [7]
(d) The stepwise search method provided by GenStat, with M346
default choices, was applied to the dataset. Both forward
selection and backward elimination methods resulted in a model
containing lgest, black and smoke, but excluding educ.
(i) What is meant by forward selection and backward
elimination in the stepwise method, and why do they
sometimes result in different models? [4]
(ii) Does the model finally obtained by GenStat seem reasonable
given the preliminary analysis you did in part (c)? Briefly
explain your answer. [2]
(e) GenStat output for Model 1, the regression of grams on lgest,
smoke and black only, is given next.
Model 1
Regression analysis
Response variate: grams
Fitted terms: Constant, lgest, smoke, black
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 3 235450059. 78483353. 411.98 <.001
Residual 1111 211648092. 190502.
Total 1114 447098151. 401345.
Percentage variance accounted for 52.5
Standard error of observations is estimated to be 436.
Message: the following units have large standardized residuals.
Unit Response Residual
748 1900. −3.34
1045 4830. 3.68
Message: the following units have high leverage.
(Here GenStat gave a list of 21 units, with leverages between 0.0170
and 0.0692.)
Estimates of parameters
Parameter estimate s.e. t(1111) t pr.
Constant −15798. 619. −25.54 <.001
lgest 5243. 168. 31.19 <.001
smoke 1 −187.5 30.9 −6.07 <.001
black 1 −184.0 27.0 −6.82 <.001

M346 June 2013 10


Parameters for factors are differences compared with the reference
level:
Factor Reference level
smoke 0
black 0
Explain whether you consider Model 1 to be a good model. [2]
(f) Model 2 was found to be another sensible model. The resulting
GenStat output from fitting this model is as follows.
Model 2
Regression analysis
Response variate: grams
Fitted terms: Constant + lgest + smoke + black + smoke.black
Summary of analysis
Source d.f. s.s. m.s. v.r. F pr.
Regression 4 237440198. 59360049. 314.27 <.001
Residual 1110 209657954. 188881.
Total 1114 447098151. 401345.
Percentage variance accounted for 52.9
Standard error of observations is estimated to be 435.
Message: the following units have large standardized residuals.
Unit Response Residual
94 4830. 3.34
119 3438. 3.33
1045 4830. 3.74
Message: the following units have high leverage.
(Here GenStat gave a list of 19 units, with leverages between 0.0174
and 0.0762.)
Estimates of parameters
Parameter estimate s.e. t(1110) t pr.
Constant −15866. 616. −25.75 <.001
lgest 5269. 168. 31.45 <.001
smoke 1 −314.8 49.8 −6.32 <.001
black 1 −230.6 30.5 −7.57 <.001
smoke 1 .black 1 204.5 63.0 3.25 0.001
Parameters for factors are differences compared with the reference
level:
Factor Reference level
smoke 0
black 0
(i) Compare Model 2 with Model 1, the model considered in
part (e), and comment on their advantages and weaknesses. [2]
(ii) Write down the regression equation for Model 2. Explain this
equation in qualitative terms. [3]
(iii) Using Model 2, compute the expected birth weight in grams
for a white smoking woman with 39 weeks gestational age. [1]

M346 June 2013 TURN OVER 11


Question 5
An agricultural experiment was carried out to investigate the effects
of date of cutting and of nitrogen fertiliser on the yield of forage. The
data were saved in a GenStat spreadsheet.
The experiment was laid out in four blocks (GenStat factor block),
each of which contained four plots (GenStat factor plot). Each plot
was divided into two subplots (GenStat factor subplot), one of which
was fertilised with a nitrogen fertiliser. The GenStat factor fertiliser
records whether each subplot was fertilised or not. The four plots
within each block were each harvested on a different date (11 June,
1 July, 22 July and 12 August), with these dates recorded in the
GenStat factor cutdate. The yield of forage (GenStat variate yield) on
each subplot was recorded. Unfortunately no data could be obtained
for three of the subplots in the experiment.
Preliminary analysis indicated that some transformation of the data
was necessary, and a log transform (base e) of the yields was carried
out (variate lyield).
(a) Explain why, in some experiments, it is necessary to split the
plots and to apply one treatment factor to whole plots and
another to subplots. [2]
(b) How should the experimenters have decided which cutting date is
used for which plot within a block, and which subplot in each
plot should have the fertiliser? Give reasons for your answer. [3]
(c) Give a possible reason why the experimenters may have chosen to
divide the field into four blocks rather than just using each of the
four cutting dates for four plots chosen from anywhere in the field. [2]
(d) The data were analysed using GenStat’s analysis of variance
commands. The resulting output, together with the (GenStat
default) residual plots (Figure 3) and a potentially useful plot of
means (Figure 4), are as follows.
Analysis of variance
Variate: lyield

Source of variation d.f. (m.v.) s.s. m.s. v.r. F pr.

block stratum 3 0.0302010 0.0100670 1.26

block.plot stratum
cutdate 3 9.8605161 3.2868387 412.57 <.001
Residual 8 (1) 0.0637342 0.0079668 8.51

block.plot.subplot stratum
fertiliser 1 0.0794758 0.0794758 84.86 <.001
fertiliser.cutdate 3 0.0153184 0.0051061 5.45 0.018
Residual 10 (2) 0.0093659 0.0009366

Total 28 (3) 9.3620650

M346 June 2013 12


Message: the following units have large residuals.
block 1 plot 5 subplot 1 0.0452 approx. s.e. 0.0171
block 1 plot 5 subplot 2 −0.0452 approx. s.e. 0.0171
Tables of means
Variate: lyield
Grand mean 4.0373
fertiliser no yes
3.9874 4.0871
cutdate Jun11 Jul01 Jul22 Aug12
3.1120 4.1073 4.4105 4.5192
fertiliser cutdate Jun11 Jul01 Jul22 Aug12
no 3.0287 4.0650 4.3595 4.4965
yes 3.1953 4.1496 4.4616 4.5419
Standard errors of differences of means
Table fertiliser cutdate fertiliser
cutdate
rep. 16 8 4
s.e.d. 0.01082 0.04463 0.04718
d.f. 10 8 9.88
Except when comparing means with the same level(s) of
cutdate 0.02164
d.f. 10
(Not adjusted for missing values)

Figure 3 Figure 4

M346 June 2013 TURN OVER 13


(i) Explain briefly the meaning of the numbers in brackets in the
(m.v.) column of the analysis of variance table. [2]
(ii) Explain briefly why the rows labelled cutdate and fertiliser
appear in different strata in the analysis of variance table. [2]
(iii) Explain why there is a row labelled fertiliser.cutdate in the
analysis of variance table, and what hypothesis the figures in
this row can be used to test. [3]
(iv) Are there any aspects of the output or plots that give you
cause to suspect that the assumptions of the analysis of
variance model may not hold? Briefly explain your answer.
Suggest what further investigation or analysis you might
carry out in the light of any suspicions you may report. [5]
(v) Suppose you are asked to report on this experiment to
agriculturalists who want to know how cutting date and
nitrogen fertiliser might affect their own forage yields. Write
a brief summary of the main findings from this experiment. [6]

M346 June 2013 14


Question 6
A GenStat data file contains records on 32 Tibetan skulls divided into
two groups (1-17 and 18-32), denoted on the file by Y = 0 for the first
group and Y = 1 for the second. On each skull five measurements (in
millimeters) were recorded: greatest length of skull (x1), greatest
horizontal breadth of skull (x2), height of skull (x3), upper face height
(x4), and face breadth, between outermost points of cheek bones (x5).
The goal of the analysis is to estimate the probability that a skull
belongs to one of the two groups, given the values of its five
measurements.
(a) Explain briefly why logistic regression is appropriate for analysing
these data. Explain briefly whether the scatter plot of x1 against
Y in Figure 5 supports this choice. Say, with a reason, whether
you could use log-linear modelling instead. [3]

Figure 5
(b) If a logistic regression model is fitted in GenStat using General
model within the Generalized Linear Models dialogue box, what
should be entered into the field Binomial Totals for this type of
data, and why? [2]
(c) Consider the following GenStat logistic regression output:
Model A
Regression analysis
Response variate: Y
Binomial totals: ????
Distribution: Binomial
Link function: Logit
Fitted terms: Constant + x1 + x2 + x3 + x4 + x5

M346 June 2013 TURN OVER 15


Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 5 24.07 4.8149 4.81 <.001
Residual 26 20.16 0.7755
Total 31 44.24 1.4270
Dispersion parameter is fixed at 1.00.
Message: deviance ratios are based on dispersion parameter with
value 1.
Message: the following units have large standardized residuals.
Unit Response Residual
25 1.00 2.12
Message: the residuals do not appear to be random; for example,
fitted values in the range 0.00 to 0.22 are consistently larger than
observed values and fitted values in the range 0.70 to 1.00 are
consistently smaller than observed values.
Message: the error variance does not appear to be constant:
intermediate responses are more variable than small or large
responses.
Message: the following units have high leverage.
Unit Response Leverage
1 0.00 0.57
12 0.00 0.66
14 0.00 0.44
25 1.00 0.42
26 1.00 0.64
Estimates of parameters
antilog of
Parameter estimate s.e. t(*) t pr. estimate
Constant −73.5 44.4 −1.66 0.098 *
x1 0.1289 0.0966 1.33 0.182 1.138
x2 −0.323 0.180 −1.80 0.072 0.7237
x3 0.104 0.122 0.85 0.394 1.110
x4 0.335 0.230 1.46 0.145 1.399
x5 0.424 0.270 1.57 0.116 1.528
Message: s.e.s are based on dispersion parameter with value 1.
Making use of the Model A output, answer the following
questions:
(i) Explain briefly why the regression degrees of freedom takes
the value 5. [2]
(ii) In the Model A output, what does approx chi pr <.001 mean
and what does it tell us about the usefulness of the terms in
the regression model? [2]
(iii) Does Model A fit the data well or not? Explain briefly what
evidence there is for your answer in the Model A output. [3]

M346 June 2013 16


(iv) Are there warnings in the Model A output that can be
disregarded, and, if so, why? [3]
(v) Does overdispersion appear to be problem for Model A?
Briefly explain why or why not. [2]
(d) It was decided to discard the variable with the largest t pr. value
in Model A, x3, and consider the resulting reduced model. The
same process was then repeated. The models found and the
t pr. values of the non-constant terms involved in them are listed
in Table 1:
Table 1
model t pr. values
Constant + x1 + x2 + x4 + x5 0.144, 0.054, 0.172, 0.093
Constant + x1 + x2 + x5 0.036, 0.055, 0.041
Constant + x1 + x5 0.028, 0.164
Constant + x1 0.006
Which of the listed models do you consider best, and why? [2]
(e) The GenStat output for the last model of Table 1 is as follows:
Model B Regression analysis
Response variate: Y
Binomial totals: ????
Distribution: Binomial
Link function: Logit
Fitted terms: Constant, x1
Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 1 13.32 13.316 13.32 <.001
Residual 30 30.92 1.031
Total 31 44.24 1.427
Dispersion parameter is fixed at 1.00.
Message: deviance ratios are based on dispersion parameter with
value 1.
Message: the following units have large standardized residuals.
Unit Response Residual
1 0.00 −2.14
Message: the error variance does not appear to be constant:
intermediate responses are more variable than small or large
responses.
Estimates of parameters
antilog of
Parameter estimate s.e. t(*) t pr. estimate
Constant −34.5 12.6 −2.74 0.006 *
x1 0.1914 0.0703 2.72 0.006 1.211
Message: s.e.s are based on dispersion parameter with value 1.

M346 June 2013 TURN OVER 17


Using Model B, write down the fitted equation for the probability
that a skull belongs to the group for which Y = 1. [2]
(f) All five logistic regressions of Y on the five individual
measurements were performed in turn. The regression deviances,
regression d.f.s and p values are shown in Table 2:
Table 2
d.f. Deviance p value
x1 1 13.32 < .001
x2 1 0.07 0.795
x3 1 1.74 0.188
x4 1 15.47 < .001
x5 1 8.84 0.003
Which of the explanatory variables appear to be important
predictors for Y , and why? [2]
(g) It was decided to perform backward elimination of variables
based on Table 2. The resulting models are listed in Table 3:
Table 3
model t pr. values
Constant + x1 + x4 + x5 0.218, 0.148, 0.371
Constant + x1 + x4 0.207, 0.073
Constant + x4 0.006
The GenStat output from the last model in Table 3 is as follows:
Model C Regression analysis
Response variate: Y
Binomial totals: ????
Distribution: Binomial
Link function: Logit
Fitted terms: Constant, x4
Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 1 15.47 15.4734 15.47 <.001
Residual 30 28.76 0.9588
Total 31 44.24 1.4270
Dispersion parameter is fixed at 1.00.
Message: deviance ratios are based on dispersion parameter with
value 1.
Message: the following units have large standardized residuals.
Unit Response Residual
32 1.00 2.13
Message: the error variance does not appear to be constant:
intermediate responses are more variable than small or large
responses.

M346 June 2013 18


Estimates of parameters
antilog of
Parameter estimate s.e. t(*) t pr. estimate
Constant −27.99 9.89 −2.83 0.005 *
x4 0.380 0.134 2.83 0.005 1.462
Message: s.e.s are based on dispersion parameter with value 1.
Which of Model B and Model C do you consider to be better, and
why? [2]

M346 June 2013 TURN OVER 19


Question 7
A study of coronary artery disease recorded the electrocardiogram
(ECG) conditions of 156 patients who voluntarily attended a clinic
and requested an evaluation. Table 4 contains data from the study
where classifications were made according to gender of the patient
(gender), the ST segment depression measure from the patient’s ECG
(STdep) and whether the patient was diagnosed with coronary artery
disease (coronary). The number of patients in each category is the
variable count. The interest in analysing these data focuses on how
other variables might affect the chance that a patient is diagnosed
with coronary artery disease.
Table 4

Gender ST depression Coronary

Yes No
F > 0.1 16 20
F ≤ 0.1 8 22
M > 0.1 42 12
M ≤ 0.1 18 18
It was decided to analyse these data using log-linear analysis in
GenStat.
(a) Describe how these data should be entered into a GenStat
spreadsheet. Your answer should mention the number of rows and
columns that are required and describe the contents of each
column and each row. [4]
(b) The data were entered into a GenStat spreadsheet and then
analysed using Log-linear modelling in the Analysis field of the
Generalized Linear Models dialogue box. The following GenStat
output was obtained from fitting such a model (Model A).
Model A
Regression analysis
Response variate: count
Distribution: Poisson
Link function: Log
Fitted terms: Constant + gender + STdep + coronary
Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 3 17.79 5.929 5.93 <.001
Residual 4 14.98 3.746
Total 7 32.77 4.681
Dispersion parameter is fixed at 1.00.

M346 June 2013 20


Message: deviance ratios are based on dispersion parameter with
value 1.
Message: the following units have large standardized residuals.
Unit Response Residual
1 16.00 −2.61
3 42.00 2.62
7 12.00 −2.56
Estimates of parameters
antilog of
Parameter estimate s.e. t(*) t pr. estimate
Constant 2.340 0.186 12.57 <.001 10.38
gender M 0.310 0.162 1.92 0.055 1.364
STdep >0.1 0.310 0.162 1.91 0.056 1.364
coronary Yes 0.525 0.166 3.17 0.002 1.690
Message: s.e.s are based on dispersion parameter with value 1.
Parameters for factors are differences compared with the reference
level:
Factor Reference level
gender F
STdep <=0.1
coronary No
(i) What is the value of the test statistic for testing whether
Model A fits the data adequately? What distribution would
this test statistic be compared to for the test? The p value for
the test turns out to be 0.0047. What do you conclude? [3]
(ii) If Model A fitted the data adequately, what would that tell
you about the relationship between the main factors (gender,
STdep and coronary)? [2]
(c) Three further models, Model B, Model C and Model D were then
fitted to the data. The resulting output is as follows.
Model B
Regression analysis
Response variate: count
Distribution: Poisson
Link function: Log
Fitted terms: Constant + gender + STdep + coronary + gender.coronary
Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 4 19.13 4.782 4.78 <.001
Residual 3 13.64 4.547
Total 7 32.77 4.681
Dispersion parameter is fixed at 1.00.
Message: deviance ratios are based on dispersion parameter with
value 1.
Message: the following units have large standardized residuals.

M346 June 2013 TURN OVER 21


Unit Response Residual
1 16.00 −2.35
3 42.00 2.38
4 18.00 2.05
6 22.00 2.11
7 12.00 −2.31
8 18.00 −2.59
Estimates of parameters
antilog of
Parameter estimate s.e. t(*) t pr. estimate
Constant 2.472 0.211 11.73 <.001 11.85
gender M 0.069 0.263 0.26 0.793 1.071
STdep >0.1 0.310 0.162 1.91 0.056 1.364
coronary Yes 0.305 0.249 1.23 0.220 1.357
gender M .coronary Yes 0.388 0.335 1.16 0.246 1.474
Message: s.e.s are based on dispersion parameter with value 1.
Parameters for factors are differences compared with the reference
level:
Factor Reference level
gender F
STdep <=0.1
coronary No

Model C
Regression analysis
Response variate: count
Distribution: Poisson
Link function: Log
Fitted terms: Constant + gender + STdep + coronary + STdep.coronary
Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 4 18.03 4.506 4.51 0.001
Residual 3 14.74 4.914
Total 7 32.77 4.681
Dispersion parameter is fixed at 1.00.
Message: deviance ratios are based on dispersion parameter with
value 1.
Message: the following units have large standardized residuals.
Unit Response Residual
1 16.00 −3.06
3 42.00 2.75
5 20.00 2.42
7 12.00 −2.77

M346 June 2013 22


Estimates of parameters
antilog of
Parameter estimate s.e. t(*) t pr. estimate
Constant 2.398 0.217 11.04 <.001 11.00
gender M 0.310 0.162 1.92 0.055 1.364
STdep >0.1 0.208 0.264 0.79 0.431 1.231
coronary Yes 0.431 0.252 1.71 0.087 1.538
STdep >0.1 .coronary Yes 0.164 0.334 0.49 0.624 1.178
Message: s.e.s are based on dispersion parameter with value 1.
Parameters for factors are differences compared with the reference
level:
Factor Reference level
gender F
STdep <=0.1
coronary No

Model D
Regression analysis
Response variate: count
Distribution: Poisson
Link function: Log
Fitted terms: Constant + gender + STdep + coronary + gender.STdep
Summary of analysis
mean deviance approx
Source d.f. deviance deviance ratio chi pr
Regression 4 18.25 4.562 4.56 0.001
Residual 3 14.52 4.840
Total 7 32.77 4.681
Dispersion parameter is fixed at 1.00.
Message: deviance ratios are based on dispersion parameter with
value 1.
Message: the following units have large standardized residuals.
Unit Response Residual
1 16.00 −2.74
3 42.00 2.71
5 20.00 2.42
7 12.00 −3.04
Estimates of parameters
antilog of
Parameter estimate s.e. t(*) t pr. estimate
Constant 2.412 0.210 11.48 <.001 11.15
gender M 0.182 0.247 0.74 0.461 1.200
STdep >0.1 0.182 0.247 0.74 0.461 1.200
coronary Yes 0.525 0.166 3.17 0.002 1.690
gender M .STdep >0.1 0.223 0.328 0.68 0.496 1.250
Message: s.e.s are based on dispersion parameter with value 1.

M346 June 2013 TURN OVER 23


Parameters for factors are differences compared with the reference
level:
Factor Reference level
gender F
STdep <=0.1
coronary No
(i) Explain why the regression degrees of freedom for Model B is
the same as for Model C and Model D. [2]
(ii) Model B, Model C and Model D can each be obtained from
Model A by adding one two-way interaction. By considering
how the deviance changes, explain which of Models B, C and
D seems to be the most appropriate model. [4]
(d) Which of the four Models A, B, C and D would you choose as the
most appropriate to assess the chance that a patient is diagnosed
with coronary artery disease? Briefly explain why. [3]
(e) Data which can be analysed by means of a log-linear model can
sometimes also be analysed by means of logistic regression.
(i) Give one advantage and one disadvantage of using logistic
regression rather than log-linear modelling. [2]
(ii) Explain why the data in this question could have been
analysed using logistic regression instead of fitting log-linear
models. [2]
(iii) Some log-linear models correspond exactly to a logistic
model. Explain which one of the four log-linear Models A, B,
C and D has a logistic model that corresponds to it exactly. [3]
[END OF QUESTION PAPER]

M346 June 2013 24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy