Poisson Models For Person-Years and Expected Rates
Poisson Models For Person-Years and Expected Rates
Elizabeth J. Atkinson
Cynthia S. Crowson
Rachel A. Pedersen
Terry M. Therneau
3 Additive models 22
3.1 Motivation and basic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Excess risk regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Direct standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Direct standardization to a cohort . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Confidence intervals for standardized rates . . . . . . . . . . . . . . . . . . . . . 29
4 Summary 29
5 Appendix 32
5.1 SAS code for the examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.1 Code for section 1.6 Relating Cox and rate regression models . . . . . . . . . . . 32
5.1.2 Code for section 2.1 Relative risk regression - basic models . . . . . . . . . . . . 33
5.1.3 Code for section 3.1 Additive models - basic models . . . . . . . . . . . . . . . . 33
5.2 Expected rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.1 Expected rates in S-Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.2 Creating expected rates in S-Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.3 Expected rates in SAS for survival analysis . . . . . . . . . . . . . . . . . . . . . 37
5.2.4 Expected rates in SAS for person-years analysis . . . . . . . . . . . . . . . . . . . 37
5.3 S-Plus functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.1 pyears2html . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.2 poisson.additive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.3 addglm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 A closer look at differences between gam and glm . . . . . . . . . . . . . . . . . . . . . . 40
1
1 Introduction
1.1 Motivation
In medical research we are often faced with the question of whether, in a specified cohort, the observed
number of events (such as death or fracture) is more than we would expect in the general population.
If there is an excess risk, we then wish to determine whether the excess varies based on factors such
as age, sex, and time since disease onset.
Statistical approaches to this problem have been studied for a long time and there is a well-
described set of methods available (see, for instance, overviews in the Encyclopedia of Biostatistics
[6]). An interesting aspect of the methods, and the primary subject of this report, is that many of the
computations are very closely related to Poisson regression models. Powerful modern software, such as
the generalized linear models functions of S-Plus (glm), SAS (genmod), or other packages, allow us to
do these “specialized” computations quite simply via creation of datasets in the appropriate format.
This report summarizes several of these computations, and is also a compendium of various tricks and
techniques that the authors have accumulated over the years.
At the outset, however, we need to distinguish two separate kinds of questions about event rates,
as the approaches herein deal only with one of them. Consider all of the subjects in a population with
a particular condition, e.g. exposure to Lyme disease as measured by the presence of an antibody.
Interest could center on two quite different questions:
• Prospective risk: a set of subjects with the condition has been identified; what does their future
hold?
• Population risk: what is the impact of the condition on the population as a whole? This includes
all subjects with the condition, whether currently identified or not.
A big difference between these questions is what to do with cases that are found post-hoc, for instance
patients who are found to have the antibody only when they develop an adverse outcome known to
be related to Lyme disease, or cases found at autopsy. Analysis for population questions is much
more difficult since it involves not only the cases found incidentally, but inference about the number
of subjects in the population with the condition of interest who have not been identified.
The methods discussed here only apply to the prospective case. We might think of prospective risk
as the therapist’s question, as they advise the patient currently in front of them about his/her future.
Interesting questions include
• Short term. What is the risk for the short term, say 30 days? This question is important, but we
must be quite careful about implying causality from any association that we find. We might find
more joint pain in Lyme disease patients simply because we look for it harder, or a particular
lab test might only be ordered on selected patients.
• Long term. What is the long term risk for the patient? Specific examples of questions are:
– Do patients with diabetes have a higher death rate than the general population? Has this
relationship changed over time? Are there different relationships for younger versus older
subjects? Does this relationship change with disease duration?
– Has the rheumatoid arthritis population experienced the same survival benefits as the gen-
eral population?
– Multiple myeloma subjects are known to experience a large number of fractures. Does this
excess fracture rate occur before or after the diagnosis of multiple myeloma? Is this excess
fracture rate constant after diagnosis?
– Do patients with MGUS (monoclonal gammopathy of undetermined significance) have
higher death rates than expected? Is the excess mortality rate constant, or does it rise
with age? How is the amount of excess related to gender, obesity, or other factors?
Each of these items are actual medical questions that have been addressed in studies at Mayo, worked
on by members of the Division of Biostatistics, and several are used as examples in this report.
2
Age
Time < 35 35–45 45–55 55–65 65–75 75+
0–1 month 0 0 0 .082 0 0
1–6 month 0 0 0 .416 0 0
6–12 month 0 0 0 .236 .266 0
1–2 yr 0 0 0 0 1 0
2–5 yr 0 0 0 0 2.49 0
5+ yr 0 0 0 0 0 0
Table 1: Person-years analysis for one person who entered the study at age 64.3 and died at age 68.8. Each
cell contains the number of person-years spent in that category.
1.3 Person-years
Tabular analyses of observed versus expected events, often called “person-years” summaries, are very
familiar in epidemiological studies. Table 1 is an example, and shows how a subject may contribute
to multiple cells within a person-years table. In this case, the results have been stratified both by
patient age and by the time interval that has transpired since detection of MGUS. This table includes
the follow-up for a female who is age 64.3 at diagnosis of MGUS and is followed for approximately 4.5
years. She contributes 0.082 years (1 month) to the “0-1 month” cell as a 55-65 year old. During the
“6-12 month” cell she contributes person-years to two different age groups.
This concept can be applied to all the females from the mgus data, as shown in Table 2. We can learn
a lot from this table, but it is not always easy to read. We immediately see that making those under
age 35 into a separate group was unnecessary; very few people are diagnosed with MGUS before age
45. Looking across the rows, we see considerable early mortality in these patients: in the first month
after detection there were 20 observed deaths, but only 2.0 expected events. This 10-fold increase is
likely an artifact of detection, i.e. it is likely attributable to people who came to Mayo for a serious or
life-threatening condition and were found to have MGUS incidentally. Such individuals are not dying
of MGUS, but of the causes that underlie their visit.
3
Age
Time < 35 35–45 45–55 55–65 65–75 75+ Total
3 13 44 100 201 270
0.2 1.1 3.5 7.9 16.3 21.7 50.7
0–1 1 0 2 5 4 8 20
month .0001 .0014 .011 .063 .314 1.58 2.0
9675.1 0 179.1 79.1 12.7 5.1 10.1
2 13 43 97 202 270
0.8 5.1 17.4 38.8 78.9 106.8 247.8
1–6 0 0 0 2 5 18 25
month .0004 .0065 .055 .312 1.525 7.81 9.7
0 0 0 6.4 3.3 2.3 2.6
2 12 42 94 195 272
1.0 5.9 19.4 43.5 89.7 129.3 288.8
6–12 0 0 0 4 3 9 16
month .0005 .0076 .061 .345 1.709 9.35 11.5
0 0 0 11.6 1.8 0.96 1.4
2 11 40 86 182 288
2.0 10.2 35.9 76.7 158.7 267.4 550.9
1–2 0 0 1 5 8 21 35
yr .0011 .0133 .112 .611 2.947 19.50 23.2
0 0 8.9 8.2 2.7 1.1 1.5
2 10 39 85 178 318
4.5 23.1 77.7 183.3 408.5 762.3 1459.4
2–5 0 0 2 3 8 70 83
yr .0029 .0303 .237 1.389 7.467 57.80 66.9
0 0 8.4 2.2 1.1 1.2 1.2
1 6 26 68 140 294
0.1 22.7 73.8 189.8 409.6 928.9 1624.9
5–10 0 0 0 4 19 119 142
yr .0001 .0294 .235 1.442 7.630 80.94 90.3
0 0 0 2.8 2.5 1.5 1.6
0 1 9 35 74 173
0.0 5.1 42.3 146.3 269.4 712.4 1175.4
10+ 0 0 1 0 13 88 102
yr 0 .0069 .127 1.083 4.946 62.15 68.3
0 7.9 0 2.6 1.4 1.5
3 15 57 147 298 466 631
8.7 73.1 270.1 686.2 1431.1 2928.8 5398.0
Total 1 0 6 23 60 333 423
0.01 0.1 0.84 5.2 26.5 239.1 271.9
196.1 0 7.2 4.4 2.3 1.4 1.6
Table 2: Rates analysis for female patients with MGUS. Each cell contains five values: 1) the number of subjects
contributing to the cell, 2) the total number of person-years of observation in the cell, 3) the number of deaths,
4) the expected number of deaths based on the Minnesota population, and 5) a risk ratio. A given patient will
contribute to multiple cells during her follow-up.
4
The values in Table 2 were produced using the following code in S-Plus.
## Create desired grouping of the follow-up person-years
> cuttime <- tcut(rep(0, nrow(data.mgus)),
c(0, 30, 182, c(1, 2, 5, 10, 100)*365.25),
labels=c(’0-1 mon’, ’1-6 mon’, ’6-12 mon’, ’1-2 yr’,
’2-5 yr’, ’5-10 yr’, ’10+ yr’))
The tcut command creates time-dependent categories especially for the pyears computation. Its
first argument is the starting point for each person; for the cuttime variable (time since diagnosis of
MGUS) each person starts at 0 days. As a person is followed through time, they dynamically move to
another category of follow-up time at 30, 182, etc. days. The pyears function breaks the accumulated
person-years into cells based on cutage and cuttime. The created intervals are of the form (t1 , t2 ],
meaning that the cutpoints are included on the right end of each interval.
The survexp.mn rate table is a special object containing the daily death rates for Minnesota res-
idents by age, sex, and calendar year; the variable names of ‘age’, ‘sex’, and ‘year’ can be found
by summary(survexp.mn). Other rate tables may include different dimensions. For instance, the
survexp.usr rate table is also divided by race. By default, the pyears function assumes that your
dataset, data.mgus in the above example, contains both the variables found in the model statement
and the variables found in the rate table, and that the latter are on the right scale (e.g. days ver-
sus years). If the variable names are not in your dataset you can still do the analysis by calling the
ratetable function as part of your formula. In this particular dataset ‘year’ is named ‘dtdiag’. Since
the Minnesota rate table contains daily hazards, this means that age should be in days, and that the
5
‘year’ argument should be a Julian date (number of days since 1/1/1960) at which the subject began
their follow-up. The variable sex should be coded as (“male”, “female”). Any unique abbreviation of
the character string, ignoring case, is acceptable, as is a numeric code of (1, 2). Correctly matching
the variables of the rate table is one of the more fussy aspects of the pyears and survexp routines, and
it is easy to make an error. Be very careful as well to use only the starting values of your variables
(follow-up time at baseline, baseline age, etc.) when using tcut. The pyears function returns two
components whose primary purpose is data checking, shown in the last 4 lines of the example. The
summary component shows how subjects mapped into the rate table at the start of their follow-up, and
offtable shows the amount of follow-up time that did not fall into any of the categories specifed by
the formula.
An HTML version of the table can be produced using the pyears2html function. The local Mayo
SAS procedure personyrs produces such tables directly, but unfortunately the result is not compact
enough to fit nicely into this report. Additionally, the SAS procedure is not set up to use the standard
expected death rate tables and instead expects a dataset with user-defined expected event rates [2].
See the Appendix for the location of pyears2html code (section 5.3.1) and for information on how to
create rate tables in S-Plus and SAS (section 5.2).
Figure 1 shows the rates from the output table (pyrs.spe3) plotted on a log scale versus age; it
appears that the log of the rates is nearly linear in age, with a small upward curving component. Figure
2 shows these same rates plotted on the arithmetic scale versus age; here the relationship definitely
curves upward, and needs at least a quadratic component. Whatever the relationship, information
from the oldest age category will be highly influential on the fit. Depending on the dataset, analysis of
6
0.500 f Negative, female M
m Negative, male
m
F Positive (MGUS), female Ff
M Positive (MGUS), male M
m
m f
M
M F F
F
m f
M
Death rate
f
0.050
M m
F
M
M M m
F f
M F
m f
M m
F F m F f
f
m Ff
m
m
0.005
f f
f
50 60 70 80 90
Age
Figure 1: Death rates (plotted on a log scale) versus age for patients who had an SPE. Note that two points
with rates of zero do not appear on the logarithmic plot.
f Negative, female M
m Negative, male
Positive (MGUS), female
0.5
F
M Positive (MGUS), male
0.4
m
f
Death rate
F
0.3
M
m
0.2
f
m
M
M F F
F f
0.1
m
M
M M m
F f
M M F m
F f
M M m m f
m Ff
m Ff
m m
Ff Ff f
0.0
Ff M
50 60 70 80 90
Age
Figure 2: Death rates (plotted on the arithmetic scale) versus age for patients who had an SPE.
7
the log of the rates (multiplicative scale) or of the rates without a transformation (additive scale) may
lead to a better model. In this case, analysis of the log rates appears to provide a simpler description
of the age relationship.
Plot code for Figure 2 is similar to Figure 1 above, except with log=‘n’ (not shown).
## assign a numeric value to each age group for plotting and modeling
> pyrs.spe4$age <- (34:100)[as.numeric(pyrs.spe4$cutage4)]
8
0.8
0.8
Death rate
0.4
0.4
Negative, female Negative, male
0.0
0.0
40 50 60 70 80 90 100 40 50 60 70 80 90 100
0.8
0.8
Death rate
0.4
0.4
MGUS, female MGUS, male
0.0
0.0
40 50 60 70 80 90 100 40 50 60 70 80 90 100
Age Age
Figure 3: Predicted death rates (solid line) and 95% confidence intervals (dashed lines) for each of the four
groups, along with the observed death rates (circles) in each cell of the table.
## Add points to the figure from the coarser fit, as shown in Figure 1
> points(tmp.age, tmp.y[,1], pch=1)
> mylung <- lung ## creates local version of the lung dataset
> mylung$event <- mylung$status - 1 ## so that it is coded as 0=censor/1=event
> coxph(Surv(time, event) ∼ age + ph.ecog, data=mylung)
9
coef exp(coef) se(coef) z p
age 0.0113 1.01 0.00932 1.21 0.23000
ph.ecog 0.4435 1.56 0.11583 3.83 0.00013
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -7.10610011 0.575199157 -12.3542 0.0000
age 0.01097865 0.009242255 1.1879 0.2361
ph.ecog 0.38716948 0.114240374 3.3891 0.0008
Notice how closely the coefficients and standard errors for the Poisson regression, which uses the
number of events for each person as the y variable, match those of the Cox model, which is focused
on a censored time value as the response. In fact, if the baseline hazard of the Cox model λ0 (t) is
assumed to be constant over time, the Cox model is equivalent to Poisson regression.
One non-obvious feature of the Poisson fit is the use of an offset term. This is based on a clever
“sleight of hand”, which has its roots in the fact that a Poisson likelihood is based on the number of
events (y), but that we normally want to model not the number but rather the rate of events (λ).
Then
E(yi ) = λi ti
eXi β ti
=
= eXi β+log(ti ) (1)
We see that treating the log of time as another covariate, with a known coefficient of 1, correctly
transforms from the hazard scale to the total number of events scale. An offset in glm models is
exactly this, a covariate with a fixed coefficient of 1.
The hazard rate in a Poisson model is traditionally modeled as exp(Xβ) (i.e. the inverse link
f (η) = eη ) rather than the linear form Xβ, for essentially the same reason that it is modeled that way
in the Cox model: it guarantees that the hazard rate (the expected value given the linear predictors)
is positive. The exponentiated coefficients from the Cox model are hazard ratios and those from the
Poisson model are known as standardized mortality ratios (SMR).
A second reason for modeling exp(Xβ), at least in the Cox model case, is that for acute diseases
(such as death following hip fractures or myocardial infarctions) the covariates often act in a multi-
plicative fashion, or at least approximately so, and the multipicative model therefore provides a better
fit. Largely for these two reasons: that the underlying code works reliably and the fit is usually ac-
ceptable, the multiplicative form of both the Cox and rate regression (Poisson) models has become the
standard.
Recently there has been a growing appreciation that it is worthwhile to summarize a study not just
in terms of relative risk (hazard ratio or SMR) but in terms of absolute risk, the absolute amount of
excess hazard experienced by a subject. An example is provided by the well-known Women’s Health
Initiative (WHI) trial of combined estrogen and progestin therapy in healthy postmenopausal women
with an intact uterus. After 5 years, there was a 26% increase in the risk of invasive breast cancer
(hazard ratio 1.26, 95% CI 1.0 to 1.6) among women who were in the active treatment group as
compared to placebo [11]. It has been suggested that the results of the WHI trial should have been
reported in absolute as well as relative risk terms [10]. Thus, WHI investigators should also have
emphasized that the annual event rates in the two arms were 0.38% and 0.30%, respectively, leading
to an increased case incidence of only 8 per 10,000 patients per year. Given other benefits of the
10
treatment, such as a reduction in hip fracture, a patient might take a very different view of “26%
excess” and “< 1/1000 excess”.
Consequently, this report explores the fit of excess risk (additive) models as well as relative risk
(multiplicative) models. In many cases, excess risk models may provide information that is comple-
mentary to the relative risk models, in others they may provide a more succinct and superior summary.
Both types of models can be fit using Poisson regression, but the data setup and fitting process for
excess risk models is somewhat more involved and certainly far less well known.
= (λage,sex ti ) eXi β
= Λi,age,sex,t eXi β
= ei eXi β
= eXi β+log(ei )
In the above formula, λage,sex is the appropriate population hazard rate for a particular age and sex
combination (that of the ith subject), and ei is the expected number of events over the time period
of observation for the subject, or, more accurately, the cumulative hazard Λi (ti ) for the person. In
reality, the baseline hazard changes over the follow-up time for a subject, as they move from one
age group to another, and computing it directly from the rate tables is a major bookkeeping chore.
However, keeping track of these changes and computing a correct total expected number of events for
each person is precisely what is done by the pyears and survexp functions in S-Plus and the %ltp macro
in SAS. See the Appendix (section 5.2) for more information about rate tables in S-Plus and SAS.
Per the above formulation, the coefficients β in this model describe the risk for each subject relative
to that for the standard population. Programming wise, the only change from the usual Poisson
regression is the use of log(expected) instead of log(time) as the offset. The use of an offset treats
the log of the expected as another covariate, with a known coefficient of 1.
For uncomplicated data, the S-Plus survexp and SAS %ltp (life table probability) functions are
the easiest to use. Each of these returns the survival probability Si (t) = exp[−Λi (t)], from which the
expected number of events Λi can easily be extracted. We will base our expected calculations on the
Minnesota life tables. See the Appendix (section 5.1.2) for SAS code for this example.
11
In this analysis we needed to confront an issue that is not uncommon in these studies: two of the
subjects have an expected number of events of 0. Upon further examination, these are two people
who were diagnosed with MGUS on the day of their death. Simple relative survival is not a valid
technique when such cases are included. The model is attempting to compare the mortality experience
of the enrolled subjects to that of a hypothetical control population, matched by age and sex, drawn
randomly from the total population of Minnesota. It recognizes, correctly, that the probability of such
a control’s demise at the instant of enrollment is zero, i.e., infinitely unlikely, which leads to infinite
values in the likelihood. The problem extends beyond day 0. In this dataset there are 16 subjects who
die within 1 week of MGUS detection; for all of these it is almost certain that MGUS was detected
because the patients were at an extreme risk of death. We must exclude those with futime=0, but
perhaps we should also exclude those with very small follow-up times.
The analysis above shows that for both males and females, the death rate is significantly worse
than that for an age-, sex- and calendar-year matched population. Rates are 55% greater than the
Minnesota population at large (exp(0.44) = 1.55). Note that because we have removed the intercept
(using the -1 coding), we have coefficients for both males and females. This allows us to visually
compare the coefficients and also to obtain the correct standard error term for each gender. In the
age range of this study (mean age = 71) the population death rate for males is substantially higher
than that for females; it is interesting that the excess death rate associated with a MGUS diagnosis is
almost identical for the two groups.
To explore this further, we will look at a second dataset that allows an estimate of detection bias,
i.e., how much of this increase is actually due to MGUS, and how much might be due to the disease
condition that caused the patient to come to Mayo. We also want to look at time trends in the rate:
is the MGUS effect greater or less for older patients, for times near to or far from the diagnosis date,
and for different calendar years?
12
### First we create the pyrs.spe dataset
> cuttime <- tcut(rep(0, nrow(data.spe)), c(0:23 *30.5, 365.25*2:10, 36525),
labels=c(paste(0:23, ’-’, 1:24, ’ mon’, sep=’’),
paste(2:9, ’-’, 3:10, ’ yr’, sep=’’), ’10+ yr’))
## Double check that the pyears, ages, sex distribution, and dates all look ok.
> summary(tmpfit)
Total number of person-years tabulated: 225416
Total number pf person-years off table: 0
Matches to the chosen rate table:
age ranges from 24 to 103.8 years
male: 10324 female: 12970
date of entry from 12/15/1960 to 11/29/1994
As before, we use the tcut command to create time-dependent cutpoints for the pyears routine.
We’ve created follow-up time intervals of zero to 1 month, 1 to 2 months, etc. for the first 2 years,
then yearly up to 10 years after the SPE test. For the age variable we have used 1 year age groupings
from age 40 up to age 95. Notice one other constraint of rate tables: since the Minnesota rate table
uses units of hazard/day, all time variables in the dataset must be in days. The default behavior for
the pyears function is to create a set of arrays, however the data.frame=T argument produces instead
a dataset useful for further analysis. In the final data frame, the ‘cuttime’ and ‘cutage’ variables are
categorical variables which is a result of using tcut and pyears. The last 2 lines create a numeric value
for each category which will be more useful for subsequent models and plots.
13
CAUTION - COMMON MISTAKES:
1) When using tcut, make sure that the input value reflects the beginning of your time
period or age period. For follow-up, you usually start at time 0. DO NOT use your final
follow-up time in tcut. If you have variables that reflect the start and stop time for each
individual, make sure the age listed is the age at the start time.
2) All time and age variables MUST be in the same units (in the previous example, days).
You will run into problems if you have age in years and follow-up time in days. Additionally,
these units need to match the units used in your rate table. For example, when the summary
shows that age ranged between 0 and 0.3 years, it is a good clue that you used years and
the program expected days.
We then fit generalized additive models (gam) using the gam function. Generalized additive models
are simply an extension of the generalized linear models that are often used for Poisson regression. Gam
models allow the fitting of nonparametric functions, in this case the smoother function s, to estimate
relationships between the event rate and the predictors (age and dxtime). Again we use log(expected)
as an offset to describe the risk for each subject relative to that for the standard population. Four
subsets are fit, broken up by male/female and MGUS/Negative.
See the Appendix (section 5.4) for a discussion of differences between gam and glm.
14
10.0 Negative, female
Negative, male
Positive (MGUS), female
Positive (MGUS), male
Relative death rate
5.0
2.5
1.5
2/12 6/12 1 2 4 6 8 10
Figure 4: The estimated selection effect for male and female patients who are 67-68 years old (≈ 67.5 years)
with positive and negative SPE. To spread out the earlier times the x-axis is on a square root scale. Note that
the y-axis is on the log scale.
Negative, female
Negative, male
Positive (MGUS), female
10.0
5.0
2.5
1.5
40 50 60 70 80 90
Age
Figure 5: The estimated age effect for a patient with 17-18 months (≈ 1.375 years) of follow-up with positive
and negative SPE. The y-axis is on the log scale.
15
##### CODE TO CREATE FIGURE 4 #####
## Note: age=67.5 corresponds to the middle age group (67-68 year olds)
> newdata1 <- expand.grid(age = 67.5, dxtime=seq(0,10,length=50), expected=1)
> pred1 <- cbind(predict(fit3.1, newdata=newdata1, type=’response’),
predict(fit3.2, newdata=newdata1, type=’response’),
predict(fit3.3, newdata=newdata1, type=’response’),
predict(fit3.4, newdata=newdata1, type=’response’))
### FIT OVERALL AND SUBSETTED MODELS TO CHECK FOR SEX AND MGUS SIGNIFICANCE
> fit3.overall <- gam(event ∼ offset(log(expected)) + s(age) + s(dxtime) + sex + mgus,
data=pyrs.spe, family=poisson)
> fit3.drop1 <- gam(event ∼ offset(log(expected)) + s(age) + s(dxtime) + mgus,
data=pyrs.spe, family=poisson)
> fit3.drop2 <- gam(event ∼ offset(log(expected)) + s(age) + s(dxtime),
data=pyrs.spe, family=poisson)
Response: event
Terms Resid. Df Resid. Dev Test Df Dev Pr(Chi)
1 s(age) + s(dxtime) + sex + mgus 7442.252 5911.278
2 s(age) + s(dxtime) + mgus 7443.252 5913.008 -sex -1.00 -1.73 0.1883
3 s(age) + s(dxtime) 7444.202 5943.054 -mgus -0.95 -30.04 0.0000
16
Standardization Indirect Direct
method
Question How many events would my study How many events would the reference
population have had if their event population have had if their event rate
rate was the same as the reference was the same as my study population?
population?
Procedure Event rates in reference population Event rates in the study population
are applied to the study population. are applied to the reference population
Reference population Age/sex stratified event rates Age/sex stratified population sizes
data needed
It is also possible to test whether the smoother function is different from a simple linear or quadratic
fit for the term. The example below tests for the difference between a linear age term and the smoother.
In this case there is significant evidence that the smoother is better at explaining the age relationship.
Response: event
Terms Resid. Df Resid. Dev Test Df Dev Pr(Chi)
1 s(age) + s(dxtime) + mgus 7443.252 5913.008
2 age + s(dxtime) + mgus 7446.090 6044.867 1 vs. 2 -2.84 -131.86 0
17
overall death rate would be lower. By normalizing them to a common population structure, the rates
become comparable.
In direct standardization it is important to recognize the implication of using different standard
populations. For instance, if you want to make some statement about a diseased population, you may
want to standardize to the overall age and sex distribution of that diseased population. Often diseased
populations are weighted more heavily in the older ages, so standardizing to the US population would
give extra weight to the younger ages where there may not be as much information. It might be more
appropriate and informative to use the overall age and sex structure of diseased subjects as a reference.
Likewise, if you have an age- and sex- stratified sampling of the population and you want to make
generalizations to the entire US population, then you would want to standardize to the US population.
The expected number of events in the parent population is a simple sum
D = RF,35−39 NF,35−39 + RM,35−39 NM,35−39 +
RF,40−44 NF,40−44 + . . . + RM,95−100 NM,95−100
where R are the death rates estimated from the study and N the population sizes P in the reference
population. Reference rates are usually expressed per 100,000, or (100000 D)/ i,j Ni,j , where i is
the sex and j is the age group.
One shortcoming of direct standardization is that covariates are limited – you can only include
in the model those variables that are known for the parent population, which is usually age and sex
groups, and sometimes race. Compare this to the examples, where test status and time since diagnosis
both played a role. An advantage to direct standardization is that it can often be calculated from a
published paper, without access to the raw data, allowing us to compare our work to other results.
When doing direct standardization, there are three advantages to using a model for the predicted
death rates rather than the table values:
• The values for certain age groups may be erratic due to small sample size. The smoothing
provided by the model stabilizes them.
• We can use finer age groupings. To see why coarse age groupings could be a problem, consider,
for example, that we had two samples to be compared, with 10 year age groupings. In one sample
the mean age within the 45-55 year age group might be 48, and in the other 52. This could bias
the comparison of the studies.
• Estimates of the direct age-adjusted value and its variance can be obtained from the fitted model,
as shown below.
There is also one major disadvantage to using a Poisson fit: the chosen model has to fit the data well.
The estimates in our example would not be particularly good if we had used only a linear term for age,
particularly in the tails. Figure 3, which is purposely plotted on arithmetic rather than logarithmic
scale, clearly shows that the direct adjusted value depends very heavily on the predictions in the right
hand tail.
The direct age adjusted value and its variance can be computed as follows. Assume that we want
to standardize the rates for females with MGUS to the age 35 to 100 US population using a model
that includes age and age2 . From the Poisson regression fit (using glm) we have the coefficient β̂ and
variance-covariance matrix Σ (i.e. coef(pfit4c) and summary(pfit4c)$cov, respectively). If we let X
be the predictor matrix for the integer ages at which we need the prediction, each row Pof X being
the vector (1, age, age2), then r̂ = exp(X β̂) is the vector of predicted rates, and T = wi r̂i is the
expected number of total events where wi is the vector of population weights, and T /N will be the
direct-adjusted estimate, where N is the total population. (Alternatively, use the proportions wi /N
as the weights.) The variance matrix of X β̂ is XΣX ′, and the first-order Taylor series approximation
gives RXΣX ′ R as the variance for r̂, where R is a diagonal matrix with Rii = r̂i . The variance of T
is then w′ RXΣX ′ Rw.
The code below will calculate the direct age-adjusted estimate and its standard error, for the female
MGUS group. Note that this approach will not work using gam models, since in that case we do not
have an explicit variance matrix. See the appendix (section 5.4) for more details.
18
## As calculated earlier in section 1.5:
> pfit4c <- glm(event ∼ offset(log(pyears)) + age + age^2, data=pyrs.spe4,
family=poisson, subset= (sex==’female’ & mgus==1))
The direct adjusted estimate is 2677 deaths per 100,000 +/- 256
## fit model
> agefit3.3 <- glm(event ~ offset(log(pyears)) + ns.age, family=poisson, data=pyrs.femaleMGUS)
The direct adjusted estimate is: 2874 deaths per 100,000 +/- 440
We could get the vector pi r̂i directly as a prediction from the model, along with the standard error
of each element, but since predict does not return the full variance-covariance matrix, this does not
give the variance for T , the sum of the vector.
> sum(USweights*predict(pfit4c, type=’response’,
newdata=data.frame(age = ages, pyears = 1)))
2677
One caution regarding direct standardizing to a population is that the resulting estimates often
represent a substantial extrapolation of the dataset. For instance, in the MGUS example above only
5/1384 subjects are aged < 30 years with none under 20 years. Standardization to the entire US pop-
ulation aged 20–100 years requires estimated rates at each age, many of which have no representatives
at all in the study subjects! Even in using the age 35–100 year subset that we chose for the examples,
19
Negative, female
10.0 Negative, male
Positive (MGUS), female
Positive (MGUS), male
Relative death rate
5.0
2.5
1.5
2/12 6/12 1 2 4 6 8 10
Figure 6: The estimated selection effect for male and female patients with positive and negative SPE, age-
adjusted to the population of subjects who had an SPE test. To spread out the earlier times the x-axis is on a
square root scale. Note that the y-axis is on the log scale.
14% of US female population was in the 35–39 age group, and hence this age group contributed 14%
of the weight in the final estimate, but only 1.2% of the female study subjects were in this age and sex
group.
20
10.0
Relative death rate
5.0
2.5
1.5
2/12 6/12 1 2 4 6 8 10
Figure 7: The estimated selection effect for female patients with negative SPE, age-adjusted to the population
of subjects who had an SPE test, with confidence intervals. To spread out the earlier times the x-axis is on a
square root scale. Note that the y-axis is on the log scale.
21
> age.range <- range(c(pyrs.spe$age, data.spe$age/365.25))
> dx.range <- range(pyrs.spe$dxtime)
## initialize storage space for final results (at each unique dxtime)
> finalRhat.vector <- rep(NA, N.dxtime)
> finalStd.vector <- rep(NA, N.dxtime)
3 Additive models
3.1 Motivation and basic models
There are many cases where an additive hazard model
E(yi ) = λi ti
= (Xi β)ti (2)
makes more sense, from a medical or biological point of view, than a multiplicative model. For
instance, it is known that MGUS patients have about a 1%/year risk of conversion to overt plasma cell
malignancy. Since this event rate is constant over time, it may be reasonable to assume the covariates
22
of interest have a constant effect on the event rate. In the additive model, covariate effects are modeled
on the event rate scale (e.g. 1 year increase in age confers an additional 0.2 absolute increase in the
death rate per year). This model may not fit well if the event rate changes dramatically over time, such
as the death rate following myocardial infarction (MI) which is quite high in the first few days after
MI, but much lower later on. In this case it would not make sense to assume age has the same effect
on the event rate both early on and later following an MI, and a multiplicative model may provide a
better fit.
The main reason that the additive model is not commonly used is technical. Namely, for some
choices of β, equation 2 can predict a negative hazard for some subjects, e.g., the dead coming back
to life. The Poisson likelihood involves a log(λ) term, which is numerically undefined for a negative
value. Even if the true MLE estimates are positive, if the iterative procedure ever flirts with a bad
choice for β̂, a missing value is generated which quickly propagates, and the fitting program will fail.
Programs which regularly fail get little use. Luckily, failure can be almost completely avoided by use
of a modified link function.
We wish to use an identity function for the inverse link f (η) = η, but also ensure that f (η) > 0
for all values of η. A second consideration is that we would like f to be smooth, with a continuous
first derivative, since discontinuities tend to confuse the Newton-Raphson fitting algorithm used in the
underlying code for generalized linear models. We have found the following to work well in practice:
p
f (η) = 0.5 ∗ (η + η 2 + ǫ2 )
η = f −1 (µ) = µ − ǫ2 /(4µ)
f ′ (µ) = 1 + ǫ2 /(4µ2 )
This is a hyperbolic function whose asymptotes are the 45◦ line for η > 0 and the x axis f (η) = 0 for
x < 0. The value of ǫ controls how tightly it hugs the corner and the default value for ǫ is set to 0.01.
Choosing this is the only problematic part of the procedure: one wants a strictly additive model to
hold for as much of the data as possible, and thus a small value of ǫ, yet not so small a value as to
create round-off errors. In particular, small values of ǫ along with large negative values of the linear
predictor η can lead to some observations having an extremely large weight, and in turn a subsequent
lack of convergence. In this case, you may want to choose a slightly larger value for ǫ. For the lung
dataset the constrained link function is not necessary; death rates for all subsets of age and ECOG
score are far enough away from zero that no problems arise. See the Appendix (section 5.3.2) for more
details regarding the link function.
To fit this model, we must pre-multiply each element of the X matrix by time. This is done
automatically using the addglm function instead of the usual glm function. Details about the addglm
can be found in the Appendix (see 5.3.3). In this case, the final additive model for our earlier example
using the lung data (Section 1.5) becomes
> summary(addglm( event ∼ age + ph.ecog, time=time,
data=lung, family=poisson.additive))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -4.846495e-04 1.153262e-03 -0.4202 0.6747
age 3.259046e-05 1.931485e-05 1.6873 0.0929
ph.ecog 9.345934e-04 2.675068e-04 3.4937 0.0006
See the Appendix (section 5.1.3) for SAS code for this example.
The coefficients of the fit correspond to the intercept, age, and physician ECOG score effects. The
value of the last coefficient indicates that each 1 point increase in ECOG score confers an additional
.00093 ∗ 365.25 = .34 absolute increase in the per year death rate. Note that death rates can exceed
1.0, for instance, when average survival is less than a year.
In rare cases better starting estimates may be required. The specification of initial values for
the glm function in S-Plus is unusual; rather than expecting starting guesses for the coefficients β, it
expects guesses for the vector of estimated predicted values ŷ. This makes choosing starting values
23
very easy: the default is to use the observed data y as the vector of starting values. An optimistic
assumption that the final fit will be perfect. However, if there are observations with 0 observed events,
this strategy must be modified since it would lead to log(0) in the likelihood. By default, the starting
value of 1/6 is used in this case, but for some datasets, this may not be good enough, e.g. many zeros
and many covariates. A solution in such a case is to first fit a multiplicative model, and then use the
predicted values from that model as starting estimates for the additive one.
> summary(fita2)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -4.846495e-04 1.153262e-03 -0.4202 0.6747
age 3.259046e-05 1.931485e-05 1.6873 0.0929
ph.ecog 9.345934e-04 2.675068e-04 3.4937 0.0006
Finally, while multiplicative models give consistent answers regardless of how finely or coarsely
the person-years are partitioned, this is not the case for additive models. One problem with the
hyperbolic link function is that it is not invariant to subdivision of the data. Predicted values for each
observation are computed near the expected number of events for the observation. When the data is
finely subdivided, the expected number of events is close to zero for each observation and the predicted
values lie on the curved region near the origin of the hyperbolic link function. An easy way to see
if there is a problem is to compare the total number of observed events in the dataset to the total
number of events predicted by the model. If these do not closely agree, try dividing the person-years
less finely or using a smaller ǫ in the link function. Another simple way to identify a problem with
the fit is to compare sum(predict(fit, type=’link’)) to sum(predict(fit, type=’response’)), which
would be the same for an exactly linear model.
E(yi ) = (λage,sex + Xi β) ti
= λage,sex ti + (Xi β)ti
= Λi,age,sex,t + (Xi β)ti
= ei + (ti Xi )β (3)
As before, λage,sex is the appropriate population hazard rate for a particular age and sex combination
(that of the ith subject), and ei is the expected number of events over the time period of observation
for the subject, or, more accurately, the cumulative hazard Λi (ti ) for the person.
Let us return to the MGUS example of the prior section, and examine it in terms of excess risk.
Unfortunately, the pre-multiplication of each variable by time makes the smooth terms of gam models
less useful. The smoothness constraints should be based on the covariates, e.g. s(age) ∗ time. This is
not a form that the gam routine is designed to handle, and is not the same as s(age ∗ time). Because of
24
Negative, female
0.25 Negative, male
Positive (MGUS), female
Positive (MGUS), male
0.20
Excess risk / year
0.15
0.10
0.05
0.0
0 2 4 6 8 10
Figure 8: The additive model was used to estimate the selection effect for male and female patients who are
67-68 years old (≈ 67.5 years) with positive and negative SPE.
this issue, we will make use of natural splines. With natural splines you can either specify the degrees
of freedom or specific knots. The choice of knots was, in this case, based on trial and error.
As shown in equation 3 the offset for the fit is the number of expected events. The addglm code
uses time as a multiplier for the covariates, which in this case is the person-years (called pyears in the
dataset). Because the time variable for the fit is in years, the coefficients of the fit represent excess
hazard per person-year.
To draw the plots, we first create natural spline versions of age and follow-up time using the same
settings for knots and Boundary.knots, and then use these to obtain predicted values from the model.
25
Negative, female
Negative, male
Positive (MGUS), female
0.04
0.03
0.02
0.01
40 50 60 70 80 90
Age
Figure 9: The additive model was used to estimate the age effect for a male and female patient with 17-18
months (≈ 1.375 years) of follow-up and with positive and negative SPE.
26
> key(corner=c(0,1), lines=list(lty=1:4), text=list(c("Negative, female", "Negative, male",
"Positive (MGUS), female", "Positive (MGUS), male")))
The story told by the additive model, as shown in Figures 8 and 9, is quite different than that from
the multiplicative model (Figures 4 and 5). Excess risk, as a function of time from diagnosis, is not the
same for males and females, nor for positive and negative SPE results, and the effect is essentially done
within one year instead of two. The age effect is nearly zero, except for age ≥ 80, as opposed to the
steady decline of the multiplicative model. These differences must be viewed with some caution, since
generalized additive models can sometimes be unstable with respect to the assignment of an effect to
a particular term, especially if the two terms are somewhat correlated, as age and follow-up time must
be.
If we return to Table 2, and look at the bottom margin, we see the same effect. Combining the
first two age groups, the risk ratios are 9.1, 7.2, 4.4, 2.3 and 1.4, similar to the pattern of Figure 5.
The yearly excess risks are .011, .019, .025, .023, and .032, respectively, which is instead a somewhat
upward trend. Note that risk ratio=events/expected and excess risk=(events - expected)/person-
years. The effect in Figure 9 for MGUS males is much sharper at the far right. There are several
possible explanations for this. For instance, the oldest age groups have a much smaller fraction of their
person-years during the first year after diagnosis and so miss the initial period effect.
Again, the major limitation with Figures 8 and 9 is that values are presented for specified values
of age and follow-up time (dxtime). Direct standardization is necessary to better understand the effect
on the cohort as a whole.
## Need to transform values back (instead of using exp as was done for the multiplicative)
> inv.linkFunction <- function(eta, a=.02) { .5*(eta + sqrt(eta^2 + a^2)) }
The direct adjusted estimate is: 3274 deaths per 100,000 +/- 431
The final answer is similar to the multiplicative model that uses the ns function to describe the age
relationship.
27
Negative, female
0.25 Negative, male
Positive (MGUS), female
Positive (MGUS), male
0.20
Excess risk / year
0.15
0.10
0.05
0.0
0 2 4 6 8 10
Figure 10: The additive model was used to estimate the selection effect for male and female SPE positive and
negative patients age-adjusted to the population of subjects who had an SPE test.
28
3.3.2 Confidence intervals for standardized rates
The procedure for estimating confidence intervals for additive models is similar to that for multiplicative
models (see section 2.3.2). The appropriate variance-covariance matrix and link function are needed
in order to estimate the confidence intervals. The following example estimates a confidence interval
for females without MGUS.
## figure out baseline age distribution of cohort and the proportion in each age-group
> AgeWeights <- table(cut(data.spe$age/365, breaks=seq(20,105,5), left.include=T))/N
> N.age <- length(AgeWeights)
## initialize storage space for final results (at each unique dxtime)
> finalRhat.vector <- rep(NA, N.dxtime)
> finalStd.vector <- rep(NA, N.dxtime)
## Create the appropriate inverse link function for the additive model
## Note: for the multiplicative model, exp is the inverse link of log
> inv.linkFunction <- function(eta, a=.02) { .5*(eta + sqrt(eta^2 + a^2)) }
4 Summary
The classic methods for event rate data based on person-years tables can all be cast into the framework
of Poisson regression models, using the appropriate offset terms and contrasts of the coefficients. The
standard methods, in fact, correspond to a regression where all predictors are categorical variables.
An advantage of the regression framework is the ability to use continuous predictor variables, and thus
to model the event rates in a smooth way. The resultant estimates may also be more stable in small
samples.
29
1500
•
• •
500 RR=1.3 Excess Risk=324
Death Rate (per 100,000 p−y)
1000
•
50
•
RR=2.8
500
10
5
0
Non−Smokers Smokers Non−Smokers Smokers
Figure 11: This two-panel figure shows the excess death risk due to smoking for lung cancer and heart disease.
The panel on the left uses the log scale and summarizes the risk between smokers and non-smokers using the
relative risk (as in the multiplicative model). The panel on the right uses the arithmetic scale and summarizes
the risk using excess risk (as in the additive model).
30
Additive, Spline fit
Multiplicative
0.25
Predicted Death Rate
0.15
0.05
0.0
20 40 60 80 100
Age
Figure 12: Predicted death rates for female MGUS patients using an additive (thinner line) and a multiplicata-
tive (thicker line), along with the observed death rates (circles) in each cell of the table.
with covariates of interest can be used to allow for a change in covariate effects during periods of high
and low event rates in the additive model framework.
It may be true that neither model fits better than the other. For example, Figure 12 shows
predictions from an additive and a multiplicative model for the female MGUS patients. There is very
little difference between the predictions from these two models.
31
> lines(20:100, predict(pfit4c, type=’response’,
newdata=data.frame(age=20:100, pyears=1)),col=2,lwd=3)
5 Appendix
5.1 SAS code for the examples
5.1.1 Code for section 1.6 Relating Cox and rate regression models
In SAS, proc phreg is used to fit Cox models and proc genmod with log link and Poisson distribution
is used to fit Poisson regression models.
32
5.1.2 Code for section 2.1 Relative risk regression - basic models
In SAS, the %ltp macro returns the survival probability, from which the expected number of events
can easily be obtained. The use of noint provides separate estimates for males and females. To test
whether the ratio of observed to expected events is different for males and females, remove the noint
option.
Note that SAS and S-Plus differ slightly due to a minor difference in the expected calculation to adjust
for leap years.
33
model event= time ecog age_ch
/ dist = poisson
noint pscale;
ods output ParameterEstimates=newlink1;
This shows that the survival table for Minnesota is stratified by age, sex, and calendar year. The
table is based on decennial census data for 1970-2000, but is expanded to individual calendar years by
linear interpolation. When invoking one of the routines, it is assumed that the user’s dataset contains
these three variables, with exactly these names, and the correct definitions. In this case, age must
be in days and year must be a date coded as the number of days since 1/1/1960 (which is what SAS
automatically does). The variable ‘sex’ must be a character string or factor with the specified levels
(“male”, “female”).
or
10−5 r/365.24 .
For rare events, these two forms will give nearly identical answers. For larger rates, the proper
choice depends on whether the rate is computed over a population that is static and therefore depleted
by the events in question, or a population that is dynamic and therefore remains approximately the
same size over the interval. The first formula applies to the standard population rate tables, the
second formula may more often apply in epidemiology. In this particular example we will use the
second formula.
There are two reasons for using 365.24 instead of 365.25 in our calculations. First, there are 24
leap years per century, not 25. Second, the use of 0.25 led to some confusing S-Plus results when we
34
Age Group Men Women
< 35 21 7
35–44 4 21
45–54 47 82
55–64 64 265
65–74 148 546
75–84 449 1067
85+ 1327 1214
Table 4: Incidence of clinically diagnosed vertebral compression fractures among Rochester, MN residents,
1985-1989. The age- and sex-specific rates are expressed per 100,000 person-years.
did detailed testing of the functions, because the S-Plus round function uses a nearest even number
rule, i.e., round(1.5) = round(2.5) = 2. In actual data, of course, this small detail won’t matter a bit.
Despite this, 365.25 is often used.
There are several other important pieces of information in a rate table, which are coded as a vector
of attributes. The most important are:
factor: identifies whether a dimension is time-varying (such as age) or fixed (such as sex or race)
dimid: the variable labels for each dimension
cutpoints: for the time-varying dimensions, the breakpoints between the rows.
The actual dimensions of a rate table are arbitrary, although age and sex are the most common.
Rate tables can have any number of dimensions: the survexp.usr table has age, sex, calendar year,
and race.
In this particular example there are age-groups (7 levels) and sex (2 levels). People need to move
through the age-groups throughout their follow-up whereas sex is fixed and people should not move
through the sexes. Therefore, the factor is set to 0 for age and 1 for sex. The cutpoints are in terms
of days instead of years. All dimensions that involve cutting need to be on the same scale (all in days
or all in years). The summary function for the pyears object is a quick way to see if age is coded
correctly. The call to is.ratetable checks to see if the created object meets some basic checks and is
considered a legal ratetable object.
35
> attributes(exp.vertfx) <- list(
dim = c(7,2),
dimnames = list(inc.rates$age.gp,
c(’male’,’female’)),
dimid = c(’age’,’sex’),
factor = c(0,1),
cutpoints=list(c(0,seq(35,85,10))*365.24,
NULL),
summary = function(x) {
paste("age ranges from", format(round(min(x[,1])/365.24 ,1)),
"to", format(round(max(x[,1])/365.24 ,1)),
" male:", sum(x[,2]==1), " female:", sum(x[,2]==2)) },
class = ’ratetable’
)
Sometimes it is good to check that everything is working correctly. The following code creates
some fake data, then checks to see if the results match what is expected for the rate table. Note that
the data.frame option returns the results in terms of a data frame, which may be easier to use in
subsequent analyses.
## Test rate table to make sure it is correct - create some fake data with 10 days of follow-up
## Make sure that the variables age and sex are in the dataset
> fakedata <- data.frame(sex=c(1,1,2), days2event=c(10,10,10),
event=c(0,1,0), age=c(37,57,62)*365.24)
36
5.2.3 Expected rates in SAS for survival analysis
The SAS macro %survexp allows the user to access various rate tables, including MN T = Minnesota
Total for 1970-2000. As in S-Plus, the dataset includes one observation for each age 0-109 and for each
sex (M,F) for a given year. Only decade data can be entered and the macro does linear interpolation
between decades. The population dataset must contain the variables:
• POP = 3-5 character population name, as USER.
• YEAR = decade specification such as 1980 (maximum of 10 decades).
• SEX = sex recorded as (M,F)
• RACE = 2 character race (or other covariate) (must not be missing)
• AGE = numeric age from 0 to 109.
• Q = probability of dying before next birthday (may be set to missing)
• HAZARD = daily hazard for this age, sex, race, and year. That is:
• CALYRB = 4 digit numeric variable containing the first calendar year to which the rate applies.
The unique CALYRB values must match those used later in the calyrint statement in the
personyrs procedure. Set this variable to missing if there are no calendar year restrictions.
• AGEB = integer numeric variable containing the first year of age to which the rate applies.
These values must be identical to those defined in the ageyrint statement in the personyrs
procedure.
• MRATE , FRATE = these two numeric variables contain the expected annual event rate per
ratemult for males and females.
data expected;
input ageb mrate frate ;
calyrb = .;
datalines;
00 21 7
35 4 21
45 47 82
55 64 265
65 148 546
75 449 1067
85 1327 1214
;
Now use the expected counts with the same fake data that is used in the S-Plus example. In SAS
we must have age in years, a start date (using dummy 1/1/1900 since we are just testing this and have
1900 entered in our expected counts as calendar year). We also need number of days to event for those
with an event (which is missing for those where no event occurred).
37
data fakedata;
input sex days2lfu event age;
start_dt=mdy(1,1,1900);
if event=1 then days2event=days2lfu;
if sex=1 then sex_char=’M’; else if sex=2 then sex_char=’F’;
datalines;
1 10 0 37
1 10 1 57
2 10 0 62
;
You need to check in the correct follow-up subset and the age/sex categories to see that you get
the results that you expect. Note that sex must be coded M/F and age must be in years.
5.3.2 poisson.additive
Initially we created a Poisson link function for additive families that had a failsafe for predicted
values that are too small, but it didn’t have a continuous derivative function, which can cause some
convergence problems.
η η>ǫ
f (η) =
ǫeη/ǫ−1 η < ǫ
The improved Poisson link function has a shape that is quite similar to the initial one, but because
it is hyperbolic it has a continuous derivative. Both of these options are available in the function listed
below where the original is called “exponential” and the newer link is called “hyperbolic”. In addition,
there is a “positive” option which only uses positive values of η.
38
# Note that "a" is epsilon in the link function formulas
5.3.3 addglm
The addglm function is available from the Mayo website at: http://www.mayo.edu/biostatistics. Its
primary purpose is to simplify the model statement. The two fits listed below produce the same results.
39
2.0
2.0
’Exponential’ piecewise linear function
1.5
1.0
1.0
0.5
0.5
0.0
0.0
−2 −1 0 1 2 −2 −1 0 1 2
eta eta
Figure 13: The left plot depicts the inverse link function f (η) the “exponential” solution and the right plot
depicts the “hyperbolic” solution. It has the added benefit that f is smooth with a continuous first derivative.
Unless η is near zero, there will be little difference in the results.
> summary(fit2)
Coefficients:
Value Std. Error t value Pr(>|t|)
time -4.846495e-04 1.153262e-03 -0.4202 0.6747
I(time * age) 3.259046e-05 1.931485e-05 1.6873 0.0929
I(time * ph.ecog) 9.345934e-04 2.675068e-04 3.4937 0.0006
> summary(fit)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -4.846495e-04 1.153262e-03 -0.4202 0.6747
age 3.259046e-05 1.931485e-05 1.6873 0.0929
ph.ecog 9.345934e-04 2.675068e-04 3.4937 0.0006
40
5.4 A closer look at differences between gam and glm
The generalized linear model glm has the form
V ar(y) = ΦV (E(y))
where g is the link function, V is the variance function, and Φ is a constant. The error terms are
allowed to be non-Gaussian, and it is possible to have a non-constant variance. Different error models
are handled by a reparameterization to induce linearity. The variance of y is allowed to depend on the
expected value of y rather than remain constant.
The generalized additive model gam has the form
V ar(y) = ΦV (E(y))
where g is the link function, V is the variance function, Φ is a constant and f1 , . . . , fp are functions.
Therefore g(E(y|x)) is modeled as a sum of functions of the predictors. The predictor functions
f1 , . . . , fp can be parametric functions (equivalent to a generalized linear model) or nonparametric
functions based on smoothers (e.g., loess (lo), spline smoothers (s)).
The advantage of using a nonparametric smoother is that the user doesn’t need to specify nodes
or degrees of freedom. These functions are great as an exploratory tool. The disadvantage is that it
is difficult to calculate confidence intervals for model predictions. In that case, it is easier to use the
generalized linear model and parametric splines (e.g. ns, bs) or polynomial fits (e.g. poly).
The S-Plus language automatically uses predict.gam when used with a gam model and predict.glm
when used with a glm model. Trying to use the inappropriate summary or prediction method can
cause a whole host of problems, as illustrated by the following examples.
## WRONG, only use predict.glm with glm fits and predict.gam with gam fits
> predict.glm(test, newdata=tmp, type=’response’)
1 2 3
51.23453 18.29896 8.555364
41
## Values match predict.gam with gam model
> predict.glm(test2, newdata=tmp, type=’response’)
1 2 3
3.936414 2.906844 2.146558
42
References
[1] P. K. Andersen, O. Borgan, R. D. Gill, and N. Keiding. Statistical Models Based on Counting
Processes. Springer-Verlag, New York, 1993.
[2] E. J. Bergstralh, K. P. Offord, J. L. Kosanke, and G. A. Augustine. Personyrs: A SAS procedure
for person year analyses. Technical Report 31, Department of Health Sciences Research, Mayo
Clinic, 1986.
[3] J. Berkson. The statistical study of association between smoking and lung cancer. Proceedings of
the Staff Meetings of the Mayo Clinic, 30:319–348, 1955.
[4] G. Berry. The analysis of mortality by the subject-years method. Biometrics, 39:173–184, 1983.
[5] C. Cooper, E. J. Atkinson, W. M. O’Fallon, and L. J. Melton. Incidence of clinically diagnosed
vertebral fractures: A population-based study in Rochester, Minnesota, 1985-1989. JBMR, 7:221–
227, 1992.
[6] H. Inskip. Standardization methods. In P. Armitage and T. Colton, editors, Encyclopedia of
Biostatistics, volume 6, pages 4237–50. Wiley, 1998.
[7] Robert A. Kyle, Terry M. Therneau, S. Vincent Rajkumar, Janice R. Offord, Dirk R. Larson,
Matthew F. Plevak, and L. Joseph Melton III. A long-term study of prognosis in monoclonal
gammopathy of undetermined significance. New England J Medicine, 346:564–569, 2002.
[8] C. L. Loprinzi, J. A. Laurie, H. S. Wieand, J. E. Krook, P. J. Novotny, J. W. Kugler, J. Bartel,
M. Law, M. Bateman, N. E. Klatt, A. M. Dose, P. S. Etzell, R. A. Nelimark, J. A. Mailliard, and
C. G. Moertel. Prospective evaluation of prognostic variables from patient-completed question-
naires. J Clin Oncology, 12:601–607, 1994.
[9] P. McCullagh and J.A. Nelder. Generalized Linear Models. Chapman and Hall, London, 1989.
[10] A. Patel, R. Norton, and S. MacMahon. The HRT furor: getting the message right. Med J Aust,
177:345–346, 2002.
[11] J.E. Rossouw, G.L. Anderson, R.L. Prentice, A.Z. LaCroix, C. Kooperberg, M.L. Sefanick, R.D.
Jackson, S.A. Beresford, B.V. Howard, K.C. Johnson, J.M. Kotchen, J Ockene, and Writing
Group for the Women’s Health Initiative I. Risks and benefits of estrogen plus progestin in
healthy postmenopausal women: principle results from the Women’s Health Initiative randomized
controlled trial. J Amer Med Assoc, 288:321–333, 2002.
43