Modeling Count Data
Modeling Count Data
Joseph M. Hilbe
Arizona State University
1
A key feature of the Poisson model is the equality of the mean and variance func-
tions. When the variance of a Poisson model exceeds its mean, the model is termed
overdispersed. Simulation studies have demonstrated that overdispersion is indi-
cated when the Pearson χ2 dispersion is greater than 1.0 (Hilbe, 2007). The dis-
persion statistic is defined as the Pearson χ2 divided by the model residual degrees
of freedom. Overdispersion, common to most Poisson models, biases the parameter
estimates and fitted values. When Poisson overdispersion is real, and not merely
apparent (Hilbe, 2007), a count model other than Poisson is required.
Several methods have been used to accommodate Poisson overdispersion.
Two common methods are quasi-Poisson and negative binomial regression. Quasi-
Poisson models have generally been understood in two distinct manners. The tra-
ditional manner has the Poisson variance being multiplied by a constant term. The
second, employed in the glm() function that is downloaded by default when in-
stalling R software, is to multiply the standard errors by the square root of the
Pearson dispersion statistic. This method of adjustment to the variance has tradi-
tionally been referred to as scaling. Using R’s quasipoisson() function is the same
as what is known in standard GLM terminology as the scaling of standard errors.
The traditional negative binomial model is a Poisson-gamma mixture model
with a second ancillary or heterogeneity parameter, α. The mixture nature of the
variance is reflected in its form, µi + αµi2 , or µi (1 + αµi ). The Poisson variance is
µi , and the two parameter gamma variance is µi2 /ν. ν is inverted so that α = 1/ν,
which allows for a direct relationship between µi , and ν. As a Poisson-gamma
mixture model, counts are gamma distributed as they enter into the model. α is
the shape of the manner counts enter into the model as well as a measure of the
amount of Poisson overdispersion in the data.
The negative binomial probability mass function (see Geometric and neg-
ative binomial distributions) may be formulated as
yi + 1/α − 1
f (yi ; µi , α) = (1/(1 + αµi ))1/α (αµi /(1 + αµi ))yi , (5)
1/α − 1
with a log-likelihood function specified as:
n αµ 1
i
X
L(µi ; yi , α) = yi ln − ln(1 + αµi )
i=1
1 + αµ i α
1 1
+ ln Γ yi + − ln Γ(yi + 1) − ln Γ . (6)
α α
In terms of µ = exp(x′ β), required for maximum likelihood estimation, the negative
binomial log-likelihood appears as
Xn α exp(x′ β) 1
i ′
L(β; yi , α) = yi ln ′ β) − α ln(1 + α exp(xi β))
i=1
1 + α exp(xi
1 1
+ ln Γ yi + − ln Γ(yi + 1) − ln Γ . (7)
α α
2
This form of negative binomial has been termed N B2, due to the quadratic nature
of its variance function. It should be noted that the N B2 model reduces to the
Poisson when α = 0. When α = 1, the model is geometric, taking the shape of the
discrete correlate of the continuous negative exponential distribution. Several fit
tests exist that evaluate whether data should be modeled as Poisson or N B2 based
on the degree to which α differs from 0.
When exponentiated, Poisson and N B2 parameter estimates may be inter-
preted as incidence rate ratios. For example, given a random sample of 1000 patient
observations from the German Health Survey for the year 1984, the following Pois-
son model output explains the years expected number of doctor visits on the basis
of gender and marital status, both recorded as binary (1/0) variables, and the
continuous predictor, age.
docvis IRR OIM Std. Err. z P > |z| [95% Conf. Interval]
female 1.516855 .054906 11.51 0.000 1.41297 1.628378
married .8418408 .0341971 -4.24 0.000 .7774145 .9116063
age 1.018807 .0016104 11.79 0.000 1.015656 1.021968
3
The N B-C model better fits certain types of count data than N B2, or any other
variety of count model. However, since its fitted values are not on the log scale,
comparisons cannot be made to Poisson or N B2.
The N B2 model, in a similar manner to the Poisson, can also be overdispersed
if the model variance exceeds its nominal variance. In such a case one must attempt
to determine the source of the extra correlation and model it accordingly.
The extra correlation that can exist in count data, but which cannot be
accommodated by simple adjustments to the Poisson and negative binomial algo-
rithms, has stimulated the creation of a number of enhancements to the two base
count models. The differences in these enhanced models relates to the attempt of
identifying the various sources of overdispersion.
For instance, both the Poisson and negative binomial models assume that
there exists the possibility of having zero counts. If a given set of count data
excludes that possibility, the resultant Poisson or negative binomial model will likely
be overdispersed. Modifying the log-likelihood function of these two models in order
to adjust for the non-zero distribution of counts will eliminate the overdispersion, if
there are no other sources of extra correlation. Such models are called, respectively,
zero-truncated Poisson and zero-truncated negative binomial models.
Likewise, if the data consists of far more zero counts that allowed by the
distributional assumptions of the Poisson or negative binomial models, a zero-
inflated set of models may need to be designed. Zero-inflated models are mixture
models, with one part consisting of a 1/0 binary response model, usually a logistic
regression, where the probability of a zero count is estimated in difference to a non-
zero-count. A second component is generally comprised of a Poisson or negative
binomial model that estimates the full range of count data, adjusting for the overlap
in estimated zero counts. The point is to 1) determine the estimates that account
for zero counts, and 2) to estimate the adjusted count model data.
Hurdle models are another type mixture model designed for excessive zero
counts. However, unlike the zero-inflated models, the hurdle-binary model esti-
mates the probability of being a non-zero count in comparison to a zero count; the
hurdle-count component is estimated on the basis of a zero-truncated count model.
Zero-truncated, zero-inflated, and hurdle models all address abnormal zero-count
situations, which violate essential Poisson and negative binomial assumptions.
Some of the more recently developed count modlels include finite mixture
models and exact Poisson regression. Finite mixture models allow the count re-
sponse to have been created from two or more separate generating mechanisms.
For example, a portion of the counts may have a Poisson distribution with a mean
.5, with another portion having a Poisson distribution with a mean of 4. A re-
sponse may consist of two separate underlying distributions. Such a model allows
estimation of a more complex structures of counts than do standard Poisson and
negative binomial models. Exact Poisson models are not based on the asymptotic
methods characteristic of maximum likelihood or generalized linear models estima-
tion; rather they are based on the construction of a statistical distribution that
can be thoroughly emumerated. This highly iterative technique allows appropriate
4
estimation of parameters and confidence intervals for small and unbalanced data
which would otherwise not be able to be modeled using conventional estimation
methods.
Other violations of the distributional assumptions of Poisson and negative
binomial probability distributions exist. The table below summarizes major types
of violations that have resulted in the creation of specialized count models.
The four texts listed in the References below are specifically devoted to describing
the theory and variety of count models, and are currently regarded as standard
resources on the subject. A number of journal articles and book chapters have
been written on the subject. Other texts dealing with discrete response models
in general, as well as texts on generalized linear models (see Generalized linear
models), also have descriptions of count models, although only a few go beyond
examining basic Poisson and negative binomial regression.
References
5
[ 1 ] Cameron, A. C. and P. K. Trivedi (1986). Econometric models based on count data:
Comparisons and applications of some estimators, Journal of Applied Econometrics, 1:
29-53.
[ 2 ] Cameron, A. C., P. K. Trivedi (1998). Regression analysis of count data. New York:
Cambridge University Press.
[ 3 ] Hilbe, J. M, (1993). Log-negative binomial regression as a generalized linear model,
Technical report COS 93/94-5-26, Department of Sociology, Arizona State University.
[ 4 ] Hilbe, J. M. (2007). Negative binomial regression. Cambridge, UK: Cambridge Uni-
versity Press.
[ 5 ] Hilbe, J. M. (2011). Negative binomial regression. 2nd edition. Cambridge, UK:
Cambridge University Press. In press.
[ 6 ] Hilbe, J. M. and W. H. Greene (2007). Count response regression models, in (eds)
C.R. Rao, J.P. Miller, and D.C. Rao, Epidemiology and Medical Statistics, Elsevier
Handbook of Statistics Series, London: Elsevier.
[ 7 ] Hinde, J. and C. G. B. Demetrio (1998). Overdispersion: models and estimation,
Computational Statistics and Data Analysis, Vol 27, 2: 151-170.
[ 8 ] Lawless, J. F. (1987). Negative binomial and mixed Poisson regression, Canadian
Journal of Statistics, 15, 3: 209-225.
[ 9 ] Long, J. S. (1997). Regression models for categorical and limited dependent variables.
Thousand Oaks, CA: Sage.
[ 10 ] Simon, L. J. (1960). The negative binomial and Poisson distributions compared. Pro-
ceedings of the casualty and actuarial society XLVII: 20-24.
[ 11 ] Winkelmann, R. (2008). Econometric Analysis of Count Data. 5th edition, Heidelberg,
Ger: Springer.