0% found this document useful (0 votes)
22 views24 pages

Shorten - Count Data Analysis

This document provides an overview of count data analysis methods including Poisson regression, negative binomial regression, and zero-inflated Poisson regression. It discusses the Poisson distribution and its properties. It also covers interpreting loglinear models, examples of count data, and fitting count data models in Stata. Key count data modeling techniques are explained including dealing with overdispersion using negative binomial regression and modeling excess zeros using zero-inflated Poisson regression.

Uploaded by

woretawendalew91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views24 pages

Shorten - Count Data Analysis

This document provides an overview of count data analysis methods including Poisson regression, negative binomial regression, and zero-inflated Poisson regression. It discusses the Poisson distribution and its properties. It also covers interpreting loglinear models, examples of count data, and fitting count data models in Stata. Key count data modeling techniques are explained including dealing with overdispersion using negative binomial regression and modeling excess zeros using zero-inflated Poisson regression.

Uploaded by

woretawendalew91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Count data analysis

By Lemma Derseh
Presentation outline
Poisson regression

Negative binomial Poisson regression

Zero-inflated Poisson regression


Poisson distribution
• The Poisson distribution assigns a positive probability to every
nonnegative integer 0, 1, 2, . . ., so that every nonnegative integer
becomes a mathematical possibility (albeit practically zero
possibility for most count values)

• The Poisson is different than the binomial, Bin(n, π), which takes on
numbers only up to some n, and leads to a proportion (out of n)

• But the Poisson is similar to the binomial in that it can be shown


that the Poisson is the limiting distribution of a Binomial for large n
and small π

• Furthermore, because of the simple form of the Poisson


distribution, it is often computationally preferred over the Binomial
Poisson distribution
 We cannot consider count data as continuous normal because:
 Count distribution is too skewed to satisfy normality
(incorrect test results)
 Normal model does not necessarily prevent negative
estimated counts

 If we just dichotomize count data (e.g. 0 vs >0) and analyze it


using logistic regression, then we:
 Loss information resulting in under-powered tests
 For instance 1 event is not really equal to 100 events
Poisson distribution
• Expected number of counts (per unit of time) is strictly
positive

• As mean increases, distribution approximates normal


Examples of count data
• Number of deaths due to HIV/AIDs in a region per month

• Number of car accidents in a city

• Number of malaria cases in a district


Interpretation of loglinear models
Let Yi be the observed count for experimental unit i
Yi|Xi ~ Poisson(μi)
log(μi) = Xi β

Assumption: the mean (i.e. μ) equals the variance (i.e. σ2)

• We think that the covariates influence the mean of the counts (μ)
in a multiplicative way, i.e. as a covariate increases by 1 unit, the
log of the mean increases by β units and this implies the mean
increases by a “fold-change” of or “scale factor” of exp(β).

• The log link is the canonical link in GLM for Poisson distribution
Interpretation of loglinear model

• Considering only one covariate, the model is:


log(μ) = α + βx

• Since the log of the expected value of Y is a linear function of


explanatory variable(s), and the expected value of Y is a
multiplicative function of x:
μ = eα + βx
= eαeβx

What does this mean for μ? How do we interpret ?


Interpretation: Example
 Let a Poisson process involves the event of myocardial infarction (MI)
encountered per month and is required to model it using a single factor
say age of study participants (in years). Suppose the estimates are:
Intercept or α= 0.3396 and β=0.02565

 Our estimated model will be:


log(μ) = 0.3396 + 0.02565x

 Therefore, for a 1-year increase in age, the expectation (or mean)


number of event of MI increases by a factor of e0.2565 = 1.026 per
month

 That is, the number of MIs is growing at the rate of 2.6% per month
due to a one year increase in age
Interpretation

• All the findings mentioned above are based on the assumption


that the mean equals to the variance or equi-dispersion.

• Empirically, however, we often find data that exhibit over-


dispersion, with a variance larger than the mean and rarely
under dispersion, when the variance less than the mean

• Thus we need to have models that accommodate the excess


variation.
Fitting count data

(Stata Oriented)
Data: We use data from the STATA log provided by Long (1990) on the number
of publications produced by Ph.D. biochemists to illustrate the application on
the models mentioned. The variables in the data set include art: articles in last
three years of Ph.D., fem: coded one for females, mar: coded one if married,
kid5: number of children under age five, phd: prestige of PhD program, and
ment: articles published by mentor in last three years.

It can be accessed from the web via: use http://www.stata-


press.com/data/lf2/couart2,clear
Example

Data assessment
• After running the data using Stata: sum art, one can see that the
mean number of articles is 1.69 and the variance is 3.71

• a bit more than twice the mean

• The data are over-dispersed, but of course we haven’t


considered any covariates
A Poisson Model
 We can fit Poisson model with the five covariates in two
ways:

i. Use the command: poison

ii. Use the command: glm

 We will also store the estimates for latter use using the
command: estimates store poison
A Poisson Model
. glm art fem mar kid5 phd ment, family(poisson) nolog

Generalized linear models No. of obs = 915


Optimization : ML Residual df = 909
Scale parameter = 1
Deviance = 1634.370984 (1/df) Deviance = 1.797988
Pearson = 1662.54655 (1/df) Pearson = 1.828984

Variance function: V(u) = u [Poisson]


Link function : g(u) = ln(u) [Log]

AIC = 3.621981
Log likelihood = -1651.056316 BIC = -4564.031

OIM
art Coef. Std. Err. z P>|z| [95% Conf. Interval]

fem -.2245942 .0546138 -4.11 0.000 -.3316352 -.1175532


mar .1552434 .0613747 2.53 0.011 .0349512 .2755356
kid5 -.1848827 .0401272 -4.61 0.000 -.2635305 -.1062349
phd .0128226 .0263972 0.49 0.627 -.038915 .0645601
ment .0255427 .0020061 12.73 0.000 .0216109 .0294746
_cons .3046168 .1029822 2.96 0.003 .1027755 .5064581
Negative Binomial approach
• In the context of count data, consider the assumption that the
variance is proportional to the mean. Specifically, var(Y ) = φE(Y ) =
φμ

• If φ= 1 then the variance equals the mean and we obtain the


Poisson mean-variance relationship. If φ > 1 then we have over-
dispersion, and if φ < 1 we would have under-dispersion, but this is
relatively rare

• φ= 1 when the outcome data is generated by pure Poisson process;


φ< 1 when outcome events are happening in a regulated manner;
φ> 1 when there are unobserved heterogeneities that should be
captured by random effect models like negative binomial or mixed
methods or both
Negative Binomial
• Modeling over-dispersion in count data should start from a Poisson
regression model and add a multiplicative random effect to represent
unobserved heterogeneity which leads into the negative binomial
regression model

• Suppose that the conditional distribution of the outcome Y given an


unobserved variable effect θ is Poisson with mean and variance μθ.
That is, Y|θ ~ P(θμ)

• In the context of the Ph.D. biochemists data, θ captures unobserved


factors effect that increase (if θ > 1) or decrease (if θ < 1) productivity
relative to what one would expect given the observed values of the
covariates, which is of course μ where logμ = x’β.

• For convenience we take E(θ) = 1, that represents the expected


outcome for the average individual given covariates x.
Negative Binomial
• Stata has a command called nbreg that can fit the negative
binomial model described here by maximum likelihood.

• The output uses alpha to label the variance of θ the unobserved


variable, which we call σ2.
Negative Binomial Regression

• Stata’s alpha is the variance of the multiplicative random effect and


corresponds to σ2
• It is estimated to be 0.44 and is statistically significant
Negative Binomial Regression
• Because the Poisson model is a special case of the negative binomial
when σ2 =0, we can use a likelihood ratio test to compare the two
models.

• Or to test the significance of Stata’s alpha we may think of computing


twice the difference in log-likelihoods between this model and the
Poisson model, 180.2, and treating it as chi-squared with one df.

• However, the usual asymptotic do not apply, because the null


hypothesis is on a boundary of the parameter space

• A better approximation is to treat the statistic as 50:50 mixture of zeros


and chi-squared with one df

• And Stata implements this procedure, reporting the statistics as


chi2bar.
Negative Binomial Regression

• Alternatively to compare the two models, we can treat the


statistics as a chi-square with one df which gives a conservative
test

• To test hypothesis about the regression coefficients, we can use


either the Wald or LR tests, which are possible because we have
made full distributional assumptions.
Zero-Inflated Poisson
• One way to model when there is a situation of inflated zeros is to
assume that the data come from a mixture of two populations,
one where the counts is always zero, and another where the
count has a Poisson distribution with mean μ
• In this model, zero counts can come from either population,
while positive counts come only from the second one
• In the context of publications by PhD biochemists data, we can
imagine that some had in mind jobs where publications wouldn’t
be important, while others were aiming for academic jobs where
a record of publications was expected.
• Members of the first group would publish zero articles, whereas
members of the second group would publish 0, 1, 2, …., a count
that may be assumed to have a Poisson distribution
Zero-Inflated Poisson
• The distribution of the outcome can be modeled in terms of two
parameters, π the probability of “always zero” and μ, the mean
number of publications for those not in the “always zero’ group

• A natural way to introduce covariates is to model the logit of the


probability π of always zero and the log of the mean μ for those
not in the “always zero” class
• Stata implements this combination in the zip command when the
counts are assumed Poisson

• A parallel development using a negative binomial model for the


counts in the second group leads to the zinb command.

• In both cases, the model for the probability of always zero is


specified in the inflate () option
Zero-Inflated Poisson
Zero-inflated Poisson model with all covariates in both equation
. zip art fem mar kid5 phd ment, inflate(fem mar kid5 phd ment) nolog

Zero-inflated Poisson regression Number of obs = 915


Nonzero obs = 640
Zero obs = 275

Inflation model = logit LR chi2(5) = 78.56


Log likelihood = -1604.773 Prob > chi2 = 0.0000

art Coef. Std. Err. z P>|z| [95% Conf. Interval]

art
fem -.2091446 .0634047 -3.30 0.001 -.3334155 -.0848737
mar .103751 .071111 1.46 0.145 -.035624 .243126
kid5 -.1433196 .0474293 -3.02 0.003 -.2362793 -.0503599
phd -.0061662 .0310086 -0.20 0.842 -.066942 .0546096
ment .0180977 .0022948 7.89 0.000 .0135999 .0225955
_cons .640839 .1213072 5.28 0.000 .4030814 .8785967

inflate
fem .1097465 .2800813 0.39 0.695 -.4392028 .6586958
mar -.3540107 .3176103 -1.11 0.265 -.9765155 .2684941
kid5 .2171001 .196481 1.10 0.269 -.1679956 .6021958
phd .0012702 .1452639 0.01 0.993 -.2834418 .2859821
ment -.134111 .0452461 -2.96 0.003 -.2227918 -.0454302
_cons -.5770618 .5093853 -1.13 0.257 -1.575439 .421315
Zero-Inflated Poisson
• Looking at the inflate equation we see that the only significant
predictor of being in the “always zero” class is the number of
articles published by the mentor, with each article by the mentor
associated with 13.4% lower odds of never publishing

• Looking at the equation for the mean number of articles among


those not in the always zero class, we find significant
disadvantages for females and scientists with children under five,
and

• A large positive effect of the number of publications by the


mentor, with each article associated with a 1.8% increase in the
expected number of publications

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy