0% found this document useful (0 votes)

22 views24 pages

Shorten - Count Data Analysis

This document provides an overview of count data analysis methods including Poisson regression, negative binomial regression, and zero-inflated Poisson regression. It discusses the Poisson distribution and its properties. It also covers interpreting loglinear models, examples of count data, and fitting count data models in Stata. Key count data modeling techniques are explained including dealing with overdispersion using negative binomial regression and modeling excess zeros using zero-inflated Poisson regression.

Uploaded by

woretawendalew91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views24 pages

Shorten - Count Data Analysis

Uploaded by

woretawendalew91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Count data analysis

By Lemma Derseh
Presentation outline
Poisson regression

Negative binomial Poisson regression

Zero-inflated Poisson regression

Poisson distribution
• The Poisson distribution assigns a positive probability to every
nonnegative integer 0, 1, 2, . . ., so that every nonnegative integer
becomes a mathematical possibility (albeit practically zero
possibility for most count values)

• The Poisson is different than the binomial, Bin(n, π), which takes on
numbers only up to some n, and leads to a proportion (out of n)

• But the Poisson is similar to the binomial in that it can be shown

that the Poisson is the limiting distribution of a Binomial for large n
and small π

• Furthermore, because of the simple form of the Poisson

distribution, it is often computationally preferred over the Binomial
Poisson distribution
 We cannot consider count data as continuous normal because:
 Count distribution is too skewed to satisfy normality
(incorrect test results)
 Normal model does not necessarily prevent negative
estimated counts

 If we just dichotomize count data (e.g. 0 vs >0) and analyze it

using logistic regression, then we:
 Loss information resulting in under-powered tests
 For instance 1 event is not really equal to 100 events
Poisson distribution
• Expected number of counts (per unit of time) is strictly
positive

• As mean increases, distribution approximates normal

Examples of count data
• Number of deaths due to HIV/AIDs in a region per month

• Number of car accidents in a city

• Number of malaria cases in a district

Interpretation of loglinear models
Let Yi be the observed count for experimental unit i
Yi|Xi ~ Poisson(μi)
log(μi) = Xi β

Assumption: the mean (i.e. μ) equals the variance (i.e. σ2)

• We think that the covariates influence the mean of the counts (μ)
in a multiplicative way, i.e. as a covariate increases by 1 unit, the
log of the mean increases by β units and this implies the mean
increases by a “fold-change” of or “scale factor” of exp(β).

• The log link is the canonical link in GLM for Poisson distribution
Interpretation of loglinear model

• Considering only one covariate, the model is:

log(μ) = α + βx

• Since the log of the expected value of Y is a linear function of

explanatory variable(s), and the expected value of Y is a
multiplicative function of x:
μ = eα + βx
= eαeβx

What does this mean for μ? How do we interpret ?

Interpretation: Example
 Let a Poisson process involves the event of myocardial infarction (MI)
encountered per month and is required to model it using a single factor
say age of study participants (in years). Suppose the estimates are:
Intercept or α= 0.3396 and β=0.02565

 Our estimated model will be:

log(μ) = 0.3396 + 0.02565x

 Therefore, for a 1-year increase in age, the expectation (or mean)

number of event of MI increases by a factor of e0.2565 = 1.026 per
month

 That is, the number of MIs is growing at the rate of 2.6% per month
due to a one year increase in age
Interpretation

• All the findings mentioned above are based on the assumption

that the mean equals to the variance or equi-dispersion.

• Empirically, however, we often find data that exhibit over-

dispersion, with a variance larger than the mean and rarely
under dispersion, when the variance less than the mean

• Thus we need to have models that accommodate the excess

variation.
Fitting count data

(Stata Oriented)
Data: We use data from the STATA log provided by Long (1990) on the number
of publications produced by Ph.D. biochemists to illustrate the application on
the models mentioned. The variables in the data set include art: articles in last
three years of Ph.D., fem: coded one for females, mar: coded one if married,
kid5: number of children under age five, phd: prestige of PhD program, and
ment: articles published by mentor in last three years.

It can be accessed from the web via: use http://www.stata-

press.com/data/lf2/couart2,clear
Example

Data assessment
• After running the data using Stata: sum art, one can see that the
mean number of articles is 1.69 and the variance is 3.71

• a bit more than twice the mean

• The data are over-dispersed, but of course we haven’t

considered any covariates
A Poisson Model
 We can fit Poisson model with the five covariates in two
ways:

i. Use the command: poison

ii. Use the command: glm

 We will also store the estimates for latter use using the
command: estimates store poison
A Poisson Model
. glm art fem mar kid5 phd ment, family(poisson) nolog

Generalized linear models No. of obs = 915

Optimization : ML Residual df = 909
Scale parameter = 1
Deviance = 1634.370984 (1/df) Deviance = 1.797988
Pearson = 1662.54655 (1/df) Pearson = 1.828984

Variance function: V(u) = u [Poisson]

Link function : g(u) = ln(u) [Log]

AIC = 3.621981
Log likelihood = -1651.056316 BIC = -4564.031

OIM
art Coef. Std. Err. z P>|z| [95% Conf. Interval]

fem -.2245942 .0546138 -4.11 0.000 -.3316352 -.1175532

mar .1552434 .0613747 2.53 0.011 .0349512 .2755356
kid5 -.1848827 .0401272 -4.61 0.000 -.2635305 -.1062349
phd .0128226 .0263972 0.49 0.627 -.038915 .0645601
ment .0255427 .0020061 12.73 0.000 .0216109 .0294746
_cons .3046168 .1029822 2.96 0.003 .1027755 .5064581
Negative Binomial approach
• In the context of count data, consider the assumption that the
variance is proportional to the mean. Specifically, var(Y ) = φE(Y ) =
φμ

• If φ= 1 then the variance equals the mean and we obtain the

Poisson mean-variance relationship. If φ > 1 then we have over-
dispersion, and if φ < 1 we would have under-dispersion, but this is
relatively rare

• φ= 1 when the outcome data is generated by pure Poisson process;

φ< 1 when outcome events are happening in a regulated manner;
φ> 1 when there are unobserved heterogeneities that should be
captured by random effect models like negative binomial or mixed
methods or both
Negative Binomial
• Modeling over-dispersion in count data should start from a Poisson
regression model and add a multiplicative random effect to represent
unobserved heterogeneity which leads into the negative binomial
regression model

• Suppose that the conditional distribution of the outcome Y given an

unobserved variable effect θ is Poisson with mean and variance μθ.
That is, Y|θ ~ P(θμ)

• In the context of the Ph.D. biochemists data, θ captures unobserved

factors effect that increase (if θ > 1) or decrease (if θ < 1) productivity
relative to what one would expect given the observed values of the
covariates, which is of course μ where logμ = x’β.

• For convenience we take E(θ) = 1, that represents the expected

outcome for the average individual given covariates x.
Negative Binomial
• Stata has a command called nbreg that can fit the negative
binomial model described here by maximum likelihood.

• The output uses alpha to label the variance of θ the unobserved

variable, which we call σ2.
Negative Binomial Regression

• Stata’s alpha is the variance of the multiplicative random effect and

corresponds to σ2
• It is estimated to be 0.44 and is statistically significant
Negative Binomial Regression
• Because the Poisson model is a special case of the negative binomial
when σ2 =0, we can use a likelihood ratio test to compare the two
models.

• Or to test the significance of Stata’s alpha we may think of computing

twice the difference in log-likelihoods between this model and the
Poisson model, 180.2, and treating it as chi-squared with one df.

• However, the usual asymptotic do not apply, because the null

hypothesis is on a boundary of the parameter space

• A better approximation is to treat the statistic as 50:50 mixture of zeros

and chi-squared with one df

• And Stata implements this procedure, reporting the statistics as

chi2bar.
Negative Binomial Regression

• Alternatively to compare the two models, we can treat the

statistics as a chi-square with one df which gives a conservative
test

• To test hypothesis about the regression coefficients, we can use

either the Wald or LR tests, which are possible because we have
made full distributional assumptions.
Zero-Inflated Poisson
• One way to model when there is a situation of inflated zeros is to
assume that the data come from a mixture of two populations,
one where the counts is always zero, and another where the
count has a Poisson distribution with mean μ
• In this model, zero counts can come from either population,
while positive counts come only from the second one
• In the context of publications by PhD biochemists data, we can
imagine that some had in mind jobs where publications wouldn’t
be important, while others were aiming for academic jobs where
a record of publications was expected.
• Members of the first group would publish zero articles, whereas
members of the second group would publish 0, 1, 2, …., a count
that may be assumed to have a Poisson distribution
Zero-Inflated Poisson
• The distribution of the outcome can be modeled in terms of two
parameters, π the probability of “always zero” and μ, the mean
number of publications for those not in the “always zero’ group

• A natural way to introduce covariates is to model the logit of the

probability π of always zero and the log of the mean μ for those
not in the “always zero” class
• Stata implements this combination in the zip command when the
counts are assumed Poisson

• A parallel development using a negative binomial model for the

counts in the second group leads to the zinb command.

• In both cases, the model for the probability of always zero is

specified in the inflate () option
Zero-Inflated Poisson
Zero-inflated Poisson model with all covariates in both equation
. zip art fem mar kid5 phd ment, inflate(fem mar kid5 phd ment) nolog

Zero-inflated Poisson regression Number of obs = 915

Nonzero obs = 640
Zero obs = 275

Inflation model = logit LR chi2(5) = 78.56

Log likelihood = -1604.773 Prob > chi2 = 0.0000

art Coef. Std. Err. z P>|z| [95% Conf. Interval]

art
fem -.2091446 .0634047 -3.30 0.001 -.3334155 -.0848737
mar .103751 .071111 1.46 0.145 -.035624 .243126
kid5 -.1433196 .0474293 -3.02 0.003 -.2362793 -.0503599
phd -.0061662 .0310086 -0.20 0.842 -.066942 .0546096
ment .0180977 .0022948 7.89 0.000 .0135999 .0225955
_cons .640839 .1213072 5.28 0.000 .4030814 .8785967

inflate
fem .1097465 .2800813 0.39 0.695 -.4392028 .6586958
mar -.3540107 .3176103 -1.11 0.265 -.9765155 .2684941
kid5 .2171001 .196481 1.10 0.269 -.1679956 .6021958
phd .0012702 .1452639 0.01 0.993 -.2834418 .2859821
ment -.134111 .0452461 -2.96 0.003 -.2227918 -.0454302
_cons -.5770618 .5093853 -1.13 0.257 -1.575439 .421315
Zero-Inflated Poisson
• Looking at the inflate equation we see that the only significant
predictor of being in the “always zero” class is the number of
articles published by the mentor, with each article by the mentor
associated with 13.4% lower odds of never publishing

• Looking at the equation for the mean number of articles among

those not in the always zero class, we find significant
disadvantages for females and scientists with children under five,
and

• A large positive effect of the number of publications by the

mentor, with each article associated with a 1.8% increase in the
expected number of publications

2.2 The Poisson Distribution
No ratings yet
2.2 The Poisson Distribution
22 pages
Modeling Count Data. ISBN 1107611253, 978-1107611252
100% (27)
Modeling Count Data. ISBN 1107611253, 978-1107611252
23 pages
Modeling Count Data (Joseph M. Hilbe)
No ratings yet
Modeling Count Data (Joseph M. Hilbe)
304 pages
Poisson
No ratings yet
Poisson
54 pages
L1.1 Distributions
No ratings yet
L1.1 Distributions
36 pages
Section 8 P
No ratings yet
Section 8 P
43 pages
Tutorial 106b - Poisson Regression and Log-Linear Models (Bayesian)
No ratings yet
Tutorial 106b - Poisson Regression and Log-Linear Models (Bayesian)
122 pages
TCRM CountData
No ratings yet
TCRM CountData
43 pages
L19 CountDataModels v2
No ratings yet
L19 CountDataModels v2
36 pages
Lect 12
No ratings yet
Lect 12
36 pages
PSQF6270 Example4a Binomial
No ratings yet
PSQF6270 Example4a Binomial
13 pages
Poisson Regression
No ratings yet
Poisson Regression
60 pages
Modeling
100% (1)
Modeling
300 pages
Chapter-09 Count Data, Poisson Regression, and Log-Linear Model
No ratings yet
Chapter-09 Count Data, Poisson Regression, and Log-Linear Model
15 pages
Internship Report On AKRSP
67% (3)
Internship Report On AKRSP
83 pages
2017may 02323 02402 Solution en
No ratings yet
2017may 02323 02402 Solution en
43 pages
BS 4 Years Past Paper STAT 2
No ratings yet
BS 4 Years Past Paper STAT 2
15 pages
Chapter 4 QM (PC)
No ratings yet
Chapter 4 QM (PC)
26 pages
Count Models Poisson NB
No ratings yet
Count Models Poisson NB
10 pages
CRIMINAL LAW - Estrada Problems and Answers
67% (3)
CRIMINAL LAW - Estrada Problems and Answers
3 pages
Cópia de Aula5 - Contagem
No ratings yet
Cópia de Aula5 - Contagem
28 pages
Models For Counts
No ratings yet
Models For Counts
59 pages
Count Models
No ratings yet
Count Models
38 pages
Count Data 2012
No ratings yet
Count Data 2012
20 pages
Poisson Regression in Stata
No ratings yet
Poisson Regression in Stata
23 pages
Modeling Count Data
No ratings yet
Modeling Count Data
6 pages
Countdata2018 2
No ratings yet
Countdata2018 2
23 pages
GLM Slides 11 ZTP GP
No ratings yet
GLM Slides 11 ZTP GP
26 pages
4glm3 Ha Online
No ratings yet
4glm3 Ha Online
51 pages
Poisson Regression Models
No ratings yet
Poisson Regression Models
14 pages
Chapter 6
No ratings yet
Chapter 6
24 pages
Genlog Poisson
No ratings yet
Genlog Poisson
16 pages
An Introduction To Generalized Linear Models (Third Edition, 2008) by Annette Dobson & Adrian Barnett Outline of Solutions For Selected Exercises
No ratings yet
An Introduction To Generalized Linear Models (Third Edition, 2008) by Annette Dobson & Adrian Barnett Outline of Solutions For Selected Exercises
23 pages
Binomial Distribution: ,.... 2, 1, 0 Where) 1 (
No ratings yet
Binomial Distribution: ,.... 2, 1, 0 Where) 1 (
15 pages
Glmcourse 003
No ratings yet
Glmcourse 003
49 pages
SSPSS Data Analysis Examples Poisson Regression
No ratings yet
SSPSS Data Analysis Examples Poisson Regression
34 pages
Lecture 10 Spring 2017
No ratings yet
Lecture 10 Spring 2017
15 pages
Cambridge International AS & A Level: HISTORY 9489/41
No ratings yet
Cambridge International AS & A Level: HISTORY 9489/41
4 pages
Poisson Regression
No ratings yet
Poisson Regression
3 pages
Chap1 Introduction 2may24
No ratings yet
Chap1 Introduction 2may24
21 pages
The Poisson Regression Model
No ratings yet
The Poisson Regression Model
6 pages
Models For Count Data With Overdispersion: 1 Extra-Poisson Variation
No ratings yet
Models For Count Data With Overdispersion: 1 Extra-Poisson Variation
7 pages
P NB ProbitE
No ratings yet
P NB ProbitE
21 pages
PS With R Lab Record Exp PDF
No ratings yet
PS With R Lab Record Exp PDF
25 pages
Count Data Models in SAS
No ratings yet
Count Data Models in SAS
12 pages
STAT PRELIM - Hemal
No ratings yet
STAT PRELIM - Hemal
11 pages
PBHS32700 Lecture12
No ratings yet
PBHS32700 Lecture12
18 pages
Y e y Y P: (A) Poisson Distribution and Poisson Process
No ratings yet
Y e y Y P: (A) Poisson Distribution and Poisson Process
6 pages
Count Data Analysis Using Poisson Regression: University of Southeastern Philippines Advanced Studies Mintal, Davao City
No ratings yet
Count Data Analysis Using Poisson Regression: University of Southeastern Philippines Advanced Studies Mintal, Davao City
19 pages
Countreg
No ratings yet
Countreg
11 pages
Poisson & Negative Binomial Regressions: Notes
No ratings yet
Poisson & Negative Binomial Regressions: Notes
30 pages
Logit Probit
No ratings yet
Logit Probit
66 pages
21mab204t Unit II Lecture Notes
No ratings yet
21mab204t Unit II Lecture Notes
19 pages
Poisson Regression - Stata Data Analysis Examples
No ratings yet
Poisson Regression - Stata Data Analysis Examples
12 pages
Assessment Questions PHI 815 Sem One 2022 2023
No ratings yet
Assessment Questions PHI 815 Sem One 2022 2023
24 pages
The Ailing Planet Notes
100% (1)
The Ailing Planet Notes
4 pages
Simovert 106035
No ratings yet
Simovert 106035
4 pages
Poisson Distribution: Submitted By: 2014-UETR-CS-04 Submitted To: Sir Ihsan-Ul-Ghafoor
No ratings yet
Poisson Distribution: Submitted By: 2014-UETR-CS-04 Submitted To: Sir Ihsan-Ul-Ghafoor
27 pages
10E-Poisson Regression
No ratings yet
10E-Poisson Regression
19 pages
Negro Who's Who in California (1948)
100% (2)
Negro Who's Who in California (1948)
154 pages
Ultralight Shape Load Table PDF
No ratings yet
Ultralight Shape Load Table PDF
13 pages
Obj - Que. PBG 4.4
No ratings yet
Obj - Que. PBG 4.4
11 pages
HV Link Boxe Epp 1665 5-13
No ratings yet
HV Link Boxe Epp 1665 5-13
8 pages
Generalized Linear Models-1
No ratings yet
Generalized Linear Models-1
29 pages
Avg. Market Capitalization of Listed Companies During Jul-Dec 2018
No ratings yet
Avg. Market Capitalization of Listed Companies During Jul-Dec 2018
294 pages
Gold Chemistry 2
100% (3)
Gold Chemistry 2
189 pages
Oiv Ma As1 12
No ratings yet
Oiv Ma As1 12
92 pages
Probability Review Questions
No ratings yet
Probability Review Questions
6 pages
Poisson Regression
No ratings yet
Poisson Regression
12 pages
637768232285587483ce 20ce33pt W3 S3 Sy
No ratings yet
637768232285587483ce 20ce33pt W3 S3 Sy
7 pages
GBV Monthly Work Plan
No ratings yet
GBV Monthly Work Plan
20 pages
2 Plugins Changelog
No ratings yet
2 Plugins Changelog
3 pages
Important Reminders: Step 1
No ratings yet
Important Reminders: Step 1
4 pages
5.3 Disintegration Test For Tablets and Capsules: Final Text For Revision of
No ratings yet
5.3 Disintegration Test For Tablets and Capsules: Final Text For Revision of
4 pages
How To Pay Profession Tax Online
No ratings yet
How To Pay Profession Tax Online
22 pages
Transportation Calculations
No ratings yet
Transportation Calculations
11 pages
DemoProject2Project Report
No ratings yet
DemoProject2Project Report
9 pages
Concurrence of Big Data Analytics and Healthcare
No ratings yet
Concurrence of Big Data Analytics and Healthcare
10 pages
D R L F L S I G S: EEP Einforcement Earning For Urniture Ayout Imulation in Ndoor Raphics Cenes
No ratings yet
D R L F L S I G S: EEP Einforcement Earning For Urniture Ayout Imulation in Ndoor Raphics Cenes
6 pages
Classics: Invention of The Integrated Circuit
No ratings yet
Classics: Invention of The Integrated Circuit
16 pages
Chickering Seven Principles
No ratings yet
Chickering Seven Principles
6 pages
Mark Jimuel M. Gabrillo: Job Objective
No ratings yet
Mark Jimuel M. Gabrillo: Job Objective
3 pages
Chavez vs. CA
No ratings yet
Chavez vs. CA
1 page
C1 Reading Political Manifestos
No ratings yet
C1 Reading Political Manifestos
3 pages
Fleximax Series Power System: All The Power You Need
No ratings yet
Fleximax Series Power System: All The Power You Need
2 pages
BBS Implementation Process - Matrix
No ratings yet
BBS Implementation Process - Matrix
2 pages
Endorsement Letter Honda
No ratings yet
Endorsement Letter Honda
1 page
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Shorten - Count Data Analysis

Uploaded by

Shorten - Count Data Analysis

Uploaded by

Count data analysis

Negative binomial Poisson regression

Zero-inflated Poisson regression

• But the Poisson is similar to the binomial in that it can be shown

• Furthermore, because of the simple form of the Poisson

 If we just dichotomize count data (e.g. 0 vs >0) and analyze it

• As mean increases, distribution approximates normal

• Number of car accidents in a city

• Number of malaria cases in a district

Assumption: the mean (i.e. μ) equals the variance (i.e. σ2)

• Considering only one covariate, the model is:

• Since the log of the expected value of Y is a linear function of

What does this mean for μ? How do we interpret ?

 Our estimated model will be:

 Therefore, for a 1-year increase in age, the expectation (or mean)

• All the findings mentioned above are based on the assumption

• Empirically, however, we often find data that exhibit over-

• Thus we need to have models that accommodate the excess

It can be accessed from the web via: use http://www.stata-

• a bit more than twice the mean

• The data are over-dispersed, but of course we haven’t

i. Use the command: poison

ii. Use the command: glm

Generalized linear models No. of obs = 915

Variance function: V(u) = u [Poisson]

fem -.2245942 .0546138 -4.11 0.000 -.3316352 -.1175532

• If φ= 1 then the variance equals the mean and we obtain the

• φ= 1 when the outcome data is generated by pure Poisson process;

• Suppose that the conditional distribution of the outcome Y given an

• In the context of the Ph.D. biochemists data, θ captures unobserved

• For convenience we take E(θ) = 1, that represents the expected

• The output uses alpha to label the variance of θ the unobserved

• Stata’s alpha is the variance of the multiplicative random effect and

• Or to test the significance of Stata’s alpha we may think of computing

• However, the usual asymptotic do not apply, because the null

• A better approximation is to treat the statistic as 50:50 mixture of zeros

• And Stata implements this procedure, reporting the statistics as

• Alternatively to compare the two models, we can treat the

• To test hypothesis about the regression coefficients, we can use

• A natural way to introduce covariates is to model the logit of the

• A parallel development using a negative binomial model for the

• In both cases, the model for the probability of always zero is

Zero-inflated Poisson regression Number of obs = 915

Inflation model = logit LR chi2(5) = 78.56

art Coef. Std. Err. z P>|z| [95% Conf. Interval]

• Looking at the equation for the mean number of articles among

• A large positive effect of the number of publications by the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.