0% found this document useful (0 votes)
140 views20 pages

The Problem of Modeling Rare Events in ML-based Logistic Regression - Assessing Potential Remedies Via MC Simulations

The document summarizes a study that used Monte Carlo simulations to compare the performance of conventional maximum likelihood estimation, King and Zeng's bias correction method, and Firth's penalized maximum likelihood estimation when modeling rare events in logistic regression. The simulations varied the sample size (n) and probability of the rare event (p) and assessed bias in the estimated intercepts and slopes. Results showed that maximum likelihood estimates became increasingly biased as n decreased and p became rarer, while the alternative methods provided less biased estimates, especially King and Zeng's correction method and Firth's penalized maximum likelihood estimation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views20 pages

The Problem of Modeling Rare Events in ML-based Logistic Regression - Assessing Potential Remedies Via MC Simulations

The document summarizes a study that used Monte Carlo simulations to compare the performance of conventional maximum likelihood estimation, King and Zeng's bias correction method, and Firth's penalized maximum likelihood estimation when modeling rare events in logistic regression. The simulations varied the sample size (n) and probability of the rare event (p) and assessed bias in the estimated intercepts and slopes. Results showed that maximum likelihood estimates became increasingly biased as n decreased and p became rarer, while the alternative methods provided less biased estimates, especially King and Zeng's correction method and Firth's penalized maximum likelihood estimation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/269708531

The Problem of Modeling Rare Events in ML-based Logistic Regression


- Assessing Potential Remedies via MC Simulations

Conference Paper · July 2013

CITATIONS READS

13 3,649

1 author:

Heinz Leitgöb
Katholische Universität Eichstätt-Ingolstadt (KU)
42 PUBLICATIONS   200 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Measurement invariance View project

Measuring Value Priorities and Testing Theory among University Students in Africa and Middle Europe View project

All content following this page was uploaded by Heinz Leitgöb on 19 December 2014.

The user has requested enhancement of the downloaded file.


The Problem of Modeling Rare Events in
ML-based Logistic Regression
s

Assessing Potential Remedies via MC Simulations

Heinz Leitgöb
University of Linz, Austria
Problem
• In logistic regression, MLEs are consistent but only
asymptotically unbiased -> MLEs may be heavily biased away
from 0

• McCullagh and Nelder (1989) determine the bias as

𝐛 = 𝐗 𝑇 𝐖𝐗 −𝟏 𝐗 𝑇 𝐖𝛏 (1)

• Rare events (= a very small #e) exacerbate 𝐛

• This phenomenon is well known in the statistical literature


(for an overview see Gao and Shen (2007)) but―thus far―not
adequately communicated to applied researchers by the
available textbooks on logistic regression
ESRA 2013, Ljubljana 2
Conventional ML-based logistic regression

Logistic regression model


𝑒𝑥𝑝 𝜂 𝑘
𝑦~𝐵 1, 𝜋 with 𝜋 = 𝑤𝑖𝑡ℎ 𝜂 = 𝑗=1 𝑥𝑗 𝛽𝑗 (2)
1+𝑒𝑥𝑝 𝜂

(log-)Likelihood functions and score vector


𝑛 𝑦𝑖 1−𝑦𝑖
𝐿 𝛃 = 𝑖=1 𝜋 𝑖 1 − 𝜋𝑖 (3)

𝑛
log𝐿 𝛃 = 𝑖=1 𝑦𝑖 log 𝜋 𝐱 𝑖 , 𝛃 + 1 − 𝑦𝑖 log 1 − 𝜋 𝐱 𝑖 , 𝛃 (4)

𝜕log𝐿 𝛃
𝐪= =𝟎 (5)
𝜕𝛃

ESRA 2013, Ljubljana 3


Potential remedies

• Exact logistic regression (Stata command: exlogistic)

• Bias correction method proposed by King and Zeng (2001a,


2001b) (Stata command: relogit)

• Penalized maximum likelihood estimation (PMLE) proposed by


Firth (1993) (Stata command: firthlogit)

ESRA 2013, Ljubljana 4


Exact logistic regression

Principle: exact computation of parameter estimates -> foregoes


asymptotic properties of estimates as in MLE

First result:
Exact logistic regression is only applicable when
• n is (very) small (<200)
• covariates are discrete (best: dichotomous)
• # of covariates is small

otherwise: working memory will

ESRA 2013, Ljubljana 5


Bias correction method proposed by King and
Zeng (2001a, 2001b)
Principle: 2-step correction procedure

• Step 1: correction of finite sample bias -> generation of an unbiased


vector of parameter estimates 𝛃 (for bias 𝛃 see Eq. (1))

𝛃 = 𝛃 − bias 𝛃 (6)

• Step 2: changes in 𝛃 usually do not affect 𝜋 symmetrically ->


approximate correction can be realized by

Pr 𝑦𝑖 = 1 𝐱 𝑖 ≈ 𝜋𝑖 + 𝐶𝑖 (7)

with

𝐶𝑖 = 0.5 − 𝜋𝑖 𝜋𝑖 1 − 𝜋𝑖 𝐱 0 𝑉 𝛃 𝐱′0 (8)


ESRA 2013, Ljubljana 6
PMLE proposed by Firth (1993)

Principle: extending the logLML-function―and thus the elements


of the score vector―by a penalization term which is sensitive to
decreasing n and #e

𝐿𝑃𝑀𝐿 𝛃 = 𝐿𝑀𝐿 𝛃 𝐢 𝛃 1 2 (9)

log𝐿𝑃𝑀𝐿 𝛃 = log𝐿𝑀𝐿 𝛃 + 1 2 log 𝐢 𝛃 (10)

𝜕𝐢
𝐪𝑃𝑀𝐿 = 𝐪𝑀𝐿 + 1 2 𝑡𝑟 𝐢−1 (11)
𝜕𝛃

𝐢 𝛃 1 2 … Jeffreys (1946) invariant prior

ESRA 2013, Ljubljana 7


Research questions

• How biased are MLEs in small samples with rare events?

• How do the alternative estimation procedures perform under


these conditions (focusing on unbiasedness)

ESRA 2013, Ljubljana 8


Design of Monte Carlo (MC) simulation
Table 1: Simulation design (#e)
n
p 5,000 1,000 500 250 100
0.5 2,500 500 250 125 50
0.1 500 100 50 25 10
0.05 250 50 25 ≈13 5
0.01 50 10 5 ― ―

Linear predictors (𝜼𝒑 ):


𝜂0.5 = 2𝑥1 ; 𝑥1 ~𝑁 0,1
𝜂0.1 = −3.3 + 2𝑥1 ; 𝑥1 ~𝑁 0,1
10,000 replications
𝜂0.05 = −4.3 + 2𝑥1 ; 𝑥1 ~𝑁 0,1
𝜂0.01 = −6.6 + 2𝑥1 ; 𝑥1 ~𝑁 0,1

Variation in n and p (-> in 𝛽0 ) while 𝛽1 = 2 and the # of covariates are


kept constant ESRA 2013, Ljubljana 9
MC simulation results ― conventional MLE
Table 2a: Conventional MLE ― mean intercepts
p (𝛽0 ) 5,000 1,000 500 250 100
0.5 (0) -0.0001084 0.0004301 0.0018214 0.0001666 0.0028939
0.1 (-3.3) -3.3041540 -3.3235550 -3.3491030 -3.4005300 -3.6099880
0.05 (-4.3) -4.3109700 -4.3456310 -4.3998010 -4.5071530 -5.0832930
0.01 (-6.6) -6.6438070 -6.9020720 -7.4014120 ― ―

Table 2b: Conventional MLE ― mean slopes


p (𝛽1 =2) 5,000 1,000 500 250 100
0.5 2.0016170 2.0102410 2.0197690 2.0364510 2.1058500
0.1 2.0027970 2.0175180 2.0381940 2.0789060 2.2306880
0.05 2.0069420 2.0261880 2.0563780 2.1174040 2.4500680
0.01 2.0149170 2.1007360 2.2920260 ― ―

Italic implies, that the 95%ci does not contain the true score
ESRA 2013, Ljubljana 10
MC simulation results ― King/Zeng correction
Table 3a: King/Zeng correction― mean intercepts
p (𝛽0 ) 5,000 1,000 500 250 100
0.5 (0) -0,0001084 0,0004286 0,0018100 0,0001658 0,0027876
0.1 (-3.3) -3,2996170 -3,2995520 -3,3017820 -3,3001730 -3,2840120
0.05 (-4.3) -4,3016320 -4,2998200 -4,3032600 -4,2905530 -4,0773970
0.01 (-6.6) -6,5957210 -6,5828270 -6,3526430 ― ―

Table 3b: King/Zeng correction ― mean slopes


p (𝛽1 =2) 5,000 1,000 500 250 100
0.5 1,9997340 2,0007440 2,0005830 1,9973300 2,0005910
0.1 1,9992970 1,9969960 2,0025320 2,0043180 1,9892110
0.05 2,0017520 2,0001370 2,0016380 1,9953640 1,8951280
0.01 1,9992300 1,9949910 1,9447270 ― ―

Italic implies, that the 95%ci does not contain the true score
ESRA 2013, Ljubljana 11
MC simulation results ― Firth’s PMLE
Table 4a: Firth’s PMLE ― mean intercepts
p (𝛽0 ) 5,000 1,000 500 250 100
0.5 (0) -0,0001084 0,0004286 0,0018100 0,0001657 0,0027907
0.1 (-3.3) -3,2996230 -3,2996900 -3,3023610 -3,3027510 -3,3129370
0.05 (-4.3) -4,3016500 -4,3002850 -4,3053100 -4,3010580 -4,3168090
0.01 (-6.6) -6,5961770 -6,6058880 -6,5950600 ― ―

Table 4b: Firth’s PMLE ― mean slopes


p (𝛽1 =2) 5,000 1,000 500 250 100
0.5 1,9997350 2,0007690 2,0006860 1,9977570 2,0035800
0.1 1,9993010 1,9970950 2,0029460 2,0061450 2,0088150
0.05 2,0017610 2,0003810 2,0027020 2,0006890 2,0012150
0.01 1,9993380 1,9998850 1,9713260 ― ―

Italic implies, that the 95%ci does not contain the true score
ESRA 2013, Ljubljana 12
Comparison of mean intercepts
Graph 1: Comparison of mean intercepts (p = 0.05 -> 𝜷𝟎 = −𝟒. 𝟑)
n
5.000 1.000 500 250 100
-3,5

-4,0
-4,077397

-4,316809
-4,399801
ß0

-4,5 -4,507153

-5,0
-5,083293

-5,5
ML_ß0 King/Zeng_ß0 PML_ß0 true_ß0
ESRA 2013, Ljubljana 13
Comparison of mean slopes
Graph 2: Comparison of mean slopes (p = 0.05; 𝜷𝟏 = 𝟐)
2,5
2,450068
2,4

2,3

2,2

2,1 2,117404
2,056378
2,0
ß1

2,001215

1,9 1,895128

1,8

1,7

1,6

1,5
5.000 1.000 500 250 100
n
ML_ß1 King/Zeng_ß1 PML_ß1 true_ß1
ESRA 2013, Ljubljana 14
Comparison of probabilities

Table 5: Comparison of 𝐏𝐫 𝒚 = 𝟏 𝒙𝟏 = 𝟏 for n = 100 and p = 0.05


𝛽0 𝛽1 𝜂 Pr Pr*100
true -4.3 2 -2.3 0.091122 9.112296
MLE -5.083293 2.450068 -2.633225 0.067030 6.703049
King/Zeng -4.077397 1.895128 -2.182269 0.101354 10.135408
PMLE -4.316809 2.001215 -2.315594 0.089839 8.983968

ESRA 2013, Ljubljana 15


Summary
• MLEs are systematically biased away from 0 as n and #e are
getting small -> underestimation of the “true” Pr 𝑦 = 1 𝐱

• In samples with n > 200 and/or in cases with “many” covariates


and/or non-discrete covariates exact logistic regression will
blow up working memory

• The correction method proposed by King/Zeng is somewhat


overcorrecting bias in MLEs as n is getting small (<200)

• PMLEs seem unbiased, even in cases with small n and very few
#e. Further advantages: PMLE is always converging and
solves the “problem of separation” (Heinze/Schemper 2002)

Recommendations: Try to keep n large and apply PMLE when


estimating logistic regression models (with rare events data)!
ESRA 2013, Ljubljana 16
Future research

• Additional investigation of other desirable properties of


estimates (e.g. consistency, efficiency)

• Testing the respective models for systematic bias in standard


errors

• Testing the performance of the models for non-normal and


discrete covariates (e.g. Poisson distributed covariates)

• Testing for a decreasing number of events per variable by


including more than one covariate into the model (Peduzzi et
al. 1996)

ESRA 2013, Ljubljana 17


Literature
Firth, D. (1993): Bias reduction of maximum likelihood estimates. In: Biometrika 80:
27-38.
Gao, S./Shen, J. (2007): Asymptotic properties of a double penalized maximum
likelihood estimator in logistic regression. In: Statistics and Probability Letters 77: 925-
930.
Heinze, G./Schemper, M. (2002): A solution to the problem of separation in logistic
regression. In: Statistics in Medicine 21: 2409-2419.
Jeffreys, H. (1946): An invariant form for the prior probability in estimation
problems.
King, G./Zeng, L. (2001a): Logistic Regression in Rare Events Data. In: Political
Analysis 9: 137-163.
King, G./Zeng, L. (2001b): Explaining Rare Events in international Relations. In:
International Organization 55: 693-715.
McCullagh, P./Nelder, J. A. (1989): Generalized Linear Models. Chapman & Hall: Boca
Raton.
Peduzzi, P./Concato, J./Kemper, E./Holford T. R./Feinstein, A. R. (1996): A Simulation
Study of the Number of Events per Variable in Logistic Regression Analysis. In: Journal of
Clinical Epidemiology 49: 1373-1379.
ESRA 2013, Ljubljana 18
Thanks for your attention!

heinz.leitgoeb@jku.at

View publication stats


ESRA 2013, Ljubljana 19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy