0% found this document useful (0 votes)
45 views14 pages

Paper MS Statistics

This document discusses different models for analyzing correlated ordinal data: Multivariate Ordinal Probit (MVOP), Ordered Probit (OP), Multivariate Probit (MVP), and Binary Probit (BP). It notes issues with the MVOP model in SAS, including long run times and convergence problems. It proposes simplifying the model by discounting cross-equation correlations (OP model) or by making the ordinal variables binary and using MVP. A simulation study is used to evaluate the performance of these models.

Uploaded by

aqp_peru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views14 pages

Paper MS Statistics

This document discusses different models for analyzing correlated ordinal data: Multivariate Ordinal Probit (MVOP), Ordered Probit (OP), Multivariate Probit (MVP), and Binary Probit (BP). It notes issues with the MVOP model in SAS, including long run times and convergence problems. It proposes simplifying the model by discounting cross-equation correlations (OP model) or by making the ordinal variables binary and using MVP. A simulation study is used to evaluate the performance of these models.

Uploaded by

aqp_peru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

MULTI-ORDINAL PROBIT/LOGIT MODELS WHERE TAKE INTO ACCOUNT

CORRELATION

Ludwig P. Linares

Summary

Ordinal correlated data is often the consequence of surveys. One of the available techniques to
analyze this type of data is PROC QLIM in SAS using the MVOP (Multivariate ordinal Probit)
option. While this works in general this technique is memory intensive often takes a long time
and often has issues with the estimates converging. In this study we use Monte Carlo simulation
to investigate the consequences of aggregating ordinal response variables and/or discounting for
cross-equation correlations when analyzing models with ordinal response variables under multiple
correlation structures. Our results indicate that while MVOP remains the best model to fit this
data, Ordered Probit (discounting the cross equation correlation) performs competitively as well.
However, the performance of Multivariate Probit (that discounts the ordinal nature of the data) is
less competitive. Hence, the take home message is, if we do want to simplify the problem we
have better results discounting the cross equation correlation as opposed to the ORDINAL nature
of the data.

I. Introduction

Discrete choice models are now used in a wide variety of situations at applied statistics.
There are many applications of limited dependent variable and discrete choice models in a wide
variety of areas, including economics, finance, marketing, political science, and sociology. Data
sets collected from survey data often results in cross correlated ordinal responses. One example is
that many quality characteristics of products or services are commonly evaluated on ordinal scales
with a finite number of categories. A systematic analysis of categorical variables collected over
time may be useful for a profitable management strategy. In practice to measure customer
satisfaction or quality improvement in a process, two or more quality characteristics are often
conjointly measured and summarized by suitable indexes. These examples result in multivariate
ordinal data.

1
In theory the best option will be Multivariate Ordinal Probit (MVOP). This technique theoretically
accounts for both the cross correlation and the ordinal nature of this type of data. An advantage
of the estimator is that even for high dimensional models, the estimation procedure requires the
evaluation of bivariate normal integrals only. Numerical integration is needed because the
cumulative normal density function cannot be expressed in closed form. The exact estimation of a
Multivariate Probit (MVP) model with more than two dependent variables is computationally
infeasible. This explains why most of the applications of the MVP and Ordinal Probit (OP) models
are limited to containing at most two dependent variables. Most software use a practical solution
to the problem of integration, which is, the use of methods of simulation to approximate integrals
(McFadden, 1989). However, these methods are still very computer intensive.
SAS provides PROC QLIM (qualitative and limited dependent variable model) for analysis
of this kind of data. Specifically it provides options for MVOP (Multivariate Ordered Probit), OP
(ordered Probit), MVP (Multivariate Probit) as well as the simple case of Binary Probit. Our goal
for this paper is to look at the performance of these four models in terms of Bias, RMSE, Standard
error and Type I error using Monte Carlo simulation process.

Multivariate Ordinal & Qlim


The MVOP model is an extension of the MVP model where the response variable is ordinal
instead of binary. The behavioral structure of the models are based on the concept of threshold
crossing where an individual is assumed to derive utility from a choice and the intensity (or level)
of such utility determines which alternative to pick.

More specifically some individual (i=1,2,3) choice (j=1,2,3) yields a level of utility for a
systematic component and a random component .

= + given i=1,2,3 and j=1,2,3J

Where represent our explanatory variables and a vector of unknown parameters and is a
standard normal distributed random variable, independent and identical across individuals but
correlated between choices with some covariance and a unit diagonal.

(0, )

2
Individual i is observed to pick some alternative j if fall between two thresholds:

ji (I) = 1 if i (I 1) < Yji i (I)

The marginal probability of the event ji (I) = 1 can be written as

() = (ji (I) = 1 ) = (i (I 1) < Yji i (I))

i (I) i (I1)
= ( ) ( )

Where ( ) represent the standard univariate normal density function.

The Multivariate joint probability can be defined as:

() = (ji (I) = 1 )

I i (I)
= (1,1 ; 12 )11
=1 i (Ii)

Where (1,1 ; 12 ) follow a standard multivariate normal density and the last term
shows their correlations coefficient.

The QLIM (qualitative and limited dependent variable model) procedure analyzes
univariate and multivariate limited dependent variable models where dependent variables take
discrete values or dependent variables are observed only in a limited range of values. This
procedure includes logit, probit, tobit, selection, and multivariate models. The multivariate model
can contain discrete choice and limited endogenous variables as well as continuous endogenous
variables (SAS, User Guide).

Most of the models in the QLIM procedure can be extended to accommodate bivariate and
multivariate scenarios. The assumption that one variable is observed only if another variable takes
on certain values lead to the introduction of sample selection models. If the dependent variables
are mutually exclusive and observed only for certain ranges of the selection variable, the sample
selection can be extended to include cases of switching regression. The QLIM procedure uses

3
maximum likelihood methods where the initial starting values for the non-linear optimizations are
typically calculated by OLS.

Issues with QLIM

A frequent problem in estimating ordinal regression models is a failure of the likelihood


maximization algorithm to converge. But the biggest question is why numerical algorithms for
maximum likelihood estimation of the ordinal regression model sometimes fail to converge.
A common problem in maximizing a function is the presence of local maxima. Fortunately, such
problems cannot occur with logistic/probit regressions because the log-likelihood is globally
concave. Unfortunately, there are many situations in which the likelihood function has no
maximum, in which case we say that the maximum likelihood estimate does not exist (Allison
2008). Under Newton-Raphson algorithm when no maximum is present the interactions will
report larger amounts of separations between values t and t+1.
When such issues are present SAS could give us a False Convergence for PROC QLIM, the
reason is that their iteration algorithms have been modified to improve convergence. In practice,
the most widely used method for dealing with complete or quasi complete separation (when
convergence is not present) is simply to delete from the model any variables whose coefficients
did not converge.

II. How we can simplify


In order to avoid computational problems, researchers have an option of discounting for
the cross-equation correlations. Studies in the past have done so by estimating separate ordered
probit (see e.g. Orazem, Otto, Edelman, 1989; Barkley and Flinchbaugh, 1990). The issue is
analogous to multiple equation models with continuous dependent variables. But, when responses
or dependent variables are potentially correlated or interdependent, efficiency loss may be incurred
if the correlation is not taken into account in the estimation (Fu et al 2000).
One possible alternative for the MVOP will be construct binary variables from the ordered
categorical variables and estimate a much simpler MVP, that approach is valid only if the research
objective allows the researcher to make inferences from binary response variables. Where, the
multivariate probit is an appealing model of choice behavior because it allows a flexible correlation
structure for the unobservable variables. Zhao and Harris (2003) shows that on average a 4.1%

4
error is present for the estimates of six coefficients of their three Univariate Probit equations when
the correlations across equations are ignored, instead the average error reduces by almost half to a
2.7% when the correlations are recognized using a Multivariate Probit under simulation. Later on
they apply a rank in order to jointly study the changes in frequencies for the response variable.

Finally, a researcher can decide to both discount for correlations and aggregate the ordered
variables into binary variable and estimate a simple binary probit (BP).

III. Study based on Simulations


In order to evaluate the performance of various discrete choice models, we conduct a
simulation experiment. The data are generated by the following model:

(1) Z ij i 0 i1 X ij1 i 2 X ij 2 i 3 X ij 3 i 4 X ij 4 i 5 X ij 5 i 6 X ij 6 i 7 X ij 7 ij ;

where i {1, 2,3} and j {1, , n}. The matrix of exogenous variables, X [ X ij1 , , X ij 7 ] , was

generated as i.i.d. standard normal variables. The true parameter matrix, [i 0 , , i 7 ] , was set to the
following value, which is arbitrary:

(2) 2 0.8 1.5 0 3 0 2 0

Matrix vij is the error term generated as multivariate standard normal variable from N (0, )

distribution1. We consider six different structures for the error term with the following correlations:

Correlation 1: 1 12 , 13 , 23 [0.5, 0.5, 0.5];

Correlation 2: 2 12 , 13 , 23 [0.1, 0.1, 0.1];

Correlation 3: 3 12 , 13 , 23 [0.8, 0.8, 0.8];

Correlation 4: 4 12 , 13 , 23 [0.2, 0.9, 0.5];

1
Using linear transformation

5
Correlation 5: 5 12 , 13 , 23 [0.2, 0.2, 0.2];

Correlation 6: 6 12 , 13 , 23 [0.4, 0.5, 0.6].

Next, we generate ordered categorical variables using the true latent left-hand side variable Z as2:

1 if - <Z ij 0.84
2 if -0.84<Z ij 0.35

(3) Yij 3 if -0.35<Z ij 0.35
4 if 0.35<Z ij 0.84

5 if 0.84<Z ij

From Y, we generate a set of binary variables by aggregating the categories in Y into two groups.
The new binary variable is generated as3:

0 if Yij 1, 2,3

(4) Dij
1 if Yij 4,5

Subsequently, we estimate the following four discrete choice models:

Model 1: Multivariate ordered probit

(5) Yij i 0 ik X ijk ij , where v N (0, l ) l {1, ,6}

Model 2: Ordered probit

(6) Yij i 0 ik X ijk ij , where v N (0, I )

Model 3: Multivariate binary probit

(7) Dij i 0 ik X ijk ij , where v N (0, l ) l {1, ,6}

Model 4: Binary probit

2
Choice of cutoffs are relatively arbitrary as long include positive and negative values
3
Choice of group is arbitrary

6
(8) Dij i 0 ik X ijk ij , where v N (0, I ) .

We perform 1000 simulations for 1000 observations, then evaluate the performance of each model
using average bias, root means squared error (RMSE), standard deviation, and Type I error.

IV. Results:
The next section presents overall statistics as well our results for our four models where
different correlation levels are present. I considered three levels (High, Medium and Low
correlation) between the three equations, besides six different structures for the error term.
Tables 1 to 3 provide a summary report on the mean BIAS and RMSE, Standard Error and Type I
error of the estimates.

6.1 High Correlation

High correlation levels are forced between the three equations using all independent
variables from equation one same as independent variables for equation two and three.

Table 1 show that mean of estimates for our parameters are quite close to their true value in terms
of both Bias and RMSE, even exist a difference at the third correlation the difference between
models related to bias is quite small (less than 5 103 ) when for RMSE the difference was smaller.

Also, models where order is considered shows smaller standard errors but related to Type I error
display closer values between them.

I expected to reduce our Bias and RMSE as our sample size increase, which indicates the
estimators will converge to the true values faster as sample increase.

Finally Table 1 is a good representation in terms of which model will perform best, when order is
considered it, MVOP performed better than MVP or the others two models. Clear is that when
order is not considered part of solution MVP performed better than BP, which is at the end of the
day what was expected.

Table 2, shows Pearson Correlation between our latent variables, when high correlation is present.

7
Table 1: Average Bias, RMSE, Standard Error and Type I error of Regression Coefficient
for 1000 observations and 1000 simulations all X's values equal
Model 1: Model 2: Model 3: Model 4:
Performance Statistics
MVOP OP MVP BP
Correlation 1:
Bias 0.07414 0.08349 0.11112 0.11507
RMSE 0.11718 0.12422 0.17369 0.17697
Std. Error 0.08253 0.08258 0.12004 0.12036
Type I Error 0.07778 0.06667 0.10000 0.10000

Correlation 2:
Bias 0.07991 0.09180 0.12070 0.13319
RMSE 0.12402 0.13361 0.18719 0.19848
Std. Error 0.08445 0.08591 0.12685 0.12922
Type I Error 0.11111 0.10000 0.10000 0.10000

Correlation 3:
Bias 0.06063 0.05495 0.08870 0.08632
RMSE 0.10532 0.10223 0.15149 0.15296
Std. Error 0.07912 0.08068 0.11301 0.11632
Type I Error 0.06667 0.07778 0.10000 0.07778

Correlation 4:
Bias 0.06313 0.06216 0.09365 0.08075
RMSE 0.10956 0.11065 0.16303 0.15196
Std. Error 0.08181 0.08324 0.12084 0.12004
Type I Error 0.04444 0.06667 0.00000 0.04444

Correlation 5:
Bias 0.06256 0.06965 0.09594 0.10483
RMSE 0.10859 0.11355 0.16235 0.16892
Std. Error 0.08216 0.08343 0.11964 0.12135
Type I Error 0.08889 0.07778 0.07778 0.07778

Correlation 6:
Bias 0.07056 0.07226 0.11413 0.11131
RMSE 0.11582 0.11607 0.17729 0.17047
Std. Error 0.08487 0.08508 0.12327 0.12410
Type I Error 0.05556 0.05556 0.06667 0.05556

8
Table 2: Correlation Latent Variables
Correlation 1: Correlation 2:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.876 0.868 Y1 1.000 0.925 0.924

Y2 1.000 0.872 Y2 1.000 0.919

Y3 1.000 Y3 1.000

Correlation 3: Correlation 4:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.957 0.960 Y1 1.000 0.863 0.976

Y2 1.000 0.959 Y2 1.000 0.907

Y3 1.000 Y3 1.000

Correlation 5: Correlation 6:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.897 0.814 Y1 1.000 0.863 0.859

Y2 1.000 0.806 Y2 1.000 0.872

Y3 1.000 Y3 1.000

6.2 Medium Correlation

Medium correlation levels are also forced between three equations using
independent variables X11 X 21 X 31 , X12 X 22 X 32 and X13 X 23 X 33 the rest of variables
different.

The second table show that mean of estimates for our parameters are also close to their true value,
but here error structure 1, 4 and 6 shows unexpected result but difference between them in less
than 0.02 respect to Bias and RMSE, which could be considered as low and difference will be
smaller as long as sample size will increase.

Result with respect to standard deviation and Type I error show that ordered need to be consider
as part of our model such that MVOP estimators are lower than the others.

Finally Table 3 still shows that MVOP performed better than MVP or the others two models. Clear
is that when order is not considered part of the structure MVP is performing better than BP, which
is at the end of the day what was expected.

9
Table 3: Average Bias, RMSE, Standard Error and Type I error of Regression Coefficient
for 1000 observations and 1000 simulations where Xi1 to Xi7 have same values
Model 1: Model 2: Model 3: Model 4:
Performance Statistics
MVOP OP MVP BP
Correlation 1:
Bias 0.07111 0.07018 0.10895 0.09950
RMSE 0.11612 0.11439 0.17173 0.16432
Std. Error 0.08320 0.08334 0.11949 0.11990
Type I Error 0.06667 0.06667 0.07778 0.07778

Correlation 2:
Bias 0.06544 0.06829 0.10538 0.10063
RMSE 0.10893 0.11344 0.16513 0.16289
Std. Error 0.07903 0.08202 0.11518 0.11749
Type I Error 0.10000 0.10000 0.04444 0.06667

Correlation 3:
Bias 0.06826 0.08804 0.10166 0.11011
RMSE 0.10854 0.13099 0.17948 0.17768
Std. Error 0.07561 0.08570 0.12913 0.12370
Type I Error 0.06667 0.05556 0.06667 0.04444

Correlation 4:
Bias 0.07741 0.07589 0.10550 0.10470
RMSE 0.11591 0.11987 0.17889 0.17282
Std. Error 0.07626 0.08292 0.12269 0.12289
Type I Error 0.10000 0.06667 0.14444 0.03333

Correlation 5:
Bias 0.06273 0.06348 0.09652 0.11973
RMSE 0.10548 0.10797 0.15970 0.18034
Std. Error 0.07784 0.08079 0.11623 0.11888
Type I Error 0.07778 0.06667 0.07778 0.07778

Correlation 6:
Bias 0.07117 0.05743 0.12203 0.13092
RMSE 0.11748 0.11042 0.18680 0.19133
Std. Error 0.08373 0.08427 0.12468 0.12514
Type I Error 0.02222 0.02222 0.06667 0.05556

Table 4, same as table 2, presents Person Correlation between latent variables when a middle
correlation level is present. The difference is clear between tables 2 and 4 where for table 4 those
correlations running between 0.08 to 0.15 but when high correlation is present Person correlation
will be in between 0.85 to 0.93.

10
Table 4: Correlation Latent Variables
Correlation 1: Correlation 2:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.110 0.110 Y1 1.000 0.078 0.107

Y2 1.000 0.117 Y2 1.000 0.084

Y3 1.000 Y3 1.000

Correlation 3: Correlation 4:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.134 0.125 Y1 1.000 0.148 0.153

Y2 1.000 0.153 Y2 1.000 0.152

Y3 1.000 Y3 1.000

Correlation 5: Correlation 6:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.103 0.119 Y1 1.000 0.108 0.068

Y2 1.000 0.104 Y2 1.000 0.133

Y3 1.000 Y3 1.000

6.3 Low Correlation

The ideal scenario is forced where all independent variables are different between
equations (i.e. X 1 j X 2 j X 3 j for j 1 to 7 ) all results are behave as it is expected MVOP

shows the lowest values for all estimators follow by OP, MVP and BP.

11
Table 5: Average Bias, RMSE, Standard Error and Type I error of Regression Coefficient
for 1000 observations and 1000 simulations all X's values different
Model 1: Model 2: Model 3: Model 4:
Performance Statistics
MVOP OP MVP BP
Correlation 1:
Bias 0.07725 0.08004 0.11447 0.11481
RMSE 0.12015 0.12402 0.17564 0.17799
Std. Error 0.08325 0.08345 0.11959 0.12025
Type I Error 0.05556 0.05556 0.06667 0.07778

Correlation 2:
Bias 0.06386 0.08288 0.10884 0.15059
RMSE 0.10856 0.12469 0.19659 0.21016
Std. Error 0.08027 0.08379 0.14564 0.12414
Type I Error 0.03333 0.04444 0.05556 0.05556

Correlation 3:
Bias 0.06195 0.08064 0.09756 0.10837
RMSE 0.10042 0.12081 0.16726 0.16676
Std. Error 0.07160 0.08339 0.12211 0.11950
Type I Error 0.03333 0.03333 0.07778 0.04444

Correlation 4:
Bias 0.06484 0.08082 0.11454 0.11448
RMSE 0.10495 0.12268 0.57118 0.17393
Std. Error 0.07580 0.08410 0.52446 0.12450
Type I Error 0.03333 0.04444 0.06667 0.03333

Correlation 5:
Bias 0.06057 0.04633 0.10961 0.09439
RMSE 0.10555 0.09808 0.25311 0.16356
Std. Error 0.07746 0.08152 0.20071 0.12046
Type I Error 0.05556 0.03333 0.06667 0.05556

Correlation 6:
Bias 0.06026 0.06500 0.09875 0.08809
RMSE 0.10782 0.11376 0.16226 0.15373
Std. Error 0.08157 0.08227 0.11782 0.11881
Type I Error 0.03333 0.02222 0.05556 0.05556

Furthermore, there exists a slight tendency for some estimators to deviate less from the true value
as the level of correlation coefficient increases.

Finally our table 6 shows when our latent variables are forced to have small correlation levels,
where levels are between 0.00 to 0.04.

12
Table 6: Correlation Latent Variables
Correlation 1: Correlation 2:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 -0.035 -0.026 Y1 1.000 0.013 0.039

Y2 1.000 0.078 Y2 1.000 0.002

Y3 1.000 Y3 1.000

Correlation 3: Correlation 4:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.054 -0.021 Y1 1.000 -0.013 0.050

Y2 1.000 0.078 Y2 1.000 0.029

Y3 1.000 Y3 1.000

Correlation 5: Correlation 6:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 -0.003 -0.063 Y1 1.000 -0.012 -0.027

Y2 1.000 0.022 Y2 1.000 0.003

Y3 1.000 Y3 1.000

V. Conclusions:

Overall, our results suggest the MVOP is the best model to use for models with three
ordered categorical variables and correlated error terms. If this model cannot be used or is too
computer intensive, OP will be the second most appropriate model at any of our present scenarios.

To conclude, some deficiencies in the study should be pointed out. While running the
simulations, a portion of the MVOP and MVP iterations did not converge when the correlation
between error terms was 0.8. These observations were thrown out. Extra simulations were run until
our sample size reached 1000. Non-convergence is an important criterion to consider when
choosing a model. Future simulations should investigate the convergence rates of the MVOP in
relation to MVP in models with high correlation between error terms. It should also be noted that
models with more than three equations and five categories were not tested. Adding more equations
or categories may also increase the likelihood of non-convergence.

13
Bibliography

Allison (2008) Convergence Failures in Logistic Regression, SAS Global Forum 2008, Paper 360.

Brondino, Franceschini, Galetto and Vicario (2006). Synthesis maps for multivariate ordinal
variables in manufacturing, International Journal of Production Research, Vol.44, 20, 4241-4255.

Fu, Li, Lin and Kan (2000) A limit information estimator for the multivariate ordinal probit model,
Applied Econometrics, 32,1841-1851.

Greene, W (2003). Econometric Analysis, 5th Edition, Englewood Cliffs, Prentice Hall.

Huguenin, Pelgrin and Holly (2009) Estimation of Multivariate Probit Models by Exact Maximum
Likelihood. Working Paper n 09

McFadden, D. (1989) A method of simulated moments for estimation of discrete choice response
models without numerical integration, Econometrica, 57, 995- 1027

Orazem, Otto and Edelman (1989) An Analysis of Farmers' Agricultural Policy Preferences
American Journal of Agricultural Economics, Vol. 71, No. 4, 837-846.

Zhao and Harris (2003) Demand for Marijuana, Alcohol and Tobacco: Participation, Frequency
and Cross-Equation Correlations. Department of Econometrics and Business Statistics, Monash
University, Australia

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy