Paper MS Statistics
Paper MS Statistics
CORRELATION
Ludwig P. Linares
Summary
Ordinal correlated data is often the consequence of surveys. One of the available techniques to
analyze this type of data is PROC QLIM in SAS using the MVOP (Multivariate ordinal Probit)
option. While this works in general this technique is memory intensive often takes a long time
and often has issues with the estimates converging. In this study we use Monte Carlo simulation
to investigate the consequences of aggregating ordinal response variables and/or discounting for
cross-equation correlations when analyzing models with ordinal response variables under multiple
correlation structures. Our results indicate that while MVOP remains the best model to fit this
data, Ordered Probit (discounting the cross equation correlation) performs competitively as well.
However, the performance of Multivariate Probit (that discounts the ordinal nature of the data) is
less competitive. Hence, the take home message is, if we do want to simplify the problem we
have better results discounting the cross equation correlation as opposed to the ORDINAL nature
of the data.
I. Introduction
Discrete choice models are now used in a wide variety of situations at applied statistics.
There are many applications of limited dependent variable and discrete choice models in a wide
variety of areas, including economics, finance, marketing, political science, and sociology. Data
sets collected from survey data often results in cross correlated ordinal responses. One example is
that many quality characteristics of products or services are commonly evaluated on ordinal scales
with a finite number of categories. A systematic analysis of categorical variables collected over
time may be useful for a profitable management strategy. In practice to measure customer
satisfaction or quality improvement in a process, two or more quality characteristics are often
conjointly measured and summarized by suitable indexes. These examples result in multivariate
ordinal data.
1
In theory the best option will be Multivariate Ordinal Probit (MVOP). This technique theoretically
accounts for both the cross correlation and the ordinal nature of this type of data. An advantage
of the estimator is that even for high dimensional models, the estimation procedure requires the
evaluation of bivariate normal integrals only. Numerical integration is needed because the
cumulative normal density function cannot be expressed in closed form. The exact estimation of a
Multivariate Probit (MVP) model with more than two dependent variables is computationally
infeasible. This explains why most of the applications of the MVP and Ordinal Probit (OP) models
are limited to containing at most two dependent variables. Most software use a practical solution
to the problem of integration, which is, the use of methods of simulation to approximate integrals
(McFadden, 1989). However, these methods are still very computer intensive.
SAS provides PROC QLIM (qualitative and limited dependent variable model) for analysis
of this kind of data. Specifically it provides options for MVOP (Multivariate Ordered Probit), OP
(ordered Probit), MVP (Multivariate Probit) as well as the simple case of Binary Probit. Our goal
for this paper is to look at the performance of these four models in terms of Bias, RMSE, Standard
error and Type I error using Monte Carlo simulation process.
More specifically some individual (i=1,2,3) choice (j=1,2,3) yields a level of utility for a
systematic component and a random component .
Where represent our explanatory variables and a vector of unknown parameters and is a
standard normal distributed random variable, independent and identical across individuals but
correlated between choices with some covariance and a unit diagonal.
(0, )
2
Individual i is observed to pick some alternative j if fall between two thresholds:
i (I) i (I1)
= ( ) ( )
() = (ji (I) = 1 )
I i (I)
= (1,1 ; 12 )11
=1 i (Ii)
Where (1,1 ; 12 ) follow a standard multivariate normal density and the last term
shows their correlations coefficient.
The QLIM (qualitative and limited dependent variable model) procedure analyzes
univariate and multivariate limited dependent variable models where dependent variables take
discrete values or dependent variables are observed only in a limited range of values. This
procedure includes logit, probit, tobit, selection, and multivariate models. The multivariate model
can contain discrete choice and limited endogenous variables as well as continuous endogenous
variables (SAS, User Guide).
Most of the models in the QLIM procedure can be extended to accommodate bivariate and
multivariate scenarios. The assumption that one variable is observed only if another variable takes
on certain values lead to the introduction of sample selection models. If the dependent variables
are mutually exclusive and observed only for certain ranges of the selection variable, the sample
selection can be extended to include cases of switching regression. The QLIM procedure uses
3
maximum likelihood methods where the initial starting values for the non-linear optimizations are
typically calculated by OLS.
4
error is present for the estimates of six coefficients of their three Univariate Probit equations when
the correlations across equations are ignored, instead the average error reduces by almost half to a
2.7% when the correlations are recognized using a Multivariate Probit under simulation. Later on
they apply a rank in order to jointly study the changes in frequencies for the response variable.
Finally, a researcher can decide to both discount for correlations and aggregate the ordered
variables into binary variable and estimate a simple binary probit (BP).
(1) Z ij i 0 i1 X ij1 i 2 X ij 2 i 3 X ij 3 i 4 X ij 4 i 5 X ij 5 i 6 X ij 6 i 7 X ij 7 ij ;
where i {1, 2,3} and j {1, , n}. The matrix of exogenous variables, X [ X ij1 , , X ij 7 ] , was
generated as i.i.d. standard normal variables. The true parameter matrix, [i 0 , , i 7 ] , was set to the
following value, which is arbitrary:
Matrix vij is the error term generated as multivariate standard normal variable from N (0, )
distribution1. We consider six different structures for the error term with the following correlations:
1
Using linear transformation
5
Correlation 5: 5 12 , 13 , 23 [0.2, 0.2, 0.2];
Next, we generate ordered categorical variables using the true latent left-hand side variable Z as2:
1 if - <Z ij 0.84
2 if -0.84<Z ij 0.35
(3) Yij 3 if -0.35<Z ij 0.35
4 if 0.35<Z ij 0.84
5 if 0.84<Z ij
From Y, we generate a set of binary variables by aggregating the categories in Y into two groups.
The new binary variable is generated as3:
0 if Yij 1, 2,3
(4) Dij
1 if Yij 4,5
2
Choice of cutoffs are relatively arbitrary as long include positive and negative values
3
Choice of group is arbitrary
6
(8) Dij i 0 ik X ijk ij , where v N (0, I ) .
We perform 1000 simulations for 1000 observations, then evaluate the performance of each model
using average bias, root means squared error (RMSE), standard deviation, and Type I error.
IV. Results:
The next section presents overall statistics as well our results for our four models where
different correlation levels are present. I considered three levels (High, Medium and Low
correlation) between the three equations, besides six different structures for the error term.
Tables 1 to 3 provide a summary report on the mean BIAS and RMSE, Standard Error and Type I
error of the estimates.
High correlation levels are forced between the three equations using all independent
variables from equation one same as independent variables for equation two and three.
Table 1 show that mean of estimates for our parameters are quite close to their true value in terms
of both Bias and RMSE, even exist a difference at the third correlation the difference between
models related to bias is quite small (less than 5 103 ) when for RMSE the difference was smaller.
Also, models where order is considered shows smaller standard errors but related to Type I error
display closer values between them.
I expected to reduce our Bias and RMSE as our sample size increase, which indicates the
estimators will converge to the true values faster as sample increase.
Finally Table 1 is a good representation in terms of which model will perform best, when order is
considered it, MVOP performed better than MVP or the others two models. Clear is that when
order is not considered part of solution MVP performed better than BP, which is at the end of the
day what was expected.
Table 2, shows Pearson Correlation between our latent variables, when high correlation is present.
7
Table 1: Average Bias, RMSE, Standard Error and Type I error of Regression Coefficient
for 1000 observations and 1000 simulations all X's values equal
Model 1: Model 2: Model 3: Model 4:
Performance Statistics
MVOP OP MVP BP
Correlation 1:
Bias 0.07414 0.08349 0.11112 0.11507
RMSE 0.11718 0.12422 0.17369 0.17697
Std. Error 0.08253 0.08258 0.12004 0.12036
Type I Error 0.07778 0.06667 0.10000 0.10000
Correlation 2:
Bias 0.07991 0.09180 0.12070 0.13319
RMSE 0.12402 0.13361 0.18719 0.19848
Std. Error 0.08445 0.08591 0.12685 0.12922
Type I Error 0.11111 0.10000 0.10000 0.10000
Correlation 3:
Bias 0.06063 0.05495 0.08870 0.08632
RMSE 0.10532 0.10223 0.15149 0.15296
Std. Error 0.07912 0.08068 0.11301 0.11632
Type I Error 0.06667 0.07778 0.10000 0.07778
Correlation 4:
Bias 0.06313 0.06216 0.09365 0.08075
RMSE 0.10956 0.11065 0.16303 0.15196
Std. Error 0.08181 0.08324 0.12084 0.12004
Type I Error 0.04444 0.06667 0.00000 0.04444
Correlation 5:
Bias 0.06256 0.06965 0.09594 0.10483
RMSE 0.10859 0.11355 0.16235 0.16892
Std. Error 0.08216 0.08343 0.11964 0.12135
Type I Error 0.08889 0.07778 0.07778 0.07778
Correlation 6:
Bias 0.07056 0.07226 0.11413 0.11131
RMSE 0.11582 0.11607 0.17729 0.17047
Std. Error 0.08487 0.08508 0.12327 0.12410
Type I Error 0.05556 0.05556 0.06667 0.05556
8
Table 2: Correlation Latent Variables
Correlation 1: Correlation 2:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.876 0.868 Y1 1.000 0.925 0.924
Y3 1.000 Y3 1.000
Correlation 3: Correlation 4:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.957 0.960 Y1 1.000 0.863 0.976
Y3 1.000 Y3 1.000
Correlation 5: Correlation 6:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.897 0.814 Y1 1.000 0.863 0.859
Y3 1.000 Y3 1.000
Medium correlation levels are also forced between three equations using
independent variables X11 X 21 X 31 , X12 X 22 X 32 and X13 X 23 X 33 the rest of variables
different.
The second table show that mean of estimates for our parameters are also close to their true value,
but here error structure 1, 4 and 6 shows unexpected result but difference between them in less
than 0.02 respect to Bias and RMSE, which could be considered as low and difference will be
smaller as long as sample size will increase.
Result with respect to standard deviation and Type I error show that ordered need to be consider
as part of our model such that MVOP estimators are lower than the others.
Finally Table 3 still shows that MVOP performed better than MVP or the others two models. Clear
is that when order is not considered part of the structure MVP is performing better than BP, which
is at the end of the day what was expected.
9
Table 3: Average Bias, RMSE, Standard Error and Type I error of Regression Coefficient
for 1000 observations and 1000 simulations where Xi1 to Xi7 have same values
Model 1: Model 2: Model 3: Model 4:
Performance Statistics
MVOP OP MVP BP
Correlation 1:
Bias 0.07111 0.07018 0.10895 0.09950
RMSE 0.11612 0.11439 0.17173 0.16432
Std. Error 0.08320 0.08334 0.11949 0.11990
Type I Error 0.06667 0.06667 0.07778 0.07778
Correlation 2:
Bias 0.06544 0.06829 0.10538 0.10063
RMSE 0.10893 0.11344 0.16513 0.16289
Std. Error 0.07903 0.08202 0.11518 0.11749
Type I Error 0.10000 0.10000 0.04444 0.06667
Correlation 3:
Bias 0.06826 0.08804 0.10166 0.11011
RMSE 0.10854 0.13099 0.17948 0.17768
Std. Error 0.07561 0.08570 0.12913 0.12370
Type I Error 0.06667 0.05556 0.06667 0.04444
Correlation 4:
Bias 0.07741 0.07589 0.10550 0.10470
RMSE 0.11591 0.11987 0.17889 0.17282
Std. Error 0.07626 0.08292 0.12269 0.12289
Type I Error 0.10000 0.06667 0.14444 0.03333
Correlation 5:
Bias 0.06273 0.06348 0.09652 0.11973
RMSE 0.10548 0.10797 0.15970 0.18034
Std. Error 0.07784 0.08079 0.11623 0.11888
Type I Error 0.07778 0.06667 0.07778 0.07778
Correlation 6:
Bias 0.07117 0.05743 0.12203 0.13092
RMSE 0.11748 0.11042 0.18680 0.19133
Std. Error 0.08373 0.08427 0.12468 0.12514
Type I Error 0.02222 0.02222 0.06667 0.05556
Table 4, same as table 2, presents Person Correlation between latent variables when a middle
correlation level is present. The difference is clear between tables 2 and 4 where for table 4 those
correlations running between 0.08 to 0.15 but when high correlation is present Person correlation
will be in between 0.85 to 0.93.
10
Table 4: Correlation Latent Variables
Correlation 1: Correlation 2:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.110 0.110 Y1 1.000 0.078 0.107
Y3 1.000 Y3 1.000
Correlation 3: Correlation 4:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.134 0.125 Y1 1.000 0.148 0.153
Y3 1.000 Y3 1.000
Correlation 5: Correlation 6:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.103 0.119 Y1 1.000 0.108 0.068
Y3 1.000 Y3 1.000
The ideal scenario is forced where all independent variables are different between
equations (i.e. X 1 j X 2 j X 3 j for j 1 to 7 ) all results are behave as it is expected MVOP
shows the lowest values for all estimators follow by OP, MVP and BP.
11
Table 5: Average Bias, RMSE, Standard Error and Type I error of Regression Coefficient
for 1000 observations and 1000 simulations all X's values different
Model 1: Model 2: Model 3: Model 4:
Performance Statistics
MVOP OP MVP BP
Correlation 1:
Bias 0.07725 0.08004 0.11447 0.11481
RMSE 0.12015 0.12402 0.17564 0.17799
Std. Error 0.08325 0.08345 0.11959 0.12025
Type I Error 0.05556 0.05556 0.06667 0.07778
Correlation 2:
Bias 0.06386 0.08288 0.10884 0.15059
RMSE 0.10856 0.12469 0.19659 0.21016
Std. Error 0.08027 0.08379 0.14564 0.12414
Type I Error 0.03333 0.04444 0.05556 0.05556
Correlation 3:
Bias 0.06195 0.08064 0.09756 0.10837
RMSE 0.10042 0.12081 0.16726 0.16676
Std. Error 0.07160 0.08339 0.12211 0.11950
Type I Error 0.03333 0.03333 0.07778 0.04444
Correlation 4:
Bias 0.06484 0.08082 0.11454 0.11448
RMSE 0.10495 0.12268 0.57118 0.17393
Std. Error 0.07580 0.08410 0.52446 0.12450
Type I Error 0.03333 0.04444 0.06667 0.03333
Correlation 5:
Bias 0.06057 0.04633 0.10961 0.09439
RMSE 0.10555 0.09808 0.25311 0.16356
Std. Error 0.07746 0.08152 0.20071 0.12046
Type I Error 0.05556 0.03333 0.06667 0.05556
Correlation 6:
Bias 0.06026 0.06500 0.09875 0.08809
RMSE 0.10782 0.11376 0.16226 0.15373
Std. Error 0.08157 0.08227 0.11782 0.11881
Type I Error 0.03333 0.02222 0.05556 0.05556
Furthermore, there exists a slight tendency for some estimators to deviate less from the true value
as the level of correlation coefficient increases.
Finally our table 6 shows when our latent variables are forced to have small correlation levels,
where levels are between 0.00 to 0.04.
12
Table 6: Correlation Latent Variables
Correlation 1: Correlation 2:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 -0.035 -0.026 Y1 1.000 0.013 0.039
Y3 1.000 Y3 1.000
Correlation 3: Correlation 4:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 0.054 -0.021 Y1 1.000 -0.013 0.050
Y3 1.000 Y3 1.000
Correlation 5: Correlation 6:
Y1 Y2 Y3 Y1 Y2 Y3
Y1 1.000 -0.003 -0.063 Y1 1.000 -0.012 -0.027
Y3 1.000 Y3 1.000
V. Conclusions:
Overall, our results suggest the MVOP is the best model to use for models with three
ordered categorical variables and correlated error terms. If this model cannot be used or is too
computer intensive, OP will be the second most appropriate model at any of our present scenarios.
To conclude, some deficiencies in the study should be pointed out. While running the
simulations, a portion of the MVOP and MVP iterations did not converge when the correlation
between error terms was 0.8. These observations were thrown out. Extra simulations were run until
our sample size reached 1000. Non-convergence is an important criterion to consider when
choosing a model. Future simulations should investigate the convergence rates of the MVOP in
relation to MVP in models with high correlation between error terms. It should also be noted that
models with more than three equations and five categories were not tested. Adding more equations
or categories may also increase the likelihood of non-convergence.
13
Bibliography
Allison (2008) Convergence Failures in Logistic Regression, SAS Global Forum 2008, Paper 360.
Brondino, Franceschini, Galetto and Vicario (2006). Synthesis maps for multivariate ordinal
variables in manufacturing, International Journal of Production Research, Vol.44, 20, 4241-4255.
Fu, Li, Lin and Kan (2000) A limit information estimator for the multivariate ordinal probit model,
Applied Econometrics, 32,1841-1851.
Greene, W (2003). Econometric Analysis, 5th Edition, Englewood Cliffs, Prentice Hall.
Huguenin, Pelgrin and Holly (2009) Estimation of Multivariate Probit Models by Exact Maximum
Likelihood. Working Paper n 09
McFadden, D. (1989) A method of simulated moments for estimation of discrete choice response
models without numerical integration, Econometrica, 57, 995- 1027
Orazem, Otto and Edelman (1989) An Analysis of Farmers' Agricultural Policy Preferences
American Journal of Agricultural Economics, Vol. 71, No. 4, 837-846.
Zhao and Harris (2003) Demand for Marijuana, Alcohol and Tobacco: Participation, Frequency
and Cross-Equation Correlations. Department of Econometrics and Business Statistics, Monash
University, Australia
14