Latent Class Cluster Analysis - Vermunt, Magidson
Latent Class Cluster Analysis - Vermunt, Magidson
INTRODUCTION
Kaufman and Rousseeuw (1990) dene cluster analysis as the classication of similar objects into groups, where the number of groups, as well as their forms are unknown. The form of a group refers to the parameters of cluster; that is, to its cluster-specic means, variances, and covariances that also have a geometrical interpretation. A similar denition is given by Everitt (1993) who speaks about deriving a useful division into a number of classes, where both the number of classes and the properties of the classes are to be determined. These could also be denitions of exploratory LC analysis, in which objects are assumed to belong to one of a set of K latent classes, with the number of classes and their sizes not known a priori. In addition, objects belonging to the same class are similar with respect to the observed variables in the sense that their observed scores are assumed to come from the same probability distributions, whose parameters are, however, unknown quantities to be estimated. Because of the similarity between cluster and exploratory LC analysis, it is not surprising that the latter method is becoming a more and more popular clustering tool. In this paper, we want to describe the state-of-art in the eld of LC cluster analysis. Most of the work in this eld involves continuous indicators assuming (restricted) multivariate normal distributions within classes. Although authors seldom refer to the work of Gibson (1959) and Lazarsfeld and Henry (1968), actually they are using what these authors called latent prole analysis: that is, latent structure models with a single categorical latent variable and a set of continuous indicators. Wolfe (1970) was the rst one who made an explicit connection between LC and cluster analysis. The last decade there was a renewed interest in the application of LC analysis as a cluster analysis method. Labels that are used to describe such a use of LC analysis are: mixture likelihood approach to clustering (McLachlan and Basford 1988; Everitt 1993), model-based clustering (Baneld and Raftery 1993; Bensmail et. al. 1997; Fraley and Raftery 1998a, 1998b), mixture-model clustering (Jorgensen and Hunt 1996; McLachlan et al. 1999), Bayesian classication (Cheeseman and Stutz 1995), unsupervised learning (McLachlan and Peel 1996), and latent class cluster analysis (Vermunt and Magidson 2000). Probably the most important reason of the increased popularity of LC analysis as a statistical tool for cluster analysis is the fact that nowadays high-speed computers make these computationally intensive methods practically applicable. Several software packages are available for the estimation of LC cluster models. An important dierence between standard cluster analysis techniques and LC clustering is that the latter is a model-based clustering approach. This means that a statistical model is postulated for the population from which the sample under study is coming. More precisely,
2 it is assumed that the data is generated by a mixture of underlying probability distributions. When using the maximum likelihood method for parameter estimation, the clustering problem involves maximizing a log-likelihood function. This is similar to standard non-hierarchical cluster techniques in which the allocation of objects to clusters should be optimal according some criterion. These criteria typically involve minimizing the within-cluster variation and/or maximizing the between-cluster variation. An advantage of using a statistical model is, however, that the choice of the cluster criterion is less arbitrary. Nevertheless, the log-likelihood functions corresponding to LC cluster models may be similar to the criteria used by certain non-hierarchical cluster techniques like k-means. LC clustering is very exible in the sense that both simple and complicated distributional forms can be used for the observed variables within clusters. As in any statistical model, restrictions can be imposed on the parameters to obtain more parsimony and formal tests can be used to check their validity. Another advantage of the model-based clustering approach is that no decisions have to be made about the scaling of the observed variables: for instance, when working with normal distributions with unknown variances, the results will be the same irrespective of whether the variables are normalized or not. This is very dierent from standard non-hierarchical cluster methods, where scaling is always an issue. Other advantages are that it is relatively easy to deal with variables of mixed measurement levels (dierent scale types) and that there are more formal criteria to make decisions about the number of clusters and other model features. LC analysis yields a probabilistic clustering approach. This means that although each object is assumed to belong to one class or cluster, it is taken into account that there is uncertainty about an objects class membership. This makes LC clustering conceptually similar to fuzzy clustering techniques. An important dierence between these two approaches is, however, that in fuzzy clustering an objects grades of membership are the parameters to be estimated (Kaufman and Rousseeuw 1990) while in LC clustering an individuals posterior class-membership probabilities are computed from the estimated model parameters and his observed scores. This makes it possible to classify other objects belonging to the population from which the sample is taken, which is not possible with standard fuzzy cluster techniques. The remainder of this paper is organized as follows. The next section discusses the LC cluster model for continuous variables. Subsequently, attention is paid to models for sets of indicators of dierent measurement levels, also known as mixed-mode data. Then we explain how to include covariates in a LC cluster model. After discussing estimation and testing, two empirical examples are presented. The paper ends with a short discussion. An appendix describes computer programs that implement the various kinds of LC clustering methods presented in this paper.
f (yi |) =
k=1
k fk (yi |k ) .
Here, yi denotes an objects scores on a set of observed variables, K is the number of clusters, and k denotes the prior probability of belonging to latent class or cluster k or, equivalently, the size of cluster k. Alternative labels for the ys are indicators, dependent variables, outcome
3 variables, outputs, endogenous variables, or items. As can be seen, the distribution of yi given the model parameters , f (yi |), is assumed to be a mixture of classes-specic densities, fk (yi |k ). Most of the work on LC cluster analysis has been done for continuous variables. Generally, these continuous variables are assumed to be normally distributed within latent classes, possibly after applying an appropriate non-linear transformation (Lazarsfeld and Henry 1968; Baseld and Raftery 1993; McLachlan 1988; McLachlan et. al. 1999; Cheeseman and Stutz 1995). Alternatives for the normal distribution are student, Gompertz, or gamma distributions (see, for instance, McLachlan et. al. 1999). The most general Gaussian distribution of which all restricted versions discussed below are special cases is the multivariate normal model with parameters k and k . If no further restrictions are imposed, the LC clustering problem involves estimating a separate set of means, variances, and covariances for each latent class. In most applications, the main objective is nding classes that dier with respect to their means or locations. The fact that the model allows classes to have dierent variances implies that classes may also dier with respect to the homogeneity of the responses to the observed variables. In standard LC models with categorical variables, it is generally assumed that the observed variables are mutually independent within clusters. This is, however, not necessary here. The fact that each class has its own set of covariances means that the y variables may be correlated with clusters, as well as that these correlations may be cluster specic. So, the clusters do not only dier with respect to their means and variances, but also with respect to the correlations between the observed variables. It will be clear that as the number of indicators and/or the number of latent classes increases, the number of parameters to be estimated increases rapidly, especially the number of free parameters in the variance-covariance matrices, k . Therefore, it is not surprising that restrictions which are imposed to obtain more parsimony and stability typically involve constraining the class-specic variance-covariance matrices. An important constraint model is the local independence model obtained by assuming that all within-cluster covariances are equal to zero or, equivalently, by assuming that the variancecovariance matrices, k , are diagonal matrices. Models that are less restrictive than the local independence model can be obtained by xing some but not all covariances to zero or, equivalently, by assuming certain pairs of ys to be mutually dependent within latent classes. Another interesting type of constraint is the equality or homogeneity of variance-covariance matrices across latent classes, i.e., k = . Such a homogeneous or class-independent error structure yields clusters having the same forms but dierent locations. Note that these kinds of equality constraints can be applied in combination with any structure for . Baneld and Raftery (1993) proposed reparameterizing the class-specic variance-covariance matrices by an eigenvalue decomposition:
T k = k Dk A k Dk .
The parameter k is a scalar, Dk is a matrix with eigenvectors, and Ak is a diagonal matrix whose elements are proportional to the eigenvalues of k . More precisely, k = |k |1/d , where d is the number of observed variables, and Ak is scaled such that |Ak | = 1. A nice feature of the above decomposition is that each of the three sets of parameters has a geometrical interpretation: k indicates what can be called the volume of cluster k, Dk its orientation, and Ak its shape. If we think of a cluster as a clutter of points in a multidimensional space, the volume is the size of the clutter, while the orientation and shape parameters indicate
4 whether the clutter is spherical or ellipsoidal. Thus, restrictions imposed on these matrices can directly be interpreted in terms of the geometrical form of the clusters. Typically, matrices are assumed to be class-independent and/or simpler structures (diagonal or identity) are used for certain matrices. See Bensmail et al. (1997) and Fraley and Raftery (1998b) for overviews of the many possible specications. Rather than by a restricted eigenvalue decomposition, the structure of the k matrices can also be simplied by means of a covariance-structure model. Several authors have proposed using latent class models for dealing with unobserved heterogeneity in covariance-structure analysis (Arminger and Stein 1997; Dolan and Van der Maas 1997; Jedidi et. al. 1997). The same methodology can be used to restrict the error structure in LC cluster analysis with continuous indicators. An interesting structure for k , that is related to the eigenvalue decomposition described above, is a factor analytic model (Yung 1997; McLachlan and Peel 1998); that is, k = k k k + Uk . (1)
Here, k is a matrix with factor loadings, k is the variance-covariance matrix of the factors, and Uk is a diagonal matrix with unique variances. Restricted versions can be obtained by limiting the number of factors (for instance, to one) and/or xing some factor loading to zero. Such specications make it possible to describe the correlations between the y variables within clusters or, equivalently, the structure of local dependencies, by means of a small number of parameters.
f (yi |) =
k=1
k
j=1
fk (yij |jk ) ,
(2)
where J denotes the total number of indicators and j a particular indicator. Rather than specifying the joint distribution of yi given class membership using a single multivariate distribution, we now have to specify the appropriate univariate distribution function for each element yij of yi . Possible choices for continuous yij are univariate normal, student, gamma, and log-normal distributions. A natural choice for discrete nominal or ordinal variables is the (restricted) multinomial distribution. Suitable distributions for counts are, for instance, Poisson, binomial, or negative binomial.
5 In the above specication, we assumed that the ys are conditional independent within latent classes. This assumption can easily be relaxed by using the appropriate multivariate rather than univariate distributions for sets of locally dependent y variables. It is not necessary to present a separate formula for this situation. We can just think of the index j in equation (2) to denote a set of indicators rather than a single indicator. For sets of continuous variables, we can again work with a multivariate normal distribution. A set of nominal/ordinal variables can combined into a (restricted) joint multinomial distribution. Correlated counts could be modeled with a multivariate Poisson model. More dicult is the specication of the mixed multivariate distributions. Krzanowski (1983) described two possible ways of modeling the relationship between a nominal/ordinal and a continuous y: via a conditional Gaussian or via a conditional multinomial distribution, which means either using the categorical variable as a covariate in the normal model or the continuous one as a covariate in the multinomial model. Lawrence and Krzanowski (1996) and Hunt and Jorgensen (1999) used the conditional Gaussian distribution in LC clustering with combinations of categorical and continuous variables. Local dependencies with a Poisson variable could be dealt with in the same way, i.e., by allowing its mean to dependent on the relevant continuous or categorical variable(s). The possibility to include local dependencies between indicators is very important when using LC analysis as a clustering tool. First, it prevents that one ends with a solution that contains too many clusters. Often, a simpler solution with less clusters is obtained by including a few direct eects between y variables. It should be stressed that there is also a risk of allowing for within-cluster associations: direct eects may hide relevant clusters. A second reason for relaxing the local independence assumption is that it may yield a better classication of objects into clusters. Saying that two variables are locally dependent is conceptually the same as saying that they contain some overlapping information that should not be used when determining to which class an object belongs. Consequently, if we omit a signicant bivariate dependency from a LC cluster model, the corresponding locally dependent indicators get a too high weight in the classication formula (see equation (3)) compared to the other indicators.
COVARIATES
The LC cluster modeling approach described above is quite general: It deals with mixed-mode data and it allows for many dierent specication of the (correlated) error structure. An important extension of this model is the inclusion of covariates to predict class membership. Conceptually, it makes very much sense to distinguish (endogenous) variables that serve as indicators of the latent variable from (exogenous) variables that are used to predict to which cluster an object belongs. This idea is, in fact, the same as in Cloggs (1981) LCM with external variables. Note that in certain situations we may want to use the latent cluster variable as a predictor of an observed response variable rather than as a dependent variable. For such situations, we do not need special arrangements like the ones needed with covariates. A model in which the cluster variable serves as predictor can be obtained by using the response variable as one of the y variables.
6 Using the same basic structure as in equation (2), this yields the following LC cluster model:
K J
f (yi |zi , ) =
k=1
k|zi
j=1
fk (yij |jk ) .
Here, zi denotes object is covariate values. Alternative terms for the zs are concomitant variables, grouping variables, external variables, exogenous variables, and inputs. To reduce the number of parameters, the probability of belonging to class k given covariate values zi , k|zi , will generally be restricted by a multinomial logit model; that is, a logit model with linear eects and no higher order interactions. An even more general specication is obtained by allowing covariates to have direct eects on the indicators, which yields
K J
f (yi |zi , ) =
k=1
k|zi
j=1
fk (yij |zi , jk ) .
The conditional mean of the y variables can now be directly related to the covariates. This makes it possible to relax the implicit assumption in the previous specication that the inuence of the zs on the ys goes completely via the latent variable. For an example, see Vermunt and Magidson (2000: 155). The possibility to have direct eects of zs on ys can also be used to specify direct eects between indicators of dierent scale types by means of a simple trick: one of the two variables involved should be used both as covariate (not inuencing class membership) and as indicator. We will use this trick below in our second example.
ESTIMATION
The two main methods to estimate the parameters of the various types of LC cluster models are maximum likelihood (ML) and maximum posterior (MAP). Wallace and Dowe (forthcoming) proposed a minimum message length (MML) estimator, which in most situations is similar of MAP. The log-likelihood function required in ML and MAP approaches can be derived from the probability density function dening the model. Bayesian MAP estimation involves maximizing the log-posterior distribution, which is the sum of the log-likelihood function and the logs of the priors for the parameters. Although generally there is not much dierence between ML and MAP estimates, an important advantage of the latter method is that it prevents the occurrence of boundary or terminal solutions: probabilities and variances cannot become zero. With a very small amount of prior information, the parameter estimates are forced to stay within the interior of the parameter space. Typical priors are Dirichlet priors for multinomial probabilities and inverted-Wishart priors for the variance-covariance matrices in multivariate normal models. For more details on these priors see Vermunt and Magidson (2000: 164-165) Most software packages, use the EM algorithm or some modication of it to nd the ML or MAP estimates. In our opinion, the ideal algorithm is starting with a number of EM iterations and when close enough to the nal solution, switching to Newton-Raphson. This is a way to combine the advantages of both algorithms, that is, the stability of EM even when far away from the optimum and the speed of Newton-Raphson when close to the optimum.
7 A well-known problem in LC analysis is the occurrence of local solutions. The best way to prevent ending with a local solution is to use multiple sets of starting values. Some computer programs for LC clustering have automated the search for good starting values using several sets of random starting values, as well as solutions obtained with other cluster methods. In the application of LC analysis to clustering, we are not only interested in the estimation of the model parameters. Another important estimation problem is classication of objects into clusters. This can be based on the posterior class membership probabilities k|yi ,zi , = k|zi j fk (yij |zi , jk ) . k k|zi j fk (yij |zi , jk ) (3)
The standard classication method is modal allocation, which amounts to assigning each object to the class with the highest posterior probability.
MODEL SELECTION
The model selection issue is one of the main research topics in LC clustering. Actually, there are two issues: the rst one concerns the decision about the number of clusters, the second one concerns the form of the model given the number of clusters. For an overview on this topic see Celeux et. al. (1997). Assumptions with respect to the forms of the clusters given their number can be tested using standard likelihood-ratio tests between nested models, for instance, between a model with an unrestricted covariance matrix and a model with a restricted covariance matrix. Wald tests and Lagrange multiplier tests can be used to assess the signicance of certain included or excluded terms, respectively. It is well-known that these kinds of chi-squared tests cannot be used to determine the number of clusters. The most popular set of model selection tools in LC cluster analysis are information criteria like AIC, BIC, and CAIC (Fraley and Raftery 1998b). The most recent development is the use of computationally intensive techniques like parametric bootstrapping (McLachlan, et. al. 1999) and Markov Chain Monte Carlo methods (Bensmail et. al. 1997) to determine the number of clusters and their forms. Cheeseman and Stutz (1995) proposed a fully automated model selection method using approximate Bayes factors (dierent from BIC). Another set of methods for evaluating LC cluster models is based on the uncertainty of classication or, equivalently, the separation of the clusters. Besides the estimated total number of misclassications, Goodman-Kruskal lambda, Goodman-Kruskal tau, or entropy based measures can be used to indicate how well the indicators predict class membership. Celeux et. al. (1997) described various indices that combine information on model t and information on classication errors; two of them are the classication likelihood (C) and the approximate weight of evidence (AWE).
8 was extensively used in the analyses described below is the possibility to add local dependencies using information on bivariate residuals. Model selection was based on BIC, where it should be noted that the BIC we use is computed using the log-likelihood value and the number of parameters rather than using the L2 value and the number of degrees of freedom.
Diabetes data
The rst empirical example concerns a three-dimensional data set involving 145 observations used for diabetes diagnosis (Reaven and Miller 1979). The three continuous variables are labeled glucose (y1 ), insuline (y2 ), and sspg (y3 ). The data set also contains information on the clinical classication in three groups (normal, chemical diabetes, and overt diabetes), which makes it possible to compare the clinical classication with the classication obtained from the cluster model. The substantive question of interest is whether the three indirect diagnostic measures yield a reliable diagnosis; that is, whether they yield a classication that is close to the clinical classication. This data set comes with the MCLUST program and is also used by Fraley and Raftery (1998a, 1998b) to illustrate their model-based cluster analysis based on the eigenvalue decomposition described in equation (1). The nal model they selected on the basis of the BIC criterion was the unrestricted three-class model, which means that none of the restrictions that can be specied with their approach holds for this data set. We used six dierent specications for the variance-covariance matrices: class-dependent and class-independent unrestricted, class-dependent and class-independent diagonal, as well as class-dependent and class-independent with only the y1 -y2 error covariance free. With unrestricted we that all covariances are free and with diagonal that all covariances are assumed to be zero. The models with only the y1 -y2 error covariance free were used because the bivariate residuals of both diagonal models indicated that there was only a local dependency between these two variables. Moreover, the results from the unrestricted models indicated that the y1 -y3 and y2 -y3 covariances did not dier signicantly from zero. [INSERT TABLE 1 ABOUT HERE] Table 1 reports the BIC values for the estimated one to ve class models. The 3-class model that only includes the error covariance between y1 and y2 and with class-dependent variances and covariances has the lowest BIC value. Its BIC value is slightly lower than of the classdependent unrestricted three-class model, Fraley and Rafterys nal model for this data set. The BIC values in table 1 show clearly that models with too restrictive error structures for a particular data set overestimate the number of clusters. Here, this applies to the models with class-independent error variances and the class-dependent diagonal model. Therefore, it is important to be able to work with dierent types of error structures. Note that the most restrictive model that we used the model with class-independent diagonal error structure can be seen as a probabilistic variant of k-means cluster analysis (McLachlan and Basford 1988). [INSERT TABLE 2 ABOUT HERE] Table 2 reports the parameters estimates for the three-class model with class-dependent variance-covariance matrices and with only a local dependence between y1 and y2 . These parameters are the cluster sizes (k ), the cluster-specic means (jk ), the cluster-specic variances
9
2 (jk ), as well as the cluster-specic covariance between y1 and y2 (12k ). The overt diabetes group (cluster 3), has much higher means on glucose and insuline and a much lower mean on sspg than the normal group (cluster 1). The chemical diabetes group (cluster 2) has somewhat lower means on glucose and insuline and a much lower mean on sspg than the normal group. The reported error variances show that the overt diabetes cluster is much more heterogeneous with respect to glucose and insuline and much more homogeneous with respect to sspg than the normal cluster. The chemical diabetes group is the most homogeneous cluster on all three measures. The error covariances are somewhat easier to interpret if we transform them to correlations. Their values are .69, .21, and .93 for cluster 1, 2 and 3, respectively. This indicates that in the overt diabetes group there is a very strong association between glucose and insuline, while in the chemical diabetes group this association is very low, and even not signicantly dierent from zero (12k /SE12k = 1.60). Note that the within-cluster correlation of .93 is very high, which indicates that, in fact, the two measures are equivalent in cluster 3.
[INSERT TABLE 3 ABOUT HERE] Not only the BIC of our nal model is somewhat better than Fraley and Rafterys, also our classication is more in agreement with the clinical classication: our model misclassies 13.1 percent of the patients while the unrestricted models misclassies 14.5 percent. Table 3 reports the cross-tabulation of the clinical and the LC cluster classication based on the posterior classmembership probabilities. As can be seen, some normal patients are classied as cases with chemical diabetes and vice versa. The other type of error is that some overt diabetes cases are classied as normal.
10 and 12 indicators. The estimation time increases linearly with the number of cases and, as long as we do not include too many local dependencies, also almost linearly with the number of indicators. [INSERT TABLE 4 ABOUT HERE] Table 4 presents the BIC values for the estimated models. As can be seen, the two-class model that includes all four direct relationships has the lowest BIC. Comparison of the various models given a certain number of classes shows that inclusion of the direct relationship between y5 and y6 (the two blood pressure measures) improves the t in all situations. The other bivariate terms improve the t in the one-, two-, and three-class models, but not in the fourclass model. If we compare the models with dierent number of classes for a given error structure, the four-class model performs best when assuming local independence, the threeclass model when including the y5 and y6 covariance, and the two-class model when including additional bivariate terms. Thus, if we are willing to include the y5 -y6 eect, a model with no more than three classes should be selected. If we are willing to include more direct eects, the two-class model is the preferred one. This shows again that the possibility to work with more local dependencies may yield a simpler nal model. [INSERT TABLE 5 ABOUT HERE] Table 5 reports the parameters estimates for the two-class model containing all four direct eects. Wald tests for the dierence of the means and probabilities between classes indicate that only the mean ages (1k ) are not signicantly dierent between classes. Cluster 2 turns out to have somewhat higher means on weight (2k ), blood pressure (5k and 6k ), and serum haemoglobin (8k ), and lower means on size of tumor (9k ), index of tumor stage (10k ), and serum prostatic acid phosphatase (11k ). If we look at the nominal indicators, we see a large dierence between the two classes in the distribution of bone metastases (y12 ), somewhat smaller dierences in performance rating (y3 ) and cardiovascular disease history (y4 ), and a very small dierence in electrocardiogram code (y7 ). The direct eects between the indicators are quite strong. They all have a positive sign except for the eect of y12 on y11 . To investigate the usefulness of the applied technique, Jorgensen and Hunt (1996) and Hunt and Jorgensen (1999) investigated the strength of the relationship between the obtained classication and the outcome of the medical trial. They showed that their two-class solution, which is similar to the two-class model with local dependencies obtained here, predicted very well the success of the medical treatment.
CONCLUSIONS
This paper described the state-of-art in the eld of cluster analysis using LC models. Two important recent developments are the possibility to use various kinds of meaningful restrictions on the covariance structure in mixtures of multivariate normal distributions and the possibility to work with mixed-mode data. The rst example demonstrated the use of dierent types of specications for the covariance structure. It showed that too restrictive models may yield too many latent classes. The second example illustrated LC clustering with mixed-mode data using models with and without local dependencies.
11
REFERENCES
Arminger, G., and Stein, P. 1997. Finite mixture of covariance structure models with regressors: loglikehood function, distance estimation, t indices, and a complex example. Sociological Methods and Research 26: 148-182. Baneld, J.D., and Raftery, A.E. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803-821. Bensmail, H., Celeux, G., Raftery, A.E., and Robert, C.P. 1997. Inference in model based clustering. Statistics and Computing 7: 1-10. Byar, D.P., and Green, S.B. 1980. The choice of treatment for cancer patients based on covariate information: Application to prostate cancer. Bulletin of Cancer 67: 477-490. Bckenholt, U. 1993. A latent class regression approach for the analysis of recurrent choices. o British Journal of Mathematical and Statistical Psychology 46: 95-118. Celeux, G., Biernacki, C., and Govaert, G. 1997. Choosing models in model-based clustering and discriminant analysis. Technical Report. Rhone-Alpes: INRIA. Cheeseman, P., and Stutz, J. 1995. Bayesian classication (Autoclass): Theory and results. In Advances in knowledge discovery and data mining. edited by U.M.Fayyad, G.PiatetskyShapiro, P.Smyth and R.Uthurusamy. Menlo Park: The AAAI Press. Clogg, C.C. 1981. New developments in latent structure analysis. Pp. 215-246, in Factor analysis and measurement in sociological research, edited by D.J. Jackson and E.F. Borgotta. Beverly Hills: Sage Publications. Clogg, C.C. 1995. Latent class models. Pp. 311-359, in Handbook of statistical modeling for the social and behavioral sciences, edited by G.Arminger, C.C.Clogg, and M.E.Sobel. New York: Plenum Press. Dolan, C.V., and Van der Maas, H.L.J. 1997. Fitting multivariate normal nite mixtures subject to structural equation modeling. Psychometrika 63: 227-253. Everitt, B.S. 1988. A nite mixture model for the clustering of mixed-mode data. Statistics and Probability Letters 6: 305-309. Everitt, B.S. 1993), Cluster analysis. London: Edward Arnold. Fraley, C., and Raftery, A.E. 1998a. MCLUST: Software for model-based cluster and discriminant analysis. Department of Statistics, University of Washington: Technical Report No. 342. Fraley, C., and Raftery, A.E. 1998b. How many clusters? Which clustering method? - Answers via model-based cluster analysis. Department of Statistics, University of Washington: Technical Report no. 329. Gibson, W.A. 1959. Three multivariate models: Factor analysis, latent structure analysis, and latent prole analysis. Psychometrika 24: 229-252. Goodman, L.A. 1974. Exploratory latent structure analysis using both identiabe and unidentiable models. Biometrika 61: 215-231. Hunt, L, and Jorgensen, M. 1999. Mixture model clustering using the MULTIMIX program. Australian and New Zeeland Journal of Statistics 41: 153-172. Jedidi, K., Jagpal, H.S., and DeSarbo, W.S. 1997. Finite-mixture structural equation models for response-based segmentation and unobserved heterogeneity. Marketing Science 16: 39-59.
12 Jorgensen, M., and Hunt, L. 1996. Mixture model clustering of data sets with categorical and continuous variables. Pp 375-384, in Proceedings of the Conference ISIS 96, Australia 1996. Kaufman, L., and Rousseeuw, P.J. 1990. Finding groups in data: An introduction to cluster analysis. New York: John Wiley and Sons, Inc.. Krzanowski, W.J. 1983. Distance between populations using mixed continuous and categorical variables. Biometrika 70: 235-243. Lawrence C.J., Krzanowski, W.J. 1996. Mixture separation for mixed-mode data. Statistics and Computing 6: 85-92. Lazarsfeld, P.F., and Henry, N.W. 1968. Latent structure analysis. Boston: Houghton Mill. McLachlan, G.J., and Basford, K.E. 1988. Mixture models: inference and application to clustering. New York: Marcel Dekker. McLachlan, G.J., and Peel, D. 1996. An algorithm for unsupervised learning via normal mixture models. Pp. 354-363, in Information, statistics and induction in science, edited by D.L.Dowe, K.B.Korb, and J.J.Oliver. Singapore: World Scientic Publishing. McLachlan, G.J., and Peel, D. 1999. Modelling nonlinearity by mixtures of factor analysers via extension of the EM algorithm. Technical Report. Australia: Center for Statistics, University of Queensland. McLachlan, G.J., Peel, D., Basford, K.E., and Adams, P. 1999. The EMMIX software for the tting of mixtures of normal and t-components. Journal of Statistical Software 4, No. 2. Moustaki, I. 1996. A latent trait and a latent class model for mixed observed variables. The British Journal of Mathematical and Statistical Psychology 49: 313-334. Muthen, B., and Muthen, L., 1998. Mplus: Users manual. Los Angeles: Muthen and Muthen. Reaven, G.M., and Miller, R.G. 1979. An attempt to dene the nature of chemical diabetes using multidimensional analysis. Diabetologia 16: 17-24. Vermunt, J.K. 1997. LEM: A general program for the analysis of categorical data. Users manual. Tilburg University, The Netherlands. Vermunt, J.K., and Magidson, J. 2000. Latent GOLDs Users Guide. Boston: Statistical Innovations Inc.. Wallace, C.S., and Dowe, D.L. Forthcoming. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. Statistics and Computing. Wedel, M., DeSarbo, W.S., Bult, J.R., and Ramaswamy, V. 1993. A latent class Poisson regression model for heterogeneous count data with an application to direct mail. Journal of Applied Econometrics 8: 397-411. Wolfe, J.H. 1970. Pattern clustering by mulltivariate cluster analysis. Multivariate Behavioral Research 5: 329-350. Yung, Y.F. 1997. Finite mixtures in conrmatory factor-analysis models. Psychometrika 62: 297-330.
13 Table 1: BIC values for diabetes example Number of clusters 2 3 4 4819 4762 4788 5014 4923 4869 4957 4833 4805 5170 4999 4938 4835 4756 4761 5008 4920 4862
Model 1. Class-dependent unrestricted k 2. Class-independent unrestricted k 3. Class-dependent diagonal k 4. Class-independent diagonal k 5. Class-dependent k with only 12k free 6. Class-independent k with only 12k free
14 Table 2: Parameter estimates for diabetes example Cluster 2 = Chemical 3 = Overt Estimate S.E. estimate S.E. 0.54 0.05 0.19 0.03 91.23 1.06 234.76 14.87 359.22 6.63 1121.09 58.70 163.13 6.37 76.98 9.47 76.48 12.93 5005.91 1414.43 2669.75 506.55 73551.09 22176.29 2421.45 476.65 2224.50 616.43 96.46 60.30 17910.71 5423.37
1= Normal Parameter Estimate S.E. k 0.27 0.05 1k 104.00 2.85 2k 495.06 22.74 3k 309.43 28.06 2 1k 230.09 62.96 2 2k 14844.55 3708.65 2 3k 22966.52 5395.90 12k 1279.92 420.93
15 Table 3: Clinical versus LC cluster classication in diabetes example Clinical LC cluster classication classication normal chemical overt normal 26 10 0 chemical 4 72 0 overt 5 0 28 total 35 82 28
total 36 76 33 145
16 Table 4: BIC values for cancer example Number of clusters 1 2 3 4 23762 23112 23089 23088 23529 22889 22883 22887 23502 22872 22875 22893 23473 22861 22866 22895 23322 22845 22855 22888
Model 1. Local independence 2. Model 1 + 56k 3. Model 2 + 28k 4. Model 3 + 8.12 5. Model 4 + 11.12
17 Table 5: Parameter estimates for prostate cancer example Cluster 1 Cluster 2 Parameter Estimate S.E. Estimate S.E. k 0.45 0.03 0.55 0.03 1k 71.38 0.51 71.70 0.43 2k 97.51 0.98 100.26 0.83 1,3k 0.85 0.02 0.94 0.02 2,3k 0.09 0.02 0.05 0.01 3,3k 0.05 0.02 0.01 0.01 4,3k 0.01 0.01 0.00 0.00 1,4k 0.65 0.03 0.49 0.03 2,4k 0.35 0.03 0.51 0.03 5k 14.18 0.16 14.54 0.16 6k 8.00 0.09 8.29 0.10 1,7k 0.35 0.03 0.33 0.030 2,7k 0.05 0.02 0.05 0.01 3,7k 0.14 0.02 0.07 0.02 4,7k 0.04 0.01 0.06 0.02 5,7k 0.30 0.03 0.31 0.03 6,7k 0.12 0.02 0.17 0.02 7,7k 0.00 0.00 0.00 0.00 8k 128.01 1.38 132.21 1.80 9k 4.11 0.12 2.88 0.08 10k 12.02 0.11 8.88 0.08 11k 4.00 0.12 2.11 0.11 1,12k 0.65 0.03 0.99 0.01 2,12k 0.35 0.03 0.01 0.01 2 1k 52.35 5.36 43.97 4.15 2 186.60 19.82 166.73 15.89 2k 2 4.98 0.50 6.60 0.59 5k 2 6k 1.79 0.18 2.40 0.21 2 8k 355.82 35.44 325.52 29.47 2 9k 2.91 0.29 1.40 0.14 2 10k 2.05 0.21 1.25 0.13 2 2.56 0.25 0.25 0.03 11k 28k 61.98 19.14 47.56 15.12 56k 1.82 0.25 2.52 0.30 8.12 5.76 1.35 5.76 1.35 11.12 -0.49 0.11 -0.49 0.11
Name NORMIX EMMIX MCLUST LEM Classmix Autoclass MULTIMIX Mplus Latent GOLD
Algorithm EM EM EM EM +NR EM EM EM EM EM + NR
System / source DOS DOS + Fortran code S-plus DOS + Windows unknown DOS + C code Fortran code DOS Windows
19
SOFTWARE
Several computer programs are available for estimating the various types of LC cluster models discussed in this paper. Table 6 lists the most important packages and gives information on the types of cluster models they implement (multivariate normal distributions and/or mixed-mode data); whether they allow users to include covariates in the model; the estimation method they use; the algorithm (EM or NR=Newton-Raphson) they use; and the system for with an executable version and/or the type of source code that is available. [INSERT TABLE 6 ABOUT HERE] We will not repeat all the information listed in table 6 but describe the main special features of some of the programs. NORMIX (Wolfe, 1970), EMMIX (McLachlan et. al., 1999), and MCLUST (Fraley and Raftery, 1998a) are programs for LC clustering with continuous variables using multivariate normal distributions. Special features of EMMIX are that it uses of multiple sets of starting values to prevent local solutions and that it performs likelihood-ratio tests for the number of clusters using parametric bootstrapping. MCLUST allows users to restrict the class-specic variance-covariance matrices using the eigenvalue decomposition described in equation (1). LEM (Vermunt, 1997) and Classmix (Moustaki, 1996) are LC analysis programs that can be used for clustering with mixed-mode data. LEM cannot only deal with (ordinal) categorical and continuous variables, but also with Poisson counts. In LEM, it is possible to include local dependencies between categorical variables. MULTIMIX (Hunt and Jorgensen, 1999), Mplus (Muthen and Muthen, 1998), Autoclass (Cheeseman and Stutz, 1995), and Latent GOLD (Vermunt and Magidson, 2000) can deal with multivariate normal distributions, as well as with mixed-mode data. MULTIMIX allows users to specifying local dependencies between categorical and continuous variables using conditional Gaussian distributions. Both Mplus and Latent GOLD are very exible with respect to the specication of the structure of the error-covariance matrices: any covariance can be included or excluded from the model. Two weak points of Mplus are that the categorical variables should be dichotomous and that the user has to provide starting values for all parameters. Autoclass is a program that has automatized model selection using multiple sets of starting values (also for the number of classes). Latent GOLD is the only fully Windows based program, which make it very easy to use. Like LEM, it cannot only deal with (ordinal) categorical and continuous variables, but also with Poisson counts. Its multiple sets of random starting values help users to prevent ending with a local solution and its bivariate residual measures make it easy to detect local dependencies to be included in the model.
20
SYMBOLS
K J i j k y y z f (..) 2 j j number of classes or clusters number of indicator variables index to denote a particular case index to denote a particular indicator variable index to denote a particular class or cluster vector of indicator variables value of an indicator variable covariate vector density function probability parameter vector mean vector variance-covariance matrix variance of variable j covariance between variables j and
21
FURTHER READING
Further reading on cluster analysis by means of latent class or nite mixture models can be done with McLachlan and Basford (1988) and Everitt (1993).