Cohen Sanborn Shiffrin 2008
Cohen Sanborn Shiffrin 2008
net/publication/23257501
CITATIONS READS
96 376
3 authors:
Richard M Shiffrin
Indiana University Bloomington
165 PUBLICATIONS 24,243 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Adam N Sanborn on 23 May 2014.
Analyzing the data of individuals has several advantages over analyzing the data combined across the in-
dividuals (the latter we term group analysis): Grouping can distort the form of data, and different individuals
might perform the task using different processes and parameters. These factors notwithstanding, we demonstrate
conditions in which group analysis outperforms individual analysis. Such conditions include those in which
there are relatively few trials per subject per condition, a situation that sometimes introduces distortions and
biases when models are fit and parameters are estimated. We employed a simulation technique in which data
were generated from each of two known models, each with parameter variation across simulated individuals.
We examined how well the generating model and its competitor each fared in fitting (both sets of) the data,
using both individual and group analysis. We examined the accuracy of model selection (the probability that the
correct model would be selected by the analysis method). Trials per condition and individuals per experiment
were varied systematically. Three pairs of cognitive models were compared: exponential versus power models
of forgetting, generalized context versus prototype models of categorization, and the fuzzy logical model of
perception versus the linear integration model of information integration. We show that there are situations in
which small numbers of trials per condition cause group analysis to outperform individual analysis. Additional
tables and figures may be downloaded from the Psychonomic Society Archive of Norms, Stimuli, and Data,
www.psychonomic.org/archive.
To determine the processes underlying human perfor- relative utility of the two approaches under different ex-
mance, researchers often compare the relative fit of com- perimental conditions.
peting quantitative models when applied to experimen- A number of researchers (e.g., R. B. Anderson & Tweney,
tal data. It is sometimes explicit and more often implicit 1997; Ashby, Maddox, & Lee, 1994; Estes, 1956; Estes &
that it is the processes of individuals that are of interest. Maddox, 2005; Sidman, 1952; Siegler, 1987) have shown
Nonetheless, many researchers apply models not to data that various forms of bias due to data averaging can and do
from individuals, but rather to data combined across mul- occur. Hayes (1953) gave a compelling example of such a
tiple subjects. The hope is that combining the data will distortion. Assume that a subject is given successive 30-sec
provide a clearer view of the underlying processes by re- opportunities to open a puzzle box. An individual’s data will
ducing error and/or distortions. Such a group analysis is typically be well represented by a step function—successive
most common when individuals provide very few data per failures followed by a run of successes. That is, for each
condition. Of course, a group analysis is based on the as- individual, learning is all-or-none. But because subjects
sumption that the processes used by the individuals are tend to learn on different trials, the average proportion cor-
qualitatively similar and that combining across the indi- rect for the group will rise gradually. An analysis of the
viduals will not in itself distort the analysis. This assump- group data would lead to the biased inference that learning
tion is never fully realized. The processes that individuals is gradual rather than all-or-none. Just this sort of bias due
use may be similar but differ quantitatively, or they may to grouping was observed and then corrected through more
be dissimilar, partitioning the individuals into groups. In sophisticated analysis, in studies in the 1950s and 1960s of
the present research, we examine the trade-offs between simple learning tasks (e.g., Bower, 1961).
some of the factors that might favor individual or group Although there are classes of models for which grouping
analysis when quantitative models of psychological pro- across individuals does not cause distortions (see, e.g., Estes,
cesses are compared. The effort required to carry out both 1956), the data from most influential contemporary models
analyses is hardly greater than that required to carry out (models that tend to be complex, nonlinear, and interactive)
either alone, so our intent is not to recommend use of one change form when combined over individuals. For this and
approach instead of the other. We therefore explore the other reasons, the field has seen a trend toward modeling
A. L. Cohen, acohen@psych.umass.edu
individual data. In many cases, individual analysis is well these models are nested: In each instance, the more com-
justified—for example, when researchers collect a large plex model contains the simpler model as a special case
number of data from each individual for each condition. In- (when d and/or c are set to zero). For nested models, the
dividual data analysis not only reduces distortions caused by more complex model can always fit a data set at least as
the potentially large differences across individuals,1 but also well as the simpler model. Furthermore, if error is added
allows one to discover groups of individuals using different to data produced by a simpler nested model (e.g., the qua-
processes or different parameters of the same process.2 dratic), the extra terms in the more complex model (e.g.,
However, group analysis may be warranted in certain the cubic) will allow the more complex model to fit better
situations. For example, there are experimental situations than even the simpler model that produced the data.
in which it is difficult or impossible to collect much data When used in isolation, the standard goodness-of-fit
from an individual for each condition. Such situations measures for comparing models, such as sum of squared
occur, for example, when infants or clinical populations error, percentage of variance accounted for, or maximum
are studied. They also occur when the design prevents likelihood, do not deal adequately with model complex-
multiple observations (e.g., studies of inattentional blind- ity. In recent years, a number of sophisticated methods
ness sometimes obtain one observation per subject; Mack have been developed that do take complexity into account.
& Rock, 1998), when the number of possible stimuli is The normalized maximum likelihood (NML; Barron, Ris-
limited, or when the critical issue is performance on a par- sanen, & Yu, 1998; Grünwald, 2005; Rissanen, 2001), and
ticular preasymptotic trial during a long learning process. Fisher information approximation (FIA; Rissanen, 1996)
In such data-poor situations, inference based on individual implementations of minimum description length, Bayes-
data may well be so error prone that accepting the distor- ian model selection (BMS; e.g., the Bayes factor; Kass &
tions produced by grouping is a worthwhile trade-off. Raftery, 1995), Bayesian nonparametric model selection
The article is organized as follows. We first briefly re- (BNPMS; Karabatsos, 2006), various forms of cross-
view a number of model selection techniques and discuss validation (CV; e.g., Berger & Pericchi, 1996; Browne,
their appropriateness when there are few data points per 2000), and generalization (Busemeyer & Wang, 2000) fall
individual. We then suggest a simple-to-use simulation pro- into this category. The difficulty in computing and/or ap-
cedure that allows us to determine which quantitative analy- plying the current state-of-the-art methods places these
sis choices are best for any particular number of trials per techniques out of the reach of many researchers, so simpler-
condition and individuals per experiment. Extension of the to-use approximations (which sometimes make unrealistic
simulation procedure to model selection is discussed, and assumptions about the models) are commonly used. These
a number of model selection techniques are used to com- approximations include the Akaike information criterion
pare group and individual analyses in a specific example. (AIC; Akaike, 1973) and the Bayesian information criterion
To make the results as general as possible, this simulation (BIC; Schwarz, 1978). We will consider both categories
procedure is then applied to pairs of models from three of methods, although the state-of-the-art methods will be
experimental settings: forgetting, categorization, and in- explored only for one special case, since the main target of
formation integration. An enormous number of simulation this article will be the simpler methods that are, or could be,
results were obtained, producing information overload for in standard use. Our findings should thereby have consider-
any potential reader. We therefore have placed the complete able practical utility for a large number of practitioners.
results of the simulations in the Psychonomic Society Ar- In minimum description length, the data set can be
chive of Norms, Stimuli, and Data, giving only particularly considered a message that needs to be encoded for trans-
useful, diagnostic, and representative results in the body of mission to a receiver. Each model is a method used to
the article, along with a summary of the main findings. encode the message. The message (data and model) with
the shortest code is the more parsimonious model and is
Model Selection Measures selected (Grünwald, 2000; Hastie, Tibshirani, & Fried-
man, 2001; Rissanen, 1978, 1987, 1996). It is not gener-
There are many criteria by which one model may be ally possible to determine the shortest code exactly, so
favored over another, including many that are not quantifi- it must be approximated. Two methods for approximat-
able.3 Here, we will focus solely on numerical techniques ing this code have achieved recent prominence. The older
of model comparison. But even with this restriction, com- of these, which we will refer to as FIA (Rissanen, 1996),
paring models is a complex affair. For example, selecting involves easier computations. Although still perhaps too
a model solely on the basis of its relative goodness of fit technical for general use in the field, we will give detailed
(the level of agreement between the model’s predictions results for FIA for each of our simulated conditions.
and the observed data) is often inadequate. One problem The newer and better-justified method for approximat-
is that goodness of fit ignores the relative flexibility or ing minimum code length is NML (Grünwald, 2007; Ris-
complexity of the two models. sanen, 2001; described briefly later). NML requires too
The complexity of a model increases with the number of many computational resources for us to investigate sys-
different data sets it can describe. As a simple example, the tematically in all our simulations. A similar computational
cubic model ( y 5 a 1 bx 1 cx2 1 dx3) is more complex problem exists for another favored method for model se-
than the quadratic model ( y 5 a 1 bx 1 cx2), which is, in lection known as BMS. BMS is based on the Bayes factor
turn, more complex than the linear model ( y 5 a 1 bx), (Jeffreys, 1935, 1961), the ratio of posterior to prior odds.
on the basis of the number of curves each can fit. In fact, Other computationally intensive methods include various
694 Cohen, Sanborn, and Shiffrin
forms of CV, which represent an empirical attempt to carry & Myung, 2004) that is easy to use and allows us to examine
out model selection through assessment of predictive ac- model selection and mimicry in a way that applies when the
curacy. In CV, model parameters are selected on the basis number of data observations is small. Suppose a researcher
of a portion of the data (the training data), and prediction is interested in inferring which of two models provides a
accuracy is assessed by employing these parameters to better account of data and needs to decide whether or not to
predict the remainder of the data (the testing data). If a group individuals prior to analysis. To be concrete, suppose
model possesses unneeded complexity, it will account for that the two models are exponential and power law descrip-
noise in the training data and, therefore, perform poorly tions of forgetting curves (a case considered in this article).
on the testing data. The generalization criterion follows On each trial of a forgetting experiment, a subject studies a
a principle similar to that for CV, but instead of selecting list of words and then, after a varying time delay, is asked to
some of the data across all conditions for training and test- recall as many words from the list as possible. This proce-
ing on the remaining data, training takes place on all of the dure repeats a number of times for each retention interval.
data from a subset of the conditions, with the remaining The data are the average proportion of words recalled at each
conditions used for testing. Our examination of NML, CV, retention interval t, P(Recall | t). The exponential model as-
and generalization will be restricted to one special case. sumes that recall drops off according to
(BMS is considered in Appendix A. BNPMS [Karabatsos,
P(Recall | t) 5 ae2bt, (1)
2006] has been proposed too recently for us to assess in
this article.) and, for the power model, recall decreases according to
The AIC and BIC methods are easy to implement P(Recall | t) 5 at 2b. (2)
and are in common use, so we will report the simula-
tion results for these methods for each of our simulated For the PBCM procedure, first assume that the experi-
conditions. The basis for AIC is the idea that the models mental data were generated by one of the models—say, the
under consideration are only approximations to the true exponential model.4 Second, an experiment is simulated
generating model for a particular set of data (Burnham assuming that the generating model (in this case, the expo-
& Anderson, 2002). AIC selects the model that is closest nential model) produces data for a particular number of in-
to the generating model by using the Kullback–Leibler dividuals and a particular number of trials per condition. To
(K–L) distance (Cover & Thomas, 1991), which gives the produce variability across individuals, the parameters of the
amount of information lost when the generating model is generating model are assumed to differ across individuals.
approximated. It turns out that AIC values are based solely Two methods are used to select parameters to generate data
on the number of model parameters, and therefore, AIC and to select variability in those parameters across individu-
fails to take into account such important factors as the als. These will be described in detail later, but for now we
correlations among the parameters. BIC is an approxima- say only that the informed method selects parameters for
tion to the Bayes factor used in BMS and adjusts for both each individual by sampling from actual fits of parameters
the number of parameters in a model and the number of to real-life data sets and that the uninformed method chooses
observations in the data but, like AIC, fails to take into a parameter value from a range of possible values for that
account the correlations among parameters. model and then introduces variability across individuals by
Methods such as FIA, NML, BMS, BNPMS, CV, and adding random Gaussian noise to that value. To produce
generalization (and AIC and BIC under certain conditions) variability within an individual, appropriate measurement
have excellent justifications. It is important to note, how- error is added to the data generated with a particular set of
ever, that all of these methods could run into difficulty parameters. This error, in general, will depend on both the
when applied to data generated from small numbers of tri- form of the true model and the number of trials per condi-
als per condition, exactly the situation in which using group tion, but in this article, the form of the error will always
data may be best justified. AIC, for example, was derived as be binomial. For example, if the exponential model with a
an approximation to the K–L distance between the generat- fixed set of parameters predicts that the probability of recall
ing and the candidate models, and it is valid only for large after 2.5 sec is .78 and there are five trials per condition, the
numbers of trials (Burnham & Anderson, 2002). BIC also is number of simulated successful recalls for this condition is
a large-sample approximation. For large numbers of trials, binomially distributed, with p 5 .78 and n 5 5.
BIC approximates the Bayes factor (Kass & Raftery, 1995), Third, both models being compared (in this case, the
but when there are only a few trials per condition, the ap- exponential and the power models) are fit to the data just
proximation may not be very accurate. FIA is also a large- produced. For each model, this is done in two ways: (1) by
sample technique. The behavior of NML, BMS, CV, and fitting the model parameters to best account for the data
generalization with small numbers of observations has not from each simulated individual in the experiment and
been studied in depth. We now will turn to a model selection (2) by fitting the model parameters to best account for the
procedure that may not suffer from such shortcomings. data combined over all of the simulated individuals in the
experiment.5 We find the parameters that maximize the
The Simulation Procedure likelihood of the data, given the parameters. The maxi-
mum likelihood provides an appropriate method for com-
The majority of the present research utilizes the paramet- bining the fits to individual data into a summary measure
ric bootstrapping cross-fitting method (PBCM; Wagenmak- across all individuals: The summary measure is the prod-
ers, Ratcliff, Gomez, & Iverson, 2004; see also Navarro, Pitt, uct of the individual likelihoods.6 In an individual analy-
Model Evaluation 695
sis, the negative of the log maximum likelihood (2log L) enough to make it plausible that these best-fitting parameter
was summed across individuals, but for the group data, a values provided sensible and reasonably stable estimates of
single 2log L fit is obtained to the data combined across parameters for that model used by humans in that task.) Let
individuals. us suppose that a model has several component parameters
A fit value of zero means that the model perfectly ac- that, together, make up its specification (e.g., a and b for a
counts for the data. The higher the fit value is, the greater model y 5 ax 1 b). In the uninformed method, a central
the discrepancy between the data and the model pre- value was chosen for each component parameter by drawing
dictions. The difference between the two negative log from a uniform distribution. The range of values included
maximum likelihoods for the two models indicates the those from past experimental work. Then parameters were
difference in fit for the two models. For example, if the chosen for individuals by adding Gaussian random noise
difference is 2log L for the power model minus 2log L for with zero mean and a specified variance to each component
the exponential model, zero indicates that both models fit central value (the variance was small, in comparison with
the data equally well; for this measure, the exponential and the entire range).7 The specifics of this sampling procedure
power models fit better when the values are positive and will be discussed in Appendix B.
negative, respectively. Because it is difficult to keep track This method does not reflect typical parameter varia-
of all the compounded negatives and changes in directions tions between individuals, because component parameter
of effects, we will simply indicate in our tables and figures values for any individual are often correlated and not in-
which model is favored by a larger difference score. dependent. Furthermore, if the level of parameter varia-
To summarize the first three steps: One of the compet- tion selected for the uninformed method is too low, the
ing models was designated as the generating model; the differences between individuals may be artificially small,
model was used to simulate experimental data, given a producing little distortion in the data. Thus, we used a sec-
fixed number of trials per condition per individual, a fixed ond method termed informed: The individual’s parameter
number of individuals in the experiment, and a specific values were sampled with replacement from those param-
method for varying parameters across individuals; both eter sets actually produced by fitting the model to real data
competing models were then fit to both the individual for individuals (the real data were obtained in the study
and the grouped simulated data, using a 2log L fit mea- we used as a reference point for the simulations). We do
sure. Thus far, the products of the simulations were two not at present know how much the choice of methods for
difference-of-fit values, one for the individual data and choosing individual differences will affect the pattern of
one for the group data. For the same specific combina- results, but note that some robustness is suggested by the
tion of factors, we repeated this procedure for 500 experi- fact that for each of the applications studied in this article,
ments. Each of the 500 simulations produced two differ- the results of interest were essentially the same for the two
ence scores that we could tabulate and graph. different methods for choosing parameters.
Fourth, we repeated the entire process when the other In all, the variations of number of observations, num-
model under consideration was used to generate the data. ber of individuals, and method of parameter selection and
Thus, first we carried out the procedure above when one variation produced 117,000 simulations 8 that repeated
of the two models under examination was used to generate Steps 1–4 above for a pair of compared models (e.g., the
the data (the exponential in this example). Then the entire exponential and power laws of forgetting).
process was repeated when the other model was used to Sixth, this entire process was repeated for four pairs of
generate the data (in this example, the power model). To models (each case based on some study chosen from the
this point, there were now two sets of 1,000 data points literature): exponential versus power models of forgetting,
(difference-of-fit values), 500 for each generating model. GCM (two versions) versus prototype models of catego-
The first set was for the individual data, and the second rization, and FLMP versus LIM models of information
set was for the group data. Fifth, this entire process was integration (see p. 702 below). In all, 468,000 simulated
repeated for all the parametric variations we considered. experiments were carried out.
All of the simulations were run as described above, with
all combinations of 13 levels of individuals per experi- Use of the Simulations
ment and 9 levels of trials per condition. for Model Selection and Mimicry
The factors above were also crossed with the two methods
of selecting parameters to generate data—that is, two meth- The previous section described how to generate the
ods of selecting parameter variation across individuals. Both PBCM simulation results. This section illustrates how to
of these methods produced parameter ranges across individ- use the simulation results for model selection. Recall that
uals that were smaller than the range of uncertainty concern- two sets of 1,000 data points were generated for a given set
ing the parameters themselves. The idea was that individuals of experimental factors. Each set consisted of 500 simula-
obeying the same model should vary in their parameters, tions in which the first model (the exponential model) was
but not wildly around their joint mean. In addition, we did used to generate the data and 500 simulations in which the
not wish to explore parameters that were very implausible second model (the power model) was used to generate the
for humans in a given task. With this in mind, we began data. The first set was generated using average data, and
by obtaining parameter estimates for individuals from the the second set was generated using individual data.
actual studies on which our simulations were based. (In the For the purposes of understanding the results, it is per-
actual studies, the numbers of data per individual were large haps most helpful to provide a graph giving the histogram
696 Cohen, Sanborn, and Shiffrin
of these 500 difference-of-fit scores when one model gen- data. The difference is arranged so that higher numbers rep-
erated the data and, on the same graph, the histogram of resent better fits for the power model. That is, the gray bars,
these 500 difference-of-fit scores when the other model representing the case in which the true model was the power
generated the data. This histogram comparison graph is model, tend to lie further to the right, as they should if the
produced for a given method of data analysis (say, group), power model generated the data. (Note that the individual
and then another such graph is produced for the other and the group histograms have different axes.)
method of data analysis (individual). It is important to In many ways, the histograms graphed together on one
note that, at this point, the simulated results are described plot invite familiar signal detection analyses (e.g., Mac-
only in terms of differences in 2log L, a measure that does millan & Creelman, 2005). In particular, to select one of the
not account for model complexity. two models, the researcher needs to determine a decision
We will illustrate the results and demonstrate how com- criterion. Given data from an experiment, if the difference
plexity adjustments are introduced by discussing just one in fit values is above the decision criterion, the power model
simulated case: the comparison of exponential versus power is selected; otherwise, the exponential model is selected.
laws of forgetting, with two observations per retention inter- When the two distributions overlap, as in our example, the
val per individual, for 34 individuals, with parameter varia- accuracy of correct model selection is limited by the degree
tion produced by sampling from actual data fits (i.e., the of overlap (overlap may be thought of as a measure of the
informed method). Note that the number of observations per extent to which the two models mimic each other). As in
condition is very low, the situation in which we have argued signal detection analysis, one may analyze the results to
that group analysis might be warranted. The upper graph in produce a measure of sensitivity that will represent how dis-
Figure 1 shows the pair of histograms for the case in which tinguishable the distributions are from one another, regard-
data are analyzed by group. The lower graph shows the pair less of the placement of the decision criterion. We used the
of histograms for the case in which the data are analyzed overall probability of correct model selection as a measure
by individual. The horizontal axis gives the difference in and chose the decision criterion that maximized this prob-
2log L fit for the two models; the white bars give the results ability (assuming that both models are equally likely a pri-
when the exponential model generated the data; and the gray ori). For each of the histogram-of-difference graphs that we
bars give the results when the power model generated the produced, we show this criterion, and label it optimal.9
Group
30 Exp
Pow
Opt
20
Count
Log L
FIA
10
0
–4 –2 0 2 4 6
Individual
20
15
Count
10
0
–12 –10 –8 –6 –4 –2 0 2
Log L Pow � Log L Exp
Figure 1. A sample histogram showing the difference of goodness of fit (log likeli-
hood) for the exponential (Exp) and power (Pow) models of forgetting simulated with
two trials per condition and 34 individuals per experiment, using informed param-
eters. The dotted, dashed, and mixed lines are the zero (log L), optimal (Opt), and
Fisher information approximation (FIA) criteria, respectively. The FIA criterion for
the individual analysis was 238.00 and does not show in the figure. The models are fit
to the group and individual data in the top and bottom panels, respectively.
Model Evaluation 697
The upper graph in Figure 1 illustrates these points for this bias than is individual analysis. Whatever biases exist
the analysis of group data. Note that, in this case, the opti- for or against group analyses and for or against individual
mal decision criterion is close to zero difference. The group analyses, they are, of course, present in the methods used
analysis histograms overlap quite a lot, limiting accuracy to produce the histogram plots, so that the PBCM method
of model selection (perhaps not surprising with only two of choosing an optimal criterion to optimize model selec-
observations per individual per condition). The lower his- tion accuracy does not get rid of these biases but produces
tograms illustrate the situation in which individual analy- the best answer that can be achieved in the face of these
ses are carried out. These histograms overlap even more biases.
than those for group analysis, showing that optimal crite- It is also unclear how the between-subjects parameter
rion placement leads to superior model selection for the variability will affect the results. For all the simulations in
group analysis: Indeed, using the optimal decision crite- this article, the level of individual variability for the unin-
rion, the correct generating model was selected on 70% of formed parameters was fixed. That is, after the mean pa-
the simulations when group analysis was used and on only rameter value was determined, the parameters from each
56% of the simulations when individual analysis was used. individual were selected from a distribution with a fixed
(Nor is this result exclusive to our method; we will show standard deviation. It is a logical step, however, to assume
below that group analysis is also superior for AIC, BIC, that the advantage of the group analysis will disappear
FIA, NML, and other methods). These results show, for when the dependence between subjects is reduced. In par-
group and individual analyses, the best possible accuracy ticular, the greater the standard deviation between sub-
of model selection (defined as the probability of selecting jects, the more the distribution over subjects will resemble
the actual generating model) that could be achieved on the a uniform distribution, resulting in independent subjects
basis of the fitting of maximum likelihood estimates. as assumed by the individual analysis. Pilot simulations
One might wonder why the group method is superior, have shown, however, that increasing or decreasing indi-
especially given previous demonstrations that a mixture of vidual variability produces results that are qualitatively
exponentials with different parameters can produce data (and often quantitatively) very similar.
very similar to those produced by a power law (e.g., R. B.
Anderson & Tweney, 1997; Myung, Kim, & Pitt, 2000): Model Selection by Fit Adjusted
This result should bias group analysis to favor the power for Complexity—AIC, BIC, FIA, and NML
law. This bias was, in fact, observed in our simulations: AIC, BIC, FIA, and NML select between models by
Almost all of both individual analysis histograms (lower comparing the log-likelihood fit difference with a criterion
panel) are to the left of zero difference, showing that the ex- that adjusts for differences in model complexity. It turns out
ponential model fit both its own data and the data generated that the complexity adjustments of these model selection
from the power model better. However, the group analy- techniques produce a decision criterion that can be directly
sis histograms (top panel) have shifted considerably to the compared with the PBCM decision criterion. As we will de-
right, showing that group analysis produces a bias favoring scribe shortly, in our example, when there are few data, AIC,
the power law. Thus, the superiority of the group analysis BIC, and FIA do not produce a criterion adjustment that ap-
occurs despite the bias caused by grouping: The group his- proximates the optimal criterion placement as determined
tograms may have shifted to the right, but they also have by PBCM simulation. The distortion is quite severe, greatly
separated from each other, producing better performance. lowering the potential accuracy of model selection and lim-
We have identified at least two of the factors that make iting any inference the investigator might draw concerning
group analysis superior in the present situation. First is the whether it is better to analyze data by individual or group.
bias to treat all of the data as coming from 1 subject when Because AIC and BIC base their complexity adjustments
the data are noisy as a result of very few observations per on the number of model parameters, which are equal for
condition. The individual analysis assumes (as a result of the exponential and power models, the adjustment leaves
our combination rule) that the individuals are indepen- the decision criterion at zero. This is true for both group
dent of each other. Our simulation produced subjects that and individual analyses. For the group histograms, the
tended to cluster in a region of the parameter space and, as zero criterion is close to optimal. For the individual histo-
a result, were somewhat dependent. With very noisy data, grams, however, the criterion of zero lies almost entirely
each individual estimate is poor, and the individual analy- to the right of both histograms. The exponential model
sis does not take advantage of this dependence between will almost always be selected, and the probability of cor-
subjects. The group analysis assumes that each subject is rect model selection falls to chance.
the same, so with very noisy data and some subject depen- As was noted earlier, there is bias favoring the exponen-
dency, the bias of the group analysis is more useful than tial law when small amounts of data are analyzed. It might
the flexibility of the individual analysis. Second, there is a be thought that FIA could correct for this bias. For the
separate bias favoring the exponential law when there are group analysis, this criterion slightly misses the optimal
few data: A small number of all-or-none trials often results criterion placement and only mildly lowers model selec-
in zero correct scores for the longer delays. Because the ex- tion performance to 65% correct model selection. For the
ponential distribution has a lower tail than does the power individual analysis, however, the FIA complexity adjust-
law (at the extreme end of the parameter ranges used), the ment overcorrects to an enormous degree, so that both
zero observations produce a better maximum likelihood fit histograms lie far to the right of the criterion. (Indeed,
for the exponential. Group analysis is less likely to show the FIA decision criterion was 238.00 for the individual
698 Cohen, Sanborn, and Shiffrin
analysis, far to the left of any of the simulated data, and so partly on the grounds that, according to some theories, its
was omitted from the graph.) Now the power model would penalty for complexity may be insufficiently large.
always be favored, again producing chance performance. In order to generalize our results, we explore another
Given that the rationale justifying FIA applies for large selection criterion, predictive validation (PV), an empiri-
amounts of data, this failure is perhaps not a complete cal method based on predictive accuracy. In this approach,
surprise. instead of evaluating how well a method selects the gen-
Next, consider NML. NML calculates the maximum erating model, we measure performance by how well a
likelihood for the observed data, and scales this by the fitted model will predict new data from the same process
sum of the maximum likelihoods over all possible data that generated the original data. This method is a general-
sets that could have been observed in the experimental ization of the method of comparing the recovered param-
setting. Given that it considers all possible data sets of eters with the generating parameters (e.g., Lee & Webb,
the size specified by the experimental design, NML is, in 2005), but instead of looking at the parameters directly,
principle, applicable to small data sets. The results were we examine the probability distribution over all possible
simple: For both individual analysis and group analysis, data sets that can be produced by a model and a specific
the NML criterion was very close to optimal as deter- set of parameters. Thus, we can compare the generating
mined by PBCM (71% and 58% correct model selection model and its generating parameters with any candidate
for group and individual analyses, respectively). Thus, for model with its best-fitting parameters by examining the
individual analysis, NML did much better than AIC, BIC, correspondence of the probability distributions produced
and FIA. Nonetheless, NML produced results that were by the two processes. Note that, according to this criterion,
the same as those for the optimal PBCM criterion place- it is logically possible that, even if Model A generated the
ment: better performance for the group analysis. original data, Model B might still be selected because it
provides superior predictions of new data generated by
Model Selection by Cross-Validation Model A. We will present an intuitive overview of the PV
and Generalization technique here; details will be given in Appendix C.
There are many forms of CV, all based on the principle Say we are interested in comparing the exponential and
that a model is fit to part of the data and then validated by power models of forgetting. To simplify the example, as-
using the results to predict the remaining data. A model is sume for now that the experiment involves only a single
preferred if it predicts better. Each model was fit to half subject. (This restriction will be lifted below.) First, a set
the data (for individuals, this meant that we fit one of the of parameters is selected for a generating model—say, the
two observations per delay interval), and then the best- exponential model. The manner of parameter selection is
fitting parameters were used to predict the remaining data. identical to that of the informed approach discussed above.
For group analysis, a model was preferred if it fit the vali- Note that, for any one experimental condition, a model
dation set better. For individual analysis, a model was pre- with a fixed set of parameters defines a distribution over
ferred if the product of likelihoods of the individual vali- the possible data. Following the example from above, if
dation sets was higher. CV selected the generating model the exponential model with a particular parameterization
on 69% of the simulations for group data but selected the predicts a .78 probability of recall on each of five trials
generating model on only 50% of the simulations (chance) after a 2.5-sec retention interval, the predicted probability
for the individual analysis. of 0, 1, 2, 3, 4, or 5 recalls after 2.5 sec is .00, .01, .06, .23,
The generalization criterion allows the researcher to .40, and .29, respectively. Each experimental condition
choose which conditions should be used for training and (in this case, each retention interval) has a comparable
which for testing the models under consideration. We chose distribution. Just as in the PBCM, a simulated data set is
to train the models on the first three retention intervals and produced from the generating model.
to test using the last two retention intervals. Because the Second, the maximum likelihood parameters for these
important theoretical difference between the exponential simulated data are found for one of the candidate models—
and the power law models lies in the longer retention inter- say, the power model. The power model with maximum
vals, the extreme retention intervals were used for testing. likelihood parameters will also generate a distribution for
Like CV, the model that predicted the testing data better each condition. It is unlikely that the predictions of the
was selected. Generalization performed well for the group generating and candidate models will be identical. For any
analysis (66%), but poorly for the individual analysis one experimental condition, the differences between the
(52%). As was found using PBCM, model selection using generating distributions and the candidate distributions can
CV and generalization was better for group data. be assessed using K–L divergence. K–L divergence pro-
duces a distance measure from a generating distribution to
Validation of Choice a candidate distribution; the smaller the distance, the more
The comparison of model selection techniques above similar the candidate and generating distributions will be.
has equated performance with the accuracy of selecting the To produce an overall measure for an individual, the K–L
generating model. This validation criterion has the merits of divergence is found for each condition and summed.
being simple and easy to understand. Under the validation This procedure is then repeated with the exponential
criterion of selecting the generating model, the PBCM solu- model as the candidate model. Because a smaller K–L
tion can be considered optimal. However, there are reasons divergence indicates a greater degree of overlap in the dis-
to worry that this validation criterion is not fully adequate, tributions that generate the data, the candidate model with
Model Evaluation 699
the smaller K–L divergence is the model that predicts new Table 1
data from the generating model better. Predictive Validation Results for 34 Subjects
and Two Trials per Condition
Now consider the more realistic situation with multiple
subjects that can be analyzed as individuals or as a group. Generating Model Comparison Model Individual Group
For the individual analysis, there are separate generating Exponential Exponential .000 .928
Power .000 .072
parameters for each individual, so the distributions pro- Power Exponential .000 .026
duced from a candidate model for each individual can be Power .000 .974
directly compared with that individual’s generating model FLMP FLMP .436 .564
and parameters. Then the total K–L divergence is found by LIM .000 .000
summing across individuals. For the group analysis, there LIM FLMP .000 .014
is no set of best-fitting parameters for each subject, so in LIM .028 .958
order to equate this situation with the individual analysis, GCM–γ GCM–γ .668 .008
each subject is assumed to use the best-fitting group pa- Prototype .324 .000
rameters. Thus, each individual subject in the generating Prototype GCM–γ .000 .000
Prototype 1.000 .000
process is compared with the distribution defined by the
Note—FLMP, fuzzy logical model of perception; LIM, linear integration
best-fitting group parameters to produce a K–L divergence model; GCM, generalized context model.
for each subject. The total K–L divergence is again summed
over subjects. Using this procedure for the group analysis
allows these K–L divergences to be directly compared with simulated experiments will the group or individual PV
the K–L divergences produced by the individual analysis. criterion match the generating model? If these analyses
For a given generating model, the result of this meth- match each other reasonably well, it should increase our
odology is four K–L divergences, one for each of the two confidence in the generating model criterion that we ad-
candidate models crossed with the two types of analysis opted for the majority of simulations in this article. In
(group and individual). Because K–L divergence is a this article, PV was computed for only a few interesting
measure of the distance between the data produced by the model comparisons (in particular, for 34 subjects and
candidate and the generating models, the combination of two trials per condition with the informed generating
candidate model and analysis method that produces the method).
smallest K–L divergence from the generating process is The results from the PV method are presented in Ta-
preferred. That is, the four K–L divergences are com- bles 1 and 2. The tables list the results from each of the
pared, and the smallest value wins. The entire analysis is model pairs compared. For example, the first section
repeated numerous times with different generating param- of each table gives the exponential and power law com-
eters. Then the analysis is repeated with the other model, parison. The generating column indicates which model
the power model, as the generating model. produced the data. For Table 1, each entry in the generat-
There is a disadvantage, however, to using the results of ing model column indexes the results for four methods
PV as a standard for determining whether to use group or of analysis: the outcome of crossing the two comparison
individual analysis as a particular model selection method. models by individual or group analysis. The values are
Recall that the PBCM defines success as selection of the the proportions of simulated experiments for which a par-
generating model. For any data set, individual or group, ticular analysis method resulted in the best PV score. For
from a simulated experiment, a model selection method is example, when the exponential model generated the data,
accurate if it chooses the generating model. In contrast, PV fitting the exponential model to group data outperformed
defines success as selection of the model that predicts new the other three methods on 92.8% of the simulated experi-
data better, regardless of which model generated the data. ments. For the exponential versus power law of forgetting
This goal could create a situation in which it is not clear with few data per individual, the PV results were clear:
which model to prefer. For example, it might be that for data Regardless of the generating model, group analysis was
from a simulated experiment, the exponential model applied favored.
to group data is the best predictor of new data. Using this re- Table 2 shows how often the PV result matched the
sult as a standard, a model selection technique would be cor- generating model criterion. In Table 2, the same raw PV
rect if it selects the exponential model. However, although scores as those in Table 1 are used but the winning method
the exponential model applied to group data may be the best is chosen differently. Instead of the best PV score being
overall predictor, the power model may be the better predic- chosen from the four competing methods, in Table 2 the
tor when only individual data are considered. In this case, individual and group analyses are considered separately.
it is unclear whether it is appropriate to use the exponential For both the individual and the group analyses, the pro-
model as the standard for analyzing individual data when the portion of matches between the model with the better PV
group and the individual PV results disagree. score and the generating model is reported. Table 2 shows
Given these considerations, we decided to focus on that the group analysis matches well with the generating
two statistics. First, we ask directly whether PV favors model criterion. Combined with the results of Table 1,
group or individual analysis: How often does group PV showing that the group analysis produced better PV scores
analysis outperform individual PV analysis? Second, we than did the individual analysis, the good match between
use PV to at least partially support the generating model PV and the generating model criterion provides extra sup-
as a selection criterion. We ask, On what proportion of port for our use of that criterion.
700 Cohen, Sanborn, and Shiffrin
Average � Individual
Average � Individual
Proportion Incorrect
Proportion Incorrect
0 0
–.1 –.1
–.2 –.2
–.3 –.3
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Average Average
Proportion Incorrect
Proportion Incorrect
.4 .4
.2 .2
0 0
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Individual Individual
Proportion Incorrect
Proportion Incorrect
.4 .4
.2 .2
0 0
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
class of stimuli is constructed by factorially combining /da/ for visual (i.e., lip and jaw movement) and auditory
the different levels of two or more stimulus dimensions speech. As was described above, the 5 stimuli from each
(e.g., visual and auditory presentation of a person speak- of the two dimensions were presented both in isolation
ing one of a range of phonemes). The second class con- (5 auditory and 5 visual stimuli) and factorially combined
sists of each level of each stimulus dimension in isolation (25 audiovisual stimuli). On each trial, the subject was
(e.g., an auditory presentation of a phoneme without the asked to categorize the stimulus as /ba/ or /da/. The data
visual presentation). On each trial, the individual is asked are the proportion of times the individual responded /da/,
to assign the stimulus to a response class. The data are the p(/da/). A condition is one of the 35 stimuli.
proportions of trials for which each stimulus was assigned The two models of information integration under
to a response class. consideration are the fuzzy logical model of perception
The present work is based on an experimental design (FLMP; Oden & Massaro, 1978) and the linear integration
from Massaro (1998). There are two stimulus dimensions: model (LIM; N. H. Anderson, 1981). Let Ai and Vj be the
a five-step continuum between the phonemes /ba/ and ith level of the auditory dimension and the jth level of the
702 Cohen, Sanborn, and Shiffrin
visual dimension, respectively. The psychological evalua- was nearly perfect agreement between the models cho-
tion of Ai and Vj , ai and vj, respectively, are assumed to lie sen by PV for both group and individual analysis and the
between 0, a definite mismatch, and 1, a definite match. generating models (see Table 2). This agreement lends
Whereas the FLMP assumes that the probability of a /da/ more weight to using recovery of the generating model as
response is given by a Bayesian combination of ai and vj , a measure of model selection accuracy.
ai vj In summary, the use of both average and individual data
( )
p / da / | Ai ,Vj =
( )
ai vj + (1 − ai ) 1 − vj
, (3) produces good model selection results, with the following
exceptions. When the zero criterion is used, average data
the LIM assumes that these two sources of information should be used for low trials per condition. When the op-
are averaged, timal criterion is used, individual data should be used for
low trials per condition and individuals per experiment.
ai + vj
( )
p / da / | Ai ,Vj =
2
. (4) Models of Categorization: Generalized Context
Model and Prototype Model
Details of the uninformed and informed parameter sam- In a typical categorization task, an individual is asked to
pling methods are given in Appendix B. Again, because classify a series of stimuli that vary along multiple feature
the data of interest when uninformed and informed pa- dimensions. The individual receives corrective feedback
rameters were used were very similar, only the results after each trial. Two categorization models are considered
using the uninformed parameters are discussed. here: the generalized context model (GCM; Nosofsky,
For individual data, in general, model mimicry is low; 1986) and the prototype model (Reed, 1972). Both mod-
both models favor their own data, and the optimal crite- els assume that the stimuli are represented by points in a
rion is well approximated by a zero criterion. The excep- psychological M-dimensional space. Let xim be the psy-
tion to this pattern is at very low trials per condition, chological value of stimulus i on dimension m. Then, the
where much of the LIM data are better accounted for psychological distance between stimulus i and j is
by the FLMP. The FLMP advantage in this situation is
( )
2
due to its greater flexibility in fitting extreme data. For dij = ∑ wm xim − xjm , (5)
the FLMP, setting either parameter to one or zero pro- m ∈M
duces a response probability of one or zero, but for the
where the wms are parameters representing the attention
LIM model, both parameters need to be set to extreme
weight given to dimension m. Each wm $ 0 and the wms
values to produce extreme response probabilities. For
sum to 1. The similarity between stimulus i and j is an
all-or-none data with few trials per condition, the FLMP
exponentially decreasing function of distance in the space
is much better at fitting the resulting data. When average
(Shepard, 1987),
data are used, the advantage of the FLMP with low trials
all but disappears.
The proportion incorrect summary graphs for unin-
(
sij = exp − c ⋅ dij , ) (6)
formed parameters are displayed in Figure 3. The left and where c is an overall sensitivity parameter. In a two-
right columns of Figure 3 represent the proportion of in- category experiment, the GCM assumes that the probability
correct model selections when the optimal criterion and that stimulus i is classified into Category A is given by the
the zero (straightforward log likelihood), respectively, are summed similarity of stimulus i to all stored exemplars of
used. The model selection errors generated by the greater Category A divided by the summed similarity of stimulus i
complexity of the FLMP when low trials per condition to all stored exemplars from both Categories A and B,
are used are reflected in the zero-criterion, individual
data graph. If a zero criterion is used, model mimicry is ∑ sia
a ∈A
greatly reduced at low trials per condition by using aver- P( A | i ) = . (7)
age data. Utilizing the optimal cutoff greatly reduces the ∑ sia + ∑ sib
a ∈A b ∈B
errors associated with using individual data at low trials
The prototype model assumes that category decisions are
per condition and, indeed, produces a slight advantage for
based on the relative similarity of the test stimulus to the
using individual data at very low trials per condition and
single central category prototype from each category. Let
individuals per experiment (where model selection perfor-
PA and PB be the prototypes (usually the mean category
mance is particularly bad). For moderate to high trials per
member) for Categories A and B, respectively. Then,
condition and individuals per experiment, however, model
mimicry is similar (and very low) when either individual siP
or average data are used.
P( A | i ) = A
. (8)
siP + siP
PV was used only for the case with two observations A B
per individual and 34 subjects. In Table 1, the group analy- To account for the finding that subjects tended to re-
sis showed an advantage over the individual analysis in spond more deterministically than is predicted by the GCM
predicting new data from the same set of simulated sub- (Ashby & Gott, 1988), Ashby and Maddox (1993) proposed
jects, giving more evidence that group analysis is useful a version of the GCM that allows the model to predict more
when there are few trials per condition. In addition, there or less response determinism than is predicted by the base-
Model Evaluation 703
Average � Individual
Average � Individual
Proportion Incorrect
Proportion Incorrect
.1 0
.05 –.2
0 –.4
–.05
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Average Average
Proportion Incorrect
Proportion Incorrect
.3
.4
.2
.1 .2
0 0
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Individual Individual
Proportion Incorrect
Proportion Incorrect
.3
.4
.2
.1 .2
0 0
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
line version given in Equation 7. This version of the GCM response-scaling parameter will be referred to as GCM–R
adds a response-scaling parameter, γ, to Equation 7, (GCM–restricted) and GCM–γ, respectively. The same
γ response-scaling parameter can be added to the proto-
∑ sia type model. However, Nosofsky and Zaki (2002) showed
P(A | i ) = a ∈A
. (9) that the addition of a response-scaling parameter to the
γ γ prototype model formally trades off with the sensitivity
∑ sia + ∑ sib parameter and, so, is redundant. The category members,
a ∈A b ∈B classifications, and prototypes used as stimuli in the simu-
As γ increases, responses become more deterministic; lated categorization task are taken from the 5/4 category
that is, the more probable response category is more likely structure of Medin and Schaffer (1978, their Figure 4).
to be selected. When γ 5 0, all categories are selected GCM–γ versus prototype. The analyses will start by
with equal probability. The GCM without and with the comparing the prototype model with the GCM–γ. The
704 Cohen, Sanborn, and Shiffrin
Average � Individual
Proportion Incorrect
Proportion Incorrect
.05
0
0 –.1
–.05 –.2
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Average Average
Proportion Incorrect
Proportion Incorrect
.2 .3
.2
.1
.1
0 0
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Individual Individual
Proportion Incorrect
Proportion Incorrect
.2 .3
.2
.1
.1
0 0
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Ind./Exp. Trials/Cond. Ind./Exp. Trials/Cond.
Figure 4. The bottom two rows give the proportions of simulated experiments for which the incorrect
model was selected as a function of trials per condition and individuals per experiment when comparing
the generalized context model–γ and prototype model of categorization fit to average (middle row) and
individual (bottom row) data and using the optimal (left column) and zero (right column) criteria and the
uninformed parameter method. The top row is the difference of the two bottom rows.
parameter-sampling scheme is detailed in Appendix B. tive complexity of the two models is reduced with increases
Again, although the distributions for the data generated in both trials and individuals. Indeed, with 25 trials and 34
using the uninformed and informed parameter selection individuals, the two models rarely mimicked each other.
methods have different shapes, the results were otherwise A summary of the selection errors for the zero and
very similar. The discussion will focus on the results gen- optimal criteria is given in Figure 4 for both the optimal
erated from uninformed parameters. (left column) and the zero (right column) criteria when
The GCM–γ is far more complex than the prototype uninformed parameters are used. When the zero criterion
model when fitting individual data for low trials per con- is used, it is typically safer to analyze average data, espe-
ditions and individuals per experiment. This advantage de- cially when the number of trials per condition is low. For
creases as the number of trials increases. For group data, moderate to high trials per condition, both averaged and
the GCM–γ is again more complex than the prototype individual data give approximately the same result. When
model for low numbers of trials and individuals. The rela- the optimal criterion is used, both averaged and individual
Model Evaluation 705
data essentially produce the same results for all levels of found when informed parameters were used, but, interest-
the factors explored here. ingly, the prototype model was the more general model.
Both AIC and BIC penalize the GCM–γ for its extra The probability of selecting the incorrect model for the
parameter (BIC’s penalty depending on the number of ob- uninformed (left column) and informed (right column)
servations), implemented as a shift of the zero criterion to parameter selection methods is shown in Figure 5. For
another specified position. For individual analysis, AIC moderate to high trials per condition, model selection was
does no better than the zero cutoff, and BIC does even excellent regardless of parameter selection method or the
worse than the zero cutoff for very small numbers of trials use of average and individual data. The average analysis
per condition. We do not recommend the use of AIC and outperformed the individual analysis for low trials per
BIC with individual analysis. condition when the uninformed parameters were used,
PV was used for the case of 34 subjects and two trials but a reversal was found when informed parameters were
per condition. Tables 1 and 2 show an interesting result for used. This result can be taken as a demonstration that one
this particular situation. In Table 1, the individual analy- cannot always expect that a reduction of data will produce
sis was more effective than the group analysis at predict- an increasing benefit for group analysis.
ing new data from the same simulated subjects. However,
in Table 2, the group PV analysis picked the generating Conclusions
model more often than did the individual PV analysis.
Combined, these results mean that the group analysis pro- The results of our simulations illustrate some of the
duced worse PV scores than did the individual analysis, yet complexities of model selection and model mimicry. They
was able to determine the generating model better. Because afford a few reliable conclusions and some suggestions for
the group and individual analyses almost always agreed action and provide some pointers toward further research.
on the prototype model when it generated the data, this It must be kept in mind that our main concern was the
strange result is due to the GCM–γ-generated data. Of the great body of researchers who need to draw conclusions
500 simulations in which the GCM–γ generated the data, on the basis of fairly simple and easy-to-apply techniques,
there were 159 cases in which the group analysis picked which we implemented as maximum likelihood parameter
the GCM–γ model and the individual analysis picked the estimation for each model. Our primary goal was investi-
prototype model. For 157 of these 159 cases, the prototype gation of the relative merits of analysis by individuals or
model had a better PV score than did the GCM–γ model. analysis by group (data combined across individuals).
This result is likely due to the enhanced flexibility of the First, and most clearly, we have seen that neither analysis
GCM–γ model with small amounts of data, as compared by individual nor analysis by group can be recommended
with the prototype model. With few trials, the GCM–γ can as a universal practice. The field began in the 1800s with
fit the extreme values produced by all-or-none trials better experiments utilizing just one or only a few subjects and,
than the prototype model can. However, new data gener- therefore, using individual analysis. As studies began using
ated from the same process are unlikely to have the same larger groups of subjects, typically with fewer data per
extreme values, so the prototype model (although not the subject, group analysis became common. More recently,
generating model) is a better predictor of new data. practice has moved toward individual analysis, followed
In summary, regardless of the parameter-sampling by combination of the results across individuals. We have
scheme or criterion choice, similar results are given when seen, most clearly in the case of the power/exponential
average and individual data are used, except when the model comparison, that the last practice is not always jus-
number of trials per condition is low. In this case, it is tified. Furthermore, we have seen cases in which group
safer to use averaged data. analysis is superior, whether one’s goal is selecting the bet-
GCM–R versus prototype. An additional set of simu- ter model or obtaining accurate parameter estimates (al-
lations were run with the response scaling parameter, γ, of though our main focus was on the former).
the GCM–γ fixed at one, so that the response probability Our simulations show a tendency for the relative advan-
always matched the probability of Equation 7. We will tage of group analysis to increase as the number of data per
refer to this restricted model as the GCM–R. Data were subject drops. There are, of course, many potential benefits
generated from the prototype model as in the previous of individual data analysis that we have not explored in our
simulation. Details of the parameter-sampling procedure simulations, such as the possibility of partitioning individu-
for the GCM–R are given in Appendix B. als into groups with similar parameters or into groups obey-
As in the GCM–γ and prototype model simulations, the ing different models. We note that such refinements of indi-
results using the zero and optimal criteria do differ, but be- vidual analysis will likely be ineffective in the situations that
cause the differences here do not have a large impact on how we have found in which group analysis is superior: experi-
often the correct generating model is selected, the analysis ments with few data per individual. On the other hand, one
will focus on the optimal criterion. There are, however, some cannot unconditionally recommend group analysis, because
important differences between the results when uninformed such analysis is subject to well-known distortions in many
and informed parameter selection methods are used. When of the settings in which individuals operate with different
uninformed parameters are used, model mimicry is relatively parameters. These factors often operate in opposition. Our
low for both individual and average data, except for low tri- findings demonstrate the perhaps surprising result that when
als per condition and individuals per experiment, where the there are very few data per individual, individual analysis is
GCM–R was slightly more complex. Similar results were subject to noise and bias that sometimes produce distortions
706 Cohen, Sanborn, and Shiffrin
Average � Individual
Proportion Incorrect
Proportion Incorrect
.05 .15
.1
0
.05
–.05
0
–.1 –.05
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Average Average
Proportion Incorrect
Proportion Incorrect
.3 .3
.2 .2
.1 .1
0 0
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Individual Individual
Proportion Incorrect
Proportion Incorrect
.3
.3
.2 .2
.1 .1
0 0
10 5 10 5
10 10
20 15 20 15
30 25 20 30 25 20
Ind./Exp. Trials/Cond. Ind./Exp. Trials/Cond.
Figure 5. The bottom two rows give the proportions of simulated experiments for which the incorrect
model was selected as a function of trials per condition and individuals per experiment when comparing the
generalized context model–restricted and prototype model of categorization fit to average (middle row) and
individual (bottom row) data and using the uninformed (left column) and informed (right column) param-
eter selection methods and the optimal criterion. The top row is the difference of the two bottom rows.
even more serious than those produced by grouping. In such In contrast with a number of the theoretically based methods
cases, group analysis is the least bad strategy. for balancing simplicity and fit, PBCM provides a good deal
These findings do not produce a clear recommendation of extra information, giving the distributions of expected
for practice, because there does not seem to exist a generally outcomes when each model is in fact “true.” The PBCM
optimal approach, at least for the cases we analyzed in detail simulation method has the advantages of relatively easy im-
in which an investigator is limited to the use of maximum plementation, ease of comparison of individual and group
likelihood parameter estimation. Superficially, it appears analysis for small data sets, and a measure of how well an
tempting to recommend more advanced methods of model analysis will work for a particular experimental design but
selection, such as FIA, NML, BMS, and so forth. Our re- requires the distribution of subject parameters to be speci-
sults, however, lead to a cautionary note: The use of a single fied. In addition, PBCM requires one to accept the goal of
advanced method could produce poor results, because we selection of the actual generating model. Experts in the field
have seen different answers produced by different methods. of model selection will often prefer other goals (ones under-
Model Evaluation 707
Exp/Pow FLMP/LIM
Prop. Agree Agree: Prop. Incorr. Prop. Agree Agree: Prop. Incorr.
25 25
20 20
Trials/Cond.
Trials/Cond.
15 15
10 10
5 5
Disagree: Prop. Incorr., Avg. Disagree: Prop. Incorr., Ind. Disagree: Prop. Incorr., Avg. Disagree: Prop. Incorr., Ind.
25 1 25 1
20 .8 20 .8
Trials/Cond.
Trials/Cond.
15 .6 15 .6
10 .4 10 .4
5 .2 5 .2
0 0
5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30
Ind./Exp. Ind./Exp. Ind./Exp. Ind./Exp.
GCM–R/Proto GCM–γ/Proto
Prop. Agree Agree: Prop. Incorr. Prop. Agree Agree: Prop. Incorr.
25 25
20 20
Trials/Cond.
Trials/Cond.
15 15
10 10
5 5
Disagree: Prop. Incorr., Avg. Disagree: Prop. Incorr., Ind. Disagree: Prop. Incorr., Avg. Disagree: Prop. Incorr., Ind.
25 1 25 1
20 .8 20 .8
Trials/Cond.
Trials/Cond.
15 .6 15 .6
10 .4 10 .4
5 .2 5 .2
0 0
5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30
Ind./Exp. Ind./Exp. Ind./Exp. Ind./Exp.
Figure 6. Proportions of simulations for which average and individual analyses agree, proportions of incorrect model selections when
average and individual analyses agree, and proportions of incorrect model selections using average and individual analyses, respectively,
when average and individual model analyses do not agree for the four pairs of models and levels of trials per condition and individuals per
experiment discussed in the text. Xs indicate conditions for which average and individual analyses always agreed. Exp, exponential; Pow,
power; FLMP, fuzzy logical model of perception; LIM, linear integration model; GCM, generalized context model; Proto, prototype.
lying such techniques as NML and BMS). In our opinion, results. Because the computer program capable of ana-
there are multiple goals of model selection that cannot all be lyzing the combined group data will be capable of ana-
satisfied simultaneously, even when focus is restricted pri- lyzing individual data without extra programming, and
marily to the goal of balancing simplicity and fit. Thus the vice versa, both methods should be used for each data set.
“best” goal for model selection remains an elusive target, a When both methods lead to the same conclusion, it is a
matter for debate, and a subject for ongoing research. reasonable decision strategy to accept this conclusion.
There is, however, an easy-to-use heuristic that can in- There is, of course, no guarantee that the conclusion so
crease a researcher’s confidence in their model selection reached will be correct. In fact, the accuracy of such joint
708 Cohen, Sanborn, and Shiffrin
acceptance decisions could well be significantly lower ease of use, it is not unreasonable to expect the eventual
than the decision reached with the better method by itself. production of packaged programs placing the best state-of-
This was seen, for example, in the case of the power law the-art methods within reach of the average researcher.
versus exponential, with few data per individual, where
individual analysis was essentially at chance. Applying a Author Note
joint agreement heuristic in such a situation is equivalent
All the authors contributed equally to this article. A.N.S. was sup-
to randomly selecting one half of the group analysis cases. ported by a National Defense Science and Engineering Graduate Fellow-
It is clear that such a procedure can only reduce the effi- ship and an NSF Graduate Research Fellowship. R.M.S. was supported
cacy of group analysis. Nonetheless, the joint acceptance by NIMH Grants 1 R01 MH12717 and 1 R01 MH63993. The authors
heuristic may be the best strategy available in practice. thank Woojae Kim, Jay Myung, Michael Ross, Eric-Jan Wagenmak-
We tested this joint acceptance heuristic on the simulated ers, Trisha Van Zandt, Michael Lee, and Daniel Navarro for their help
in the preparation of the manuscript and Dominic Massaro, John Paul
data discussed previously with uninformed parameter se- Minda, David Rubin, and John Wixted for the use of their data. Cor-
lection. The results are given in Figure 6 for the four model respondence concerning this article should be addressed to A. L. Cohen,
comparisons. The x- and y-axes of each panel are the differ- Department of Psychology, University of Massachusetts, Amherst, MA
ent levels of individuals per experiment and trials per con- 01003 (e-mail: acohen@psych.umass.edu).
dition, respectively, discussed above. Simulations with one
References
individual per experiment are not included in the figures, be-
cause average and individual analyses always agree. For each Akaike, H. (1973). Information theory as an extension of the maximum
model comparison, the upper left panel gives the proportion likelihood principle. In B. N. Petrov & F. Csaki (Eds.), Second Inter-
of simulations for which the average and individual analy- national Symposium on Information Theory (pp. 267-281). Budapest:
ses agreed. The upper right panel indicates the proportion Akademiai Kiado.
Anderson, N. H. (1981). Foundations of information integration theory.
of simulations for which the incorrect model was selected New York: Academic Press.
when average and individual analyses agreed. The lower left Anderson, R. B., & Tweney, R. D. (1997). Artifactual power curves in
and lower right panels give the proportion of simulations in forgetting. Memory & Cognition, 25, 724-730.
which average and individual data, respectively, were used Ashby, F. G., & Gott, R. E. (1988). Decision rules in the perception and
categorization of multidimensional stimuli. Journal of Experimental
for which the incorrect model was selected when the average Psychology: Learning, Memory, & Cognition, 14, 33-53.
and individual analyses disagreed. Light and dark shadings Ashby, F. G., & Maddox, W. T. (1993). Relations between prototype,
indicate high and low proportions, respectively. exemplar, and decision bound models of categorization. Journal of
First, note that, in general, agreement between group Mathematical Psychology, 37, 372-400.
and individual analyses was high. The only exception was Ashby, F. G., Maddox, W. T., & Lee, W. W. (1994). On the dangers of
averaging across subjects when using multidimensional scaling or the
for the exponential and power simulations, especially for similarity-choice model. Psychological Science, 5, 144-151.
a large number of subjects with few trials. Second, when Barron, A. R., Rissanen, J., & Yu, B. (1998). The MDL principle in
average and individual analyses agreed, model selection modeling and coding. IEEE Transactions on Information Theory, 44,
was nearly perfect for most model pairs. The accuracy 2743-2760.
Berger, J. O., & Pericchi, L. R. (1996). The intrinsic Bayes factor for
for the exponential and power simulations was less than model selection and prediction. Journal of the American Statistical
that for the other model pairs but was still good. Third, Association, 91, 109-122.
the simulations do not suggest a clear choice for which Bower, G. H. (1961). Application of a model to paired-associate learn-
analyses to favor when average and individual analyses ing. Psychometrika, 26, 255-280.
disagree. For the exponential and power simulations, av- Browne, M. W. (2000). Cross-validation methods. Journal of Math-
ematical Psychology, 44, 108-132.
erage analysis clearly performs better, and interestingly, Burnham, K. P., & Anderson, D. R. (1998). Model selection and
for few subjects and trials, group analysis is even better inference: A practical information-theoretic approach. New York:
when it disagrees with individual analysis. For the FLMP Springer.
and LIM, individual analysis is preferred when there is Burnham, K. P., & Anderson, D. R. (2002). Model selection and
multimodel inference: A practical information-theoretic approach
disagreement. There is no obvious recommendation for (2nd ed.). New York: Springer.
the GCM and prototype simulations. In our simulations, Busemeyer, J. R., & Wang, Y.-M. (2000). Model comparisons and
the joint acceptance heuristic tends to outperform even the model selections based on generalization criterion methodology. Jour-
best single method of analysis. nal of Mathematical Psychology, 44, 171-189.
The field of model selection and data analysis method- Cover, T. M., & Thomas, J. A. (1991). Elements of information theory.
New York: Wiley.
ology continues to evolve, of course, and although there Estes, W. K. (1956). The problem of inference from curves based on
may be no single best approach, the methods continue to group data. Psychological Bulletin, 53, 134-140.
improve. Researchers have, for example, already begun to Estes, W. K., & Maddox, W. T. (2005). Risks of drawing inferences
explore hierarchical analyses. Hierarchical analyses hold about cognitive processes from model fits to individual versus average
performance. Psychonomic Bulletin & Review, 12, 403-408.
the potential of combining the best elements of both in- Grünwald, P. [D.] (2000). Model selection based on minimum descrip-
dividual and group methods (see, e.g., Navarro, Griffiths, tion length. Journal of Mathematical Psychology, 44, 133-152.
Steyvers, & Lee, 2006). The technical methods of model Grünwald, P. D. (2005). Minimum description length tutorial. In P. D.
selection also continue to evolve in other ways. We can, Grünwald, I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum
for example, expect to see methods combining elements of description length: Theory and applications (pp. 23-80). Cambridge,
MA: MIT Press.
BMS and NML (e.g., Grünwald, 2007). We end by noting Grünwald, P. D. (2007). The minimum description length principle.
that although advanced research in model selection tech- Cambridge, MA: MIT Press.
niques tends not to focus on the issues of accessibility and Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of sta-
Model Evaluation 709
tistical learning: Data mining, inference, and prediction. New York: Wixted, J. T., & Ebbesen, E. B. (1991). On the form of forgetting.
Springer. Psychological Science, 2, 409-415.
Hayes, K. J. (1953). The backward curve: A method for the study of
learning. Psychological Review, 60, 269-275. Notes
Jeffreys, H. (1935). Some tests of significance, treated by the theory
of probability. Proceedings of the Cambridge Philosophical Society, 1. Recent work, such as Bayesian hierarchical modeling (e.g., Lee &
31, 203-222. Webb, 2005), allows model fitting with parameters that vary within or
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: Oxford across individuals.
University Press. 2. Of course, even in the case of well-justified individual analyses,
Karabatsos, G. (2006). Bayesian nonparametric model selection and there is still the possibility of other potential distortions, such as the
model testing. Journal of Mathematical Psychology, 50, 123-148. switching of processes or changing of parameters across trials (Estes,
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the 1956). Those possibilities go beyond the present research.
American Statistical Association, 90, 773-795. 3. We recognize that quantitative modeling is carried out for a vari-
Lagarias, J. C., Reeds, J. A., Wright, M. H., & Wright, P. E. (1998). ety of purposes, such as describing the functional form of data, test-
Convergence properties of the Nelder–Mead simplex method in low ing a hypothesis, testing the truth of a formal model, and estimating
dimensions. SIAM Journal of Optimization, 9, 112-147. model parameters. Although it is possible that the preferred method
Lee, M. D., & Webb, M. R. (2005). Modeling individual differences in of analysis could depend on the purpose, we believe that the lessons
cognition. Psychonomic Bulletin & Review, 12, 605-621. learned here will also be of use when other goals of data analysis are
Mack, A., & Rock, I. (1998). Inattentional blindness. Cambridge, MA: pursued.
MIT Press. 4. This assumption underlies other common model selection tech-
Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A niques, such as BIC (Burnham & Anderson, 1998).
user’s guide (2nd ed.). Mahwah, NJ: Erlbaum. 5. It may be best to use a common set of parameters to fit the data from
Massaro, D. W. (1998). Perceiving talking faces: From speech percep- each individual. That is, rather than fitting the models either to a single
tion to a behavioral principle. Cambridge, MA: MIT Press. set of averaged data or to individuals, using different parameters for each
Medin, D. L., & Schaffer, M. M. (1978). Context theory of classifica- individual, it is possible to find the single set of model parameters that
tion learning. Psychological Review, 85, 207-238. maximizes the goodness of fit for each of the individuals. This analysis
Minda, J. P., & Smith, J. D. (2002). Comparing prototype-based and was repeated for each pair of models discussed below, but the results
exemplar-based accounts of category learning and attentional allo- from this third type of analysis were almost identical to those when aver-
cation. Journal of Experimental Psychology: Learning, Memory, & age data were used. They will not be discussed further.
Cognition, 28, 275-292. 6. It is unclear how to combine measures such as sum of squared error
Myung, I. J., Kim, C., & Pitt, M. A. (2000). Toward an explanation across individuals in a principled manner.
of the power law artifact: Insights from response surface analysis. 7. Note that this is a very simple form of uninformed prior on the
Memory & Cognition, 28, 832-840. parameters. To be consistent with past work (Wagenmakers et al., 2004),
Navarro, D. J., Griffiths, T. L., Steyvers, M., & Lee, M. D. (2006). the priors would have to be selected from a more complex prior, such as
Modeling individual differences using Dirichlet processes. Journal of the Jeffreys’s prior. Because of its ease of use, we opted for the uniform
Mathematical Psychology, 50, 101-122. prior. Furthermore, as will be seen, the results, for the most part, seem to
Navarro, D. J., Pitt, M. A., & Myung, I. J. (2004). Assessing the dis- be unchanged by the choice of prior.
tinguishability of models and the informativeness of data. Cognitive 8. 13 individuals/experiment 3 9 trials/condition 3 2 parameter se-
Psychology, 49, 47-84. lection methods 3 500 simulations.
Nosofsky, R. M. (1986). Attention, similarity, and the identification– 9. Note that we use the term optimal guardedly. The optimality de-
categorization relationship. Journal of Experimental Psychology: pends not only on the details of our simulation procedure, but also, more
General, 115, 39-57. critically, on the goal of selecting the model that actually generated the
Nosofsky, R. M., & Zaki, S. R. (2002). Exemplar and prototype models data. There are many goals that can be used as targets for model selec-
revisited: Response strategies, selective attention, and stimulus gen- tion, and these often conflict with each other. If FIA, for example, does
eralization. Journal of Experimental Psychology: Learning, Memory, not produce a decision criterion at the “optimal” point, one could argue
& Cognition, 28, 924-940. that FIA is nonetheless a better goal for model selection. The field has
Oden, G. C., & Massaro, D. W. (1978). Integration of featural informa- not yet converged on a best method for model selection, or even on a best
tion in speech perception. Psychological Review, 85, 172-191. approximation. In our present judgment, the existence of many mutually
Reed, S. K. (1972). Pattern recognition and categorization. Cognitive inconsistent goals for model selection makes it unlikely that a single
Psychology, 3, 382-407. best method exists.
Rissanen, J. (1978). Modeling by the shortest data description. Auto-
matica, 14, 465-471. ARCHIVED MATERIALS
Rissanen, J. (1987). Stochastic complexity. Journal of the Royal Statis-
tical Society B, 49, 223-239, 252-265. The following materials associated with this article may be accessed
Rissanen, J. (1996). Fisher information and stochastic complexity. through the Psychonomic Society’s Norms, Stimuli, and Data archive,
IEEE Transactions on Information Theory, 42, 40-47. www.psychonomic.org/archive.
Rissanen, J. (2001). Strong optimality of the normalized ML models as To access these files, search the archive for this article using the
universal codes and information in data. IEEE Transactions on Infor- journal name (Psychonomic Bulletin & Review), the first author’s name
mation Theory, 47, 1712-1717. (Cohen), and the publication year (2008).
Schwarz, G. (1978). Estimating the dimension of a model. Annals of File: Cohen-PB&R-2008.doc.
Statistics, 6, 461-464. Description: Microsoft Word document, containing Tables A1–A16
Shepard, R. N. (1987). Toward a universal law of generalization for and Figures A1–A12.
psychological science. Science, 237, 1317-1323. File: Cohen-PB&R-2008.rtf.
Sidman, M. (1952). A note on functional relations obtained from group Description: .rtf file, containing Tables A1–A16 and Figures A1–A12
data. Psychological Bulletin, 49, 263-269. in .rtf format.
Siegler, R. S. (1987). The perils of averaging data over strategies: An
File: Cohen-PB&R-2008.pdf.
example from children’s addition. Journal of Experimental Psychol-
Description: Acrobat .pdf file, containing Figures A1–A12.
ogy: General, 116, 250-264.
Wagenmakers, E.-J., Ratcliff, R., Gomez, P., & Iverson, G. J. (2004). File: Cohen2-PB&R-2008.pdf.
Assessing model mimicry using the parametric bootstrap. Journal of Description: Acrobat .pdf file, containing Tables A1–A16.
Mathematical Psychology, 48, 28-50. Author’s e-mail address: acohen@psych.umass.edu.
Appendix A
Bayesian Model Selection
The use of BMS for model selection raises many deep issues and is a topic we hope to take up in detail in
future research. We report here only a few interesting preliminary results. Let Model A with parameters θ be
denoted Aθ, with associated prior probability of Aθ,0. In the simplest approach, the posterior odds for Model A
over Model B are given by
∑ θ Aθ , 0 / Bθ , 0 P ( D | Aθ ) / P ( D | Bθ ),
where D is the data. The sum is replaced by an integral for continuous parameter spaces. Because BMS inte-
grates likelihood ratios over the parameter space, the simulated difference of log maximum likelihoods is not an
appropriate axis for exhibiting results. Our plots for BMS show differences of log (integrated) likelihoods.
As is usual in BMS applications, the answers will depend on the choice of priors.A1 Consider first a flat uni-
form prior for each parameter in each model, ranging from 0.001 to 1.0 for both coefficients and 0.001 to 1.5
for both decay parameters (these ranges were chosen to encompass the range of plausible possibilities). This
approach produces the smoothed histogram graphs in Figure A1. The natural decision statistic is zero on the log
likelihood difference axis. For these priors, performance was terrible: The probability of correct model selection
was .52 for group analysis and .50 (chance) for individual analysis. It seems clear that the choice of priors overly
penalized the exponential model, relative to the power law model, for the data sets to which they were applied.
.4
Exponential generated
Density
0
0 5 10 15 20
x 10–3
4
Density
0
900 1,000 1,100 1,200 1,300 1,400 1,500 1,600
Log p(Pow | D) � Log p(Exp | D)
Figure A1. Histograms showing the distribution of Bayesian model selection results
for the exponential and power law models with 34 subjects and two trials per condi-
tion. The group data results are shown in the top panel, and the individual data results
are shown in the bottom panel. A uniform prior was used in this simulation, with the
same parameter range for both models.
By changing the priors, we can greatly change the results of BMS. For example, employing priors that
matched the parameter distributions used to generate the data increased model selection accuracy to levels
comparable to those of PBCM and NML. We did not pursue these approaches further because, in most cases,
in practice one would not have the advance knowledge allowing selection of such “tailored” priors. In future
research, we intend to explore other Bayesian approaches that could circumvent this problem. In particular,
various forms of hierarchical Bayesian analysis look promising. For example, one could assume that individual
parameter estimates are drawn from a Gaussian distribution with some mean and variance (and covariance).
It seems likely that such an assumption would produce a tendency for individual parameter estimates to move
closer to the group mean estimates—in a sense, interpolating between individual and group analysis. Other
possible approaches are the BNPMS method recently proposed by Karabatsos (2006) and modeling individual
subjects with a Dirichlet process (Navarro, Griffiths, Steyvers, & Lee, 2006).
Note
A1. Note that all inference implicitly involves prior assumptions. That this assumption is made explicit is one of the
strengths of Bayesian analysis.
Model Evaluation 711
Appendix B
Simulation Details
General Methodology
All of the simulations were run as described in the main text, with all combinations of 1, 2, 3, 4, 5, 6, 7, 8, 10,
14, 17, 20, and 34 individuals per experiment and for 1, 2, 3, 5, 7, 10, 13, 17, and 25 trials per condition. This
range spans the range of likely experimental values and invites analyses in which trials per condition and indi-
viduals per experiment are varied but the total number of trials per experiment is held roughly constant. All of the
simulations reported below were run in Matlab, using the simplex search method (Lagarias, Reeds, Wright,
& Wright, 1998) to find the maximum likelihood parameters.B1 To reduce potential problems with local minima,
each fit was repeated three times with different, randomly selected starting parameters. The ranges of possible
parameters for both the generating and the fitting models were constrained to the model-specific values given
below. Each of the two distributions was transformed into an empirical cumulative density function (cdf), which
functions were, in turn, used to find the optimal criterion, the criterion that minimizes the overall proportion of
model misclassifications. If the two distributions are widely separated, there may be a range of points over which
the proportion of misclassifications does not vary. In this case, the optimal criterion was the mean of this range.
To get a feel for the stability of the optimal criterion, this procedure was repeated five times. On each repetition,
400 of the points were used to generate the optimal criterion, and the remaining 100 points (0–100, 100–200,
etc.) were used to determine the number of misclassifications. In general, the standard deviation across these
five repetitions was well below 0.05, suggesting very stable estimates, and will not be discussed further.
As was mentioned earlier, the overall fit measure when the models were applied to individual data was gener-
ated by adding the difference of log likelihood fits (or equivalently, multiplying the likelihood ratios) for each of
the individuals. Assuming that the individuals are independent of one another, this (logarithm of the) likelihood
ratio is the probability of the data given the first model and its best-fitting parameters over the probability of the
second model and its best-fitting parameters.
Appendix B (Continued)
Models of Categorization: GCM–R, GCM–γ, and Prototype
The proportion of incorrect model selections for the GCM–γ and prototype models is given in the online
supplemental material (www.psychonomic.org/archive). To prevent infinite log values when the maximum
likelihood parameters were found, all probabilities were restricted to lie in the range of 0.001–0.999.
Uninformed. For the GCM–R, GCM–γ, and prototype models, data when the uninformed prior was used
were generated as follows. For both the GCM–γ and the prototype models, c and γ were restricted to be in the
range of 0.1–15. Each of the four raw (i.e., before they were constrained to sum to 1) wms began in the range
of 0.001–0.999. For each simulated experiment, single “means” were randomly selected for the c, γ, and four
wm parameter values from each of these ranges. Each wm was then normalized by the sum of all of the wms.
For c and γ, the parameter values for each individual were selected from a normal distribution with the mean
as the “mean” parameter value and standard deviation as 5% of the overall parameter range, respectively. The
individual wm parameters were selected using the same method, but, because of the extremely restricted range
and normalizing procedure, the standard deviation was 1% of the overall parameter range. If an individual’s
parameter value fell outside of the acceptable parameter range, the value was resampled. The weights were then
renormalized. Because of the normalization and resampling procedures, the standard deviation of the wm will
not be exactly 0.01. For example, the average standard deviation for a single wm in the one trial per condition
and 34 individuals per experiment simulation was 0.0097.
For the GCM–R, selections of the c and the wm parameters are unchanged from the GCM–γ, but γ is fixed
at 1.
Informed. The best-fitting parameters for each model were also found for each of the 48 subjects in Minda
and Smith’s (2002) Experiment 2, who participated in eight trials per condition. This set of parameters was used
as the pool from which informed parameters were drawn with replacement.
Notes
B1. For some of the models below, closed form solutions exist for the maximum likelihood parameters, but we opted to
perform a search for the parameters.
B2. Although, in this context, the order of the parameters does not matter, they were sorted from lowest to highest value
for each dimension.
Appendix C
Predictive Validation
For all of the models examined in this article, the distributions discussed in the text are binomial. For two
binomial distributions, the K–L divergence between the probability of a success under the generating model, pG,
and the probability of success under the comparison model, pC, is
pG 1 − pG
∑ i!(nn−! i )! pGi (1 − pG )
n−i
i log p + (n − i ) log 1 − p , (C1)
i C C
where i is the number of successes. The quantity in Equation C1 is summed over all conditions and subjects to
produce the total K–L divergence between the generating and the comparison models. The generating models
in this article used different parameters for each individual. To make as close a comparison as possible, the data
were taken from the informed PBCM simulations described in the main text. For individually fitted models, the
K–L divergence was summed over the fitted and generating parameters for each individual. To make the models
fitted to the group data comparable, the K–L divergence was summed over each generating individual against
the single set of group parameters.