Big Data in Cognitive Science
Big Data in Cognitive Science
Typeset in Bembo
by Out of House Publishing
CONTENTS
Contributors vii
Index 364
CONTRIBUTORS
Abstract
Cognitive research is increasingly coming out of the laboratory. It is becoming much
more common to see research that repurposes large-scale and naturalistic data sources to
develop and evaluate cognitive theories at a scale not previously possible. We now have
unprecedented availability of massive digital data sources that are the product of human
behavior and offer clues to understand basic principles of cognition. A key challenge for
the field is to properly interrogate these data in a theory-driven way to reverse engineer
the cognitive forces that generated them; this necessitates advances in both our theoretical
models and our methodological techniques. The arrival of Big Data has been met with
healthy skepticism by the field, but has also been seen as a genuine opportunity to advance
our understanding of cognition. In addition, theoretical advancements from Big Data are
heavily intertwined with new methodological developments—new techniques to answer
questions from Big Data also give us new questions that could not previously have been
asked. The goal of this volume is to present emerging examples from across the field that
use large and naturalistic data to advance theories of cognition that would not be possible in
the traditional laboratory setting.
cognitive principles; but we have to be able to put all those pieces together in a
reasonable way. This approach necessitates both advances in our theoretical models
and development of new methodological techniques adapted from the information
sciences.
Big Data sources are now allowing cognitive scientists to evaluate theoretical
models and make new discoveries at a resolution not previously possible. For
example, we can now use online services like Netflix, Amazon, and Yelp to
evaluate theories of decision-making in the real world and at an unprecedented
scale. Wikipedia edit histories can be analyzed to explore information transmission
and problem solving across groups. Linguistic corpora allow us to quantitatively
evaluate theories of language adaptation over time and generations (Lupyan &
Dale, 2010) and models of linguistic entrainment (Fusaroli, Perlman, Mislove,
Paxton, Matlock, & Dale, 2015). Massive image repositories are being used to
advance models of vision and perception based on natural scene statistics (Griffiths,
Abbott, & Hsu, 2016; Khosla, Raju, Torralba, & Oliva, 2015). Twitter and
Google search trends can be used to track the outbreak and spread of “infectious”
ideas, memory contagion, and information transmission (Chen & Sakamoto, 2013;
Masicampo & Ambady, 2014; Wu, Hofman, Mason, & Watts, 2011). Facebook
feeds can be manipulated2 to explore information diffusion in social networks
(Bakshy, Rosenn, Marlow, & Adamic, 2012; Kramer, Guillory, & Hancock,
2014). Theories of learning can be tested at large scales and in real classroom
settings (Carvalho, Braithwaite, de Leeuw, Motz, & Goldstone, 2016; Fox,
Hearst, & Chi, 2014). Speech logs afford both theoretical advancements in auditory
speech processing, and practical advancements in automatic speech comprehension
systems.
The primary goal of this volume is to present cutting-edge examples that use
large and naturalistic data to uncover fundamental principles of cognition and
evaluate theories that would not be possible without such scale. A more general
aim of the volume is to take a very careful and critical look at the role of Big Data
in our field. Hence contributions to this volume were handpicked to be examples
of advancing theory development with large and naturalistic data.
diverge beyond that. The issue is now almost humorous, with Dan Ariely’s popular
quip comparing Big Data to teenage sex, in that “everyone talks about it, nobody
really knows how to do it, everyone thinks everyone else is doing it, so everyone
claims they are doing it.”
As scientists, we are quite fond of careful operational definitions. However,
Big Data and data science are still-evolving concepts, and are moving targets for
formal definition. Definitions tend to vary depending on the field of study. A
strict interpretation of Big Data from the computational sciences typically refers to
datasets that are so massive and rapidly changing that our current data processing
methods are inadequate. Hence, it is a drive for the development of distributed
storage platforms and algorithms to analyze datasets that are currently out of
reach. The term extends to challenges inherent in data capture, storage, transfer,
and predictive analytics. As a loose quantification, data under this interpretation
currently become “big” at scales north of the exabyte.
Under this strict interpretation, work with true Big Data is by definition quite
rare in the sciences; it is more development of architectures and algorithms to
manage these rapidly approaching scale challenges that are still for the most part on
the horizon (NIST Big Data Working Group, 2014). At this scale, it isn’t clear that
there are any problems in cognitive science that are true Big Data problems yet.
Perhaps the largest data project in the cognitive and neural sciences is the Human
Connectome Project (Van Essen et al., 2012), an ambitious project aiming to
construct a network map of anatomical and functional connectivity in the human
brain, linked with batteries of behavioral task performance. Currently, the project
is approaching a petabyte of data. By comparison, the Large Hadron Collider
project at CERN records and stores over 30 petabytes of data from experiments
each year.3
More commonly, the Gartner 3 Vs definition of Big Data is used across multiple
fields: “Big data is high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable enhanced decision-making,
insight discovery and process optimization” (Laney, 2012). Volume is often
indicative of the fact that Big Data records and observes everything within a
recording register, in contrast to our commonly used methods of sampling in the
behavioral sciences. Velocity refers to the characteristic that Big Data is often a
real-time stream of rapidly captured data. The final characteristic, variety, denotes
that Big Data draws from multiple qualitatively different information sources (text,
audio, images, GPS, etc.), and uses joint inference or fusion to answer questions
that are not possible by any source alone. But far from being expensive to collect,
Big Data is usually a natural byproduct of digital interaction.
So while a strict interpretation of Big Data puts it currently out of reach, it
is simultaneously everywhere by more liberal interpretations. Predictive analytics
based on machine learning has been hugely successful in many applied settings
(see Hu, Wen, & Chua, 2014, for a review). Newer definitions of Big Data
4 M. N. Jones
I have observed over the years that there is a tendency for even the best
cognitive scientists to lose sight of large issues in their devotion to particular
methodologies, their pursuit of the null hypothesis, and their rigorous efforts
to reduce anything that seems interesting to something else that is not. An
occasional reminder of why we flash those stimuli and measure those reaction
times is sometimes useful.
(Miller, 1990: 7)
Furthermore, we are now discovering that much of the behavior we want
to use to make inferences about cognitive mechanisms is heavy-tail distributed
(exponential and power-law distributions are very common in cognitive research).
Sampling behavior in a one-hour lab setting is simply insufficient to ever observe
the rare events that allow us to discriminate among competing theoretical accounts.
And building a model from the center of a behavioral distribution may fail horribly
to generalize if the tail of the distribution is the important characteristic that the
cognitive mechanism evolved to deal with.
So while skepticism about Big Data in cognitive science is both welcome and
warranted, the above points are just a few reasons why Big Data could be a genuine
opportunity to advance our understanding of human cognition. If dealt with in
a careful and theoretically driven way, Big Data offers us a completely new set
of eyes to understand cognitive phenomena, to constrain among theories that
are currently deadlocked with laboratory data, to evaluate generalizability of our
models, and to have an impact on the real-world situations that our models are
meant to explain (e.g. by optimizing medical and consumer decisions, information
discovery, education, etc.). And embracing Big Data brings with it development of
new analytic tools that also allow us to ask new theoretical questions that we had
not even considered previously.
informatics, and the scale of data allows the theorist to paint a more complete
and realistic picture of cognitive mechanisms. Furthermore, online labor markets
such as Amazon’s Mechanical Turk have accelerated the pace of experiments
by allowing us to conduct studies that might take years in the laboratory in
a single day online (Crump, McDonnell, & Gureckis, 2013; Gureckis et al.,
2015).
Examples of new data capture technologies advancing our theoretical inno-
vations are emerging all over the cognitive sciences. Cognitive development is a
prime example. While development unfolds over time, the field has traditionally
been reliant on evaluating infants and toddlers in the laboratory for short studies
at regular intervals across development. Careful experimental and stimulus control
is essential, and young children can only provide us with a rather limited range
of response variables (e.g., preferential looking and habituation paradigms are very
common with infants).
While this approach has yielded very useful information about basic cognitive
processes and how they change, we get only a small snapshot of development.
In addition, the small scale is potentially problematic because many theoretical
models behave in a qualitatively different way depending on the amount and
complexity of data (Frank, Tenenbaum, & Gibson, 2013; McClelland, 2009;
Qian & Aslin, 2014; Shiffrin, 2010). Aslin (2014) has also noted that stimulus
control in developmental studies may actually be problematic. We may be
underestimating what children can learn by using oversimplified experimental
stimuli: These controlled stimuli deconfound potential sources of statistical
information in learning, allowing causal conclusions to be drawn, but this may
make the task much more difficult than it is in the real world where multiple
correlated factors offer complimentary cues for children to learn the structure of
the world (see Shukla, White, & Aslin, 2011). The result is that we may well
endorse the wrong learning model because it explains the laboratory data well,
but is more complex than is needed to explain learning in the statistically rich real
world.
A considerable amount of developmental research has now come out of the
laboratory. Infants are now wired with cameras to take regular snapshots of
the visual information available to them across development in their real world
experiences (Aslin, 2009; Fausey, Jayaraman, & Smith, 2016; Pereira, Smith, & Yu,
2014). LENATM recording devices are attached to children to record the richness
of their linguistic environments and to evaluate the effect of linguistic environment
on vocabulary growth (VanDam et al., 2016; Weisleder & Fernald, 2013). In one
prominent early example, the SpeechHome project, an entire house was wired
to record 200,000+ hours of audio and video from one child’s first three years of
life (Roy, Frank, DeCamp, Miller, & Roy, 2015). Tablet-based learning games are
now being designed to collect theoretically constraining data as children are playing
them all over the world (e.g. Frank, Sugarman, Horowitz, Lewis, & Yurovsky,
2016; Pelz, Yung, & Kidd, 2015).
8 M. N. Jones
A second prime example of both new data capture methods and data scale
advancing theory is in visual attention. A core theoretical issue surrounds
identification performance as a function of target rarity in visual search, but the
number of trials required to get stable estimates in the laboratory is unrealistic.
Mitroff et al. (2015) opted instead to take a Big Data approach to the problem by
turning visual search into a mobile phone game called “Airport Scanner.” In the
game, participants act the part of a TSA baggage screener searching for prohibited
items as simulated luggage passes through an x-ray scanner. Participants respond
on the touchscreen, and the list of allowed and prohibited items grows as they
continue to play.
Mitroff et al. (2015) analyzed data from the first billion trials of visual search
from the game, making new discoveries about how rare targets are processed
when they are presented with common foils, something that would never have
been possible in the laboratory. Wolfe (1998) had previously analyzed 1 million
visual search trials from across 2,500 experimental sessions which took over
10 years to collect. In contrast, Airport Scanner collects over 1 million trials
each day, and the rate is increasing as the game gains popularity. In addition
to answering theoretically important questions in visual attention and memory,
Mitroff et al.’s example has practical implications for visual detection of rare
targets in applied settings, such as radiologists searching for malignant tumors on
mammograms. Furthermore, data from the game have the potential to give very
detailed information about how people become expert in detection tasks
been adopting more techniques from machine learning and network sciences.4
One concern that accompanies this adoption is that the bulk of current machine
learning approaches to Big Data are primarily concerned with detecting and
predicting patterns, but they tend not to explain why patterns exist. Our ultimate
goal in cognitive science is to produce explanatory models. Predictive models
certainly benefit from more data, but it is questionable whether more data helps to
achieve explanatory understanding of a phenomenon more than a well-controlled
laboratory experiment.
Hence, development of new methods of inquiry from Big Data based on cogni-
tive theory is a priority area of research, and has already seen considerable progress
leading to new tools. Liberman (2014) has compared the advent of such tools in
this century to the inventions of the telescope and microscope in the seventeenth
century. But Big Data and data mining tools on their own are of limited use for
establishing explanatory theories; Picasso had famously noted the same issue about
computers: “But they are useless. They can only give answers.” Big Data in no way
obviates the need for foundational theories based on careful laboratory experimen-
tation. Data mining and experimentation in cognitive science will continue to be
iteratively reinforcing one another, allowing us to generate and answer hypotheses
at a greater resolution, and to draw conclusions at a greater scope.
Acknowledgments
This work was supported by NSF BCS-1056744 and IES R305A150546
Notes
1 And I don’t use the term “exponential” here simply for emphasis—the amount
of digital information available currently doubles every two years, following
Moore’s Law (Gantz & Reinsel, 2012).
2 However, both the Facebook and OKCupid massive experiments resulted in
significant backlash and ethical complaints.
3 The Large Hadron Collider generates roughly two petabytes of data per second,
but only a small amount is captured and stored.
4 “Drawing Causal Inference from Big Data” was the 2015 Sackler Symposium
organized by the National Academy of Sciences.
References
Aslin, R. N. (2009). How infants view natural scenes gathered from a head-mounted
camera. Optometry and Vision Science: Official publication of the American Academy of
Optometry, 86(6), 561.
Aslin, R. N. (2014). Infant learning: Historical, conceptual, and methodological
challenges. Infancy, 19(1), 2–27.
10 M. N. Jones
Bakshy, E., Rosenn, I., Marlow, C., & Adamic, L. (2012). The role of social networks
in information diffusion. In Proceedings of the 21st International Conference on World Wide
Web (pp. 519–528). ACM.
Balota, D. A., Yap, M. J., Hutchison, K. A., & Cortese, M. J. (2012). Megastudies. Visual
word recognition volume 1: Models and methods, orthography and phonology. New York, NY:
Psychology Press, 90–115.
Carvalho, P. F., Braithwaite, D. W., de Leeuw, J. R., Motz, B. A., & Goldstone, R. L.
(2016). An in vivo study of self-regulated study sequencing in introductory psychology
courses. PLoS One 11(3): e0152115.
Chen, R., & Sakamoto, Y. (2013). Perspective matters: Sharing of crisis information
in social media. In System Sciences (HICSS), 2013 46th Hawaii International Conference
(pp. 2033–2041). IEEE.
Crump, M. J., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s
Mechanical Turk as a tool for experimental behavioral research. PLoS One, 8(3), e57410.
Dufau, S., Duñabeitia, J. A., Moret-Tatay, C., McGonigal, A., Peeters, D., Alario, F. X.,
... & Ktori, M. (2011). Smart phone, smart science: How the use of smartphones can
revolutionize research in cognitive science. PLoS One, 6(9), e24974.
Estes, W. K. (1975). Some targets for mathematical psychology. Journal of Mathematical
Psychology, 12(3), 263–282.
Fausey, C. M., Jayaraman, S., & Smith, L. B. (2016). From faces to hands: Changing visual
input in the first two years. Cognition, 152, 101–107.
Fox, A., Hearst, M. A., & Chi, M. T. H. (Eds.) Proceedings of the First ACM Conference on
Learning At Scale, L@S 2014, March 2014.
Frank, M. C., Sugarman, E., Horowitz, A. C., Lewis, M. L., & Yurovsky, D. (2016). Using
tablets to collect data from young children. Journal of Cognition and Development, 17(1),
1–17.
Frank, M. C., Tenenbaum, J. B., & Gibson, E. (2013). Learning and long-term retention
of large-scale artificial languages. PLoS One, 8(1), e52500.
Fusaroli, R., Perlman, M., Mislove, A., Paxton, A., Matlock, T., & Dale, R. (2015).
Timescales of massive human entrainment. PLoS One, 10(4), e0122742.
Gantz, J., & Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital
shadows, and biggest growth in the far east. IDC iView: IDC analyze the future, 2007,
1–16.
Goldstone, R. L., & Lupyan, G. (2016). Harvesting naturally occurring data to reveal
principles of cognition. Topics in Cognitive Science., 8(3), 548–568.
Griffiths, T. L. (2015). Manifesto for a new (computational) cognitive revolution. Cogni-
tion, 135, 21–23.
Griffiths, T. L., Abbott, J. T., & Hsu, A. S. (2016). Exploring human cognition using large
image databases. Topics in Cognitive Science, 8(3), 569–588.
Gureckis, T. M., Martin, J., McDonnell, J., Rich, A. S., Markant, D., Coenen,
A., ... & Chan, P. (2015). psiTurk: An open-source framework for conducting
replicable behavioral experiments online. Behavior Research Methods, 1–14. doi:
10.3758/s13428-015-0642-8.
Hu, H., Wen, Y., Chua, T. S., & Li, X. (2014). Toward scalable systems for big data analytics:
A technology tutorial. Access, IEEE, 2, 652–687.
Khosla, A., Raju, A. S., Torralba, A., & Oliva, A. (2015). Understanding and predicting
image memorability at a large scale. In Proceedings of the IEEE International Conference on
Computer Vision (pp. 2390–2398).
Mining Naturalistic Data 11
Van Essen, D. C., Ugurbil, K., Auerbach, E., Barch, D., Behrens, T. E. J., Bucholz, R.,
... Della Penna, S. (2012). The Human Connectome Project: A data acquisition
perspective. Neuroimage, 62(4), 2222–2231.
Weisleder, A., & Fernald, A. (2013). Talking to children matters: Early language experience
strengthens processing and builds vocabulary. Psychological Science, 24(11), 2143–2152.
Wolfe, J. M. (1998). What can 1 million trials tell us about visual search? Psychological
Science, 9(1), 33–39.
Wu, S., Hofman, J. M., Mason, W. A., & Watts, D. J. (2011). Who says what to
whom on Twitter. In Proceedings of the 20th International Conference on World Wide
Web (pp. 705–714). ACM.
2
SEQUENTIAL BAYESIAN UPDATING
FOR BIG DATA
Zita Oravecz,
Matt Huentelman,
and Joachim Vandekerckhove
Abstract
The velocity, volume, and variety of Big Data present both challenges and opportunities
for cognitive science. We introduce sequential Bayesian updating as a tool to mine these
three core properties. In the Bayesian approach, we summarize the current state of
knowledge regarding parameters in terms of their posterior distributions, and use these
as prior distributions when new data become available. Crucially, we construct posterior
distributions in such a way that we avoid having to repeat computing the likelihood of old
data as new data become available, allowing the propagation of information without great
computational demand. As a result, these Bayesian methods allow continuous inference
on voluminous information streams in a timely manner. We illustrate the advantages of
sequential Bayesian updating with data from the MindCrowd project, in which crowd-sourced
data are used to study Alzheimer’s dementia. We fit an extended LATER (“Linear Approach
to Threshold with Ergodic Rate”) model to reaction time data from the project in order
to separate two distinct aspects of cognitive functioning: speed of information accumulation
and caution.
Introduction
The Big Data era offers multiple sources of data, with measurements that contain
a variety of information in large volumes. For example, neuroimaging data from
a participant might be complemented with a battery of personality tests and a set
of cognitive-behavioral data. At the same time, with brain imaging equipment
more widely accessible the number of participants is unlikely to remain limited
to a handful per study. These advancements allow us to investigate cognitive
phenomena from various angles, and the synthesis of these perspectives requires
highly complex models. Cognitive science is slated to update its set of methods to
foster a more sophisticated, systematic study of human cognition.
14 Z. Oravecz, M. Huentelman and J. Vandekerckhove
Carlin, Stern, Dunson, Vehtari, & Rubin, 2014; Kruschke, 2014; Lee &
Wagenmakers, 2013), which is rising in popularity, holds particular promise for
the future of Big Data.
The most fundamental difference between the frequentist and the Bayesian
schools lies in the use and interpretation of uncertainty—possibly the most central
issue in statistical inference. In classical statistics (null hypothesis significance
testing/NHST, α and p-values, and confidence intervals), the thought process of
inference starts with an existing hypothesis—usually, the null hypothesis H0 . The
classical reasoning goes: “assuming that the null hypothesis is true, how surprising
are the data I have observed?” The word “surprising” in this context has a very
specific meaning. It means “the probability of a set of observations that is at least as
extreme as the real data”. In the case of a common t-test where the null hypothesis
is that a difference truly is zero but the observation is td , the surprise is given by the
probability of observing a t statistic that is at least as far away from zero as td (i.e.
larger than td if td was positive, and smaller if it was negative). If this probability is
small, then the data are considered to be very surprising, or unlikely, “under the
null,” and the null hypothesis is rejected in favor of the alternative hypothesis H A .
This conditional probability of certain constellations of data given a specific model
(H0 ) is commonly known as the p-value.
One common counterargument to this line of reasoning is that just because the
data are unlikely under H0 does not imply that they are likely under any other hypothesis—it
is possible for data to simply be unlikely under all hypotheses that are being consid-
ered. This argument is somewhat counterintuitive because it is tempting to think
that the probabilities under consideration should sum up to one. A counterexample
is easy to construct. Consider a fictional person K who plays the lottery:
same settings. The observed data are then interpreted against the backdrop of this
population of hypothetical data in order to determine how surprising the outcome
was. The inferred hypothesis itself does not bear any probabilistic meaning: In
the classical sense parameters and hypotheses are fixed, meaning that there exists
a “true” parameter value, an exact value for a parameter that is waiting to be
found. The only probabilistic statements made are about data: How likely were
the data, and if we collect more data and compute confidence intervals, what
are the probabilistic properties of our conclusions?2 It is tempting to invert the
probabilistic statement and make it about the underlying truth rather than about
the data (e.g. “what is the probability that H A is true,” or “what is the probability
that the results are due to chance,” or “what is the probability these results will
reappear in a replication?”); however, such statements can only be evaluated with
the use of Bayes’ rule (see below).
Big Data applications in some sense preempt thoughts of hypothetical
datasets—we have a large amount of data at hand and the size of the sample often
approaches that of the population. Therefore in these settings it is more coherent
to assume that the data are fixed and we compute the probability distributions of
parameter values based on the information contained by all data available at present.
Moreover, a common goal in Big Data analysis is to make predictions about
future trends. Frequentist inference can only assign probabilities to random
events and to long-run frequencies, and is not equipped to make statements
that are conditioned on past data. In fact, by relying on frequentist inference
“one would not be allowed to model business forecast, industrial processes,
demographics patterns, or for that matter real-life sample surveys, all of which
involve uncertainties that cannot simply represented by physical randomization”
(Gelman, 2006: 149). To summarize, with Bayesian modeling uncertainty can
be directly addressed in terms of probability statements. To further illustrate the
advantages of Bayesian modeling, we first review some of its basic principles.
p(D|θ ) p(θ)
p(θ |D) = , (1)
p(D)
Sequential Bayesian Updating 17
where θ stands for the vector of all parameters in the model and D denotes the data.
The left-hand side is referred to as the posterior distribution. p(D|θ ) is the likelihood
of the data D given θ . The second factor p(θ) in the numerator is the prior
distribution on θ , which incorporates prior information on the parameter of interest
and formalizes the current state of our knowledge of the parameters (before having
seen the current data, but after having seen all past data). The denominator, p(D),
is the probability of the data averaged over all models under consideration. It does
not depend on the model parameters and serves as a normalization constant in the
equation above. The posterior distribution can often be obtained using only the
repeated application of Bayes’ rule (Equation 1) and the law of total probability:
Z
p(a) = p(a | b) p(b)db, (2)
B
where B is the domain Rof the random variable b. For example, Equation 2 can be
used to obtain p(D) = 2 p(D | θ ) p(θ)dθ .
5
µ
0
9 15 30 75 150 225 300
Participants
FIGURE 2.1 Sequential updating of the conditional posterior distribution of a
parameter µ. The parameter µ was simulated to be 5, and the probability density
function of the parameter given all the available data is updated with some number of
participants at a time (the total number is given on the horizontal axis). The distibution
concentrates around the true value as N increases.
Finally, the prior can be set to a Beta distribution with shape parameters α and β:
1
p(π) = Beta (α, β) = π α−1 (1 − π )β−1 . (3)
B(α, β)
α
The mean of this prior distribution is α+β . In order to allow all possible values of
rate π to be a priori equally likely, set α = β = 1, implying a prior mean of 0.5.
These elements can be combined to compute the posterior distribution of
π given the data. To simplify this calculation, isolate all factors that contain the
parameter π and collect the rest in a scale factor S that is independent of rate π :
N C N −C
h 1 α−1 β−1
i
C
π (1 − π ) B(α,β)
π (1 − π )
p (π | C, N ) = R1
0
P (C, N | π ) p(π)dπ
= Sπ C (1 − π ) N −C π α−1 (1 − π )β−1
= Sπ C+α−1 (1 − π ) N −C+β−1 .
Now use the knowledge that the posterior distribution must be a proper
distribution (i.e. it must integrate to 1), so that S can be determined as that unique
value that ensures propriety. We exploit the similarity to the binomial distribution
to obtain:
N +α+β −2
p (π | C, N ) = × (3)
C +α
π C+α−1 (1 − π ) N −C+β−1 ,
data (C + C ′ , N + N ′ ) as if there had only ever been one round of tastings. The
Bayesian method of sequential updating is coherent in this sense: Datasets can be
partitioned into smaller parts and yet contribute to the posterior distribution with
equal validity.
We also note here that sequential updating does not always lead to an analytically
tractable solution. The example above has the special property that the prior
distribution of the parameter of interest (the Beta prior for the rate parameter π )
is of the same distributional form as the posterior distribution. This property is
called conjugacy; information from the data enters into the Beta distribution by
changing the parameters of the Beta prior, but not its parametric form. Many
simple and moderately complex problems can be described in terms of conjugate
priors and likelihoods. For models where the conjugacy property is not met,
non-parametric techniques have to be applied to summarize information in the
posterior distribution. Our example application will have conjugate properties,
and we provide further information on non-parametric modeling in the Discussion
section.
Volume
One can think of Big Data simply as a large dataset that it is infeasible to analyze
at once on the available hardware. Through SBU, one can partition a large
22 Z. Oravecz, M. Huentelman and J. Vandekerckhove
dataset into smaller, more manageable batches, perform model fitting on those
sequentially, using each batch’s posterior distribution as a prior for the next batch.
This procedure avoids having to store large datasets in memory at any given time.
Velocity
Bayesian decision rules are by default sequential in nature, which makes them
suitable for handling Big Data streams. Unlike the frequentist paradigm, Bayesian
methods allow for inferences and decisions to be made at any arbitrary point in the
data stream, without loss of consistency. Information about past data is kept online
by means of the posterior distributions of the model parameters that sufficiently
summarize the data generation process. The likelihood only needs to be calculated
for the new data point to update the model parameters’ posteriors. We will focus
on cases where data are streaming continuously and a relatively complex model is
fit to the data. These principles scale seamlessly and can be applied where a large
volume of data is analyzed with complex models.
Variety
Big Data means a lot of information coming in from different sources. One
needs complex models to combine different sources of information (see van der
Linden, 2007, for a general method for combining information across sources). For
example, often not only neuroimaging data are collected, but several behavioral
measures are available (e.g. the Human Connectome Project). In such a case,
one could combine a neural model describing functional magnetic resonance
imaging data with a cognitive model describing behavioral data (see Turner,
Forstmann, Wagenmakers, Sederberg, & Steyvers, 2013, for an application in
cognitive neuroscience). Off-the-shelf software packages are not ready to make
inference with novel complex models, while Bayesian tools provide us with a
possibility to fit practically any model regardless of complexity.
goal is to collect data from one million people and select various profiles. Then in a
second phase more intensive cognitive testing will be carried out complimented by
DNA sampling and additional demographic questions. MindCrowd was launched
in April of 2013 and has recruited over 40,000 test takers who have completed
both tasks and answered at least 80 percent of the demographic questions. The
analyses presented here are based on 22,246 participants whose data were available
at the time of writing. With sequential Bayesian updating, inference regarding
substantively interesting parameters can be kept up to date in a continuous fashion,
adding only newly arriving data to a prior that is itself based on previous data.
This means, for example, that when the last responses (the ones closer to the one
million mark) arrive, computing the posterior distribution will be fast.
z pi ∼ N (ν p , 1), (4)
24 Z. Oravecz, M. Huentelman and J. Vandekerckhove
where ∼ is the common notation used to indicate that the variable on the left
side is a draw from the distribution on the right side. The predicted response time
θ
at trial i is then t pi = z pip ; that is, the person-specific caution θ p divided by the
person-specific rate at the iþ trial: z pi .
θ
Rearranging this expression yields z pi = t pip , which by Equation 4 follows a
Gaussian distribution with mean ν p and variance 1. It further follows that
!
z pi 1 νp 1
= ∼N , , (5)
θp t pi θ p θ p2
where ν p remains the accretion rate parameter for person p, capturing their
information processing speed, and θ p is the threshold parameter implying their
caution in responding.
In what follows, we will apply a regression structure to the accretion rate
parameter in order to quantify between-person differences in speed of information
processing (see, e.g. Vandekerckhove, Tuerlinckx, & Lee, 2011, on hierarchical
Bayesian approaches to cognitive models). To the best of our knowledge this is the
first application of a hierarchical Bayesian LATER model. The caution parameter,
θ , is positive by definition—and is closely related to the precision of the accretion
distribution—so we choose a gamma distribution on the square of caution, θ p2 to
inherit the conjugacy of that distribution:
4
Accumulation dimension
Latency tip
3
Threshold θp
2 te ν p
tion ra
Accre
1
0
0 0.5 1 1.5
Time (seconds)
FIGURE 2.2 Graphical illustration of the LATER model.
Sequential Bayesian Updating 25
Study Design
We analyzed the reaction time data of N = 21,947 participants (each providing at
most five valid trials) from the MindCrowd project. While the original sample size
was slightly larger (22,246) we discarded data from participants whose covariate
information was missing. We also omitted reaction times above 5 seconds or below
180 ms, which are unrealistic for a simple vigilance task. As part of the project
several covariates are collected. From this pool, we chose the following variables
for inclusion in our analysis: Age, gender, and whether the participant or a family
member5 had been diagnosed with AD. Our interest is in the effect of the presence
of AD on the speed of information processing, and its possible interaction with
age. The hierarchical LATER model we construct for this purpose is very similar
to a classical regression model, with the main difference being that the predicted
distribution of the data is not a normal distribution, but rather the distribution of
RTs as predicted by a LATER model. The “target” of the regression analysis is
therefore not the mean of a normal distribution but the accretion rate of a LATER
process. For illustration, we write out the mean of the person-specific information
accumulation rates (ν p ) from Equation 7 as a function of age, sex, AD diagnosis
and the interaction of age and AD diagnosis, and the corresponding regression
terms:
x p β = β0 + β1 AGE p + β2 SEX p + β3 ALZ p + β4 AGE p ALZ p . (8)
The key regression equation (the mean in Equation 7; worked out in Equation 8),
together with Equations 5, 6, and 7 completes our model.
For carrying out the analysis, we specify prior distributions on the parameters
in Equations 6, 7, and 8 (i.e. for the regression terms β, for the inverse of the
residual variance σ −2 , and for the person-specific caution θ p ). The parametric
forms of these priors (namely, the multivariate normal distribution and the gamma
distribution) are chosen to be conjugate with the Gaussian likelihood of the data.
The sequential Bayesian updating then proceeds as follows: As described above,
we specify standard non-informative prior distributions for the first batch of data.
26 Z. Oravecz, M. Huentelman and J. Vandekerckhove
We then obtain posterior samples from JAGS. Once JAGS returns the results, we
summarize these samples in terms of the conditional posterior distributions of the
parameters of interest. More specifically, for the regression terms, we calculate the
mean vector and the covariance matrix of the multivariate normal distribution
based on the posterior samples. The mean vector expresses our best current state
of knowledge on the regression terms, the variances on the diagonal quantify
the uncertainty in these, and the covariances in the off-diagonal positions capture
possible trade-offs due to correlation in the covariates. These posterior summary
statistics sufficiently summarize our knowledge on the parameter given the data, up
to a small computational error due to deriving these posterior summaries through
sampling with JAGS, instead of deriving them analytically. The same principle
applies for the residual precision parameter, σ −2 , in terms of the shape parameters
(sσ , rσ ) of its Gamma distribution. Finally, we plug these estimated distributions in
as priors for the next batch of data.
In the current analysis we use exclusively conjugate priors (i.e. where the
parametric form of the prior on the parameter combined with the likelihood of
the model results in a conditional posterior distribution of the same parametric
form but with updated parameters based on the data). However, not all models
can be formulated by relying only on conjugate priors. In these cases, conjugacy
can be forced with the use of non-parametric methods, but this is beyond
the scope of the current chapter (but see the Discussion section for further
guidelines).
The analyses presented here were run on a desktop computer with a 3.40 GHz
CPU and 16 GB RAM. While in principle in this phase of the project (with
N = 21947) we could have analysed the entire dataset on this machine, for
the purposes of demonstration we divided the full dataset into 40 batches. In a
later phase of the MindCrowd project the sample size will increase substantially,
to an expected one million participants, in which case—due to the desktop
computer’s RAM limitations—batch processing will be required rather than
optional.
We implemented the model in JAGS using a homegrown MATLAB interface.6
The analysis took approximately 10 minutes to run. From the first batch of data,
parameters were estimated by running five chains with 1,500 iterations each,
discarding the first 1,000 samples as burnin.7
From the second batch until the last, we ran five chains with 800 iterations
each, from which 500 were discarded as burnin. The reason why we chose a
shorter adaptation for the second batch was that the algorithm was now better
“informed” by the prior distributions of the parameters inferred from the first
batch, so that we expect faster convergence to the highest posterior density area.
The final sample size was 1,500 samples. Convergence of the five chains was tested
by the R̂ statistics. R̂ was lower than 1.01 for all parameters (with the standard
criterion being R̂ < 1.1).
Sequential Bayesian Updating 27
0.6
0.4 σ = 0.187
σ=
σ=
Regression weight β4
0.13
σ
0.2
σ
0.0
=0
σ
=
=
=
=0
6
0.
98
.07
0.
0.
06
.08
0
07
4
60
5
0
0
3
−0.2
−0.4
−0.6
5 10 15 20 25 30 35 40
Available batches of 549 participants
FIGURE 2.3 Sequence of conditional posterior distributions for the regression
coefficient parameter β4 —the weight of the AD-by-age interaction regressed on the
speed of information processing parameter. As each batch of participants is added to
the analysis, our knowledge of β4 is updated and the posterior standard deviation
decreases while the posterior mean converges to a stable value (in this case, near 0).
TABLE 2.1 Summary of the regression weights where response speed was modeled
with the LATER model and the information accumulation rate was regressed on
age, gender, AD in the family, the interaction of age and AD in the family.
Mean and SD are posterior mean and standard deviation. CrI stands for “credibility interval.”
B FAL T is the Savage–Dickey approximation (Verdinelli & Wasserman, 1995) to the Bayes factor
in favor of the (alternative) hypothesis that β 6= 0.
presence of AD—in fact, the Bayes factor for β4 shows moderate evidence in favor
of the null hypothesis of no effect (1/0.21 = 4.76).
Especially in the case of Big Data, it is important to draw a distinction
between statistical significance—the ability of the data to help us distinguish effects
from non-effects—and practical significance—the degree to which an extant effect
influences people. In the current dataset, the difference in (mean) predicted RT
between a male participant (group mean accretion rate ν̄m ) and a female participant
(group mean accretion rate ν̄ f ) is approximately θ̄ ν̄1f − ν̄1m , which with our
results works out to about 10 ms. Hence, while the difference between these two
groups is detectable (the Bayes factor against the null is more than 1000:1), it is small
enough that any daily-life consequences are difficult to imagine.
To summarize, our cognitive model allows us to cast light on the information
processing system that is assumed to underlie the simple RT measures. The
process model identifies a parameter of interest—in this case, a rate of information
accumulation—and inferences can then be drawn in terms of this parameter.
Caution in the responding is factored into the inference, treated as a nuisance
variable, and separated from the accumulation rate.
model for PAL and the model for the RTs (e.g. Pe, Vandekerckhove, & Kuppens,
2013; Vandekerckhove, 2014). Combining these joint modeling techniques that
were originally developed in psychometrics (e.g. van der Linden, 2007) with
Bayesian modeling can offer a flexible unified framework for drawing inference
from data that would classically be analyzed separately, thereby partially addressing
the “variety” aspect of Big Data challenges.
Discussion
In this chapter, we discussed one way in which Bayesian methods can contribute
to the challenges introduced by Big Data. A core aspect of Bayesian inference—the
sequential updating that is at the heart of the Bayesian paradigm—allows researchers
to partition large datasets so that they become more manageable under hardware
constraints. We have focused on one specific method for exploiting the sequential
updating property, namely using conjugate priors, which lead to closed-form
posterior distributions that can be characterized with only a few sufficient statistics,
and in turn serve as priors for future data. This particular method is limited because
it requires conjugacy of the focal parameters. However, we were able to apply it to a
non-trivial cognitive model (the hierarchical LATER model) and draw interesting
process-level conclusions. For more complex models, priors and posteriors could
be expressed in non-parametric ways (Gershman & Blei, 2012). This method solves
the need for conjugacy, but will itself introduce new computational challenges.
The sequential updating method is computationally efficient because it collapses
posterior samples into sufficient statistics, but also because the informative priors
that are generated from the first batches of data speed up convergence of later
batches.
Our method has also assumed a certain stationarity of data; that is, it was
assumed that as the data came in, the true parameters of the model did not
change. However, there are many real-world scenarios—ranging from negotiation
theory, learning psychology, and EEG analysis, over epidemiology, ecology, and
climate change, to industrial process control, fraud detection, and stock market
predictions—where the stationarity assumption would clearly be violated and the
academic interest would be in change point detection (e.g. Adams & MacKay, 2007).
Within our current approach, a change point detection model would require
that the parameters relevant to the regime switches are explicitly included, and
posteriors over these parameters can be updated as data become available.
Moving beyond sequential updating, there exist other methods for obtaining
samples of a posterior distribution using large datasets. For example, the Consensus
Monte Carlo Algorithm (Scott, Blocker, & Bonassi, 2013) or the Embarrassingly
Parallel, Asymptotically Exact MCMC algorithm (Neiswanger, Wang, & Xing,
2014) both rely on distributing the computational load across a larger hardware
infrastructure and reducing the total “wall time” required for an analysis. The
30 Z. Oravecz, M. Huentelman and J. Vandekerckhove
method we present here has the advantage of not requiring a large dedicated
computation infrastructure and can be run on a regular desktop computer, with
the size of the data affecting only the computation time.
All of these methods rely on Bayesian inference. As we have argued extensively,
we believe that Bayesian methods are not only useful and feasible in a Big Data
context, but are in fact superior from a philosophical point of view. Classical
inference is well known to generate bias against the null hypothesis, and this bias
increases with increasing data size. Recent attempts to reform statistical practice in
the psychological sciences (Cumming, 2014) shift the focus of statistical analysis
to parameter estimation, but with this there remain several major issues. First,
the estimation framework is still based in classical statistics and does not take into
account the prior distribution of parameters of interest. Second, it is not clear
if inference is possible at all in this framework, and “dichotomous thinking”
is discouraged entirely (though it is tempting to wrongly interpret confidence
intervals as posterior distributions, and to decide that an effect is present if the
interval does not contain zero). These recent recommendations seem to us to
throw the dichotomous baby away with the NHST bathwater, while a Bayesian
approach (as we and many others have demonstrated) is logically consistent, does
allow for inferential statements, and allows one to collect evidence in favor of a null
hypothesis. Especially in the case of Big Data, these are highly desirable qualities
that are not shared by classical methods, and we recommend Bayesian inference as
a default method.
Acknowledgments
JV was supported by grant #48192 from the John Templeton Foundation and by
NSF grant #1230118 from the Methods, Measurements, and Statistics panel.
Notes
1 If our example seems far fetched, consider that the existence of a counterex-
ample means one of two things. Either (a) p-values are never a logically valid
method of inference, or (b) p-values are sometimes a logically valid method
of inference, but there exist necessary boundary conditions on the use of
p-values that must be tested whenever p-values are applied. No such boundary
conditions are known to the authors.
2 These long-run guarantees of classical methods have issues in their own
right, which we will not discuss here. More on problematic interpretation
of confidence intervals can be found in Hoekstra, Morey, Rouder, &
Wagenmakers (2014).
3 This expression is due to Lindley (2004).
Sequential Bayesian Updating 31
αβ
4 The variance of the Beta distribution is defined as: (α+β)2 (α+β+1)
, which becomes
(α+C)(β+N −C)
(α+β+N )2 (α+β+N +1)
.
The posterior uncertainty regarding the parameter is hence
a strictly decreasing function of the added sample size N .
5 The phrasing of the item was: “Have you, a sibling, or one of your parents
been diagnosed with Alzheimer’s disease? Yes, No, NA.” The variable one only
took two values in the current dataset: 1—a first degree family member has AD
(including respondent, around 4,000 respondents); 0—there is no first degree
relative with AD in the family.
6 All scripts are available from the following https://git.psu.edu/zzo1/Chapter
SBU. MindCrowd’s data are proprietary.
7 These burnin samples serve two purposes. First, when a model is initialized,
JAGS enters an adaptive mode during which the sampling algorithm modifies
its behaviour for increased efficiency. These changes in the algorithm violate
the detailed balance requirement of Markov chains, so that there is no guarantee
that the so generated samples converge to the desired stationary distribution.
Second, to ensure that the samplers are exploring the posterior parameter space
sufficiently, the sampling algorithm is restarted several times with dispersed
starting values and it is checked whether all these solutions converge into
the same area (as opposed to being stuck in a local optimum, for example).
Posterior inference should be based on samples that form a Markov chain and
are converged into the same area and have “forgotten” their initial values. In
the current analysis the samplers are run independently five times (i.e. we run
five chains). The independence of these MCMC chains implies that they can
be run in parallel, which we do.
References
Adams, R. P., & MacKay, D. J. (2007). Bayesian online changepoint detection. arXiv
preprint arXiv:0710.3742.
Clarke, C. (1991). Invited commentary on R. A. Fisher. American Journal of Epi-
demiology, 134(12), 1371–1374. Retrieved from http://aje.oxfordjournals.org/content/
134/12/1371.short.
Cumming, G. (2014). The new statistics why and how. Psychological Science, 25(1), 7–29.
Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver and Boyd.
Fox, C., & Roberts, S. (2012). A tutorial on variational Bayes. Artificial Intelligence Review,
38, 85–95.
Gelman, A. (2006). The boxer, the wrestler, and the coin flip: A paradox of robust bayesian
inference and belief functions. American Statistician, 60, 146–150.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014).
Bayesian Data Analysis (3rd edn.). Boca Raton, FL.: Chapman & Hall/CRC.
Gershman, S. J., & Blei, D. M. (2012). A tutorial on bayesian nonparametric models.
Journal of Mathematical Psychology, 56, 1–12.
32 Z. Oravecz, M. Huentelman and J. Vandekerckhove
Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust
misinterpretation of confidence intervals. Psychological Bulletin and Review, 21(5),
1157–1164.
Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge, UK: Cambridge
University Press.
Kruschke, J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS and Stan (2nd
edn.). London: Academic Press/Elsevier.
Lee, M. D., & Wagenmakers, E. (2013). Bayesian cognitive modeling. New York: Cambridge.
Lindley, D. (1972). Bayesian statistics: A review. Philadelphia: Society for Industrial and
Applied Mathematics.
Lindley, D. (1993). The analysis of experimental data: The appreciation of tea and wine.
Teaching Statistics, 15(1), 22–25.
Lindley, D. (2004). That wretched prior. Significance, 1(2), 85–87.
Lunn, D., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—a Bayesian
modelling framework: concepts, structure, and extensibility. Statistics and Computing, 10,
325–337.
Neiswanger, W., Wang, C., & Xing, E. A. (2014). Asymptotically exact, embarrassingly
parallel MCMC. Retrieved from http://arxiv.org/pdf/1311.4780v2.pdf, 1311.4780.
Ostwald, D., Kirilina, E., Starke, L., & Blankenburg, F. (2014). A tutorial on variational
Bayes for latent linear stochastic time-series models. Journal of Mathematical Psychology, 60,
1–19.
Pe, M. L., Vandekerckhove, J., & Kuppens, P. (2013). A diffusion model account of the
relationship between the emotional flanker task and rumination and depression. Emotion,
13(4), 739.
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using
Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical
Computing (DSC 2003) (pp. 20–22).
Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108.
Ratcliff, R., Carpenter, R. H. S., & Reddi, B. A. J. (2001). Putting noise into
neurophysiological models of simple decision making. Nature Neuroscience, 6, 336–337.
Reddi, B. A., & Carpenter, R. H. S. (2000). The influence of urgency on decision time.
Nature Neuroscience, 3, 827–830.
Rouder, J. N., & Batchelder, W. H. (1998). Multinomial models for measuring storage
and retrieval processes in paired associate learning. In C. E. Dowling, F. S. Roberts,
& P. Theuns (Eds.), Recent progress in mathematical psychology (pp. 195–226). New York:
Psychology Press.
Scott, S. L., Blocker, A. W., & Bonassi, F. V. (2013). Bayes and Big Data: The consensus
Monte Carlo algorithm. In Paper presented at the 2013 EFab@Bayes 250 Workshop.
Stan Development Team. (2013). Stan: A C++ Library for Probability and Sampling, Version
1.3. Retrieved from http://mc-stan.org/.
Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37(1),
1–2.
Turner, B. M., Forstmann, B. U., Wagenmakers, E. J., Brown, S. D., Sederberg, P. B., &
Steyvers, M. (2013). A Bayesian framework for simultaneously modeling neural and
behavioral data. NeuroImage, 72, 193–206.
Vandekerckhove, J. (2014). A cognitive latent variable model for the simultaneous analysis
of behavioral and personality data. Journal of Mathematical Psychology, 60, 58–71.
Sequential Bayesian Updating 33
Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (in press). Model comparison and the
principle of parsimony. Oxford: Oxford University Press.
Vandekerckhove, J., Tuerlinckx, F., & Lee, M. (2011). Hierarchical diffusion models for
two-choice response times. Psychological Methods, 16, 44–62.
van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy
on test items. Psychometrika, 72(3), 287–308.
Verdinelli, I., & Wasserman, L. (1995). Computing Bayes factors using a generalization
of the Savage-Dickey density ratio. Journal of the American Statistical Association, 90(430),
614–618.
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values.
Psychonomic Bulletin & Review, 14, 779–804.
Zhu, J., Chen, J., & Hu, W. (2014). Big learning with Bayesian methods.
http://arxiv.org/pdf/1411.6370.pdf, 1411.6370v1.
3
PREDICTING AND IMPROVING
MEMORY RETENTION
Psychological Theory Matters
in the Big Data Era
Abstract
Cognitive psychology has long had the aim of understanding mechanisms of human
memory, with the expectation that such an understanding will yield practical techniques
that support learning and retention. Although research insights have given rise to
qualitative advice for students and educators, we present a complementary approach that
offers quantitative, individualized guidance. Our approach synthesizes theory-driven and
data-driven methodologies. Psychological theory characterizes basic mechanisms of human
memory shared among members of a population, whereas machine-learning techniques use
observations from a population to make inferences about individuals. We argue that despite
the power of Big Data, psychological theory provides essential constraints on models. We
present models of forgetting and spaced practice that predict the dynamic time-varying
knowledge state of an individual student for specific material. We incorporate these models
into retrieval-practice software to assist students in reviewing previously mastered material.
In an ambitious year-long intervention in a middle-school foreign language course, we
demonstrate the value of systematic review on long-term educational outcomes, but more
specifically, the value of adaptive review that leverages data from a population of learners to
personalize recommendations based on an individual’s study history and past performance.
Introduction
Human memory is fragile. The initial acquisition of knowledge is slow and
effortful. And once mastery is achieved, the knowledge must be exercised
periodically to mitigate forgetting. Understanding the cognitive mechanisms of
memory has been a longstanding goal of modern experimental psychology, with
the hope that such an understanding will lead to practical techniques that support
learning and retention. Our specific aim is to go beyond the traditional qualitative
Predicting and Improving Memory Retention 35
Knowledge State
In traditional electronic tutors (e.g. Anderson, Conrad, & Corbett, 1989;
Koedinger & Corbett, 2006; Martin & VanLehn, 1995), the modeling of a
student’s knowledge state has depended on extensive handcrafted analysis of the
teaching domain and a process of iterative evaluation and refinement. We present
a complementary approach to inferring knowledge state that is fully automatic
and independent of the content domain. We hope to apply this approach in any
domain whose mastery can be decomposed into distinct, separable components of
knowledge or items to be learned (van Lehn, Jordan, & Litman, 2007). Applicable
domains range from the concrete to the abstract, and from the perceptual to the
36 M. C. Mozer and R. V. Lindsey
1. Observations of human behavior provide only weak clues about the knowledge state.
Consider fact learning, the domain which will be a focus of this chapter. If
a student performs cued recall trials, as when flashcards are used for drilling,
each retrieval attempt provides one bit of information: whether it is successful
or not. From this meager signal, we hope to infer quantitative properties
of the memory trace, such as its strength, which we can then use to predict
whether the memory will be accessible in an hour, a week, or a month. Other
behavioral indicators can be diagnostic, including response latency (Lindsey,
Lewis, Pashler, & Mozer, 2010; Mettler & Kellman, 2014; Mettler, Massey,
& Kellman, 2011) and confidence (Metcalfe & Finn, 2011), but they are also
weak predictors.
2. Knowledge state is a consequence of the entire study history, i.e. when in the past the
specific item and related items were studied, the manner and duration of study,
and previous performance indicators. Study history is particularly relevant
because all forms of learning show forgetting over time, and unfamiliar and
newly acquired information is particularly vulnerable (Rohrer & Taylor, 2006;
Wixted, 2004). Further, the temporal distribution of practice has an impact
on the durability of learning for various types of material (Cepeda, Pashler,
Vul, & Wixted, 2006; Rickard, Lau, & Pashler, 2008).
3. Individual differences are ubiquitous in every form of learning. Taking an example
from fact learning (Kang, Lindsey, Mozer, & Pashler, 2014), Figure 3.1(a)
shows extreme variability in a population of 60 participants. Foreign-language
vocabulary was studied at four precisely scheduled times over a four-week
period. A cued-recall exam was administered after an eight-week retention
period. The exam scores are highly dispersed despite the uniformity in
materials and training schedules. In addition to inter-student variability,
inter-item variability is a consideration. Learning a foreign vocabulary word
may be easy if it is similar to its English equivalent, but hard if it is similar
to a different English word. Figure 3.1(b) shows the distribution of recall
accuracy for 120 Lithuanian-English vocabulary items averaged over a set of
students (Grimaldi, Pyc, & Rawson, 2010). With a single round of study, an
Predicting and Improving Memory Retention 37
(a) 10
8
Frequency
0
0 0.2 0.4 0.6 0.8 1
Test score
(b) 30
25
20
Frequency
15
10
0
0 0.2 0.4 0.6 0.8
Test score
FIGURE 3.1 (a) Histogram of proportion of items reported correctly on a cued recall
task for a population of 60 students learning 32 Japanese-English vocabulary pairs
(Kang et al., 2014); (b) Histogram of proportion of subjects correctly reporting an
item on a cued recall task for a population of 120 Lithuanian-English vocabulary pairs
being learned by roughly 80 students (Grimaldi et al., 2010)
exam administered several minutes later suggests that items show a tremendous
range in difficulty (krantas→shore was learned by only 3 percent of students;
lova→bed was learned by 76 percent of students).
one year, more than 50 percent by the next year (Custers, 2010), and 80–85
percent after 25 years (Custers & ten Cate, 2011).
Forgetting is often assessed by teaching participants some material in a single
session and then assessing cued-recall accuracy following some lag t. The
probability of recalling the studied material decays according to a generalized
power-law as a function of t (Wixted & Carpenter, 2007),
Pr (recall) = m(1 + ht)− f ,
where m, h, and f are constants interpreted as the degree of initial learning
(0 ≤ m ≤ 1), a scaling factor on time (h > 0), and the memory decay exponent
( f > 0), respectively. Figure 3.2(a) shows recall accuracy at increasing study-test
lags from an experiment by Cepeda, Vul, Rohrer, Wixted, & Pashler (2008) in
which participants were taught a set of obscure facts. The solid line in the figure is
the best fitting power-law forgetting curve.
When material is studied over several sessions, the temporal distribution of study
influences the durability of memory. This phenomenon, known as the spacing effect,
is observed for a variety materials—skills and concepts as well as facts (Carpenter,
Cepeda, Rohrer, Kang, & Pashler, 2012)—and has been identified as showing
great promise for improving educational outcomes (Dunlosky, Rawson, Marsh,
Nathan, & Willingham, 2013).
The spacing effect is typically studied via a controlled experimental paradigm in
which participants are asked to study unfamiliar paired associates in two sessions.
The time between sessions, known as the intersession interval or ISI, is manipulated
across participants. Some time after the second study session, a cued-recall test is
administered to the participants. The lag between the second session and the test
is known as the retention interval or RI. Cepeda et al. (2008) conducted a study in
which RIs were varied from seven to 350 days and ISIs were varied from minutes
to 105 days. Their results are depicted as circles connected with dashed lines in
Figure 3.2(b). (The solid lines are model fits, which we discuss shortly.) For each
RI, Cepeda et al. (2008) find an inverted-U relationship between ISI and retention.
The left edge of the graph corresponds to massed practice, the situation in which
session two immediately follows session one. Recall accuracy rises dramatically
as the ISI increases, reaching a peak and then falling off gradually. The optimal
ISI—the peak of each curve—increases with the RI. Note that for educationally
relevant RIs on the order of weeks and months, the Cepeda et al. (2009) result
indicates that the effect of spacing can be tremendous: Optimal spacing can double
retention over massed practice. Cepeda, Pashler, Vul, Wixted & Rohrer (2006)
conducted a meta-analysis of the literature to determine the functional relationship
between RI and optimal ISI. We augmented their dataset with the more recent
results of Cepeda et al. (2008) and observed an approximately power-function
relationship between RI and optimal ISI (both in days):
O ptimal I S I = 0.097R I 0.812 .
Predicting and Improving Memory Retention 39
(a) 100
80
60
% Recall
40
20
0
1 7 14 21 35 70 105
Retention (days)
(b) 100
80 7 Day retention
60 35 Day retention
% Recall
70 Day retention
40
0
1 7 14 21 35 70 105
Spacing (days)
FIGURE 3.2 (a) Recall accuracy as a function of lag between study and test for a set
of obscure facts; circles represent data provided by Cepeda et al. (2008) and solid line
is the best power-law fit. (b) Recall accuracy as a function of the temporal spacing
between two study sessions (on the ordinate) and the retention period between the
second study session and a final test. Circles represent data provided by Cepeda et al.
(2008), and solid lines are fits of the model MCM, as described in the text.
This relationship suggests that as material becomes more durable with practice,
ISIs should increase, supporting even longer ISIs in the future, consistent with
an expanding-spacing schedule as qualitatively embodied in the Leitner method
(Leitner, 1972) and SuperMemo (Woźniak, 1990).
40 M. C. Mozer and R. V. Lindsey
Many models have been proposed to explain the mechanisms of the spacing
effect (e.g. Benjamin & Tullis, 2010; Kording, Tenenbaum, & Shadmehr, 2007;
Mozer, Pashler, Cepeda, Lindsey, & Vul, 2009; Pavlik & Anderson, 2005a;
Raaijmakers, 2003; Staddon, Chelaru, & Higa, 2002). These models have been
validated through their ability to account for experimental results, such as those
in Figure 3.2, which represent mean performance of a population of individuals
studying a set of items. Although the models can readily be fit to an individual’s
performance for a set of items (e.g. Figure 3.1(a)) or a population’s performance
for a specific item (e.g. Figure 3.1(b)), it is a serious challenge in practice to use
these models to predict an individual’s memory retention for a specific item.
We will shortly describe an approach to making such individualized predictions.
Our approach incorporates key insights from two computational models, ACT-R
(Pavlik & Anderson, 2005a) and MCM (Mozer et al., 2009), into a Big Data
technique that leverages population data to make individualized predictions. First,
we present a brief overview of the two models.
ACT-R
ACT- R (Anderson et al., 2004) is an influential cognitive architecture whose
declarative memory module is often used to account for explicit recall following
study. ACT-R assumes that a separate trace is laid down each time an item is studied,
and the trace decays according to a power law, t −d , where t is the age of the
memory and d is the power law decay for that trace. Following n study episodes,
the activation for an item, m n , combines the trace strengths of individual study
episodes according to:
n
!
X −dk
m n = ln bk tk + β, (1)
k=1
where tk and dk refer to the age and decay associated with trace k, and β is a student
and/or item-specific parameter that influences memory strength. The variable bk
reflects the salience of the kth study session (Pavlik, 2007): Larger values of bk
correspond to cases where, for example, the participant self-tested and therefore
exerted more effort.
To explain spacing effects, Pavlik and Anderson (2005b, 2008) made an
additional assumption: The decay for the trace formed on study trial k depends
on the item’s activation at the point when study occurs:
where c and α are constants. If study trial k occurs shortly after the previous
trial, the item’s activation, m k−1 , is large, which will cause trace k to decay
rapidly. Increasing spacing therefore benefits memory by slowing decay of trace k.
Predicting and Improving Memory Retention 41
However, this benefit is traded off against a cost incurred due to the aging of traces
1...k − 1 that causes them to decay further.
The probability of recall is monotonically related to activation, m:
τ −m
Pr (recall) = 1/(1 + e s ),
where τ and s are additional parameters. In total, the variant of the model
described here has six free parameters.
Pavlik and Anderson (2008) use ACT-R activation predictions in a heuristic
algorithm for within-session scheduling of trial order and trial type (i.e. whether an
item is merely studied, or whether it is first tested and then studied). They assume
a fixed spacing between initial study and subsequent review. Thus, their algorithm
reduces to determining how to best allocate a finite amount of time within a
session. Although they show an effect of the algorithm used for within-session
scheduling, between-session manipulation has a greater impact on long-term
retention (Cepeda, Pashler, Vul, & Wixted, 2006).
MCM
ACT- R is posited on the assumption that memory decay follows a power function.
We developed an alternative model, the Multiscale Context Model or MCM (Mozer
et al., 2009), which provides a mechanistic basis for the power function. Adopting
key ideas from previous models of the spacing effect (Kording et al., 2007;
Raaijmakers, 2003; Staddon et al., 2002) MCM proposes that each time an item is
studied, it is stored in multiple item-specific memory traces that decay at different
rates. Although each trace has an exponential decay, the sum of the traces decays
approximately as a power function of time. Specifically, trace i, denoted xi , decays
over time according to:
where τi is the decay time constant, ordered such that successive traces have slower
decays, i.e. τi < τi+1 . Traces 1 − k are combined to form a net trace strength, sk ,
via a weighted average:
k k
1 X X
sk = γi xi , where Ŵk = γi
Ŵk i=1 i=1
1xi = ǫ(1 − si ),
Collaborative Filtering
In the last several years, an alternative approach to predicting learners’ performance
has emerged from the machine-learning community. This approach essentially sets
psychological theory aside in favor of mining large datasets collected as students
solve problems. To give a sense of the size of these datasets, we note that the
Khan Academy had over 10 million unique users per month and delivered over
300 million lessons at the end of 2013 (Mullany, 2013). Figure 3.3(a) visualizes
a dataset in which students solve problems over time. Each cell in the tensor
corresponds to a specific student solving a particular problem at a given moment
in time. The contents of a cell indicate whether an attempt was made and if so
whether it was successful. Most of the cells in the tensor are empty. A collaborative
filtering approach involves filling in those missing cells. While the tensor may have
no data about student S solving problem P given a particular study history, the
tensor will have data about other similar students solving P, or about S solving
problems similar to P. Filling in the tensor also serves to make predictions about
future points in time.
Collaborative filtering has a long history in e-commerce recommender systems;
for example, Amazon wishes to recommend products to customers and Netflix
wishes to recommend movies to its subscribers. The problems are all formally
equivalent, simply replace “student” in Figure 3.3(a) with “customer” or
“subscriber,” and replace “problem” with “product” or “movie.” The twist
that distinguishes memory prediction from product or movie prediction is our
understanding of the temporal dynamics of human memory. These dynamics are
Predicting and Improving Memory Retention 43
(a) (b)
Ti
m
e
Problem Pt Pt+1
Problems
Knowledge Kt Kt+1 Kt+2
state
Response Rt Rt+1
Students
FIGURE 3.3 (a) A tensor representing students × problems × time. Each cell describes
a student’s attempt to solve a problem at a particular moment in time. (b) A naive
graphical model representing a teaching paradigm. The nodes represent random
variables and the arrows indicate conditional dependencies among the variables. Given
a student with knowledge state K t at time t, and a problem Pt posed to that student,
Rt denotes the response the student will produce. The evolution of the student’s
knowledge state will depend on the problem that was just posed. This framework can
be used to predict student responses or to determine an optimal sequence of problems
for a particular student given a specific learning objective.
IRT specifies the probabilistic relationship between the predicted response, Rsi
and as and di . The simplest instantiation of IRT, called the one-parameter logistic
(1PL) model because it has one item-associated parameter, is:
1
Pr (Rsi = 1) = . (2)
1 + exp(di − as )
A more elaborate version of IRT, called the 3PL model, includes an item-associated
parameter for guessing, but that is mostly useful for multiple-choice questions
where the probability of correctly guessing is non-negligible. Another variant,
called the 2PL model, includes parameters that allow for student ability to have
a non-uniform influence across items. (In simulations we shortly describe, we
explored the 2PL model but found that it provided no benefit over the 1PL
model.) Finally, there are more sophisticated latent-trait models that characterize
each student and item not as a scalar but as a feature vector (Koren, Bell, & Volinsky,
2009).
Candidate Models
The forgetting curve we described earlier, based on the generalized power law,
is supported by data from populations of students and/or populations of items.
The forgetting curve cannot be measured for an individual item and a particular
student—which we’ll refer to as a student-item—due to the observer effect and the
all-or-none nature of forgetting. Regardless, we will assume the functional form
of the curve for a student-item is the same, yielding:
Pr (Rsi = 1) = m(1 + htsi )− f , (3)
Predicting and Improving Memory Retention 45
where Rsi is the response of student s to item i following retention interval tsi .
This model has free parameters m, h, and f , as described earlier.
We would like to incorporate the notion that forgetting depends on latent
IRT-like traits that characterize student ability and item difficulty. Because the
critical parameter of forgetting is the memory decay exponent, f , and because
f changes as a function of skill and practice (Pavlik & Anderson, 2005a), we can
individuate forgetting for each student-item by determining the decay exponent
in Equation 3 from latent IRT-like traits:
We add the tilde to ãs and d̃i to indicate that these ability and difficulty parameters
are not the same as those in Equation 2. Using the exponential function ensures
that f is non-negative.
Another alternative we consider is individuating the degree-of-learning
parameter in Equation 3 as follows:
1
m= . (5)
1 + exp(di − as )
With this definition of m, Equation 3 simplifies to 1PL IRT (Equation 2) at t = 0.
For t > 0, recall probability decays as a power-law function of time.
We explored five models that predict recall accuracy of specific student-items:
(1) IRT, the 1PL IRT model (Equation 2); (2) MEMORY, a power-law forgetting
model with population-wide parameters (Equation 3); (3) HYBRID DECAY, a
power-law forgetting model with decay rates based on latent student and item
traits (Equations 3 and 4); (4) HYBRID SCALE, a power-law forgetting model with
the degree-of-learning based on latent student and item traits (Equations 3 and 5);
and (5) HYBRID BOTH, a power-law forgetting model that individuates both the
decay rate and degree-of-learning (Equations 3, 4, and 5). The Appendix describes
a hierarchical Bayesian inference method for parameter estimation and obtaining
model predictions.
Simulation Results
We present simulations of our models using data from two previously published
psychological experiments exploring how people learn and forget facts, summa-
rized in Table 3.1. In both experiments, students were trained on a set of items
(cue–response pairs) over multiple rounds of practice. In the first round, the cue
and response were both shown. In subsequent rounds, retrieval practice was given:
Students were asked to produce the appropriate response to each cue. Whether
successful or not, the correct response was then displayed. Following this training
procedure was a retention interval tsi specific to each student and each item, after
46 M. C. Mozer and R. V. Lindsey
Study name S1 S2
Source Kang et al. (2014) Cepeda et al. (2008)
Materials Japanese-English vocabulary Interesting but obscure facts
# Students 32 1354
# Items 60 32
Rounds of practice 3 1
Retention intervals 3 min–27 days 7 sec–53 min
which an exam was administered. The exam obtained the rsi binary value for that
student-item.
To evaluate the models, we performed 50-fold validation. In each fold, a
random 80 percent of elements of R were used for training and the remaining
20 percent were used for evaluation. Each model generates a prediction,
conditioned on the training data, of recall probability at the exam time tsi , which
can be compared against the observed recall accuracy in the held-out data.
Each model’s capability of discriminating successful from unsuccessful recall
trials was assessed with a signal-detection analysis (Green & Swets, 1966). For
each model, we compute the mean area under the receiver operating characteristic
curve (hereafter, AUC) across validation folds as a measure of the model’s predictive
ability. The measure ranges from 0.5 for random guesses to 1.0 for perfect
predictions. The greater the AUC, the better the model is at predicting a particular
student’s recall success on a specific item after a given lag.
Figure 3.4(a) and (b) summarizes the AUC values for studies S1 and S2 ,
respectively. The baseline MEMORY model performs poorly ( p < 0.01 for all
pairwise comparisons by a two-tailed t test unless otherwise noted), suggesting
that the other models have succeeded in recovering latent student and item traits
that facilitate inference about the knowledge state of a particular student-item.
The baseline IRT model, which ignores the lag between study and test, does
not perform as well as the latent-state models that incorporate forgetting. The
HYBRID BOTH model does best in S1 and ties for best in S2 , suggesting that
allowing for individual differences both in degree of learning and rate of forgetting
is appropriate. The consistency of results between the two studies is not entirely
trivial considering the vastly different retention intervals examined in the two
studies (see Table 3.1).
(a)
0.84
0.82
0.80
AUC
0.58
0.56
0.54
Memory IRT Hybrid Hybrid Hybrid
scale decay both
(b)
0.84
0.82
0.80
AUC
0.66
0.64
0.62
0.60
Memory IRT Hybrid Hybrid Hybrid
scale decay both
FIGURE 3.4 Mean AUC values on the five models trained and evaluated on (a) Study
S1 and (b) Study S2 . The error bars indicate a 95 percent confidence interval on
the AUC value over multiple validation folds. Note that the error bars are not useful
for comparing statistical significance of the differences across models, because the
validation folds are matched across models, and the variability due to the fold must be
removed from the error bars.
0.82
0.80
AUC
0.58
0.56
Memory IRT Hybrid Hybrid Hybrid
scale decay both
FIGURE 3.5 Mean AUC values when random items are held out during validation
folds, Study S1
Spanish 1 students begin Spanish 2, can we benefit from the data acquired in the
fall to predict their performance on new material?
To model this situation, we conducted a further validation test in which,
instead of holding out random student-item pairs, we held out random items
for all students. Figure 3.5 shows mean AUC values for Study S1 data for the
various models. Performance in this item-generalization task is slightly worse than
performance when the model has familiarity with both the students and the items.
Nonetheless, it appears that the models can make predictions with high accuracy
for new material based on inferences about latent student traits and about other
items.1
To summarize, in this section we demonstrated that systematic individual
(student and item) differences can be discovered and exploited to better predict
a particular student’s retention of a specific item. A model that combines a
psychological theory of forgetting with a collaborative filtering approach to
latent-trait inference yields better predictions than models based purely on
psychological theory or purely on collaborative filtering. However, the datasets
we explored are relatively small—1,920 and 43,328 exam questions. Ridgeway,
Mozer, and Bowles (2016) explore a much larger dataset consisting of 46.3 million
observations collected from 125K students learning foreign language skills with
online training software. Even in this much larger dataset, memory retention
is better predicted using a hybrid model over a purely data-driven approach.2
Furthermore, in naturalistic learning scenarios, students are exposed to material
multiple times, in various contexts, and over arbitrary temporal distributions of
study. The necessity for mining a large dataset becomes clear in such a situation,
Predicting and Improving Memory Retention 49
but so does the role of psychological theory, as we hope to convince the reader in
the next section.
knowledge state in a more naturalistic setting, we must relax this assumption and
allow for an arbitrary study history, defined as zero or more previous exposures at
particular points in time.
Extending our modeling approach, we posit that knowledge state is jointly
dependent on factors relating to (1) an item’s latent difficulty, (2) a student’s latent
ability, and (3) the amount, timing, and outcome of past study. We refer to the
model with the acronym DASH summarizing the three factors (difficulty, ability,
and study history).
DASH predicts the likelihood of student s making a correct response on the kth
trial for item i, conditioned on that student-item’s specific study history:
where w ∈ {1, ..., W } is an index over time windows, csiw is the number of times
student s correctly recalled item i in window w out of n siw attempts, and θ are
window-specific weightings. Motivated by the multiple traces of MCM, we include
statistics of study history that span increasing windows of time. These windows
allow the model to modulate its predictions based on the temporal distribution
of study. Motivated by the diminishing benefit of additional study in ACT-R
(Equation 1), we include a similar log transform in Equation 7.3 Both MCM
and ACT-R modulate the effect of past study based on response outcomes, i.e.
whether the student performed correctly or not on a given trial. This property is
incorporated into Equation 7 via the separation of parameters for counts of total
and correct attempts.
Being concerned that the memory dynamics of MCM and ACT-R provided
only loose inspiration to DASH, we designed two additional variants of DASH that
more strictly adopted the dynamics of MCM and ACT-R. The variant we call
DASH [ MCM ] replaces expanding time windows with expanding time constants,
which determine the rate of exponential decay of memory traces. The model
Predicting and Improving Memory Retention 51
assumes that the counts n siw and csiw are incremented at each trial and then decay
over time at a timescale-specific exponential rate τw . Formally, we use Equation 7
with the counts redefined as:
n siw = k−1 −(tsik −tsiκ )/τw
csiw = k−1 −(tsik −tsiκ )/τw
P P
κ=1 e κ=1 r siκ e (8)
The variant we call DASH[ACT-R] does not have a fixed number of time windows,
but instead—like ACT-R—allows for the influence of past trials to continuously
decay according to a power-law. DASH[ACT-R] formalizes the effect of study
history to be identical to the memory trace strength of ACT-R (Equation 1):
k−1
!
X
−θ2
h θ = θ1 log 1 + θ3+rsiκ (tsik − tsiκ ) (9)
κ=1
Further details of the modeling and a hierarchical Bayesian scheme for inferring
model parameters are given in Lindsey (2014).
Experiment 1
Experiment 1 involved 179 third-semester Spanish students, split over six class
periods. The semester covered ten lessons of material. COLT incorporated three
different schedulers to select material from these lessons for review. The goal of each
scheduler was to make selections that maximize long-term knowledge preservation
given the limited time available for review. The scheduler was varied within
participant by randomly assigning one third of a lesson’s items to each scheduler,
counterbalanced across participants. During review, the schedulers alternated in
52 M. C. Mozer and R. V. Lindsey
selecting items for retrieval practice. Each scheduler selected from among the items
assigned to it, ensuring that all items had equal opportunity and that all schedulers
administered an equal number of review trials.
A massed scheduler selected material from the current lesson. It presented the
item in the current lesson that students had least recently studied. This scheduler
reflects recent educational practice: Prior to the introduction of COLT, alternative
software was used that allowed students to select the lesson they wished to study.
Not surprisingly, given a choice, students focused their effort on preparing for
the imminent end-of-lesson quiz, consistent with the preference for massed study
found by Cohen, Yan, Halamish, and Bjork (2013).
A generic-spaced scheduler selected one previous lesson to review at a spacing
deemed to be optimal for a range of students and a variety of material according
to both empirical studies (Cepeda et al., 2006, 2008) and computational models
(Khajah, Lindsey, & Mozer, 2013; Mozer et al., 2009). On the time frame
of a semester—where material must be retained for one to three months—a
one-week lag between initial study and review obtains near-peak performance for
a range of declarative materials. To achieve this lag, the generic-spaced scheduler
selected review items from the previous lesson, giving priority to the least recently
studied.
A personalized-spaced scheduler used our knowledge-state model, DASH, to
determine the specific item a particular student would most benefit from
reviewing. DASH infers the instantaneous memory strength of each item the
student has studied. Although a knowledge-state model is required to schedule
review optimally, optimal scheduling is computationally intractable because it
requires planning over all possible futures (when and how much a student studies,
including learning that takes place outside the context of COLT, and within
the context of COLT, whether or not retrieval attempts are successful, etc.).
Consequently, a heuristic policy is required for selecting review material. We
chose a threshold-based policy that prioritizes items whose recall probability is
closest to a threshold θ . This heuristic policy is justified by simulation studies
as being close to optimal under a variety of circumstances (Khajah et al.,
2013) and by Bjork’s (1994) notion of desirable difficulty, which suggests that
memory is best served by reviewing material as it is on the verge of being
forgotten.
As the semester progressed, COLT continually collected data and DASH was
retrained with the complete dataset at regular intervals. The retraining was
sufficiently quick and automatic that the model could use data from students in the
first class period of the day to improve predictions for students in the second class
period. This updating was particularly useful when new material was introduced
and DASH needed to estimate item difficulty. By the semester’s end, COLT had
amassed data from about 600,000 retrieval-practice trials.
Predicting and Improving Memory Retention 53
(a) (b)
% Correct on cumulative exam
75 0.35
Personalized
DASH [ACT−R]
DASH [MCM]
70
spaced
Prediction error
65 0.3
DASH
Generic
spaced
Massed
60
Baseline
ACT−R
55 0.25
IRT
50
45
End of semester One month delayed
FIGURE 3.6 COLT experiment 1. (a) Mean scores on the two cumulative
end-of-semester exams, taken 28 days apart. All error bars indicate ±1 within-student
standard error (Masson & Loftus, 2003). (b) Accumulative prediction error of six
models using the data from the experiment. The models are as follows: A baseline
model that predicts performance from the proportion of correct responses made by
each student, a model based on item-response theory (IRT), a model based on Pavlik
& Anderson’s (2005a, 2005b) ACT-R model, DASH, and two variants of DASH that
adhere more strictly to the tenets of MCM and ACT-R. Error bars indicate ±1 SEM.
Experiment 2
Experiment 1 took place in the fall semester with third-semester Spanish students.
We conducted a follow-up experiment in the next (spring) semester with the
same students, then in their fourth semester of Spanish. (One student of the 179
in experiment 1 did not participate in experiment 2 because of a transfer.) The
semester was organized around eight lessons, followed by two cumulative exams
administered 28 days apart. The two cumulative exams each tested half the course
material, with a randomized split by student.
The key motivations for experiment 2 are as follows.
• In experiment 1, the personalized-review scheduler differed from the other
two schedulers both in its personalization and in its ability to select material
from early in the semester. Because personalized review and long-term review
were conflated, we wished to include a condition in experiment 2 that
involved long-term review but without personalization. We thus incorporated
a random scheduler that drew items uniformly from the set of items that
had been introduced in the course to date. Because the massed scheduler of
experiment 1 performed so poorly, we replaced it with the random scheduler.
• Because the same students participated in experiments 1 and 2, we had
the opportunity to initialize students’ models based on all the data from
experiment 1. The old data provided DASH with fairly strong evidence from
the beginning of the semester about individual student abilities and about the
relationship of study schedule to retention. Given that experiment 2 covered
only eight lessons, versus the ten in experiment 1, this bootstrapping helped
DASH to perform well out of the gate.
Predicting and Improving Memory Retention 55
55
Personalized
50
Spaced
Random
Generic
Spaced
45
40
End of semester One month delayed
FIGURE 3.7 COLT experiment 2. Mean scores on the cumulative end-of-semester
exam. All error bars indicate ±1 within-student standard error (Masson & Loftus,
2003).
Figure 3.7 summarizes the experiment outcome. The bars represent scores in
the three review conditions on the initial and delayed exams. The differences
among conditions are not as stark as we observed in experiment 1, in part because
we eliminated the weak massed condition and in part due to an unanticipated issue
which we address shortly. Nonetheless, on the first exam, the personalized-spaced
scheduler improved retention by 4.8 percent over the generic-spaced scheduler
(t (167) = 3.04, p < 0.01, Cohen’s d = 0.23) and by 3.4 percent over the random
scheduler (t (167) = 2.29, p = 0.02, d = 0.18). Between the two exams, the
forgetting rate is roughly the same in all conditions: 16.7 percent, 16.5 percent, and
16.5 percent for the generic, random, and personalized conditions, respectively.
On the second exam, personalized review boosted retention by 4.6 percent over
generic review (t (166) = 2.27, p = 0.024, d = 0.18) and by 3.1 percent over
random review, although this difference was not statistically reliable (t (166) = 1.64,
p = 0.10, d = 0.13).
At about the time we obtained these results, we discovered a significant problem
with the experimental software. Students did not like to review. In fact, at the
end of experiment 1, an informal survey indicated concern among students that
56 M. C. Mozer and R. V. Lindsey
mandatory review interfered with their weekly quiz performance because they
were not able to spend all their time practicing the new lesson that was the subject
of their weekly quiz. Students wished to mass their study due to the incentive
structure of the course, and they requested a means of opting out of review. We
did not accede to their request; instead, the teacher explained the value of review to
long-term retention. Nonetheless, devious students found a way to avoid review:
Upon logging in, COLT began each session with material from the new lesson.
Students realized that if they regularly closed and reopened their browser windows,
they could avoid review. Word spread throughout the student population, and most
students took advantage of this unintended feature of COLT. The total number of
review trials performed in experiment 2 was a small fraction of the number of
review trials in experiment 1. Consequently, our failure to find large and reliable
differences among the schedulers is mostly due to the fact that students simply did
not review.
One solution might be to analyze the data from only those students who
engaged in a significant number of review trials during the semester. We opted
instead to use data from all students and to examine the relative benefit of the
different review schedulers as a function of the amount of review performed. The
amount of review is quantified as the total number of review trials performed by
a student divided by the total number of items, i.e. the mean number of review
trials. Note, however, that this statistic does not imply that each item was reviewed
the same number of times. For each student, we computed the difference of exam
scores between personalized and generic conditions, and between personalized and
random conditions. We performed a regression on these two measures given the
amount of review. Figure 3.8 shows the regression curves that represent the exam
score differences as a function of mean review trials per item. The regressions were
constrained to have an intercept at 0.0 because the conditions are identical when
no review is included. The data points plotted in Figure 3.8 are averages based
on groups of about ten students who performed similar amounts of review. These
groupings make it easier to interpret the scatterplot, but the raw data were used for
the regression analysis.
Figure 3.8 shows a positive slope for all four regression lines (all reliable
by t tests with p < 0.01), indicating that with more time devoted to review,
the personalized-review scheduler increasingly outperforms the random and
generic-review schedulers. If, for example, students had studied on COLT for an
average of one more review trial per item for each of the 13 weeks in the semester
leading up to exam 1, Figure 3.8 predicts an (absolute) improvement on exam 1
scores of 10.2 percent with personalized-spaced review over generic-spaced review
and 7.2 percent with personalized-spaced review over random review. We wish to
emphasize that we are not simply describing an advantage of review over no review.
Our result suggests that students will score a letter grade higher (7–10 points out
of 100) with time-matched personalized review over the other forms of review.
Predicting and Improving Memory Retention 57
4
4
2 2
0 0
−2 −2
−4
−4
−6
Personalized − Random Personalized − Random
−6
−8 Personalized − Generic Personalized − Generic
−8
0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Average number of review trials per item Average number of review trials per item
FIGURE 3.8 COLT experiment 2. Scatterplot for exams 1 and 2 ((a) and (b),
respectively) showing the advantage of personalized-spaced review over random and
generic-spaced review, as a function of the amount of review that a student performed.
The amount of review is summarized in terms of the total number of review trials
during the semester divided by the number of items. Long-dash regression line
indicates the benefit of personalized over random review; short-dash line indicates
the benefit of personalized over generic review.
Discussion
Whereas previous studies offer in-principle evidence that human learning can
be improved by the inclusion and timing of review, our results demonstrate in
practice that integrating personalized-review software into the classroom yields
appreciable improvements in long-term educational outcomes. Our experiment
goes beyond past efforts in its scope: It spans the time frame of a semester, covers
the content of an entire course, and introduces material in a staggered fashion
and in coordination with other course activities. We find it remarkable that the
review manipulation had as large an effect as it did, considering that the duration
of roughly 30 minutes a week was only about 10 percent of the time students were
engaged with the course. The additional, uncontrolled exposure to material from
58 M. C. Mozer and R. V. Lindsey
classroom instruction, homework, and the textbook might well have washed out
the effect of the experimental manipulation. Our experiments go beyond showing
that spaced practice is superior to massed practice: Taken together, experiments 1
and 2 provide strong evidence that personalization of review is superior to other
forms of spaced practice.
Although the outcome of experiment 2 was less impressive than the outcome
of experiment 1, the mere fact that students went out of their way to avoid a
review activity that would promote long-term retention indicates the great need
for encouraging review of previously learned material. One can hardly fault the
students for wishing to avoid an activity they intuited to be detrimental to their
grades. The solution is to better align the students’ goals with the goal of long-term
learning. One method of alignment is to administer only cumulative quizzes. In
principle, there’s no reason to distinguish the quizzes from the retrieval practice
that students perform using COLT, achieving the sort of integration of testing and
learning that educators often seek.
Conclusions
Theory-driven approaches in psychology and cognitive science excel at charac-
terizing the laws and mechanisms of human cognition. Data-driven approaches
from machine learning excel at inferring statistical regularities that describe how
individuals vary within a population. In this chapter, we have argued that in the
domain of learning and memory, a synthesis of theory and data-driven approaches
inherits the strengths of each. Theory-driven approaches characterize the temporal
dynamics of learning and forgetting based on study history and past performance.
Data-driven approaches use data from a population of students learning a collection
of items to make inferences concerning the knowledge state of individual students
for specific items.
The models described in this chapter offer more than qualitative guidance to
students about how to study. In one respect, they go beyond what even a skilled
classroom teacher can offer: They are able to keep track of student knowledge state
at a granularity that is impossible for a teacher who encounters hundreds of students
over the course of a day. A system such as COLT provides an efficient housekeeping
function to ensure that knowledge, once mastered, remains accessible and a part of
each student’s core competency. COLT allows educators to do what they do best:
to motivate and encourage; to help students to acquire facts, concepts, and skills;
and to offer creative tutoring to those who face difficulty. To achieve this sort of
complementarity between electronic tools and educators, a Big Data approach is
essential.
Predicting and Improving Memory Retention 59
TABLE 3.2 Distributional assumptions of the generative Bayesian response models. The
HYBRID BOTH model shares the same distributional assumptions as the HYBRID DECAY and
HYBRID SCALE models.
where the first term is given by Equations 3 and 5. The effect of the
marginalization of the precision parameters is to tie the traits of different students
together so that they are no longer conditionally independent.
Hyperparameters ψ of the Bayesian models were set so that all the Gamma
distributions had shape parameter 1 and scale parameter 0.1. For each run of each
model, we combined predictions from across three Markov chains, each with a
random starting location. Each chain was run for a burn in of 1,000 iterations and
then 2,000 more iterations were recorded. To reduce autocorrelation among the
samples, we thinned them by keeping every tenth one.
Why did we choose to fit models with hierarchical Bayesian (HB) inference
instead of the more standard maximum likelihood (ML) estimation? The difference
between HB and ML is that HB imposes an additional bias that, in the absence
of strong evidence about a parameter value—say, a student’s ability or an item’s
difficulty—the parameter should be typical of those for other students or other
items. ML does not incorporate this prior belief, and as a result, it is more
susceptible to overfitting a training set. For this reason, we were not surprised
when we tried training models with ML and found they did not perform as well
as with HB.
Acknowledgments
The research was supported by NSF grants SBE-0542013, SMA-1041755, and
SES-1461535 and an NSF Graduate Research Fellowship to R. Lindsey. We thank
Jeff Shroyer for his support in conducting the classroom studies, and Melody
Wisehart and Harold Pashler for providing raw data from their published work
and for their generous guidance in interpreting the spacing literature.
Notes
1 Note that making predictions for new items or new students is principled
within the hierarchical Bayesian modeling framework. From training data, the
Predicting and Improving Memory Retention 61
models infer not only student or item-specific parameters, but also hyper-
parameters that characterize the population distributions. These population
distributions are used to make predictions for new items and new students.
2 In contrast to the present results, Ridgeway et al. (2016) found no improvement
with the HYBRID BOTH over the HYBRID SCALE model.
3 The counts csiw and n siw are regularized by add-one smoothing, which ensures
that the logarithm terms are finite.
References
Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004).
An integrated theory of the mind. Psychological Review, 111, 1036–1060.
Anderson, J. R., Conrad, F. G., & Corbett, A. T. (1989). Skill acquisition and the LISP
tutor. Cognitive Science, 13, 467–506.
Andrade, D. F., & Tavares, H. R. (2005). Item response theory for longitudinal data:
Population parameter estimation. Journal of Multivariate Analysis, 95, 1–22.
Benjamin, A. S., & Tullis, J. (2010). What makes distributed practice effective? Cognitive
Psychology, 61, 228–247.
Bjork, R. (1994). Memory and metamemory considerations in the training of human
beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition: Knowing about knowing
(pp. 185–205). Cambridge, MA: MIT Press.
Carpenter, S. K., Cepeda, N. J., Rohrer, D., Kang, S. H. K., & Pashler, H. (2012). Using
spacing to enhance diverse forms of learning: Review of recent research and implications
for instruction. Educational Psychology Review, 24, 369–378.
Carpenter, S. K., Pashler, H., & Cepeda, N. (2009). Using tests to enhance 8th grade
students’ retention of U. S. history facts. Applied Cognitive Psychology, 23, 760–771.
Cen, H., Koedinger, K., & Junker, B. (2006). Learning factors analysis—a general method
for cognitive model evaluation and improvement. In Proceedings of the Eighth International
Conference on Intelligent Tutoring Systems.
Cen, H., Koedinger, K., & Junker, B. (2008). Comparing two IRT models for conjunctive
skills. In Woolf, B., Aimeur, E., Njambou, R, & Lajoie, S. (Eds.), Proceedings of the Ninth
International Conference on Intelligent Tutoring Systems.
Cepeda, N. J., Coburn, N., Rohrer, D., Wixted, J. T., Mozer, M. C., & Pashler, H. (2009).
Optimizing distributed practice: Theoretical analysis and practical implications. Journal of
Experimental Psychology, 56, 236–246.
Cepeda, N. J., Pashler, H., Vul, E., & Wixted, J. T. (2006). Distributed practice in verbal
recall tasks: A review and quantitative synthesis. Psychological Bulletin and Review, 132,
364–380.
Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed
practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin,
132, 354–380.
Cepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., & Pashler, H. (2008). Spacing effects in
learning: A temporal ridgeline of optimal retention. Psychological Science, 19, 1095–1102.
Cohen, M. S., Yan, V. X., Halamish, V., & Bjork, R. A. (2013). Do students think
that difficult or valuable materials should be restudied sooner rather than later? Journal of
Experimental Psychology: Learning, Memory, and Cognition, 39(6), 1682–1696.
62 M. C. Mozer and R. V. Lindsey
Metcalfe, J., & Finn, B. (2011). People’s hypercorrection of high confidence errors: Did
they know it all along? Journal of Experimental Psychology: Learning, Memory, and Cognition,
37, 437–448.
Mettler, E., & Kellman, P. J. (2014). Adaptive response-time-based category sequencing in
perceptual learning. Vision Research, 99, 111–123.
Mettler, E., Massey, C., & Kellman, P. J. (2011). Improving adaptive learning technology
through the use of response times. In L. Carlson, C. Holscher, & T. Shipley (Eds.),
Proceedings of the 33rd Annual Conference of the Cognitive Science Society (pp. 2532–2537).
Austin, TX: Cognitive Science Society.
Mozer, M. C., Pashler, H., Cepeda, N., Lindsey, R. V., & Vul, E. (2009). Predicting
the optimal spacing of study: A multiscale context model of memory. In Y. Bengio,
D. Schuurmans, J. Lafferty, C. Williams, & A. Culotta (Eds.), Advances in Neural
Information Processing Systems (Vol. 22, pp. 1321–1329). Boston, MA: MIT Press.
Mullany, A. (2013). A Q&A with Salman Khan. Retrieved December 23, 2014, from
http://live.fastcompany.com/Event/A_QA_With_Salman_Khan.
Patz, R. J., & Junker, B. W. (1999). A straightforward approach to Markov chain Monte
Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24,
146–178.
Pavlik, P. I. (2007). Understanding and applying the dynamics of test practice and study
practice. Instructional Science, 35, 407–441.
Pavlik, P. I., & Anderson, J. R. (2005a). Practice and forgetting effects on vocabulary
memory: An activation-based model of the spacing effect. Cognitive Science, 29(4),
559–586.
Pavlik, P. I., & Anderson, J. (2005b). Practice and forgetting effects on vocabulary memory:
An activation-based model of the spacing effect. Cognitive Science, 29, 559–586.
Pavlik, P. I., & Anderson, J. R. (2008). Using a model to compute the optimal schedule of
practice. Journal of Experimental Psychology: Applied, 14, 101–117.
Pavlik, P. I., Cen, H., & Koedinger, K. (2009). Performance factors analysis—a new
alternative to knowledge tracing. In V. Dimitrova & R. Mizoguchi (Eds.), Proceeding of
the Fourteenth International Conference on Artificial Intelligence in Education. Brighton, UK.
Raaijmakers, J. G. W. (2003). Spacing and repetition effects in human memory: Application
of the SAM model. Cognitive Science, 27, 431–452.
Rickard, T., Lau, J., & Pashler, H. (2008). Spacing and the transition from calculation to
retrieval. Psychonomic Bulletin & Review, 15, 656–661.
Ridgeway, K., Mozer, M. C., & Bowles, A. (2016). Forgetting of foreign-language skills:
A corpus-based analysis of online tutoring software. Cognitive Science Journal. (Accepted
for publication).
Rohrer, D., & Taylor, K. (2006). The effects of overlearning and distributed practice on
the retention of mathematics knowledge. Applied Cognitive Psychology, 20, 1209–1224.
Roussos, L. A., Templin, J. L., & Henson, R. A. (2007). Skills diagnosis using IRT-based
latent class models. Journal of Educational Measurement, 44, 293–311.
Seabrook, R., Brown, G., & Solity, J. (2005). Distributed and massed practice: From
laboratory to classroom. Applied Cognitive Psychology, 19, 107–122.
Sobel, H., Cepeda, N., & Kapler, I. (2011). Spacing effects in real-world classroom
vocabulary learning. Applied Cognitive Psychology, 25, 763–767.
64 M. C. Mozer and R. V. Lindsey
Abstract
The goal of cognitive science is to understand human cognition in the real world. However,
Bayesian theories of cognition are often unable to account for anything beyond the
schematic situations whose simplicity is typical only of experiments in psychology labs. For
example, teaching to others is commonplace, but under recent Bayesian accounts of human
social learning, teaching is, in all but the simplest of scenarios, intractable because teaching
requires considering all choices of data and how each choice of data will affect learners’
inferences about each possible hypothesis. In practice, teaching often involves computing
quantities that are either combinatorially implausible or that have no closed-form solution.
In this chapter we integrate recent advances in Markov chain Monte Carlo approximation
with recent computational work in teaching to develop a framework for tractable Bayesian
teaching of arbitrary probabilistic models. We demonstrate the framework on two complex
scenarios inspired by perceptual category learning: phonetic category models and visual
scenes categorization. In both cases, we find that the predicted teaching data exhibit
surprising behavior. In order to convey the number of categories, the data for teaching
phonetic category models exhibit hypo-articulation and increased within-category variance.
And in order to represent the range of scene categories, the optimal examples for teaching
visual scenes are distant from the category means. This work offers the potential to scale
computational models of teaching to situations that begin to approximate the richness of
people’s experience.
Pedagogy is arguably humankind’s greatest adaptation and perhaps the reason for
our success as a species (Gergely, Egyed, & Kiraly, 2007). Teachers produce data to
efficiently convey specific information to learners and learners learn with this in
mind (Shafto and Goodman, 2008; Shafto, Goodman, & Frank, 2012; Shafto,
Goodman, & Griffiths, 2014). This choice not only ensures that information
lives on after its discoverer, but also ensures that information is disseminated
66 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto
quickly and effectively. Shafto and Goodman (2008) introduced a Bayesian model
of pedagogical data selection and learning, and used a simple teaching game to
demonstrate that human teachers choose data consistently with the model and that
human learners make stronger inferences from pedagogically sampled data than
from randomly sampled data (data generated according to the true distribution).
Subsequent work, using the same model, demonstrated that preschoolers learn
differently from pedagogically selected data (Bonawitz et al., 2011).
Under the model, a teacher, T , chooses data, x ∗ , to induce a specific belief
(hypothesis, θ ∗ ) in the learner, L. Mathematically, this means choosing data with
probability in proportion with the induced posterior probability of the target
hypothesis,
p L (θ ∗ | x ∗ )
pT (x ∗ | θ ∗ ) = R (1)
p (θ ∗ | x)d x
x L
p L (x ∗ | θ ∗ ) p L (θ ∗ )
p L (x ∗ )
=R (2)
p L (x | θ ∗ ) p L (θ ∗ )
x
dx
p L (x)
p L (x ∗ | θ ∗ ) p L (θ ∗ )
∝ . (3)
p L (x ∗ )
Thus Bayesian teaching includes Bayesian learning as a sub-problem—because it
requires considering all possible inferences given all possible data. At the outer
layer the teacher
R considers (integrates; marginalizes) over all possible alternative
data choices, x p L (θ ∗ | x)d x; at the inner level, the learner considers all alternative
hypotheses in the marginal likelihood, p L (x ∗ ). The teacher considers how each
possible dataset will affect learning of the target hypothesis and the learner considers
how well the data chosen by the teacher communicate each possible hypothesis.
Pedagogy works because learners and teachers have an implicit understanding of
each other’s behavior. A learner can quickly dismiss many alternatives using the
reasoning that had the teacher meant to convey one of those alternatives, she
would have chosen data differently. The teacher chooses data with this in mind.
Computationally, Bayesian teaching is a complex problem. Producing data that
lead a learner to a specific inference about the world requires the teacher to make
choices between different data. Choosing requires weighing one choice against all
others, which requires computing large, often intractable sums or integrals (Luce,
1977). The complexity of the teacher’s marginalization over alternative data can,
to some extent, be mitigated by standard approximation methods (e.g. Metropolis,
Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Geman & Geman, 1984), but for
teaching, this is not enough. For each choice of dataset, the teacher must consider
how those data will cause the learner to weigh the target hypothesis against all other
hypotheses. As we shall see, this inner marginalization is not one that we can easily
Tractable Bayesian Teaching 67
make go away. And as the hypothesis becomes more complex, the marginalization
becomes more complex; often, as is the case in categorization, the size of the
set of alternative hypotheses increases ever faster as the number of data increases.
For example, if a category learner does not know the number of categories, he
or she must assume there can be as few as one category or as many categories
as there are data. Learning complex concepts that are reminiscent of real-world
scenarios often introduces marginalizations that have no closed form solution or
are combinatorially intractable. Because of this, existing work that models teaching
has done so using necessarily simple, typically discrete, hypothesis spaces.
A Bayesian method of eliciting a specific inference in learners has applications
beyond furthering our understanding of social learning, to education, perception,
and machine learning; thus it is in our interest to make Bayesian teaching tractable.
It is our goal in this chapter to leverage approximation methods that allow us to
scale beyond the simple scenarios used in previous research. We employ recent
advances in Monte Carlo approximation to facilitate tractable Bayesian teaching.
We proceed as follows: In the first section we discuss some of the sources of
complexity that arise in Bayesian statistics, such as marginalized probabilities, and
discuss standard methods of combating complexity. In the second section we
briefly discuss new methods from the Bayesian Big Data literature, paying special
attention to one particular method, psuedo-marginal sampling, which affords the
same theoretical guarantees of standard Monte Carlo approximation methods while
mitigating the effects of model complexity through further approximation, and
outline the procedure for simulating teaching data. In the third section we apply
the teaching model to the debate within developmental psychology of whether
infant-directed speech is for teaching, which amounts to teaching category models.
Lastly, in the fourth section we apply the teaching model to a more complex
problem of teaching natural scene categories, which we model as categories of
category models. We conclude with a brief recapitulation and meditation on future
directions.
f (x | θ ′ )π(θ ′ )
π(θ ′ | x) = (4)
m(x)
f (x | θ ′ )π(θ ′ )
=P . (5)
θ ∈2 f (x | θ)π(θ)
68 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto
Importance Sampling
Importance sampling (IS) is a Monte Carlo method used to approximate
integrals that are analytically intractable or not suitable for quadrature (numerical
integration).1 IS involves re-framing the integral of a function p with respect to θ,
as an expectation with respect to an importance function, w(·) = p(·)/q(·), under q,
such that q(·) > 0 whenever p(·) > 0. One draws a number, M, of independent
samples θ̄1 , . . . θ̄ M from q, and takes the arithmetic average of w(θ¯1 ), . . . , w(θ¯M ).
By the law of large numbers,
M Z
1 X
lim w(θ̄) = p(θ)dθ, (6)
M→∞ M θ
i=1
as M → ∞, the average approaches the true value of the target expectation, which
means that IS produces an unbiased estimate (the expected value of the estimate is
the true value).
If we wish to estimate m(x), we set w(θ) = f (x | θ)π(θ)/q(θ),
Z Z
f (x | θ)π(θ)
m(x) = f (x | θ)π(θ)dθ = q(θ)dθ
θ θ q(θ)
M
f (x | θ)π(θ) 1 X f (x | θ̄i )π(θ̄i )
= Eq ≈ . (7)
q(θ) M i=1 q(θ̄i )
When we approximate the integral with the sum, we no longer consider the
differential, dθ, but consider only individual realizations, θ̄, drawn from q. As
we shall see in the third section, the choice of q influences the efficiency of IS. A
straightforward, although usually inefficient, choice is to draw θ̄ from the prior,
q(θ) = π(θ), in which case,
M
1 X
m(x) ≈ f (x | θ̄i ). (8)
M i=1
2014) and stochastic gradient methods (Patterson & Teh, 2013). Firefly Monte
Carlo (Maclaurin & Adams, 2014), which uses a clever proposal density to activate
(light up) certain data points, is the first exact MCMC algorithm to use subsets of
data. Other proposals employ multiprocessing strategies such as averaging results
from independent Monte Carlo simulations run on subsets of data (Scott et al.,
2013) and dividing computations and computing parts of the MH acceptance ratio
on multiple processors (Banterle, Grazian, & Robert, 2014).
p L (θ ∗ | x ∗ ) p L (x ∗ | θ ∗ )
pT (x ∗ | θ ∗ ) = ∝ .
m(θ ∗ ) p L (x ∗ )
R
The teacher marginalizes over datasets, m(θ ∗ ) = x p L (θ ∗ | x)d x, andR for each
dataset marginalizes over all possible learning inferences, p L (x ∗ ) = θ p L (x ∗ |
θ) p(θ)dθ . To generate teaching data, we must simulate data according to this
probability distribution while navigating these marginalizations.
In order to simulate teaching data, we use PM-MCMC by embedding
importance sampling within the Metropolis-HastingsR algorithm. We use MH to
avoid calculating the integral over alternative data, x p L (θ ∗ | x)d x, leaving the
MH acceptance ratio,
p L (x ′ | θ) p L (x ∗ )
A= , (15)
p L (x ∗ | θ) p L (x ′ )
where x ′ is the proposed (continuous) data and it is assumed that the proposal
density, q, is a symmetric, Gaussian perturbation of the data. Equation 15 indicates
that we must calculate the marginal likelihoods of data in order to use MH for
teaching. This marginalization is inescapable, so we replace it with an importance
sampling approximation, p̂ L (x).
Teaching necessarily depends on the content to be taught, and different
problems require different formalizations of learning. In the following two sections
we apply the teaching model to generate data to teach in two distinct perceptual
learning problems involving categorization models: phonetics and visual scenes.
Categorization is well studied psychologically (see Anderson, 1991; Feldman,
1997; Markman & Ross, 2003) and computationally (see Jain, Murty, & Flynn,
1999; Neal, 2000; Rasmussen, 2000), and presents a particularly challenging
marginalization problem, and is thus an ideal testbed.
hyper-articulation is good for teaching because clusters that are farther apart should
be easier to discriminate.
To date, the IDS research lacks a formal account of teaching vowel phonemes
to infants; rather, arguments are built around intuitions, which conceivably are the
source of much of the contention regarding this topic.2 This sort of question has
been unapproachable previously because languages contain many many phonemes,
and the set of possible categorizations of even a small number of examples rapidly
results in an intractable quantity to sum over. Here we show how the teaching
model can be applied to such a problem. We first describe a model of learning
Gaussian category models and then describe the relevant teaching framework. We
then generate teaching data and explore their qualitative properties.
where δi, j is the Kronecker delta function which assumes value 1 if i = j and value
0 otherwise. δzi ,k equals 1 if, and only if the i th datum is a member of phoneme k.
Tractable Bayesian Teaching 73
G ∼ DP(α H ), (19)
φk ∼ H, (20)
xk ∼ N (xk | φk ), (21)
µk , 6k ∼ NIW(µ0 , 30 , κ0 , ν0 ), (22)
which implies
6k ∼ Inverse-Wishartν0 (3−1
0 ), (23)
µk |6k ∼ N (µ0 , 6k /κ0 ), (24)
where 30 is the prior scale matrix, µ0 is the prior mean, ν0 ≥ d is the prior degrees
of freedom, and κ0 is the number of prior observations.
To formalize inference over z, we introduce a prior, π(z | α), via the Chinese
Restaurant Process (Teh, Jordan, Beal, & Blei, 2006), denoted CRP(α), where the
parameter α affects the probability of new components. Higher α creates a higher
bias toward new components. Data points are assigned to components as follows:
nk
if k ∈ 1 . . . K N
, α) = N −α1 + α
X
(−i)
P(z i = k|z , nk = δzi ,k , (25)
if k = K + 1 i=1
N −1+α
where z (−i) = z \ z i .
Teaching DPGMMs
Recall that the probability of the teacher choosing data is proportional to the
induced posterior. The posterior for the DPGMM is,
f (x | 8, z)π(8 | µ0 , 30 , κ0 , ν0 )π(z | α)
π(8, z|x) = . (26)
m(x | µ0 , 30 , κ0 , ν0 , α)
74 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto
where QKz
Ŵ(α) k=1 Ŵ(n k ) K z
π(z | α) = α , (29)
Ŵ(N + α)
Z is the set of all possible partitions of N data points into 1 to N categories, K z is
the number of categories in assignment vector z, and f (xk | µ0 , 30 , κ0 , ν0 ) is the
marginal likelihood of the data assigned to category k under NIW (which can be
calculated analytically). The size of Z has its own named combinatorial quantity:
the Bell number, or Bn . If we have sufficiently little data or ample patience, we can
calculate the quantity in Equation 28 by enumerating Z. However, Bell numbers
grow quickly, B1 = 1, B5 = 25, B12 = 4, 213, 597, and so on. We can produce an
importance sampling approximation by setting q(z) := π(z|α),
M Kz
1 X Yi
m̂(x | µ0 , 30 , κ0 , ν0 , α) = f (xk | µ0 , 30 , κ0 , ν0 ). (30)
M i=1 k=1
The approach of drawing from the prior by setting q(θ) := π(θ) is usually
inefficient. Areas of high posterior density contribute most to the marginal
likelihood, thus the optimal q is close to the posterior. Several approaches have
been proposed for estimating the marginal likelihood in finite mixture models
(Chib, 1995; Marin & Robert, 2008; Rufo, Martin, & Pérez, 2010; Fiorentini,
Planas, & Rossi, 2012), here we propose a Gibbs initialization importance
sampling scheme suited to the infinite case.4 Each sample, z̄ 1 , . . . , z̄ M , is drawn
by sequentially assigning the data to categories based on the standard collapsed
Gibbs sampling scheme (Algorithm 1),
N
Y
q(z) = p(z i |{z 1 , . . . , z i−1 }, {x1 , . . . , xi−1 }, 30 , µ0 , κ0 , ν0 , α), (31)
i=2
101
10–2
10–3
10–2
10–4
10–5
100 101 102 103 104 3 4 5 6 7 8 9 10 11 12 13
Estimator samples (log) N
(c) 0.30
Prior IS
0.25 Gibbs IS
Mean runtime (sec)
0.20
0.15
0.10
0.05
3 4 5 6 7 8 9 10 11 12 13
N
FIGURE 4.1 Performance comparison between prior and partial Gibbs importance
sampling. (a) Mean relative error, over 2,500 random datasets (y-axis), of the prior
importance sampling approximation (dark) and the partial Gibbs importance sampling
approximation (light; Equation 31) by number of samples (x-axis) for six (solid), eight
(dashed), and ten data points (dotted). (b) Runtime performance (seconds; y-axis)
of algorithms for calculating/approximating m(x | µ0 , 30 , κ0 , ν0 , α) by number of
data points (N; x-axis): exact calculation via enumeration (black), 1,000 samples of
prior importance sampling (dark gray), and 1,000 samples of partial Gibbs importance
sampling (light gray). (c) Separate view of runtime of the importance samplers.
as the number of samples increases, that there is generally more error for higher
N , and that the partial Gibbs importance sampling scheme produces a third of the
error of the importance sampling scheme. We compared the runtime performance
of C++ implementations of exact calculation via enumeration and importance
sampling (using M = 1, 000 samples) for n = 1, . . . , 13. The results can be seen in
Figure 4.1(b) and (c). Enumeration is faster than IS until N = 10 after which the
Tractable Bayesian Teaching 77
Experiments
We first conduct small-scale experiments to demonstrate that x̂ simulated using Â
(the pseudo-marginal acceptance ratio) is equivalent to x simulated using A (the
exact acceptance ratio) while demonstrating the basic behavior of the model. We
then scale up and conduct experiments to determine what type of behavior (e.g.
hyper or hypo-articulation variance increase) can be expected in data designed to
teach complex Gaussian category models to naive learners.
To ensure the exact MH samples and pseudo-marginal MH samples are
identically distributed we used a three-category model, which was to be taught
with two data points assigned to each category. We collected 1,000 samples across
five independent Markov chains, ignoring the first 200 samples from each chain
and thereafter collecting every twentieth sample.5 The prior parameters were set as
in the previous section. Figure 4.2(a) and (b) show the result. Both datasets exhibit
similar behavior including hyper-articulation, denoted by the increased distance
between the category means of the teaching data, and within-category variance
increase. A two-sample, permutation-based, Gaussian Kernel test (Gretton,
Fukumizu, Harchaoui, & Sriperumbudur, 2009; Gretton, Borgwardt, Rasch,
Scholkopf, & Smola, 2012) using 10,000 permutations indicates that the exact
and pseudo-marginal data are identically distributed ( p = 0.9990).
1 1 0
0 0
–5
–1 –1
–2 –2 –10
–4 –2 0 2 4 –4 –2 0 2 4 –10 –5 0 5 10
FIGURE 4.2 Behavior of the teaching model with exact and pseudo-marginal samples.
(a) Three-category Gaussian mixture model. The gray points are drawn directly from
the target model and the black points are drawn from the teaching model using the
exact acceptance ratio with N = 3. (b) Three-category Gaussian mixture model. The
gray points are drawn directly from the target model and the black points are drawn
from the teaching model using the pseudo-marginal acceptance ratio with N = 3.
(c) Pseudo-marginal samples for a two-category model where both categories have
the same mean.
78 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto
Teaching
10 Original
–10
x2
–20
–30
–40
–15 –10 –5 0 5 10 15 20 25
x1
5
∆Teaching - ∆Random
4
3
2
1
0
–1
Cluster pair
FIGURE 4.3 Scale experiment. (Top) Scatter plot of random samples from the target
model (gray) and the teaching data (black). The numbered circles represent the means
of each of the 20 categories. (Bottom) Change in distance between category pairs from
random to teaching samples. Negative values indicate hypo-articulation and positive
values indicate hyper-articulation.
Discussion
In this section we sought to teach Gaussian category models using a
non-parametric categorization framework, inspired by a debate from the
infant-directed speech literature. We demonstrated how standard MH sampling in
the teaching model becomes intractable at a small number of datapoints/categories
(Figure 4.1(b)) and showed how PM-MCMC using a novel importance sampling
scheme (Algorithm 1) allows for tractable teaching in complex models. We
then conducted experiments demonstrating that PM-MCMC produces results
80 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto
indistinguishable from standard MH, while demonstrating that, like IDS, the
teaching model produces hyper-articulation and within-category variance increase
(Figure 4.2). We then scaled up and created a random target model with
roughly the same complexity as an English phonetic category model, finding
the hypo-articulation, hyper-articulation, and variance increase are all features
consistent with teaching.
The results suggest that these features are consistent with teaching in general,
but do not indicate that they are consistent specifically with teaching phonetic
category models. To that end, one would need to apply the teaching model to
category models derived from empirical phonetics data. We have demonstrated
that, using PM-MCMC, the teaching model is capable of contributing to this and
other theoretical debates in teaching complex categories such as those in natural
language.
there are substantial differences between the distributions for carpentered and
non-carpentered environments (Girshick, Landy, & Simoncelli, 2011). The
distribution of oriented contours in an office environment has substantially greater
peaks at the cardinal orientations than the distribution in a national park, for
instance. Here we generalize the teaching model described in the previous section
to determine optimal examples for “teaching” the visual system the distribution of
natural perceptual scenes from different categories (e.g. nature versus “carpentered”
environments).
Given data in the form of the amplitudes of various, discrete orientations, scene
categories can themselves be multi-modal. For example, the oriented content
in forest scenes is different from the oriented content in desert scenes, but both
desert and forest scenes fall into the category of natural scenes. In order to begin
quantifying different types of scene categories, we employ a nested categorization
model in which outer categories are composed of inner categories. (For a similar
but more restrictive model see Yerebakan, Rajwa, & Dundar, 2014). More
specifically, we implement a Dirichlet process mixture model where the outer
Dirichlet process emits a Dirichlet process that emits Gaussians according to NIW.
This is a generalization of the DPGMM model outlined in the previous section.
The generative process of this Dirichlet process mixture model of Dirichlet
process Gaussian mixture models (DP-DPGMM) is outlined in Algorithm 2. A
CRP parameter for the outer categories, γ , is drawn from H ; and the assignment
of data to outer categories, z, is drawn from C R PN (γ ). For each outer category,
k = 1, . . . , K , an inner CRP parameter, αk , is drawn from 3; a set of NIW
parameters, G k , is drawn from G; and an assignment of data in outer category
k to inner categories, vk , is drawn from C R Pnk (αk ). For each inner category,
j = 1, . . . , Jk , a mean and covariance, µk j and 6k j , are drawn from G k ; and data
points are drawn from those µk j and 6k j . The full joint density is,
K
Y
p(γ | H ) p(z | γ ) ( p(αk | 3) p(vk | αk ) p(G k | G)
k=1
Jk
Y Y
× p(µk j , 6k j | G k ) p(x | µk j , 6k j ). (34)
j=1 x∈xk j
Teaching DP-DPGMMs
Given data x = x1 , . . . , x N we wish to teach the assignment of data to outer
categories, z, the assignment of data to inner categories, v, and the means
and covariance matrices that make up the inner categories. The DP-DPGMM
framework assumes that G, H , and 3, the base distributions on G k , and the outer
and inner CRP parameters are known and that all other quantities are unknown.
82 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto
To compute the marginal likelihood m(x | G, H, 3), we must integrate and sum
over all unknowns. The resulting quantity is far more complex than the DPGMM
marginal likelihood (Equation 28). We approximate m(x | G, H, 3) via importance
sampling by drawing parameters from the generative process and calculating the
likelihood of the data,
M K z̄ Jv̄ki
1 X Yi Y
m(x | G, H, 3) ≈ f (xk, j | Ḡ k ), (35)
M i=1 k=1 j=1
where K z̄i is the number of outer categories in the ith outer category assignment,
z̄ i , and Jv̄ki is the number of inner categories in the kth outer category according
to the ith inner category assignment, v̄i . The MH acceptance ratio is then,
Q K z∗ Q Jvk∗ ′
m̂(x | G, H, 3) k=1 j=1 N (x k j | µ∗k j , 6k∗j )
A= Q K z∗ Q Jvk∗ . (36)
m̂(x ′ | G, H, 3) k=1 j=1 N (x k j | µ∗k j , 6k∗j )
Notice that all factors of the full joint distribution that do not rely on the data
cancel from A, leaving only the likelihood of the data under the inner-category
parameters (µ∗k j , 6k∗j ) and the marginal likelihood.
Experiments
Our model can be used to choose the images that would be most efficient data for
teaching the true distribution within and across scene categories. In this vein, we
Tractable Bayesian Teaching 83
shall use the teaching model to choose images, from among some set of empirical
data, that are best for teaching scene categories given their orientation distribution.
Different types of visual experience were collected by wearing a head-mounted
camera, which sent an outgoing video feed to a laptop that was stored in
a backpack. The videos were recorded during typical human environmental
interaction as observers walked around different types of environments (a nature
preserve, inside a house, downtown in a city, around a university, etc.).
Subsequently, every thousandth frame of the videos was taken as a representative
sample and sample images were sorted into two outer categories: purely natural (no
man-made structure) or outdoor, but containing carpentered content. Then, the
structural information was extracted using a previously developed image rotation
method (see Schweinhart, & Essock, 2013). Briefly, each frame was fast Fourier
transformed, rotated to the orientation of interest and the amplitude of the cardinal
orientations (horizontal and vertical) was extracted and stored. Repeating this
process every 15 degrees allowed each video frame to be condensed into a series of
12 data points representing the amount of oriented structure in the image. In this
work, we focus on amplitudes at 0, 45, 90, and 135 degrees and on natural and
carpentered scenes.
To derive a target distribution (means and covariance matrices of inner cate-
gories), we applied expectation-maximization (EM; Dempster, Laird, & Rubin,
1977) to the orientation data from each setting.6 To facilitate cross-referencing
existing images, rather than generating a distribution over datasets, we searched for
the single best dataset, xopt , for teaching the scene categories by searching for the
dataset that maximized the quantity in Equation 3, i.e.
xopt = argmaxx pT (x | θ ∗ ) . (37)
µk , λk , κk , νk ∼ G,
µk ∼ N (x̄, cov(x)),
84 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto
λk ∼ Inverse-Wishartd+1 (cov(x)),
µk ∼ N (x̄, cov(x), ),
κk ∼ Gamma(2, 2),
νk ∼ Gammad (2, 2),
with the intention of being sufficiently vague, where Gammad (·, ·) denotes the
gamma distribution with lower bound d, x̄ is the mean of x, and cov(x) is the
covariance of x.7 All α and γ were drawn from Inverse-Gamma(1, 1).
We note that PM-MCMC does not offer the same theoretical guarantees for
optimization problems that it does for simulation because PM-MCMC relies on
approximate scores; thus the maximum score may be inflated to some degree by
estimator error. Pilot simulations revealed that at 1,000 IS samples, the variance
of m̂ for this problem is acceptable for current purposes. If estimator error is a
concern, one may verify the top few optimal datasets post hoc by re-evaluating their
scores a number of times.
The optimal teaching data are plotted along with the data from the original
model, with the target model means superimposed in Figure 4.4. The images
closest to the teaching data and the empirical means in Euclidean space are
displayed in Figure 4.5. The results demonstrate that the images closest to
the mean in terms of their orientation content are not the best examples to
teach the inner categories; the algorithm instead chose images that contrast the
category distributions. This is especially true for the natural images and when the
distribution of the inner category has higher variance (Figure 4.5, bottom row;
gray data).
Although the teaching model was only given information about the amplitude
of oriented structure in the global image, there are qualitative visual implications
of the choice of images used for teaching. Whereas images near the mean for
both “natural” categories have predominant horizon lines and ground planes, the
teaching model makes a clearer distinction between the two categories by choosing
images with and without a strong horizontal gradient. The teaching model also
more readily distinguishes urban (inner A) from rural-type (inner B) environments
for the carpentered scenes as indicated by the inclusion of cars and buildings in
inner category A (see Figure 4.5). Overall, the teaching model included a wider
variety of vantage points (including looking at the ground) for teaching images of
all categories, better capturing the variability of the image set. This is opposed to
the relatively equal height in the visual field of the centers of the mean images.
Discussion
In this section we sought to select optimal images for teaching categories of
natural scenes. We employed a nested categories model to generalize the DPGMM
Tractable Bayesian Teaching 85
Carpentered Natural
70
60
50
Amplitude
40
30
20
10
0
0°
15°
30°
45°
60°
75°
90°
105°
120°
135°
150°
165°
180°
0°
15°
30°
45°
60°
75°
90°
105°
120°
135°
150°
165°
180°
Orientation Orientation
FIGURE 4.4 Image data associated with the mean empirical data and optimal teaching
data. (Top) The two closest images, in Euclidean space, to the optimal teaching datum
for each inner category for natural (left) and carpentered (right) scenes. (Bottom) The
two closest images, in Euclidean space, to the empirical means for each inner category
for natural (left) and carpentered (right) scenes.
model used in IDS categorization. Unlike the DPGMM, the DP-DPGMM had
no closed-form posterior (due to use of non-conjugate models) and therefore
computing the MH ratio required approximation. The results of the simulation
indicate that the best examples for teaching the inner categories of purely natural
and carpentered scenes are not the means of the respective categories.
The images that are best for teaching different visual categories under the model
exhibit surprising features; the teaching model emphasizes data away from the
mean in order to contrast the categories and represent the variation. Although we
have not directly tested the effectiveness of the teaching images in visual category
learning, the results of this model have potential implications for fields in which
visual training and image categorization are important.
Conclusion
The goal of cognitive science is to understand human cognition in common
scenarios; however, a valid complaint against Bayesian theoretical accounts of
100.0 Model samples
Best teaching samples
Carpentered 45°
135°
135°
135°
90°
90°
45.0
17.0
–10.0
100.0
70.0
Natural 45°
135°
135°
135°
90°
90°
40.0
10.0
–20.0
–20.0
120.0
–20.0
120.0
–20.0
120.0
–10.0
–10.0
–20.0
100.0
15.0
50.0
85.0
15.0
50.0
85.0
15.0
50.0
85.0
15.0
40.0
65.0
90.0
15.0
40.0
65.0
90.0
10.0
40.0
70.0
0° 0° 0° 45° 45° 90°
FIGURE 4.5 Scene category teaching results. Orientation–orientation scatter plots of random samples from the target model. Different
marker colors denote different inner categories. Circles represent the target model category means and triangles represent the optimal
teaching data. The top row shows data from carpentered scenes; the bottom shows data from natural scenes.
Tractable Bayesian Teaching 87
cognition is that they are often unable to account for anything more than schematic
scenarios. Although we have focused on the problem of teaching categories,
we have demonstrated how recent advances in the so-called Bayesian Big Data
literature allow Bayesian cognitive modelers, in general, to build more compelling
models that are applicable to real-world problems of interest to experimentalists.
We began the chapter by briefly discussing the complexity concerns of the
Bayesian cognitive modeler, especially in the domain of teaching, and outlined
some standard methods of dealing with it. We then discussed pseudo-marginal
sampling and applied it to the problem of teaching complex concepts. We applied
the PM-MCMC-augmented teaching model to teaching phonetic category
models, demonstrating how the framework could be used to contribute to an
active debate in linguistics: whether infant-directed speech is for teaching. The
results suggested that some of the unintuitive properties of IDS are consistent with
teaching although further work is needed to be directly applicable to IDS. We then
applied the teaching model to the far more complex problem of teaching nested
category models. Specifically, we outlined a framework for learning and teaching
scene categories from orientation spectrum data extracted from images. We found
that the optimal data for teaching these categories captured a more descriptive
picture of the nested category than the mean data. Specifically, the teaching data
seek to convey the ranges of the categories.
This work represents a first step toward a general framework for teaching
arbitrary concepts. In the future, we hope to extend the model to teach in richer
domains and under non-probabilistic learning frameworks by creating a symbiosis
between Bayesian and non-Bayesian methods such as artificial neural networks and
convex optimization.
Acknowledgments
This work was supported in part by NSF award DRL-1149116 to P.S.
Notes
1 In general, quadrature is a more precise, computationally efficient solution than
Monte Carlo integration in the situations in which it can be applied.
2 We refer those interested in reading more about this debate to Burnham,
Kitamura, & Vollmer-Conna (2002), de Boer & Kuhl (2003), Uther, Knoll,
& Burnham (2007), McMurray, Kovack-Lesh, Goodwin, & McEchron (2013),
and Cristia & Seidl (2013).
3 The term non-parametric is used to indicate that the number of parameters is
unknown (that we must infer the number of parameters), not that there are no
parameters.
88 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto
4 For an overview of methods for controlling Monte Carlo variance, see Robert
and Casella (2013, Chapter 4).
5 When the target distribution is multi-modal, Markov chain samplers often
become stuck in a single mode. To mitigate this, it is common practice to
sample from multiple independent Markov chains.
6 We used the implementation of EM in the scikit-learn (Pedregosa, Varoquaux,
Gramfort, Michel, Thirion, Grisel, & Duchesnay, 2011) python module’s
DPGMM class.
7 The degrees of freedom of NIW cannot be less than the number of dimensions,
thus the lower bound on νk must be d.
References
Anderson, J. (1991). The adaptive nature of human categorization. Psychological Review,
98(3), 409.
Andrieu, C., & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte
Carlo computations. Annals of Statistics, 37(2), 697–725. arXiv: 0903.5480.
Andrieu, C., & Vihola, M. (2012). Convergence properties of pseudo-marginal Markov
chain Monte Carlo algorithms. 25(2), 43. arXiv: 1210.1484.
Banterle, M., Grazian, C., & Robert, C. P. (2014). Accelerating Metropolis-Hastings
algorithms: Delayed acceptance with prefetching, 20. arXiv: 1406.2660.
Bardenet, R., Doucet, A., & Holmes, C. (2014). Towards scaling up Markov chain Monte
Carlo: An adaptive subsampling approach. Proceedings of the 31st International Conference on
Machine Learning, 4, 405–413.
Bonawitz, E., Shafto, P., Gweon, H., Goodman, N. D., Spelke, E., & Schulz, L. (2011).
The double-edged sword of pedagogy: Instruction limits spontaneous exploration and
discovery. Cognition, 120(3), 322–330.
Burnham, D., Kitamura, C., & Vollmer-Conna, U. (2002). What’s new, pussycat? On
talking to babies and animals. Science, 296(5572), 1435.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical
Association, 90(432), 1313–1321.
Coppola, D. M., Purves, H. R., McCoy, A. N., & Purves, D. (1998). The distribution of
oriented contours in the real world. Proceedings of the National Academy of Sciences of the
United States of America, 95(7), 4002–4006.
Cristia, A., & Seidl, A. (2013). The hyperarticulation hypothesis of infant-directed speech.
Journal of Child Language, 41, 1–22.
de Boer, B., & Kuhl, P. K. (2003). Investigating the role of infant-directed speech with a
computer model. Acoustics Research Letters Online, 4(4), 129.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B
(Methodological), 39(1), 1–38.
Essock, E. A., DeFord, J. K., Hansen, B. C., & Sinai, M. J. (2003). Oblique stimuli are seen
best (not worst!) in naturalistic broad-band stimuli: A horizontal effect. Vision Research,
43(12), 1329–1335.
Tractable Bayesian Teaching 89
Neal, R. M. (1999). Erroneous results in “marginal likelihood from the Gibbs output”. University
of Toronto.
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models.
Journal of Computational and Graphical Statistics, 9(2), 249–265.
Patterson, S., & Teh, Y. W. (2013). Stochastic gradient Riemannian Langevin dynamics on
the probability simplex. Nips, 1–10.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
. . . Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine
Learning Research, 12, 2825–2830.
Rasmussen, C. (2000). The infinite Gaussian mixture model. Advances in Neural Information
Processing, 11, 554–560.
Robert, C. P., & Casella, G. (2013). Monte Carlo statistical methods. New York: Springer
Science & Business Media.
Rufo, M., Martin, J., & Pérez, C. (2010). New approaches to compute Bayes factor in finite
mixture models. Computational Statistics & Data Analysis, 54(12), 3324–3335.
Schweinhart, A. M., & Essock, E. A. (2013). Structural content in paintings: Artists
overregularize oriented content of paintings relative to the typical natural scene bias.
Perception, 42(12), 1311–1332.
Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., & Mcculloch,
R. E. (2013). Bayes and Big Data: The Consensus Monte Carlo Algorithm. International
Journal of Management Science and Engineering Management, 11(2), 78–88.
Shafto, P., & Goodman, N. D. (2008). Teaching games: Statistical sampling assumptions
for learning in pedagogical situations. In Proceedings of the 13th Annual Conference of the
Cognitive Science Society.
Shafto, P., Goodman, N. D., & Frank, M. C. (2012). Learning from others: The
consequences of psychological reasoning for human learning. Perspectives on Psychological
Science, 7(4), 341–351.
Shafto, P., Goodman, N. D., & Griffiths, T. L. (2014). A rational account of pedagogical
reasoning: Teaching by, and learning from, examples. Cognitive Psychology, 71C, 55–89.
Sherlock, C., Thiery, A., Roberts, G., & Rosenthal, J. (2013). On the efficiency
of pseudo-marginal random walk Metropolis algorithms. arXiv preprint, arXiv, 43(1),
238–275. arXiv:1309.7209v1.
Switkes, E., Mayer, M. J., & Sloan, J. A. (1978). Spatial frequency analysis of the visual
environment: Anisotropy and the carpentered environment hypothesis. Vision Research,
18(10), 1393–1399.
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical dirichlet processes.
Journal of the American Statistical Association, 101(476), 1566–1581.
Uther, M., Knoll, M., & Burnham, D. (2007). Do you speak E-NG-L-I-SH? A comparison
of foreigner- and infant-directed speech. Speech Communication, 49(1), 2–7.
Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., & Amano, S. (2007). Unsupervised
learning of vowel categories from infant-directed speech. Proceedings of the National
Academy of Sciences of the United States of America, 104(33), 13273–13278.
Wainwright, M. J. (1999). Visual adaptation as optimal information transmission. Vision
Research, 39(23), 3960–3974.
Yerebakan, H. Z., Rajwa, B., & Dundar, M. (2014). The infinite mixture of infinite
Gaussian mixtures. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q.
Weinberger (Eds) Advances in neural information processing systems 27 (pp. 28–36). Curran
Associates, Inc.
5
SOCIAL STRUCTURE RELATES TO
LINGUISTIC INFORMATION DENSITY
David W. Vinson and
Rick Dale
Abstract
Some recent theories of language see it as a complex and highly adaptive system, adjusting
to factors at various time scales. For example, at a longer time scale, language may
adapt to certain social or demographic variables of a linguistic community. At a shorter
time scale, patterns of language use may be adjusted by social structures in real time.
Until recently, datasets large enough to test how socio-cultural properties—spanning vast
amounts of time and space—influence language change have been difficult to obtain. The
emergence of digital computing and storage have brought about an unprecedented ability
to collect and classify massive amounts of data. By harnessing the power of Big Data we
can explore what socio-cultural properties influence language use. This chapter explores
how social-network structures, in general, contribute to differences in language use. We
analyzed over one million online business reviews using network analyses and information
theory to quantify social connectivity and language use. Results indicate that perhaps a
surprising proportion of variance in individual language use can be accounted for by subtle
differences in social-network structures, even after fairly aggressive covariates have been
added to regression models. The benefits of utilizing Big Data as a tool for testing classic
theories in cognitive science and as a method toward guiding future research are discussed.
Introduction
Language is a complex behavioral repertoire in a cognitively advanced species. The
sounds, words, and syntactic patterns of language vary quite widely across human
groups, who have developed different linguistic patterns over a long stretch of time
and physical separation (Sapir, 1921). Explanations for this variation derive from
two very different traditions. In the first, many language scientists have sought
to abstract away from this observed variability to discern core characteristics of
92 D. W. Vinson and R. Dale
language, which are universal and perhaps genetically fixed across people (from
Chomsky, 1957 to Hauser, Chomsky, & Fitch, 2001). The second tradition
sees variability as the mark of an intrinsically adaptive system. For example,
Beckner et al. (2009) argue that language should be treated as being responsive to
socio-cultural change in real time. Instead of abstracting away apparently superficial
variability in languages, this variability may be an echo of pervasive adaptation,
from subtle modulation of real-time language use, to substantial linguistic change
over longer stretches of time. This second tradition places language in the broader
sphere of human behavior and cultural products in a time when environmental
constraints have well-known effects on many aspects of human behavior (see
Triandis, 1994 for review).1
Given these explanatory tendencies, theorists of language can have starkly
divergent ideas of it. An important next step in theoretical mitigation will be new
tools and broad data samples so that, perhaps at last, analyses can match theory in
extent and significance. Before the arrival of modern information technologies,
a sufficient linguistic corpus would have taken years, if not an entire lifetime,
to acquire. Indeed, some projects on the topic of linguistic diversity have this
property of impressive timescale and rigor. Some examples include the Philadelphia
Neighborhood Corpus, compiled by William Labov in the early 1970s, the
Ethnologue, first compiled by Richard Pittman dating back to the early 1950s,
and the World Atlas of Language Structures (WALS), a collection of data and
research from 55 authors on language structures available online, produced in 2008.
Digitally stored language, and to a great extent accessible for analysis, has begun to
exceed several exabytes, generated every day online (Kudyba & Kwatinetz, 2014).2
One way this profound new capability can be harnessed is by recasting current
theoretical foundations, generalized from earlier small-scale laboratory studies, into
a Big Data framework.
If language is pervasively adaptive, and is thus shaped by socio-cultural
constraints, then this influence must be acting somehow in the present day, in
real-time language use. Broader linguistic diversity and its socio-cultural factors
reflect a culmination of many smaller, local changes in the incremental choices of
language users. These local changes would likely be quite small, and not easily
discerned by simple observation, and certainly not without massive amounts of
data. In this chapter, we use a large source of language data, Yelp, Inc. business
reviews, to test whether social-network structures relate in systematic ways to the
language used in these reviews. We frame social-network variables in terms of
well-known network measures, such as centrality and transitivity (Bullmore &
Sporns, 2009), and relate these measures to language measures derived from
information theory, such as information density and uniformity (Aylett, 1999;
Jaeger, 2010; Levy & Jaeger, 2007). In general, we find subtle but detectable
relationships between these two groups of variables. In what follows, we first
motivate the broad theoretical framing of our Big Data question: What shapes
Social Structure and Information Density 93
linguistic diversity and language change in the broad historical context? Following
this we describe information theory and its use in quantifying language use. Then,
we explain how social structure may influence language structure. We consider this
a first step in understanding how theories in cognitive and computational social
science can be used to harness the power of Big Data in important and meaningful
ways (see Griffiths, 2015).
The theory of uniform information density (UID; Levy & Jaeger 2007; Jaeger,
2010) states that speakers will aim to present the highest amount of information
across a message at a uniform rate, so as to efficiently communicate the most
content without violating a comprehender’s channel capacity. Support for this
theory comes from Aylett (1999), in an early expression of this account, who
found that speech is slower when a message is informationally dense and Jaeger
(2010), who found information-dense messages are more susceptible to optional
word injections, diluting its overall density over time. Indeed, even word length
may be adapted for its informational qualities (Piantadosi, Tily, & Gibson, 2011).
In a recent paper, we investigated how a simple contextual influence, the
intended valence of a message, influences information density and uniformity.
While it is obvious that positive and negative emotions influence what words
individuals use (see Vinson & Dale, 2014a for review), it is less obvious that the
probability structure of language use is also influenced by one’s intentions. Using a
corpus of over 200,000 online customer business reviews from Yelp, Inc., findings
suggest that the information density of a message increases as the valence of that
message becomes more extreme (positive or negative). It also becomes less uniform
(more variable) as message valence becomes more positive (Vinson & Dale, 2014b).
The results are commensurate with theories that suggest language use adapts to a
variety of socio-cultural factors in real time.
In this chapter, we look to information-theoretic measures of these kinds to
quantify aspects of language use, with the expectation that they will also relate in
interesting ways to social structure.
Social-Network Structure
Another key motivation of our proposed analyses involves the use of network
theory to quantify the intricate structural properties that connect a community
of speakers (Christakis & Fowler, 2009; Lazer et al., 2009). Understanding
how specific socio-cultural properties influence language can provide insight
into the behavior of the language user herself (Baronchelli, Ferrer-i-Cancho,
Pastor-Satorras, Chater, & Christiansen, 2013). For instance, Kramer, Guilliory,
and Hancock (2014) having analyzed over 600,000 Facebook users, reported that
when a user’s newsfeed was manipulated to show only those posts that were either
positive or negative, a reader’s own posts aligned with the emotional valence of
their friends’ messages. Understanding what a language looks like when under
certain socio-cultural pressures can provide valuable insight into what societal
pressures that help shape a language. Indeed, global changes to one’s socio-cultural
context, such as changes in the classification of severity of crime and punishment
over time, are marked by linguistic change (Klingenstien, Hitchcock, & DeDeo,
2014) while differences in the distance between socio-cultural niches are marked
by differences in language use (Vilhena et al., 2014).
Social Structure and Information Density 95
Current Study
In the current study, we utilize the Yelp database as an arena to test how
population-level differences might relate to language use. While previous work
suggests online business reviews may provide insight into the psychological states
of its individual reviewers (Jurafsky, Chahuneau, Routledge, & Smith, 2014), we
expect that structural differences in one’s social community as a whole, where
language is crucial to conveying ideas, will affect language use. We focus on
how a language user’s social niche influences the amount and rate of information
transferred across a message. Agent-based simulations (Chater et al., 2006; Dale &
Luypan, 2012; Reali et al., 2014) and recent studies on the influences of interaction
in social networks (Bond et al., 2012; Choi, Blumen, Congleton, & Rajaram,
2014) indicate that the structure of language use may be influenced by structural
aspects of a language user’s social interactions. From an exploratory standpoint, we
aim to determine if one’s social-network structure predicts the probability structure
of language use.
Method
Corpus
We used the Yelp Challenge Dataset (www.yelp.com/dataset.challenge), which,
at the time of this analysis, contained reviews from businesses in Phoenix, Las
Vegas, Madison, and Edinburgh. This includes 1,125,458 reviews from 252,898
users who reviewed businesses in these cities. The field entries for reviews include
almost all the information that is supplied on the Yelp website itself, including the
content of the review, whether the review was useful or funny, the star rating that
was conferred upon the business, and so on. It omits a user’s public username, but
includes an array of other useful information, in particular a list of user ID codes
that point to friends of a given user. Yelp users are free to ask any other Yelp user to
be their friend. Friendship connections are driven by users’ mutual agreement to
become friends. These user ID codes allow us to iteratively build social networks
by randomly choosing a user, and expanding the network by connecting friends
and friends of friends, which we further detail below.
Linguistic Measures
The first and simplest measure we explore in our analysis is the number of words in a
given review, its word length. This surface measure is used as a basic but important
covariate for regression analyses. Word length will define the bin count for entropy
and other information analyses, and so directly impacts these measures.
The second measure we use is the reviewer-internal entropy (RI-Ent) of a
reviewer’s word use. This marks the discrete Shannon entropy of a reviewer’s overall
96 D. W. Vinson and R. Dale
word distribution. If reviewers use many different words, entropy would be high. If
a reviewer reuses a smaller subset of words, the entropy of word distribution would
be low, as this would represent a less uniform distribution over word types.
A third measure is the average information encoded in the reviewer’s word
use, which we’ll call average unigram information (AUI). Information, as described
above, is a measure of the number of bits a word encodes given its frequency in
the overall corpus. Reviews with higher information use less frequent words, thus
offering more specific and less common verbiage in describing a business.
A fourth measure is one more commonly used in studies of informational
structure of language, which we’ll call the average conditional information (ACI). This
is a bit-based measure of a word based on its probability conditioned on the prior
word in the text. In other words, it is a measure of the bits encoded in a given
bigram of the text. We compute the average bits across bigrams of a review, which
reflect the uniqueness in word combinations.3
Finally, we extract two crude measures of information variability by calculating
the standard deviation over AUI and ACI, which we call unigram informational
variability (UIV) and conditional informational variability (CIV), respectively. Both
measures are a reflection of how stable the distribution is over a reviewer’s
average unigram and bigram bit values. These measures relate directly to uniform
information density (see Jaeger, 2010; Levy & Jaeger, 2007). A very uniform
distribution of information is represented by a stable mean and lower UIV/CIV;
Social Networks
One benefit of the Big Data approach we take in this chapter is that we can pose
our questions about language and social structure using the targeted unit of analysis
of social networks themselves. In other words, we can sample networks from the Yelp
dataset directly, with each network exhibiting network scores that reflect a certain
aspect of local social structure. We can then explore relationships between the
information-theoretic measures and these social-network scores.
We sampled 962 unique social networks from the Yelp dataset, which amounted
to approximately 38,000 unique users and 450,000 unique reviews. Users represent
nodes in social networks and were selected using a selection and connection
algorithm also shown in Figure 5.1. We start by choosing a random user who has
2 2
2 2
1 1 1
0 0 0 0
1 1 1
2 2
2 2
between 11 and 20 friends in the dataset (we chose this range to obtain networks
which were not too small or too large as to be computationally cumbersome). After
we chose that user, we connected all his or her friends and then expanded the social
network by one degree; randomly selecting ten of his or her friends and connecting
up to ten of his or her friend’s friends to the network. We then interconnected all
users in this set (shown as the first-degree nodes and connections in Figure 5.1).
We conducted this same process of finding friends of these first-degree nodes, and
then interconnected those new nodes of the second degree. In order to make sure
networks did not become too large, we randomly sampled up to ten friends of
each node only. Fifty percent of all networks fell between 89 and 108 reviewers
in size, and the resulting nets reveal a relatively normal distribution of network
metrics described in the next section.
Network Measures
A variety of different network measures were used to quantify the structure of each
network. We consider two different categories of network structures: simple and
complex. A summary of all seven (two simple, five complex) network measures
appears in Table 5.2. We used two simple network measures: The number of
reviewers in a network, or nodes, and the number of friendship connections
between reviewers, or edges.
We considered five complex network measures. The first measure, network
degree, is the ratio of edges to nodes. This provides a measure of connectivity across
a network. High-degree networks have a higher edge-to-node ratio than lower
degree networks.
The second measure, network transitivity, determines the probability that
two adjacent nodes are themselves connected (sometimes termed the “clustering
coefficient”). Groups of three nodes, or triples, can either be closed (e.g. fully
connected) or open (e.g. two of the three nodes are not connected). The
ratio of closed triples to total triples provides a measure of the probability that
adjacent nodes are themselves connected. High transitivity occurs when the ratio
of closed-to-open triples is close to one.
A third measure, network betweenness, determines the average number of
shortest paths in a network that pass through some other node. The shortest path
of two nodes is the one that connects both nodes with the fewest edges. A pair of
nodes can have many shortest paths. A node’s betweenness value is the sum of the
ratio of a pair of node’s shortest paths that pass through the node, over the total
number of shortest paths in the network. We determined network betweenness by
taking the average node betweenness for all nodes in the network. Effectively, this
provides a measure of network efficiency. The higher a network’s betweenness, the
faster some bit of new information can travel throughout the network.
TABLE 5.2 Summary of the measures quantifying a network’s structure.
A fourth measure stems from node centrality, which determines the number of
connections a single node has with other nodes. Centrality can also be determined
for the whole network, known as graph centrality. Graph centrality is the ratio of
the sum of the absolute value of the centrality of each node, over the maximum
possible centrality of each node (Freeman, 1979). Node centrality is greatest when
a single node is connected to all other nodes, whereas graph centrality is greatest
when all nodes are connected to all other nodes. Information is thought to travel
faster in high-centrality networks. Here we use graph centrality only. From this
point on we will refer to graph centrality simply as centrality. Network betweenness
and network centrality share common theoretical assumptions, but quantify different
structural properties of a network.
Our fifth and final measure determines whether the connections between nodes
in a network share connectivity at both local and global scales. Scale-free networks
display connectivity at all scales, local and global, simultaneously (Dodds, Watts, &
Sabel, 2003). A network is scale free when its degree distribution (i.e., the number
of edge connections per node) fits a power law distribution. Networks that are
less scale free are typically dominated by either a local (a tightly connected set
of nodes) or global connectivity (randomly connected nodes). Networks that
exemplify differences in complex structures are presented in Figure 5.2.
Additional Measures
Individual reviews were not quantified independently. Instead, all reviews from
a single individual were concatenated into one document. This allowed for
information-theoretic measures to be performed over a single user’s total set of
reviews. The average information of a network was then computed by taking
the average across all individuals (nodes) in the network. Such an analysis affords
testing how the structure of an individual’s social network impacts that individual’s
Social Structure and Information Density 101
overall language use. However, due to the nature of how our information-theoretic
measures were determined, individuals who wrote well over one hundred reviews
were treated the same as those who wrote merely one. This introduces a
possible bias since information measures are typically underestimated when using
non-infinite sample sizes (as in the case of our information measures). While we
control for certain measures such as the average reviewer’s total review length and
network size, additional biases may occur due to the nature of how each measure
was determined (e.g. averaging across reviewers with unequal length reviews).
To address these concerns two additional measures, (1) a gini coefficient and (2) a
random review baseline—to assess the reliability of our analyses—were used. They
are described below.
Gini Coefficient. The Gini coefficient (range = [0,1]) was originally developed
to assess the distribution of wealth across a nation. As the coefficient approaches
zero, wealth is thought to approach greater equality. As the coefficient approaches
one, more of the nation’s wealth is thought to be shared among only a handful
of its residents. We use the Gini coefficient to assess the distribution of reviews
across a network. Since each node’s reviews were concatenated, given only one
value for each information-theoretic measure, certain reviewer’s measures will be
more representative of the linguistic distributions. The Gini coefficient provides
102 D. W. Vinson and R. Dale
information content of these words, rendering them less unique and thus show the
lowest entropy, AUI, and so on. One may expect something similar for transitivity,
which would reflect the intensity of local mutual interconnections (closed triples).
However, the reverse is also possible. If language users are more densely
connected it may be more likely that they have established a richer common
ground overall. If so, language use may contain more information-dense words
specific to a shared context (Doyle & Frank, submitted). A fourth prediction is
that more network connectivity over a smaller group (higher-network degree) may
afford more complex language use, and so lead to higher AUI and ACI.
A final prediction comes from the use of information theory to measure the rate
of information transmission. When a speaker’s message is more information-dense,
it is more likely that it will also be more uniform. Previous findings show speakers
increase their speech rate when presenting low information-dense messages, but
slow their speech rate for information-dense messages (Pellegrino, Coupé, &
Marisco, 2011). It may be that any social structure that leads to increases in
information density simultaneously decreases information variability.
Results
Simple Measures
The confidence intervals (99.9 percent CI) of five multiple regression models,
where nodes, edges, and the Gini coefficient were used to predict each
information-theoretic measure, are presented in Table 5.3. We set a conservative
criterion for significance ( p < 0.001) for all analyses. Only those analyses that were
significant are presented. Crucially, all significant effects of independent variables
were individually compared to their effects on the random review baseline. To
do this, we treated the random review baseline and true network reviews as two
levels of the same variable: “true_baseline.” Using linear regression we added an
interaction term between independent network variables and the true_baseline
variable. A significant interaction is demarcated by “†” in Tables 5.3 and 5.4.
The effects of these network variables on information-theoretic measures are
significantly different in true networks compared to baseline networks. This helps
to ensure that our findings are not simply an artifact of our methodology.
All variables were standardized (scaled and shifted to have M = 0 and SD
= 1). Additionally, the number of words (length) was log transformed due to a
heavy-tailed distribution. All other variables were normally distributed. Because
length correlates with all information-theoretic measures and UIV and CIV
correlate with the mean of AUI and ACI, respectively (due to the presence of
a true zero), the mean of each information measure was first predicted by length
while UIV and CIV were also predicted by AUI and ACI. The residual variability
of these linear regression models was then predicted by nodes, edges, and the
Gini coefficient. The purpose of residualization is to further ensure that observed
104 D. W. Vinson and R. Dale
TABLE 5.3 Lexical measures as predicted by nodes, edges, and Gini coefficient.
Only the mean and 99.9 percent confidence intervals for each IV with p < 0.001 are presented. The “†”
symbol denotes all network effects that were significantly different from baseline network effects ( p < .001).
Multiple linear regressions were performed in R: lm(DV∼Nodes+Edges+Gini).
interactions are not due to trivial collinearity between simpler variables (length,
nodes) and ones that may be subtler and more interesting (CIV, centrality, etc.).5
The number of nodes provides a crude measure of network size, edges, network
density, and the Gini coefficient (the distribution of reviews across the network).
Importantly, no correlation exists between the Gini coefficient with either edges
or nodes. And, although a strong correlation exists between nodes and edges (r =
0.67, t (960) = 28.23, p < 0.0001), in only two instances, AUI and UIV, did
nodes account for some portion of variance. As nodes increased, average unigram
information, along with average unigram variability, decreased. However, only
the relationship between nodes and average unigram information was significantly
different between the true network and the random review baseline. In all cases,
save conditional information variability (CIV), a significant proportion of variance
in information measures was accounted for by edges, and in all but length and CIV,
the relationship between information measures and edges was significantly different
between the true network and the random review baseline (Figure 5.3(a) presents
an example interaction plot between ACI, edges and true_baseline measures).
Finally, the Gini coefficient accounted for a significant portion of variance for
all information measures, but only for RI-Ent and AUI did it have a significantly
different relationship between the true network and the random review baseline.
One explanation may be that more unique language use may naturally occur
when more individuals contribute more evenly to the conversation. Another
possibility is that networks with less even review distributions are more likely to
TABLE 5.4 Information-theoretic measures predicted by complex network measures.
Only the mean and 99.9 percent confidence intervals for each IV with p < 0.001 are presented. All reported values were significant ( p < 0.001). The “†” symbol denotes all
network effects significantly different from baseline network effects.
106 D. W. Vinson and R. Dale
0.5
Residual ACI
0.0
−0.5
Baseline
True network
−1.0
0.0 2.5 0 1
Edges Residual degree
(c)
−2 0 2
Residual transitivity
FIGURE 5.3 Network measures for true and baseline networks by across conditional
information-density. All four plots show significant interactions for variables in True
networks compared to baseline networks. Linear regression models with interaction
terms were used in R:lm(DV∼IV+true_baseline+IV*true_baseline).
have more reviews, suggesting that a larger number of reviewers’ language use is
more representative of the overall linguistic distribution of reviews. A simple linear
regression analysis reveals the Gini coefficient accounts for a significant portion of
variance in the total number of reviews in each network (R2 ad j = 0.21, F[1,960]
= 249.6, p < 0.001, 99.9 percent CI [0.25, 0.38]), increasing as the number of
reviews increases.
We interpret these results cautiously considering it is a first step toward
understanding what aspects of a network relate to language use. The results
suggest changes in population size and connectivity occur alongside changes in
the structure of language use. Speculatively, the individual language user may be
influenced by the size and connectivity of her network. When the size of his or
Social Structure and Information Density 107
her network increases, the words he or she uses may be more frequent. However,
when connectivity increases, the words he or she uses may be of low frequency,
and therefore more information dense. This supports current work that shows how
a shared common ground may lead to an increase in information-dense word use
(Doyle & Frank, submitted). This is further explored in the discussion.
Although we find significant effects, how network size and connectivity
influence information density and channel capacity, and how different ways of
interpreting information (as we have done here) interact with simple network
measures is unclear. Generally, these results suggest that word choice may relate
to social-network parameters.
Complex Measures
The complex network measures, centrality, degree and scale free, were log transformed
for all analyses due to heavy-tailed distributions. Given the larger number of
network variables and their use of similar network properties such as number
of nodes or edges, it is possible that some complex network measures will be
correlated. To avoid any variance inflation that may occur while using multiple
predictors, we determined what variables were collinear using a variance inflation
factor (vif) function in R. We first used nodes and edges to predict the variance
of each complex network measure. We factor this out by taking the residual of
each model and then used the vif function from the R library car to determine
what complex network measures exhibited collinearity. Using a conservative VIF
threshold of five or less (see Craney & Surls, 2002; Stines, 1995 for review) we
determined that no complex network measures used in our model was at risk of
any collinearity that would have seriously inflated the variance.6 All VIF scores
were under the conservative threshold for all complex network variables and are
therefore not reported. Residuals of complex network measures, having factored
out any variance accounted for by nodes and edges, were used to predict each
information-theoretic measure presented in Table 5.4.
One or more complex measures accounted for a significant proportion of
variance in each information density measure. Certain trends are readily observed
across these models. Specifically, word length increased as network transitivity and
centrality increased and decreased as network betweenness increased; however, no
network measure effects were significantly different from random review baseline
effects (significance marked by the “†” symbol in Table 5.4). Additionally, RI-Ent
and AUI and ACI increased as network degree increased, accounting for ∼5–10
percent of the variance in each measure. The relationship between network
degree and corresponding information measures in true networks was significantly
different from baseline. This was also the case for network centrality for both
AUI and network transitivity for ACI. Figure 5.3 presents interaction plots for
residual ACI by degree (b) and residual transitivity (c) between true and random
108 D. W. Vinson and R. Dale
review baseline networks. Complex network measures did not share a significant
relationship with UIV or CIV.
It is clear that certain network structures predict differences in information
density measures even after stringent controls were applied to both information
and network measures. Specifically, support for higher information-dense messages
may be the result of networks that exhibit high global connectivity driven by
increases in specific network properties, namely, network degree and centrality.
This further supports previous work showing that a shared common ground may
bring about higher information-dense language use. Specifically, networks that
exhibit a more centralized nucleus and are more densely connected (higher degree)
may be more likely to share a similar common ground among many members of
the group. If so, a shared common ground may result in more unique language
use. Yet, networks that exhibit close, niche-like groupings exemplified by high
network transitivity may infect its members with the same vocabulary, decreasing
the overall variability in language use. Further analysis is necessary to unpack the
relationship that different social-network structures have with language use.
Discussion
We find pervasive relationships between language patterns, as expressed in
information-theoretic measures of review content, and social-network variables,
even after taking care to control for collinearity. The main findings are listed here:
The predictions laid out at the end of the Methods section are somewhat
borne out. Scale-free networks do not appear to have a strong relationship among
information-theoretic scores, but networks that exhibit higher transitivity do
lead to lower information-dense bigrams (though not significant for any other
Social Structure and Information Density 109
General Discussion
In this chapter we explored how language use might relate to social structure. We
built 962 social networks from over 200,000 individuals who collectively wrote
over one million online customer business reviews. This massive, structured dataset
allowed testing how language use might adapt to structural differences in social
networks. Utilizing Big Data in this way affords assessing how differences in one’s
110 D. W. Vinson and R. Dale
FIGURE 5.4 Yelp networks occurring at the tails of certain complex network measure
distributions (as specified in text) presenting ideal conditions for language use
exhibiting high (a) middle (b) and low (c) average condition information (ACI).
local social environment might relate to language use and communication more
generally. Our findings suggest that as the connectivity of a population increases,
speakers use words that are less common. Complex network variables such as the
edge-to-node ratio, network centrality, and local connectivity (e.g. transitivity) also
predict changes in the properties of words used. The variability of word use was
also affected by simple network structures in interesting ways. As a first exploration
our findings suggest local social interactions may contribute in interesting ways to
language use. A key strength of using a Big Data approach is in uncovering new
ways to test theoretical questions about cognitive science, and science in general.
Below we discuss how our results fit into the broader theoretical framework of
understanding what shapes language.
When controlling for nodes, edges, and review length, many R2 values in
our regression models were lowered. However, finding that some variability in
language use is accounted for by population connectivity suggests language use may
be partly a function of the interactions among individuals. Both network degree,
centrality, and transitivity varied in predictable ways with information measures.
Mainly, as the number of connections between nodes increased and as the network
became more centralized the use of less frequent unigrams (AUI) increased.
Interestingly, networks that exhibit high connectivity and greater centrality may
have more long-range connections. A growing number of long-range connections
may lead to the inclusion of individuals that would normally be farther away from
the center of the network. Individuals in a network with these structural properties
may be communicating more collectively, having more readily established a richer
common ground. If so, higher information density is more probable, as the
communication that is taking place can become less generic and more complex.
Additionally, networks with higher local connectivity, or high transitivity, tend
to use more common language, specifically bigrams. This again may be seen
as supporting a theory of common ground, that individuals with more local
connectivity are more likely to communicate using similar terminology, in this
Social Structure and Information Density 111
case, bigrams. Using a Big Data approach it is possible to further explore other
structural aspects of one’s network that might influence language use.
While we merely speculate about potential conclusions, it is possible to obtain
rough measures of the likelihood of including more individuals at longer ranges.
Specifically, a network’s diameter—the longest stretch of space between two
individual nodes in any network—may serve as a measure of the distance that a
network occupies in socio-cultural space. This may be taken as a measure of how
many strangers are in a network, with longer diameters being commensurate with
the inclusion of more strangers.
It may be fruitful to explore the impact of a single individual on a network’s
language use. We do not yet explore processes at the individual level, opting
instead to sample networks and explore their aggregate linguistic tendencies.
Understanding the specifics of individual interaction may be crucial toward
understanding how and why languages adapt. We took an exploratory approach
and found general support for the idea that network structure influences certain
aspects of language use, but we did not look for phonological or syntactic patterns;
in fact our analysis could be regarded as a relatively preliminary initial lexical
distribution analysis. However, information finds fruitful application in quantifying
massive text-based datasets and has been touted as foundational in an emerging
understanding of language as an adaptive and efficient communicative system
(Jaeger, 2010; Moscoso Del Prado Martín, Kostíc, & Baayen, 2004). In addition,
previous work investigating the role of individual differences in structuring one’s
network are important to consider. For instance, differences in personality, such as
being extroverted or introverted, are related to specific network-level differences
(Kalish & Robbins, 2006). It is open to further exploration as to how information
flows take place in networks, such as through hubs and other social processes.
Perhaps tracing the origin of the network by determining the oldest reviews of
the network and comparing these to the network’s average age may provide insight
into the importance of how certain individuals or personalities contribute to the
network’s current language use.
We see the current results as suggestive of an approach toward language as an
adaptive and complex system (Beckner et al., 2009; Lupyan & Dale, in press). Our
findings stand alongside previous research that reveals some aspect of the structure
of language adapts to changes in one’s socio-cultural context (Klingenstein et al.,
2014; Kramer et al., 2014; Lupyan & Dale, 2010; Vilhena et al., 2014). Since
evolution can be thought of as the aggregation of smaller adaptive changes taking
place from one generation to the next, finding differences in language within
social networks suggests languages are adaptive, more in line with shifts in social
and cultural structure than genetic change (Gray & Atkinson, 2003; cf. Chater
et al., 2008). The results of this study suggest that general language adaptation may
occur over shorter time scales, in specific social contexts, that could be detected
in accessible Big Data repositories (see, e.g. recently, Stoll, Zakharko, Moran,
112 D. W. Vinson and R. Dale
Schikowski, & Bickel, 2015). The space of communicable ideas may be more
dynamic, adapting to both local and global constraints at multiple scales of time.
A deeper understanding of why language use changes may help elucidate what
ideas can be communicated when and why. The application of sampling local
social networks provides one method toward understanding what properties of a
population of speakers may relate to language change over time—at the very least,
as shown here, in terms of general usage patterns.
Testing how real network structures influence language use is not possible
without large amounts of data. The network sampling technique used here allows
smaller networks to be sampled within a much larger social-network structure. The
use of Big Data in this way provides an opportunity to measure subtle and intricate
features whose impacts may go unnoticed in smaller-scale experimental datasets.
Still, we would of course recommend interpreting initial results cautiously. The
use of Big Data can provide further insight into the cognitive factors contributing
to behavior, but can only rarely be used to test for causation. To this point, one
major role the use of Big Data plays in cognitive science, and one we emphasize
here, is its ability to provide a sense of direction and a series of new hypotheses.
Acknowledgments
We would like to thank reviewers for their helpful and insightful commentary. This
work was supported in part by NSF grant INSPIRE-1344279 and an IBM PhD
fellowship awarded to David W. Vinson for the 2015-16 academic year.
Notes
1 This description involves some convenient simplification. Some abstract and
genetic notions of language also embrace ideas of adaptation (Pinker &
Bloom, 1990), and other sources of theoretical subtlety render our description
of the two traditions an admittedly expository approximation. However,
the distinction between these traditions is stark enough to warrant the
approximation: The adaptive approach sees all levels of language as adaptive
across multiple time scales, whereas more fixed, abstract notions of language see
it as only adaptive in a restricted range of linguistic characteristics.
2 Massive online sites capable of collecting terabytes of metadata per day have only
emerged in the last 10 years: Google started in 1998; Myspace 2003; Facebook
2004; Yelp 2004; Google+ 2011. Volume, velocity, and variety of incoming
data are thought to be the biggest challenges toward understanding Big Data
today (McAfee, Brynjolfsson, Davenport, Patil, & Barton, 2012).
3 Previous research calls this Information Density and uses this as a measure
of Uniform Information Density. We use the name Average Conditional
Information given the breadth of information-theoretic measures used in this
study.
Social Structure and Information Density 113
4 Note: AUI and ACI were calculated by taking only the unique n-grams.
5 Our approach toward controlling for collinearity by residualizing variables
follows that of previous work (Jaeger, 2010). However, it is important to note
the process of residualizing to control for collinearity is currently in debate (see
Wurm & Fisicaro, 2014 for review). It is our understanding that the current
stringent use of this method is warranted provided it stands as a first pass toward
understanding how language use is influenced by network structures.
6 The variance inflation acceptable for a given model is thought to be somewhere
between five and ten (Craney & Surles, 2002). After the variance predicted by
nodes and edges was removed from our analyses, no complex network measure
reached the variance inflation threshold of five.
References
Abrams, D. M., & Strogatz, S. H. (2003). Linguistics: Modelling the dynamics of language
death. Nature, 424(6951), 900.
Aylett, M. P. (1999). Stochastic suprasegmentals: Relationships between redundancy,
prosodic structure and syllabic duration. Proceedings of ICPhS–99, San Francisco.
Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., & Christiansen,
M. H. (2013). Networks in cognitive science. Trends in Cognitive Sciences, 17(7), 348–360.
Beckner, C., Blythe, R., Bybee, J., Christiansen, M. H., Croft, W., Ellis, N. C.,
. . . Schoenemann, T. (2009). Language is a complex adaptive system: Position paper.
Language learning, 59(s1), 1–26.
Bentz, C., Vererk, A., Douwe, K., Hill, F., & Buttery, P. (2015). Adaptive communication:
Languages with more non-native speakers tend to have fewer word forms. PLoS One,
10(6), e0128254.
Bentz, C., & Winter, B. (2013). Languages with more second language speakers tend to
lose nominal case. Language Dynamics and Change, 3, 1–27.
Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D., Marlow, C., Settle, J. E., &
Fowler, J. H. (2012). A 61-million-person experiment in social influence and political
mobilization. Nature, 489(7415), 295–298.
Bullmore, E., & Sporns, O. (2009). Complex brain networks: Graph theoretical analysis of
structural and functional systems. Nature Reviews Neuroscience, 10(3), 186–198.
Chater, N., Reali, F., & Christiansen, M. H. (2009). Restrictions on biological adaptation
in language evolution. Proceedings of the National Academy of Sciences, 106(4), 1015–1020.
Choi, H. Y., Blumen, H. M., Congleton, A. R., & Rajaram, S. (2014). The role of
group configuration in the social transmission of memory: Evidence from identical and
reconfigured groups. Journal of Cognitive Psychology, 26(1), 65–80.
Chomsky, N. A. (1957) Syntactic Structures. New York: Mouton.
Christiansen, M. H., & Chater, N. (2008). Language as shaped by the brain. Behavioral and
Brain Sciences, 31(5), 489–509.
Christakis, N. A., & Fowler, J. H. (2009). Connected: The surprising power of our social networks
and how they shape our lives. New York, NY: Little, Brown.
Craney, T. A., & Surles, J. G. (2002). Model-dependent variance inflation factor cutoff
values. Quality Engineering, 14(3), 391–403.
114 D. W. Vinson and R. Dale
Dale, R., & Lupyan, G. (2012). Understanding the origins of morphological diversity: The
linguistic niche hypothesis. Advances in Complex Systems, 15, 1150017/1–1150017/16.
Dale, R., & Vinson, D. W. (2013). The observer’s observer’s paradox. Journal of
Experimental & Theoretical Artificial Intelligence, 25(3), 303–322.
Dodds, P. S., Watts, D. J., & Sabel, C. F. (2003). Information exchange and the robustness
of organizational networks. Proceedings of the National Academy of Sciences, 100(21),
12516–12521.
Doyle, G., & Frank, M. C. (2015). Shared common ground influences information density
in microblog texts. In Proceedings of NAACL-HLT (pp. 1587–1596).
Ember, C. R., & Ember, M. (2007). Climate, econiche, and sexuality: Influences on
sonority in language. American Anthropologist, 109(1), 180–185.
Everett, C. (2013). Evidence for direct geographic influences on linguistic sounds: The case
of ejectives. PLoS One, 8(6), e65275.
Freeman, L. C. (1979). Centrality in social networks: Conceptual clarification. Social
Networks, 1(3), 215–239.
Gordon, R. G. (2005). Ethnologue: Languages of the World, 15th Edition. Dallas, TX: SIL
International.
Gray, R. D., & Atkinson, Q. D. (2003). Language-tree divergence times support the
Anatolian theory of Indo-European origin. Nature, 426(6965), 435–439.
Griffiths, T. L. (2015). Manifesto for a new (computational) cognitive revolution. Cognition,
135, 21–23.
Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: What is it,
who has it, and how did it evolve? Science, 298(5598), 1569–1579.
Jaeger, F. T. (2010). Redundancy and reduction: Speakers manage syntactic information
density. Cognitive Psychology, 61(1), 23–62.
Jurafsky, D., Chahuneau, V., Routledge, B. R., & Smith, N. A. (2014). Narrative framing
of consumer sentiment in online restaurant reviews. First Monday, 19(4).
Kalish, Y., & Robins, G. (2006). Psychological predispositions and network structure: The
relationship between individual predispositions, structural holes and network closure.
Social Networks, 28(1), 56–84.
Klingenstein, S., Hitchcock, T., & DeDeo, S. (2014). The civilizing process in London’s
Old Bailey. Proceedings of the National Academy of Sciences, 111(26), 9419–9424.
Kramer, A. D., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of
massive-scale emotional contagion through social networks. Proceedings of the National
Academy of Sciences, 111(14), 8788–8790.
Kudyba, S., & Kwatinetz, M. (2014). Introduction to the Big Data era. In S. Kudyba (Ed.),
Big Data, Mining, and Analytics: Components of Strategic Decision Making (pp. 1–15). Boca
Ratan, FL: CRC Press.
Labov, W. (1972a). Language in the inner city: Studies in the Black English vernacular
(Vol. 3). Philadelphia, PA: University of Pennsylvania Press.
Labov, W. (1972b). Sociolinguistic patterns (No. 4). Philadelphia, PA: University of
Pennsylvania Press.
Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., & Van Alstyne,
M. (2009). Life in the network: The coming age of computational social science. Science,
323(5915), 721.
Social Structure and Information Density 115
Levy, R., & Jaeger, T. F. (2007). Speakers optimize information density through syntactic
reduction. In B. Schlökopf, J. Platt, and T. Hoffman (Eds.), Advances in neural information
processing systems (NIPS) 19, pp. 849–856. Cambridge, MA: MIT Press.
Lieberman, E., Michel, J. B., Jackson, J., Tang, T., & Nowak, M. A. (2007). Quantifying
the evolutionary dynamics of language. Nature, 449(7163), 713–716.
Lupyan, G., & Dale, R. (2010). Language structure is partly determined by social structure.
PLoS One, 5(1), e8559.
Lupyan, G., & Dale, R. (2015). The role of adaptation in understanding linguistic
diversity. In R. LaPolla, & R. de Busser (Eds.), The shaping of language: The relationship
between the structures of languages and their social, cultural, historical, and natural environments
(pp. 289–316). Amsterdam, The Netherlands: John Benjamins Publishing Company.
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big Data:
The management revolution. Harvard Business Review, 90(10), 61–67.
Moscoso del Prado Martín, F., Kostić, A., & Baayen, R. H. (2004). Putting the bits together:
An information theoretical perspective on morphological processing. Cognition, 94(1),
1–18.
Nettle, D. (1998). Explaining global patterns of language diversity. Journal of Anthropological
Archaeology, 17(4), 354–374.
Nichols, J. (1992). Linguistic diversity in space and time. Chicago, IL: University of Chicago
Press.
Nowak, M. A., Komarova, N. L., & Niyogi, P. (2002). Computational and evolutionary
aspects of language. Nature, 417(6889), 611–617.
Pellegrino, F., Coupé, C., & Marsico, E. (2011). Across-language perspective on speech
information rate. Language, 87(3), 539–558.
Piantadosi, S. T., Tily, H., & Gibson, E. (2011). Word lengths are optimized for efficient
communication. Proceedings of the National Academy of Sciences, 108(9), 3526–3529.
Pinker, S., & Bloom, P. (1990). Natural language and natural selection. Behavioral and Brain
Sciences, 13(4), 707–727.
Qian, T., & Jaeger, T. F. (2012). Cue effectiveness in communicatively efficient discourse
production. Cognitive Science, 36(7), 1312–1336.
Ramscar, M. (2013). Suffixing, prefixing, and the functional order of regularities in
meaningful strings. Psihologija, 46(4), 377–396.
Reali, F., Chater, N., & Christiansen, M. H. (2014). The paradox of linguistic complexity
and community size. In E. A. Cartmill, S. Roberts, H. Lyn & H. Cornish (Eds.),
The evolution of language. Proceedings of the 10th International Conference (pp. 270–277).
Singapore: World Scientific.
Sapir, E., 1921. Language: An Introduction to the Study of Speech. New York: Harcourt, Brace
and company.
Shannon C. A. (1948) A mathematical theory of communications. Bell Systems Technical
Journal, 27(4), 623–656.
Stine, R. A. (1995). Graphical interpretation of variance inflation factors. The American
Statistician, 49(1), 53–56.
Stoll, S., Zakharko, T., Moran, S., Schikowski, R., & Bickel, B. (2015). Syntactic
mixing across generations in an environment of community-wide bilingualism. Frontiers
in Psychology, 6, 82.
Triandis, H. C. (1994). Culture and social behavior. New York, NY: McGraw-Hill Book
Company.
116 D. W. Vinson and R. Dale
Trudgill, P. (1989). Contact and isolation in linguistic change. In L. Breivik & E. Jahr (Eds.),
Language change: Contribution to the study of its causes (pp. 227–237). Berlin: Mouton de
Gruyter.
Trudgill, P. (2011). Sociolinguistic typology: Social determinants of linguistic complexity. Oxford,
UK: Oxford University Press.
Vilhena, D. A., Foster, J. G., Rosvall, M., West, J. D., Evans, J., & Bergstrom, C. T.
(2014). Finding cultural holes: How structure and culture diverge in networks of scholarly
communication. Sociological Science, 1, 221–238.
Vinson, D. W., & Dale, R. (2014a). An exploration of semantic tendencies in word of
mouth business reviews. In Proceedings of the Science and Information Conference (SAI), 2014
(pp. 803–809). IEEE.
Vinson, D. W., & Dale, R. (2014b). Valence weakly constrains the information density of
messages. In Proceedings of the 36th Annual Conference of the Cognitive Science Society (pp.
1682–1687). Austin, TX: Cognitive Science Society.
Wray, A., & Grace, G. W. (2007). The consequences of talking to strangers: Evolutionary
corollaries of socio-cultural influences on linguistic form. Lingua, 117(3), 543–578.
Wurm, L. H., & Fisicaro, S. A. (2014). What residualizing predictors in regression analyses
does (and what it does not do). Journal of Memory and Language, 72, 37–48.
6
MUSIC TAGGING AND LISTENING
Testing the Memory Cue Hypothesis
in a Collaborative Tagging System
Abstract
As an example of exploring human memory cue use in an ecologically valid context, we
present ongoing work to examine the “memory cue hypothesis” in collaborative tagging. In
collaborative tagging systems, which allow users to assign freeform textual labels to digital
resources, it is generally assumed that tags function as memory cues that facilitate future
retrieval of the resources to which they are assigned. There is, however, little empirical
evidence demonstrating that this is in fact the case. Employing large-scale music listening
and tagging data from the social music website Last.fm as a case study, we present a set of
time series and information theoretic analytic methods we are using to explore how patterns
of content tagging and interaction support or refute the hypothesis that tags function as
retreival cues. Early results are, on average, consistent with the hypothesis. There is an
immediate practical application of this work to those working with collaborative tagging
systems (are user motivations what we think they are?), but our work also comprises
contributions of interest to the cognitive science community: First, we are expanding our
understanding of how people generate and use memory cues “in the wild.” Second, we
are enriching the “toolbox” available to cognitive scientists for studying cognition using
large-scale, ecologically valid data that is latent in the logged activity of web users.
Introduction
Humans possess a unique capacity to manipulate the environment in the pursuit
of goals. These goals can be physical (building shelter, creating tools, etc.), but
also informational, such as when we create markers to point the way along a path
or leave a note to ourselves as a reminder to pick up eggs from the market. In
the informational case, the creation of reminders or pointers in the environment
functions as a kind of cognitive offloading, enriching our modes of interaction with
the environment while requiring reduced internal management of information.
118 J. Lorince and P. M. Todd
Background
What is Collaborative Tagging?
In collaborative tagging, many individuals assign freeform metadata in the form
of arbitrary strings (tags) to resources in a shared information space. These
resources can, in principle, be any digital object, and web services across a wide
variety of domains implement tagging features. Examples include web bookmarks
(Delicious.com, Pinboard.in), music (Last.fm), photos (Flickr.com, 500px.com),
academic papers (academia.edu, mendeley.com), books (LibraryThing.com), and
many others. When many users engage in tagging of a shared corpus of content,
the emergent semantic structure is known as a folksonomy, a term defined by
Thomas Vander Wal as a “user-created bottom-up categorical structure . . . with
an emergent thesaurus” (Vander Wal, 2007). Under his terminology, a folksonomy
can either be broad, meaning many users tag the same, shared resources, or narrow,
in which any given resource tends to be tagged by only one user (usually the
content creator or uploader). Last.fm, on which we are performing our analyses,
is a canonical example of the former, and Flickr, where users upload and tag their
own photos, is a good example of the latter.
Folksonomies have been lauded as a radical new approach to content
classification (Heckner, Mühlbacher, & Wolff, 2008; Shirky, 2005; Sterling,
2005; Weinberger, 2008). In principle, they leverage the “wisdom of the
crowds” to generate metadata both more flexibly (multiple classification of content
is built in to the system) and at lower economic cost (individual users are,
generally, self-motivated and uncompensated) than in traditional, expert, or
computer-generated taxonomies, as one might find in a library. The approach
is not uncontroversial, however, with critics from library science in particular
(Macgregor & McCulloch, 2006) pointing out the difficulties that the wholly
uncontrolled vocabularies of folkonomies can introduce (especially poor handling
of homonyms and hierarchical relationships between tags). In broad folksonomies,
the existence of social imitation effects (Floeck, Putzke, Steinfels, Fischbach, &
Schoder, 2011; Lorince & Todd, 2013) can also cast doubt on whether agreement
as to how an item ought to be tagged reflects true consensus, or instead bandwagon
120 J. Lorince and P. M. Todd
effects that do not “correctly” categorize the item. Given our current focus
on individuals’ tagging motivations, the level of efficacy of tagging systems for
collective classification is not something we address here.
Hotho, Jäschke, Schmitz, & Stumme (2006a) formally define a folksonomy
as a tuple F := (U, T, R, Y )1 where U , T , and R are finite sets representing,
respectively, the set of all unique users, tags, and resources in the tagging system.
Y is a ternary relation between them (Y ⊆ U × T × R), representing the set of tag
assignments (or, equivalently, annotations) in the folksonomy (i.e. instances of a
particular user assigning a particular tag to a particular resource). They also define
the personomy of a particular user, P := (Tu , Ru , Yu ), which is simply the subset of
F corresponding to the tagging activity of a single user.
Collaborative tagging systems began to be developed in the early 2000s, with
the launch of the social bookmarking tool Delicious in 2003 marking the first
to gain widespread popularity. Three years later, Golder and Huberman’s (2006)
seminal paper on the stabilization of tag distributions on Delicious sparked interest
in tagging as an object of academic interest. In the years since, a substantial
literature on the dynamics of tagging behavior has developed. Research has
covered topics as diverse as the relationship between social ties and tagging
habits (Schifanella, Barrat, Cattuto, Markines, & Menczer, 2010), vocabulary
evolution (Cattuto, Baldassarri, Servedio, & Loreto, 2007), mathematical and
multi-agent modeling of tagging behaviors (Cattuto, Loreto, & Pietronero, 2007;
Lorince & Todd, 2013), identification of expert taggers (Noll, Au Yeung, Gibbins,
Meinel, & Shadbolt, 2009; Yeung, Noll, Gibbins, Meinel, & Shadbolt, 2011),
emergence of consensus among taggers (Halpin, Robu, & Shepherd, 2007; Robu,
Halpin, & Shepherd, 2009), and tag recommendation (Jäschke, Marinho, Hotho,
Schmidt-Thieme, & Stumme, 2007; Seitlinger, Ley, & Albert, 2013), among
others.
This small sample of representative work is indicative of the fact that, at least
at the aggregate level, researchers have a fairly good idea of how people tag. What
is comparatively poorly understood (and relevant to our purposes here) is exactly
why users tag.
and Golder & Huberman, 2006), and it is one generally in line with the design
goals of tagging systems. Although tagged content can be used in various ways
beyond retrieval, such as resource discovery and sharing, the immediate motivation
for a user to tag a given item is most often assumed (not illogically) as being
to achieve an information organization and retrieval goal. This is not to imply
that other tagging objectives, such as social sharing, are necessarily illogical, only
that they are less often considered primary motivators of tagging choices. Such
retrieval goals are implemented in tagging systems by allowing users to use tags
as search keywords (returning items labeled with a particular tag from among a
user’s own tagged content, or the global folksonomy) and by allowing them to
directly browse the tags they or others have generated. On Last.fm, for example,
a user can click on the tag “rock” on the tag listing accessible from his or her
profile page, and view all the music to which he or she has assigned the tag
“rock.”
While our current goal is to test whether this assumption holds when
considering users’ histories of item tagging and interaction, it is important to
recognize that alternative motivations for tagging can exist. Gupta, Li, Yin, & Han
(2010), for instance, posit no less than nine possible reasons, beyond future retrieval,
for which a user might tag: Contribution and sharing, attracting attention to one’s
own resources, play and competition, self-presentation, opinion expression, task
organization, social signaling, earning money, and “technological ease” (i.e. when
software greatly reduces the effort required to tag content). We will not analyze
each of these motivational factors in depth, but present the list in its entirety
to make clear that tagging motivation can extend well beyond a pure retrieval
function. We do, however, briefly review the most well-developed theories of tag
motivation in the literature.
What is likely to be the most critical distinction in a user’s decision to tag
a resource is the intended audience of the tag, namely whether it is self- or
other-directed. This distinction maps onto what Heckner et al. (2008) refer to
as PIM (personal information management) and resource sharing. The sort of
self-generated retrieval cues that interest us here fall under the umbrella of PIM,
while tags generated for the purposes of resource sharing are intended to help
other people find tagged content. For example, a user may apply tags to his or her
Flickr photos that serve no personal organizational purpose, but are intended to
make it easier for others to discover his or her photos. Social motivations can be
more varied, however. Zollers (2007), for instance, argues that opinion expression,
performance, and activism are all possible motivations for tagging. Some systems
also implement game-like features to encourage tagging (Weng & Menczer, 2010;
Weng, Schifanella, & Menczer, 2011; Von Ahn & Dabbish, 2004) that can invoke
socially directed motivations.
Ames and Naaman (2007) present a two-dimensional taxonomy of tagging
motivation, dividing motivation not only along dimensions of sociality (like
122 J. Lorince and P. M. Todd
Heckner et al. 2008), but also a second, functional dimension. Under their termi-
nology, tags can be either organizational or communicative. When self-directed,
organizational tags are those used for future retrieval, while communicative tags
provide contextual information about a tagged resource, but are not intended to
aid in retrieval. Analogously, social tags can either be intended to help other users
find a resource (organizational) or communicate information about a resource once
it is found (communicative).
While all of these theories of tagging motivation appear reasonable (to varying
degrees), there is little in the way of empirically rigorous work demonstrating that
user tagging patterns actually align with them. The most common methods for
arriving at such taxonomies are examining the interface and features of tagging
systems to infer how and why users might tag (e.g. in a system where a user can
only see his or her own tags, social factors are likely not to be at play, see Marlow,
Naaman, Boyd, & Davis, 2006), semantic analysis and categorization of tags (e.g.
“to read” is likely to be a self-directed organizational tag, while tagging a photo
with one’s own username is likely to be a socially directed tag suggesting a variety
of self-advertisement, see Sen et al., 2006; Zollers, 2007), and qualitative studies in
which researchers explicitly ask users why they tag (e.g. Ames & Naaman, 2007;
Nov, Naaman, & Ye, 2008). All of these methods certainly provide useful insights
into why people tag, but none directly measure quantitative signals of any proposed
motivational factor. One notable exception to this trend is the work of Körner
and colleagues (Körner, Benz, Hotho, Strohmaier, & Stumme, 2010; Körner,
Kern, Grahsl, & Strohmaier, 2010; Zubiaga, Körner, & Strohmaier, 2011), who
propose that taggers can be classified as either categorizers (who use constrained tag
vocabularies to facilitate later browsing of resources) or describers (who use broad,
varied vocabularies to facilitate later keyword-based search over resources). They
then develop and test quantitative measures that, they hypothesize, should indicate
that a user is either a categorizer or describer. Although Körner and colleagues
are able to classify users along the dimensions they define, they cannot know
if describers actually use their tags for search, or that categorizers use them for
browsing. This is a problem pervasive in work on tagging motivation (for lack of
the necessary data, as we will discuss below); there is typically no way to verify that
users actually use the tags they have applied in a manner consistent with a given
motivational framework.
memory aids that gained popularity in the 1970s (Higbee, 1979). Although most
work has focused on internal memory aids (e.g. rhyming, rehearsal strategies, and
other mnemonics), some researchers have explored the use of external aids, which
are typically defined as “physical, tangible memory prompts external to the person,
such as writing lists, writing on one’s hand, and putting notes on a calendar” (Block
& Morwitz, 1999, p. 346). We of course take the position that digital objects, too,
can serve as memory cues, and some early work (Harris, 1980; Hunter, 1979;
Intons-Peterson & Fournier, 1986) was sensitive to this possibility long before
tagging and related technologies were developed.
The work summarized above, although relevant, provides little in the way
of testable hypotheses with respect to how people use tags. Classic research on
human memory—specifically on so-called cued recall—can offer such concrete
hypotheses. If the conceptualization of tags as memory cues is a valid one,
we would expect users’ interaction with them to conform, to at least some
degree, with established findings on cued retrieval of memories. The literature
on cued recall is too expansive and varied to succinctly summarize here (see
Kausler & Kausler, 1974 for a review of classic work), but broadly speaking
describes scenarios in which an individual is presented with target items (most
typically words presented on a screen) and associated cues (also words, generally
speaking), and is later tested on his or her ability to remember the target items
when presented with the previously learned cues. The analog to tagging is that
tags themselves function as cues, and are associated with particular resources that
the user wishes to retrieve (recall) at a later time. The scenarios, of course, are
not perfectly isomorphic. While in a cued-recall context, a subject is presented
with the cue, and must retrieve from memory the associated item(s), in a tagging
context the user may often do the opposite, recalling the cue, which triggers
automatic retrieval (by the tagging system) of the associated items “for free” with
no memory cost to the user. Furthermore, it is likely to be true in many cases
that a user may not remember the specific items they have tagged with a given
term at all. Instead, a tag might capture some relevant aspect of the item it is
assigned to, such that it can serve to retrieve a set of items sharing that attribute
(with no particular resource being sought). As an example, a user might tag upbeat,
high-energy songs with the word “happy,” and then later use that tag to listen to
upbeat, happy songs. In such a case, the user may have no particular song in mind
when using the tag for retrieval, as would be expected in a typical cued-recall
scenario.
These observations reveal that, even when assuming tags serve a retrieval
function, how exactly that function plays out in user behavior can take various
forms. Nonetheless, we take the position that an effective tag—if and when that
tag serves as retrieval cue—should share attributes of memory cues shown to be
effective in the cued recall literature. In particular, we echo Earhard’s (1967) claim
that “the efficiency of a cue for retrieval is dependent upon the number of items
124 J. Lorince and P. M. Todd
for which it must act, and that an efficient strategy for remembering must be some
compromise between the number of cues used and the number of items assigned
to each cue” (p. 257). We base this on the assumption that tags, whether used for
search, browsing, or any other retrieval-centric purpose, still serve as cue-resource
associates in much the same way as in cued recall research; useful tags should
connect a user with desired resources in a way that is efficient and does not impose
unreasonable cognitive load.
In cases of tagging for future retrieval, this should manifest as a balance between
the number of unique tags (cues) a user employs, and the number of items which
are labeled with each of those tags. Some classic research on cued recall would
argue against such a balancing act, with various studies suggesting that recall
performance reliably increases as a function of cue distinctiveness (Moscovitch &
Craik, 1976). This phenomenon is sometimes explained by the cue-overload effect
(Rutherford, 2004; Watkins & Watkins, 1975), under which increasing numbers
of targets associated with a cue will “overload” the cue such that its effectiveness
for recalling those items declines. In other words, the more distinctive a cue is (in
terms of being associated with fewer items), the better. But when researchers have
considered not only the number of items associated with a cue, but also the total
number of cues a subject must remember, results have demonstrated that at both
extremes—too many distinct cues or too many items per cue—recall performance
suffers. Various studies support this perspective (e.g. Hunt & Seta, 1984; Weist,
1970), with two particularly notable cued recall studies being those by Earhard
(1967), who found recall performance to be an increasing function of the number
of items per cue, but a decreasing function of the total number of cues, and Tulving
& Pearlstone (1966), who found that subjects were able to remember a larger
proportion of a set of cues, but fewer targets per cue, as the number of targets
associated with each cue increased.
Two aspects of tagging for future retrieval that are not well captured by existing
work are (a) the fact that, in tagging, cues are self-generated and (b) differences in
scale (the number of items to be remembered and tags used far exceed, in many
cases by orders of magnitude, the number of cues and items utilized in cued recall
studies). Tullis & Benjamin (2014) have recently begun to explore the question of
self-generated cues in experiments where subjects are explicitly asked to generate
cues for later recall of associated items, and their findings are generally consistent
with the account of cued recall described here. Results suggest that people are
sensitive to the set of items to be remembered in their choice of cues, and that
their choices generally support the view that cue distinctiveness aids in recall. The
issue of scale remains unaddressed, however.
In sum, the case of online tagging has important distinctions from the paradigms
used in cued recall research, but we nonetheless find the cued recall framework to
be a useful one for generating the specific hypotheses we explore below.
Music Tagging and Listening 125
The Challenge
As discussed above, there is no shortage of ideas as to why people tag, but actually
finding empirical evidence supporting the prevalent memory cue hypothesis—or
any other possible tagging motivation, for that matter—is difficult. The simple fact
of the matter is that there are plenty of data logging what, when, and with which
terms people tag content in social tagging systems, but to our knowledge there are
no publicly available datasets that reveal how those tags are subsequently used for
item retrieval (or for any other reason). Of the various ways a user might interact
with or be exposed to a tag after he or she has assigned it to an item (either by
using it as a search term, clicking it in a list, simply seeing it onscreen, etc.), none
are open to direct study. This is not impossible in principle, as a web service could
log such information, but such data are not present in publicly available datasets or
possible to scrape from any existing tagging systems.
Thus, we face the problem of making inferences about why a user tagged an
item based only on the history of what, how, and when that user has tagged,
without any ability to test if future use of the tag matches our inferences. It may
seem, then, that survey approaches that directly ask users why they tag might
necessarily be our best option, but we find this especially problematic. Not only
are such self-reported motivations not wholly reliable, we are more interested in
whether tags actually function as memory cues than whether users intend to use
them as such. With all this in mind, we now turn to describing the dataset with
which we are currently working, and why we believe it provides a partial resolution
to these challenges.
Dataset
Our current work revolves around data crawled over the course of 2013 and 2014
from the social music website Last.fm. The core functionality of the site (a free
service) is tracking listening habits in a process known as “scrobbling,” wherein
each timestamped, logged instance of listening to a song is a “scrobble.” Listening
126 J. Lorince and P. M. Todd
data are used to generate music recommendations for users, as well as to connect
them with other users with similar listening habits on the site’s social network.
Listening statistics are also summarized on a user’s public profile page (showing the
user’s recently listened tracks, most listened artists, and so on). Although users can
listen to music on the site itself using its radio feature, they can also track their
listening in external media software and devices (e.g. iTunes, Windows Media
Player, etc.), in which case listening is tracked with a software plugin, as well as
on other online streaming sites (such as Spotify and Grooveshark). Because the
site tracks listening across various sources, we can be confident that we have a
representative—if not complete—record of users’ listening habits.
Last.fm also incorporates tagging features, and users can tag any artist, album,
or song with arbitrary strings. Being a broad folksonomy, multiple users can tag
the same item (with as many distinct tags as they desire), and users can view the
distribution of tags assigned to any given item. In addition to seeing all the tags
that have been assigned to a given item, users are also able to search through their
own tags (e.g. to see all the songs that one has tagged “favorites”) or view the items
tagged with a particular term by the community at large. From there, they can also
listen to collections of music tagged with that term (e.g. on the page for the tag
“progressive metal” there is a link to the “play progressive metal tag”).
The current version of our dataset consists of complete listening and tagging
histories for over 90,000 Last.fm users for the time period of July 2005 through
December 2012, amounting to over 1.6 billion individual scrobbles and nearly
27 million individual annotations (tuples representing a user’s assignment of a
particular tag to a particular item at a particular time). See Table 6.1 for a high-level
summary. All data were collected either via the Last.fm API or direct scraping of
publicly available user profile pages. We originally collected a larger sample of
tagging data from users (approximately 1.9 million), and the data described here
represent the subsample of those for which we have so far collected listening data.
See our previous work using the larger tagging dataset (Lorince & Todd, 2013;
Lorince, Joseph, & Todd, 2015; Lorince, Zorowitz, Murdock, & Todd, 2014) for
technical details of the crawling process.
The value of these data is that they provide not only a large sample of user
tagging decisions, as in many other such datasets, but also patterns of interaction
over time with the items users have tagged. Thus, for any given artist or song2 a
user has listened to, we can determine if the user tagged that same item and when,
permitting a variety of analyses that explore the interplay between interaction with
an object (in our case, by listening to it) and tagging it. This places us in a unique
position to test if tagging a resource affects subsequent interaction with it in a way
consistent with the memory cue hypothesis.
We of course face limitations. While these data present a new window on our
questions of interest, they cannot establish a causal relationship between tagging
and any future listening, and there may be peculiarities of music listening that limit
Music Tagging and Listening 127
the applicability of any findings to other tagging domains (e.g. web bookmarks,
photos, etc.). Nonetheless, we find ourselves in a unique position to examine the
complex interplay between music tagging and listening that can provide insight
into whether or not people tag for future retrieval, and tagging motivation more
generally.
Hypotheses
As we clearly cannot measure motivation directly, we seek to establish a set of
anticipated relationships between music tagging and listening that should hold if
the memory cue hypothesis is correct, or at least in a subset of cases in which
it applies. The overarching prediction of the memory cue hypothesis is that tags
facilitate re-finding music in the future, which should manifest here as increased
levels of listening to tagged music than we would find in the absence of tagging.
Here we outline two concrete hypotheses:
Hypothesis 1. If a user tags an item, this should increase the probability that a user
listens to it in the future. Specifically, assignment of tags to a particular artist/song
should correlate with greater rates of listening to that artist/song later.
If tagging does serve as a retrieval aid, it should increase the chance that a user
interacts with the tagged resource in the future. We would expect that increases in
tagging an artist, on average, should correlate with and precede increased probability
of listening to that artist. This would suggest that increased tagging is predictive of
future listening, which is consistent with the application of tags facilitating later
retrieval of a resource.
Hypothesis 2. Those tags that are most associated with increased future listening
(i.e. those that most likely function as memory cues) should occupy a “sweet spot”
of specificity that makes them useful as retrieval aids.
128 J. Lorince and P. M. Todd
Even if the memory cue hypothesis holds, it is presumably the case that not all
tags serve as memory cues. Those that do, as evidenced by a predictive relationship
with future listening, should demonstrate moderate levels of information content
(in the information theoretic sense, Shannon, 1948). A tag that is overly specific
(for example, one that uniquely identifies a particular song) is likely to be of little
use in most cases,3 as the user may as well recall the item directly, while one that is
overly broad (one that applies to many different items) is also of little value, for it
picks out too broad a set of items to effectively aid retrieval. Thus we hypothesize
that the specificity of tags (as measured by Shannon entropy) should be more likely
on average to fall in a “sweet spot” between these extremes in those cases where
tagging facilitates future listening.
Analytic Approaches
In this section we describe some of the analytic approaches we are employing to test
the memory cue hypothesis, and a selection of early findings. We discuss, in turn,
time series analysis methods including visualization and clustering, information
theoretic analyses of tags, and other approaches to be explored in future work
including modeling the causal influence (or lack thereof) of tagging on subsequent
listening.
Central to the analyses presented below are user-artist listening time series
and user-artist tagging time series. The former consist of the monthly scrobble
frequencies for each user-artist pair in our data (i.e. for every user, there exists
one time series of monthly playcounts for each unique artist he or she has listened
to) in the July 2005 through December 2012 range. We similarly define tagging
time series, which reflect the number of times a particular user tagged a particular
artist each month. Although listening data are available at a higher time resolution
than we use for analysis, users’ historical tagging data are only available at monthly
time resolution. Thus we down-sample all listening data to monthly playcounts to
facilitate comparative analysis with tagging.
While it is possible in principle to define these time series at the level of
particular songs as opposed to artists, the analysis we present here is limited to
the artist level. For this first phase of research we have taken this approach because
(a) the number of unique songs is much larger than the number of unique artists,
greatly increasing the computational demands of analysis, and (b) the listening and
tagging data (especially the latter) for any particular song in our dataset are typically
very sparse. Thus, for the purposes of the work presented here, we associate with
a given artist all annotations assigned directly to that artist, or to any of the artist’s
albums or songs.
Listening time series are normalized to account for variation in baseline levels
of listening across users. We accomplish this by dividing a users’s playcount for
a given artist in a given month by that user’s total playcount (across all artists)
Music Tagging and Listening 129
for that month. This effectively converts raw listening counts to the proportion
of a user’s listening in a given time period allocated to any given artist. After all
pre-processing, our data consists of 78,271,211 untagged listening time series (i.e.
user-artist pairings in which the user never tagged the corresponding artist), and
5,336,702 tagged time series (user-artist pairings in which the user tagged the artist
at least once in the data collection period).
100
Proportion of users with N total annotations
10–1
10–2
10–3
10–4
10–5
10–6
100 101 102 103 104 105 106
Number of annotations (N )
FIGURE 6.1 For a given total annotation count N , the proportion of users in our
tagging dataset with a total of N annotations, on a log-log scale.
or if users are simply more likely to tag the artists they are more likely listen to),
aggregate listening patterns are at least consistent with Hypothesis 1.
In concurrent work,4 we are exploring canonical forms of music listening
patterns by applying standard vector clustering methods from computer science
to identify groups of similarly shaped listening time series. The precise
methodological details are not relevant here, but involve representing each time
series as a simple numeric vector, and feeding many such time series into an
algorithm (k-means) that arbitrarily defines k distinct cluster centroids. Vectors
are each assigned to the cluster to whose centroid they are most similar (as
measured by Euclidean distance), and a new centroid is defined for each cluster
as the mean of all its constituent vectors. This process repeats iteratively until
the distribution of vectors over clusters stabilizes.5 In Figure 6.3 we show results
of one of various clustering analyses, showing cluster centroids and associated
probability distributions of tagging for k = 9 clusters. Plotted are the mean
probability distributions of listening in each cluster, as well as the temporally
aligned probability distribution of tagging for all user-artist pairs in the cluster.
Consideration of the clustered results is useful for two reasons. First, it demonstrates
that tagging is, on average, most likely in the first month a user listens to an artist
even when the user’s listening peaks in a later month, which is impossible to see in
Figure 6.2. Second, it provides further evidence that increases in tagging correlate
Music Tagging and Listening 131
0.016
Untagged time series
Tagged time series
0.014
0.012
Mean normalized listening
0.010
0.008
0.006
0.004
0.002
0.000
0 20 40 60 80
Months since first listen
FIGURE 6.2 Mean normalized playcount each month (aligned to the month of first
listen) for all listening time series in which the user never tagged the associated artist
(solid line) and listening time series in which the user tagged the artist in the first
month he or she listened to the artist (dashed line).
0.08
0.06 0.15
0.06
0.04 0.10
0.04
0.02 0.05
0.02
0.14
0.10 0.20
0.12
Probability density
0.08 0.10
0.15
0.08
0.06
0.10 0.06
0.04
0.04
0.05
0.02 0.02
FIGURE 6.3 Clustering results for k = 9. Shown are mean normalized playcount (solid
line) and mean number of annotations (dashed line), averaged over all the time series
within each cluster. Time series are converted to probability densities, and aligned to
the first month in which a user listened to a given artist. Clusters are labeled with
the number of listening time series (out of 1 million) assigned to each cluster. Cluster
numbering is arbitrary.
1.0
0 annotations
1 annotations
2 annotations
3 annotations
0.8 4 annotations
5 annotations
6 annotations
7 annotations
Probability of listening
8 or more annotations
0.6
0.4
0.2
0.0
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48
Months since first listen
FIGURE 6.4Mean normalized playcount for user-artist listening time series tagged a
given number of times.
the hypothesis that the tags used as retrieval cues6 should have moderate levels of
specificity. A useful mathematical formalization of “specificity” for our purposes
here is the information theoretic notion of entropy, as defined by Shannon (1948).
Entropy (H ) is effectively a measure of uncertainty in the possible outcomes of a
random variable. It is defined as
X
H (X ) = − P(xi ) logb P(xi ), (1)
i
1.0
Probability density
0.05 0.12 14
0.8 0.10
0.04
0.08
0.03
0.06
0.02 12
0.04
0.01 0.02
Probability of listening
0.00 0.00
0.6
0.4
6
4
0.2
2
(0,0.5) bits
(0.5.1.0) bits
0.0 0
–28 –24 –20 –16 –12 –8 –4 0 4 8 12 16 20 24 28
Months since annotation
FIGURE 6.5 Mean probability of listening each month (relative to the month in which
a tag is applied) for user-artist time series associated with tags of a given binned entropy
(bin width of 0.5 bits). Each line represents the mean listening for a particular entropy
bin, with line color indicating the entropy range for the bin (darker shades show lower
entropy). Highlighted are the listening probabilities associated with 0.0–0.5 bit entropy
tags (bold dashed line) and 0.5 to 1.0 bit entropy tags (bold solid line). The inset plots
show the total mean listening (i.e. sum over all values in each line from the main plot)
for each entropy bin (left), and the probability distribution of tags by entropy (right).
bin width of 0.5 bits), and for each bin retrieved all listening time series associated
with tags in that bin. We then determined the mean probability of listening to
those artists each month relative to the month when the tag was applied.
The results are shown in Figure 6.5. Each line shows the average probability of
a user listening to an artist at a time X months before or after tagging it, given
that the user annotated that artist with a tag in a given entropy range. Entropies
are binned in 0.5 bit increments, and entropy values are indicated by the color of
each line. Two obvious large-scale trends should be noted. First, consistent with
the earlier finding that tagging overwhelmingly occurs in the first month a user
listens to an artist, the probability of listening to an artist peaks in the month it is
tagged, and is greater in the months following the annotation than preceding it.
Second, there is a general trend of overall lower listening probabilities with higher
entropy, consistent with findings suggesting that greater tag specificity ought to
facilitate retrieval. But, in support of our “sweet spot” hypothesis, this trend is not
Music Tagging and Listening 135
wholly monotonic. Tags with the lowest entropy (between 0.0 and 0.5 bits, dashed
bold line) are not associated with the highest listening probabilities; tags with low,
but not too low, entropy (between 0.5 and 1.0 bits, solid bold line) have the highest
rates of listening.
The left-hand inset plot is the probability distribution of total listening by
binned entropy (i.e. the mean sum total of normalized listening within each bin).
This is, effectively, a measure of the total amount of listening, on average, associated
with artists labeled with a tag in a given entropy bin, and makes clear the peak
for tags in the 0.5 to 1.0 bit range. Also of note is the relative stability of total
listening (excepting the aforementioned peak) up to around 7 bits of entropy, after
which total listening drops off rapidly. The right-hand inset plot is the probability
distribution of listening time series across tag entropy bins—or in other words, the
distribution of rates of tag use versus tag entropy. Very low entropy tags (0 to 0.5
bits) are clearly the most common, indicating the existence of many “singleton”
and low-use tags—that is, tags a user applies to only one, or very few, unique
artists. Ignoring these tags, however, we observe a unimodal, relatively symmetric
distribution peaked on the 5.0–5.5 bit entropy bin (marked with a vertical dashed
line) that corresponds more or less directly to the stable region of total listening in
the left-hand inset plot. Precisely what drives the preponderance of “singleton” tags
is not totally clear, but excluding them, these data do suggest that users demonstrate
a preference for moderate-entropy tags associated with relatively high listening
probabilities.
These results do not strongly suggest the existence of a single “sweet spot”
in entropy (the peak in the 0.5–1.0 bit bin may be partly due to noise, given
the relatively low frequency of tags in that entropy range), but do demonstrate
that there is not a simple, monotonic relationship between increased listening and
lower entropy values. Instead, we observed a range of entropy values (from 0.0 to
approximately 7.0 bits) that are associated with higher listening rates. We must be
cautious in drawing strong conclusions from these results, however. Because we
are collapsing tagging and listening activity by artist, we cannot know the number
of particular songs a user might retrieve with a given tag. Thus there may exist
dependencies between tag entropy and the number of associated songs that drive
mean listening rates higher or lower in a misleading manner. For example, a tag
that tends to only be associated with a small number of songs may show low mean
listening rates not because it is an ineffective retrieval cue, but because a small set
of songs may generate low listening rates compared with a larger set.
This is just one of various difficulties in interpreting large-scale data such
as these. When considering the average behavior of many, heterogenous users,
normalization and other transformations (such as our normalization of playcounts
to account for variation in users’ overall listening levels) are necessary, but can
interact with derived measures (such as our entropy calculations) in complex,
sometimes unexpected ways. As we continue this research program, we will need
136 J. Lorince and P. M. Todd
hurdles, and it is unclear if the sparsity of tagging data will permit such an analysis,
but we hope the approach will prove fruitful. We also aim to expand our analysis
to employ standard machine learning algorithms (such as support vector machines
and logistic regression models) to develop a classifier for categorizing tagged and
untagged time series. If a high-performing classifier based on listening behavior
can be developed, it would indicate that there are systematic differences in listening
behavior for tagged time series. This would suggest that tagging is not simply more
likely for those artists a user is likely to listen to anyway, but instead is associated
with distinctive patterns of listening.
One approach that has borne fruit in ongoing work, building upon the time
series analysis methods described above, is the use of a regression model that
predicts future listening rates as a function of past listening rates and whether or
not a user-artist listening time series has been tagged (Lorince, Joseph, & Todd,
2015). Using a generalized additive model (GAM, Hastie & Tibshirani, 1990), our
dependent variable in the regression is the logarithm of the sum of all listens in the
six months after a tag has been applied, to capture the possible effect of tagging
over a wide temporal window (the results are qualitatively the same when testing
listening for each individual month, however), while our independent variables
are a binary indicator of whether or not the time series has been tagged, as
well as seven continuous-valued predictors, one each for the logarithm of total
listens in the month of peak listening7 in the time series and in each of the six
previous months. The regression equation is as follows, where m corresponds to
the month of peak listening, L is the number of listens in any given month, T is
the binary tagged/untagged indicator, and f represents the exponential-family
functions calculated in the GAM (there is a unique function f for each pre-peak
month, see Hastie & Tibshirani, 1990 for details):
6
X 6
X
log L m+i = b0 + b1 T + f (log L m−i ) (2)
i=1 i=0
We refer the reader to the full paper for further details, but the model (after
imposing various constraints that permit temporal alignment of tagged and
untagged time series data), allows us to measure the effect of tagging an artist on
future listening, while controlling for users’ past listening rates. Our early results
suggest that tagging has a measurable, but quite small, effect on future listening. As
we cannot visualize the regression results for all model variables at once, Figure 6.6
instead displays the predicted difference in listening corresponding to tagging as
a function of the number of peak listens, calculated with a similar model, which
considers only the effect of listening in the peak month on post-peak listening.
This plot suggests and the full model confirms that, controlling for all previous
listening behavior, a tag increases the logarithm of post-peak listens by 0.147
(95 percent confidence interval = [0.144, 0.150]). In other words, the effect of
a tag is associated with around 1.15 more listens over six months, on average, than
138 J. Lorince and P. M. Todd
1000
Listens in following 6 months
Tagged
100 No
Yes
10
10 100 1000
Listens in peak month
FIGURE 6.6 Regression model results, showing predicted sum total of listening in
the 6 months after a tag is assigned as a function of the number of listens in the
month of peak listening in a time series. Results shown on a log-log scale, and shaded
regions indicated a bootstrapped 95 percent confidence interval. Figure replicated
from Lorince et al. (2015).
if it were not to have been applied. These results thus suggest that tagging does
lead to increases in listening, but only very small ones. Further analysis comparing
the predictiveness of different tags for future listening (again, see the full paper
for details) furthermore indicates that only a small subset of tags analyzed have
any significant effect on future listening. Taken together, these tentative results
provide evidence that tags certainly do not always function as memory cues, and
that facilitating later retrieval may actually be an uncommon tagging motivation.
Notes
1 The original definition contains a fourth element, such that F := (U, T, R,
Y, ≺). The last term, ≺, represents a user-specific subtag/supertag relation that
140 J. Lorince and P. M. Todd
folksonomy researchers (including the authors who define it) do not typically
examine, and we do not discuss it here.
2 When crawling a user’s listening history, we are able to determine song names
and the associated artist names, but not the corresponding album names.
3 This is not to say that such tags are never useful. We can imagine the generation
of highly specific cues (such as “favorite song of 1973”) that are associated with
one or only a few targets, but are still useful for retrieval. As we will see below,
however, such high specificity tags are not strongly associated with increased
listening on average.
4 This work is not yet published, but see the following URL for some method-
ological details: https://jlorince.github.io/archive/pres/Chasm2014.pdf.
5 For these analyses, we also applied a Gaussian smoothing kernel to all time
series, and performed clustering on a random subset of 1 million time series,
owing to computational constraints. Qualitative results hold over various
random samples, however.
6 This is not to say that all tags are used as retrieval cues, only that those are the
ones that this hypothesis applies to. How to determine which tags are used as
retrieval cues and which are not is a separate question we do not tackle here;
for the purposes of these analyses we assume that such tags exist in sufficient
numbers for us to see the proposed pattern in the data when considering all
tags.
7 Our methods align all time series to month of peak listening, and consider only
tagged time series where the tag was applied in that peak month.
8 Because the Last.fm software can track listening from various sources, a given
scrobble can represent a direct choice to listen to a particular song/artist, a
recommendation generated by Last.fm, or a recommendation from another
source, such as Pandora or Grooveshark.
References
Ames, M., & Naaman, M. (2007). Why we tag: Motivations for annotation in mobile
and online media. In Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems (pp. 971–980). ACM.
Block, L. G., & Morwitz, V. G. (1999). Shopping lists as an external memory aid for
grocery shopping: Influences on list writing and list fulfillment. Journal of Consumer
Psychology, 8(4), 343–375.
Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. (2014). Inferring
causal impact using Bayesian structural time-series models. Annals of Applied Statistics, 9,
247–274.
Cattuto, C., Baldassarri, A., Servedio, V. D., & Loreto, V. (2007). Vocabulary growth
in collaborative tagging systems. arXiv preprint. Retrieved from https://arxiv.org/abs/
0704.3316.
Music Tagging and Listening 141
Cattuto, C., Loreto, V., & Pietronero, L. (2007). Semiotic dynamics and collaborative
tagging. Proceedings of the National Academy of Sciences, 104(5), 1461–1464.
Earhard, M. (1967). Cued recall and free recall as a function of the number of items per
cue. Journal of Verbal Learning and Verbal Behavior, 6(2), 257–263.
Floeck, F., Putzke, J., Steinfels, S., Fischbach, K., & Schoder, D. (2011). Imitation
and quality of tags in social bookmarking systems—collective intelligence leading to
folksonomies. In T. J. Bastiaens, U. Baumöl, & B. J. Krämer (Eds.), On collective intelligence
(pp. 75–91). Berlin: Springer International Publishing.
Glushko, R. J., Maglio, P. P., Matlock, T., & Barsalou, L. W. (2008). Categorization in the
wild. Trends in Cognitive Sciences, 12(4), 129–135.
Golder, S. A., & Huberman, B. A. (2006). Usage patterns of collaborative tagging systems.
Journal of Information Science, 32(2), 198–208.
Granger, C. W. J. (1969). Investigating causal relations by econometric models and
cross-spectral methods. Econometrica, 37(3), 424.
Gupta, M., Li, R., Yin, Z., & Han, J. (2010). Survey on social tagging techniques. ACM
SIGKDD Explorations Newsletter, 12(1), 58–72.
Halpin, H., Robu, V., & Shepherd, H. (2007). The complex dynamics of collaborative
tagging. In Proceedings of the 16th International Conference on World Wide Web (pp. 211–220).
ACM.
Harris, J. E. (1980). Memory aids people use: Two interview studies. Memory & Cognition,
8(1), 31–38.
Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models (Vol. 43). London:
CRC Press.
Heckner, M., Mühlbacher, S., & Wolff, C. (2008). Tagging tagging: Analysing user
keywords in scientific bibliography management systems. Journal of Digital Information
(JODI), 9(2), 1–19.
Higbee, K. L. (1979). Recent research on visual mnemonics: Historical roots and
educational fruits. Review of Educational Research, 49(4), 611–629.
Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006). Information retrieval in
folksonomies: Search and ranking. In Proceedings of 3rd European Semantic Web Conference
(ESWC) (pp. 411–426). Springer International Publishing.
Hunt, R. R., & Seta, C. E. (1984). Category size effects in recall: The roles of relational
and individual item information. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 10(3), 454.
Hunter, I. M. L. (1979). Memory in everyday life. In M. M. Gruneberg & P. E. Morris
(Eds.), Applied problems in memory (pp. 1−11). London: Academic Press.
Intons-Peterson, M. J., & Fournier, J. (1986). External and internal memory aids: When
and how often do we use them? Journal of Experimental Psychology: General, 115(3), 267.
Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., & Stumme, G. (2007). Tag
recommendations in folksonomies. In Knowledge discovery in databases: PKDD 2007
(pp. 506–514). Berlin: Springer International Publishing.
Kausler, D. H., & Kausler, D. H. (1974). Psychology of verbal learning and memory. New
York: Academic Press.
Körner, C., Benz, D., Hotho, A., Strohmaier, M., & Stumme, G. (2010). Stop thinking,
start tagging: Tag semantics emerge from collaborative verbosity. In Proceedings of the 19th
International Conference on World Wide Web (pp. 521–530). ACM.
142 J. Lorince and P. M. Todd
Körner, C., Kern, R., Grahsl, H.-P., & Strohmaier, M. (2010). Of categorizers and
describers: An evaluation of quantitative measures for tagging motivation. In Proceedings
of the 21st ACM Conference on Hypertext and Hypermedia (pp. 157–166). ACM.
Lorince, J., & Todd, P. M. (2013). Can simple social copying heuristics explain tag
popularity in a collaborative tagging system? In Proceedings of the 5th Annual ACM Web
Science Conference (pp. 215–224). ACM.
Lorince, J., Joseph, K., & Todd, P. M. (2015). Analysis of music tagging and listening
patterns: Do tags really function as retrieval aids? In Proceedings of the 8th Annual Social
Computing, Behavioral-Cultural Modeling and Prediction Conference (SBP 2015). Washington,
D.C.: Springer International Publishing.
Lorince, J., Zorowitz, S., Murdock, J., & Todd, P. M. (2014). “Supertagger” behavior
in building folksonomies. In Proceedings of the 6th Annual ACM Web Science Conference
(pp. 129–138). ACM.
Lorince, J., Zorowitz, S., Murdock, J., & Todd, P. M. (2015). The wisdom of the few?
“supertaggers” in collaborative tagging systems. The Journal of Web Science, 1(1), 16–32.
Macgregor, G., & McCulloch, E. (2006). Collaborative tagging as a knowledge organisation
and resource discovery tool. Library Review, 55(5), 291–300.
Marlow, C., Naaman, M., Boyd, D., & Davis, M. (2006). HT06, tagging paper, taxonomy,
Flickr, academic article, to read. In Proceedings of the 17th Conference on Hypertext and
Hypermedia (pp. 31–40). ACM.
Moscovitch, M., & Craik, F. I. (1976). Depth of processing, retrieval cues, and uniqueness
of encoding as factors in recall. Journal of Verbal Learning and Verbal Behavior, 15(4),
447–458.
Noll, M. G., Au Yeung, C.-M., Gibbins, N., Meinel, C., & Shadbolt, N. (2009). Telling
experts from spammers: Expertise ranking in folksonomies. In Proceedings of the 32nd
International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp. 612–619). ACM.
Nov, O., Naaman, M., & Ye, C. (2008). What drives content tagging: The case of photos
on Flickr. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
(pp. 1097–1100). ACM.
Robu, V., Halpin, H., & Shepherd, H. (2009). Emergence of consensus and shared
vocabularies in collaborative tagging systems. ACM Transactions on the Web (TWEB), 3(4),
1–34.
Rutherford, A. (2004). Environmental context-dependent recognition memory effects:
An examination of ICE model and cue-overload hypotheses. The Quarterly Journal of
Experimental Psychology Section A, 57(1), 107–127.
Schifanella, R., Barrat, A., Cattuto, C., Markines, B., & Menczer, F. (2010). Folks in
folksonomies: Social link prediction from shared metadata. In Proceedings of the 3rd ACM
International Conference on Web Search and Data Mining (pp. 271–280). ACM.
Seitlinger, P., Ley, T., & Albert, D. (2013). An implicit-semantic tag recommendation
mechanism for socio-semantic learning systems. In T. Ley, M. Ruohonen, M. Laanpere,
& A. Tatnall (Eds.), Open and Social Technologies for Networked Learning (pp. 41–46). Berlin:
Springer International Publishing.
Sen, S., Lam, S. K., Rashid, A. M., Cosley, D., Frankowski, D., Osterhouse, J., . . . Riedl,
J. (2006). Tagging, communities, vocabulary, evolution. In Proceedings of the 2006 20th
Anniversary Conference on Computer Supported Cooperative Work (pp. 181–190). ACM.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical
Journal, 27, 379–423.
Music Tagging and Listening 143
Shirky, C. (2005). Ontology is overrated: Categories, links, and tags. Retrieved from
www.shirky.com/writings/ontology_overrated.html.
Sterling, B. (2005). Order out of chaos. Wired Magazine, 13(4).
Tullis, J. G., & Benjamin, A. S. (2014). Cueing others’ memories. Memory & Cognition,
43(4), 634–646.
Tulving, E., & Pearlstone, Z. (1966). Availability versus accessibility of information in
memory for words. Journal of Verbal Learning and Verbal Behavior, 5(4), 381–391.
Vander Wal, T. (2007). Folksonomy coinage and definition. Retrieved July 29, 2014, from
www.vanderwal.net/folksonomy.html.
Von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems (pp. 319–326). ACM.
Watkins, O. C., & Watkins, M. J. (1975). Buildup of proactive inhibition as a cue-overload
effect. Journal of Experimental Psychology: Human Learning and Memory, 1(4), 442.
Weinberger, D. (2008). Everything is miscellaneous: The power of the new digital disorder (1st
edn.). New York: Holt Paperbacks.
Weist, R. M. (1970). Optimal versus nonoptimal conditions for retrieval. Journal of Verbal
Learning and Verbal Behavior, 9(3), 311–316.
Weng, L., & Menczer, F. (2010). GiveALink tagging game: An incentive for social
annotation. In Proceedings of the ACM SIGKDD Workshop on Human Computation (pp.
26–29). ACM.
Weng, L., Schifanella, R., & Menczer, F. (2011). The chain model for social tagging game
design. In Proceedings of the 6th International Conference on Foundations of Digital Games (pp.
295–297). ACM.
Yeung, C.-M. A., Noll, M. G., Gibbins, N., Meinel, C., & Shadbolt, N. (2011). SPEAR:
Spamming-Resistant Expertise Analysis and Ranking in collaborative tagging systems.
Computational Intelligence, 27(3), 458–488.
Zollers, A. (2007). Emerging motivations for tagging: Expression, performance, and
activism. In Workshop on Tagging and Metadata for Social Information Organization,
held at the 16th International World Wide Web Conference.
Zubiaga, A., Körner, C., & Strohmaier, M. (2011). Tags vs shelves: From social tagging to
social classification. In Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia
(pp. 93–102). ACM.
7
FLICKR® DISTRIBUTIONAL TAGSPACE
Evaluating the Semantic Spaces Emerging from
Flickr® Tag Distributions
Marianna Bolognesi
Abstract
Flickr users tag their personal pictures with a variety of keywords. Such annotations could
provide genuine insights on salient aspects emerging from the personal experiences that
have been captured in the picture, which range beyond the purely visual features, or the
language-based associations. Mining the emergent semantic patterns of these complex open-
ended large-scale bodies of uncoordinated annotations provided by humans is the goal of this
chapter. This is achieved by means of distributional semantics, i.e. by relying on the idea
that concepts that appear in similar contexts have similar meanings (e.g. Latent Semantic
Analysis, LSA, Landauer & Dumais 1997). This chapter presents the Flickr Distributional
Tagspace (FDT), a distributional semantic space built on Flickr tag co-occurrences, and
evaluates it as follows: (1) through a comparison between the semantic representations that
it produces, and those that are obtained from speaker-generated features norms collected
in an experimental setting, as well as with WordNet-based metrics of semantic similarity
between words; (2) through a categorization task and a consequent cluster analysis.
The results of the two studies suggest that FDT can deliver semantic representations
that correlate with those that emerge from aggregations of features norms, and can
cluster fairly homogeneous categories and subcategories of related concepts.
Introduction
The large-scale collections of user-generated semantic labels that can be easily
found online has recently prompted the interest of research communities that focus
on the automatic extraction of meaning from large-scale unstructured data, and the
creation of bottom-up methods of semantic knowledge representation. Fragments
of natural language such as tags are today exploited because they provide contextual
Flickr® Distributional Tagspace 145
cues that can help solve problems in computer vision research; for example, the
queries in the Google image searching browser, where one can drag an image
and obtain in return other images that are visually similar, can be refined by
providing linguistic cues. For this reason, there is a growing interest in analyzing
(or “mining”) these linguistic labels for identifying latent recurrent patterns and
extracting new conceptual information, without referring to a predefined model,
such as an ontology or a taxonomy. Mining these large sets of unstructured data
retrieved from social networks (also called Big Data), seems more and more crucial
for uncovering aspects of the human cognitive system, to track trends that are
latently encoded in the usage of specific tags, and to fuel business intelligence and
decision-making in the industry sector (sentiment analysis and opinion mining).
The groupings of semantic labels attributed to digital documents by users, and
the semantic structures that emerge from such uncoordinated actions, known as
folksonomies (folk-taxonomies), are today widely studied in multimedia research
to assess the content of different digital resources (see Peters & Weller, 2008 for an
overview), relying on the “wisdom of the crowd”: If many people agree that
a web page is about cooking, then with high probability it is about cooking
even if its content does not include the exact word “cooking.” Although several
shortcomings of folksonomies have been pointed out (e.g. Peters & Stock, 2007),
this bottom-up approach of collaborative content structuring is seen as the next
transition toward the Web 3.0, or Semantic Web.
Whereas the multimedia researchers that aim to implement new tools for tag
recommendations, machine-tagging, and information retrieval in the semantic
web are already approaching and trying to solve the new challenges set by these
resources, in cognitive science the task-oriented data, collected in experimental
settings, seem to be still the preferred type of empirical data, because we know
little about how and to what extent Big Data can be modeled to reflect the human
behavior in the performance of typically human cognitive operations.
Burgess & Lund, 1997; Landauer & Dumais, 1997; Sahlgren, 2006; Turney &
Pantel, 2010; Rapp, 2004), and for this reason they have been often “accused”
of yielding language-based semantic representations, rather than experience-based
semantic representations.
In order to overcome the limitations of the language-based distributional
models, there have been recent attempts to create hybrid models, in which
the semantic information retrieved from word co-occurrences is combined with
perceptual information, retrieved in different ways, such as from human-generated
semantic features (Andrews, Vigliocco, & Vinson, 2009; Johns & Jones, 2012;
Steyvers, 2010) or from annotated images, under the assumption that images
provide a valid proxy for perceptual information (see, for example, Bruni, Tran,
& Baroni, 2014). Image-based information has been proven to be non-redundant
and complimentary to the text-based information, and the multimodal models
in which the two streams of information are combined perform better than
those based on solely linguistic information (Andrews et al., 2009; Baroni &
Lenci, 2008; Riordan & Jones, 2011). In particular, it has been shown that
while language-based distributional models capture encyclopedic, functional, and
discourse-related properties of words, hybrid models can also harvest perceptual
information, retrieved from images.
Such hybrid models constitute a great leap forward in the endeavor of modeling
human-like semantic knowledge by relying on the distributional hypothesis and on
large amounts of unstructured, human-generated data. Yet, I believe, they present
some questionable aspects, which I hereby summarize.
Combining text-derived with image-derived information by means of sophis-
ticated techniques appears to be an operation that is easily subject to error (how
much information shall be used from each stream and why? Does the merging
technique make sense from a cognitive perspective?). Moreover, this operation
seems to lean too much toward a strictly binary distinction between visual
versus linguistic features (respectively retrieved from two separate streams), leaving
aside other possible sources of information (e.g. emotional responses, cognitive
operations, other sensory reactions that are not captured by purely visual or purely
linguistic corpora).
Furthermore, the way in which visual information is retrieved from images
might present some drawbacks. For example, image-based information included
in hybrid models is often collected through real-time “games with a purpose,”
created ad hoc for stimulating descriptions of given stimuli from individuals, or
coordinated responses between two or more users (for a comprehensive overview,
see Thaler, Simperl, Siorpaes, & Hofer, 2011). In the popular ESP game (Ahn &
Dabbish, 2004, licensed by Google in 2006), for example, two remote participants
that do not know each other have to associate words with a shared image, trying to
coordinate their choices and produce the same associations as fast as possible, thus
forcing each participant to guess how the other participant would “tag” the image.
Flickr® Distributional Tagspace 147
Although the entertaining nature of these games is crucial to keep the participants
motivated during the task, and has little or no expenses, the specific instructions
provided to the contestants can constrain the range of associations that a user might
attribute to a given stimulus, and trigger ad hoc responses that provide only partial
insights on the content of semantic representations. As Weber, Robertson, and
Vojnovic show (2008), ESP gamers tend to match their annotations to colors, or
to produce generic labels to meet the other gamer quickly, rather than focusing
on the actual details and peculiarities of the image. The authors also show that
a “robot” can predict fairly appropriate tags without even seeing the image. In
addition, ESP as well as other databases of annotated images harvest annotations
provided by people who are not familiar with the images: images are provided
by the system. Arguably, such annotations reflect semantic knowledge about the
concepts represented, which are processed as categories (concept types), rather
than individual experiential instances (concept tokens). Thus, such images cannot
be fully acknowledged to be a good proxy of sensorimotor information, because
there has not been any sensorimotor experience: The annotator has not experienced
the exact situation captured by the image.
Finally, in hybrid models the texts and the images used as sources of information
have been produced /processed by different populations, and thus they may not be
comparable.
Motivated by these concerns, my research question is the following: Can we
build a hybrid distributional space that (1) is based on a unique but intrinsically
variegated source of semantic information, so as to avoid the artificial and arbitrary
merging of linguistic and visual streams; (2) contains spontaneous and therefore
richer data, which are not induced by specific instructions or time constraints such
as in the online games; (3) contains perceptual information that is derived from
direct experience; (4) contains different types of semantic information (perceptual,
conceptual, emotional, etc.) provided by the same individuals in relation to specific
stimuli; (5) is based on dynamic, noisy, and constantly updated (Big) Data?
As explained below, the answer can be found in Flickr Distributional Tagspace
(FDT), a distributional semantic space based on Flickr tags. Big Data meets
cognitive science.
tag green was in both cases the farthest one, probably due to the fact that the word
green is highly polysemic.
That first investigation, aimed at analyzing the distribution of color terms across
Flickr images’ tags, showed that it is possible to actually capture rich semantic
knowledge from the Flickr environment, and that this information is missed by
(two) distributional models based on solely linguistic contexts.
Implementing FDT
The procedure for creating a FDT semantic space relies on the following steps, as
was first illustrated in Bolognesi (2014). All the operations can be easily performed
in the R environment for statistical analyses (for these studies the R version 2.14.2
was used), while the raw data (tagsets) can be downloaded from Flickr, through
the freely available Flickr API services.1
This operation is done in order to keep only the most salient features that
users attribute to a picture, which are arguably tagged first.
4 Subset (i.e. filter) the concatenated corpus, in order to drop the redundant
tagsets that belong to the same user, and thus keep only the unique tagsets for
each user (each owner name). This operation should be done to avoid biased
frequencies among the tags’ co-occurrences, due to the fact that users often
tag batches of pictures with the same tagset (copied and pasted). For example,
on a sunny Sunday morning a user might take 100 pictures of a flower, upload
all of them, and tag them with the same tags “sunny,” “Sunday,” “morning,”
“flower.” In FDT only one of these 100 pictures taken by the same user is
kept.
5 Another filtering of the corpus should be done, by dropping those tagsets
where the concept to be analyzed appears after the first three tags.3 This
allows one to keep only those tagsets that describe pictures for which a target
concept is very salient (and therefore is mentioned among the first three tags).
Pictures described by tagsets where the target concept appears late are not
considered to be representative for the given concept.
6 Build the matrix of co-occurrences, which displays the frequencies with
which each target concept appears in the same picture with each related tag.
This table will display the target concepts on the rows and all of the other
tags, that co-appear with each of the target concepts across the downloaded
tagsets, on the columns. The raw frequencies of co-occurrence reported in the
cells should then be turned into measures of association. The measure used
for this distributional semantic space is an adaptation of the pointwise mutual
information (Bouma, 2009), in which the joint co-occurrence of each tag
pair is squared, before dividing it by the product of the individual occurrences
of the two tags. Then, the obtained value is normalized by multiplying the
squared joint frequency for the sample size (N ). This double operation (not
very different from that performed in Baroni & Lenci, 2010) is done in order
to limit the general tendency of the mutual information, to give weight to
highly specific semantic collocates, despite their low overall frequency. This
measure of association is formalized as follows:
2
f a,b
S P M I = log N
fa × fb
where a and b are two tags, f stands for frequency of occurrence (joint
occurrence of a with b in the numerator and individual occurrences of a
and b in the denominator), and N is the corpus size. The obtained value
approximates the likelihood of finding a target concept and each other tag
appearing together in a tagset, taking into account their overall frequency in
the corpus, the frequency of their co-appearance within the same tagsets, and
the sample size. Negative values, as commonly done, are raised to zero.
152 M. Bolognesi
TABLE 7.1 The three main differences between LSA and FDT, pertaining to
context type (of the co-occurrence matrix), measure of association between an
element and a context, and dimensionality reduction applied before the
computation of the cosine.
LSA FDT
Context Documents of text Tagsets
(the matrix of (the matrix of
co-occurrences is co-occurrences is
word by document) word by word)
Measure of association typically tf-idf (term SPMI
frequency–inverse
document frequency)
Dimensionality reduction SVD (singular value None, the matrix is
decomposition), used dense.
because the matrix is
sparse.
7 Turn the dataframe into a matrix, so that each row constitutes a concept’s
vector, and calculate the pairwise cosines between rows. The cosine, a
commonly used metric in distributional semantics, expresses the geometrical
proximity between two vectors, which has to be interpreted as the
semantic similarity between two concepts. The obtained table represents the
multidimensional semantic space FDT.
All the steps illustrated in the procedure can be easily done with the basic R
functions, besides step 7 for which the package LSA is required. In fact, FDT is
substantially similar to LSA; yet, there are some crucial differences between FDT
and LSA, summarized in Table 7.1.
A cluster analysis can finally provide a deeper look into the data. In the studies
described below, the data were analyzed in R through an agglomerative hierarchical
clustering algorithm, the Ward’s method (El-Hamdouchi & Willett, 1986; Ward,
1963), also called minimum variance clustering (see explanation of this choice in
Study Two). The Ward method works on Euclidean distances (thus the cosines
were transformed into Euclidean distances): It is a variance-minimizing approach,
which minimizes the sum of squared differences within all clusters and does not
demand the experimenter to set the amount of clusters in advance. In hierarchical
clustering each instance is initially considered a cluster by itself and the instances are
gradually grouped together according to the optimal value of an objective function,
which in Ward’s method is the error sum of squares. Conversely, the commonly
used k-means algorithms demand the experimenter to set the number of clusters
Flickr® Distributional Tagspace 153
in which he or she wants the data to be grouped. However, for observing the
spontaneous emergence of consistent semantic classes from wild data, it seems
preferable to avoid setting a fixed number of clusters in advance. In R it is possible
to use agglomerative hierarchical clustering methods through the function hclust.
An evaluation of the clustering solution, illustrated in the studies below, was
obtained with pvclust R package (Suzuki & Shimodaira, 2006), which allows
the assessment of the uncertainty in hierarchical cluster analysis. For each cluster
in hierarchical clustering, quantities called p-values are calculated via multiscale
bootstrap resampling. The P-value of a cluster is a value between 0 and 1, which
indicates how strong the cluster is supported by data.4
Study One
The purpose of this study was to evaluate FDT semantic representations against
those obtained from speaker-generated feature norms, and those obtained from
linguistic analyses conducted on WordNet. The research questions approached by
this task can be summarized as follows:
To what extent do the semantic representations created by FDT correlate with
the semantic representations based on human-generated features, and with those
emerging from the computation of semantic relatedness in WordNet, using three
different metrics?
In order to achieve this, the semantic representations of a pool of concepts,
analyzed with FDT, were compared through a correlation study to those obtained
from a database of human-generated semantic features, as well as to the similarities
obtained by computing the pairwise proximities between words in WordNet (three
different metrics).
1 function;
2 external surface property;
3 external component;
4 superordinate;
5 entity behavior.
Such display suggests that during the property generation task that allowed
the experimenters to collect the features, the participants activated mental
simulations that allowed them to “see” and consequently list visual properties
of the given words, while imagining the correspondent referents in action.5
However, little information is produced with regard to the contextual properties
of such simulations: The information about locations, participants, and associated
entities that would populate the imagined situations are not ranked as high as the
properties attributed directly to the referent that was mentally simulated after the
correspondent verbal stimulus was provided. This does not necessarily mean that
in the mental simulations the concepts are imagined to be floating in a vacuum,
without any surrounding context, but rather that the contextual entities are not as
salient as other types of properties directly attributed to the referent.
On the other hand, according to Beaudoin’s classification of Flickr tags
(Beaudoin, 2007), the most frequent features tagged in Flickr are:
1 geographical locations;
2 compounds;
3 inanimate things;
4 participants;
5 events.
Even though Beaudoin’s taxonomy differs from Wu and Barsalou’s one, it appears
clear that Flickr tags tend to favor contextual entities. As a matter of fact, as a
qualitative exploration, the Wu and Barsalou taxonomy was applied in FDT, to the
168 concepts taken from McRae’s features norms (listed in Table 7.2). In particular,
after building the contingency matrix that displays the 168 concepts on the rows
156 M. Bolognesi
and all of the tags that co-occur with them on the columns (step 6 of the FDT
procedure), the top 20 tags that co-occur with each of the 168 concepts (i.e. the
20 tags with higher SPMI value in each concept vector) were manually annotated
with the semantic roles of the Wu and Barsalou taxonomy. This annotation process
presented intrinsic difficulties, due to the fact that tags are one-word labels and
their relationship with the target concept can be ambiguous even when their part
of speech is not ambiguous. For example, given the concept table, its related tag
coffee might contribute to specify a subordinate category of tables (coffee tables), an
object that often appears on a table (cup of coffee), or an activity that can be done
at a table (having a coffee). In this regard, a specific case is represented by abstract
entities. Consider, for instance, how nature is associated with waterfall. According
to the coding scheme proposed in Wu & Barsalou (2009), this association could
be plausibly labeled as location (waterfalls are found in nature), associated_entity or
associated_abstract_entity (nature is simply associated with waterfalls as a related entity
that does not fulfill any strictly taxonomic relation), origin (waterfalls come from
nature), or superordinate (nature constitutes a taxonomic hypernym of waterfalls).
Although the difficulties in specifying the relation between concepts in FDT
might be considered a drawback of this distributional semantic space, they might
actually become a strength when dealing with abstract concepts. As suggested
in Recchia and Jones (2012), predicates are not necessarily the basic units for
semantic analysis: A concept such as law is clearly associated and described by
courthouse, crime, and justice, and these concepts may play a role in the semantic
representation of law, even though it is difficult to express through a short sentence
the nature of the relationship between law and each of the above mentioned
related concepts. In this respect, FDT is less constrained than other databases
of semantic features collected in experimental settings, which often address only
concrete concepts. The naturalness and genuineness of Flickr tags has been clearly
stated by one of Flickr’s co-founders in the following terms: “free typing loose
associations is just a lot easier than [for example] making a decision about the
degree of match to a pre-defined category (especially hierarchical ones). It’s like
90% of the value of a ‘proper’ taxonomy but 10 times simpler” (Butterfield,
2004).
The outcomes of the manual annotation display a very different scenario, when
compared to the features that most frequently are produced by the speakers in
McRae’s features norms. The top-ranked tags in FDT express features that are
associated with the given concepts by relationships of:
1 locations;
2 associated entities;
3 superordinates;
4 functions;
5 external surface properties.
Flickr® Distributional Tagspace 157
Such display is motivated by the fact that the conceptual representations in FDT
are indeed highly contextualized, appearing as they do in photographs that capture
real experiences, and thus unavoidably involve other entities.
TABLE 7.3 The average Pearson’s correlation coefficients between semantic representations in
FDT, McRae’s features norms, and three metrics of similarity/relatedness based on WordNet
(JCN, WUP, and PATH). All coefficients are significant with p < 0.001, besides JCN/WUP,
significant with p < 0.005.
As the table shows, FDT generates semantic representations that correlate positively
with state of the art metrics of semantic similarity/relatedness. Interestingly, FDT
semantic representations show high correlations with those computed through
PATH, which counts the number of nodes along the shortest path between the
senses in the “is-a” WordNet hierarchies.
In conclusion, Study One showed that a distributional analysis performed
with FDT on Flickr Big Data (image tags) delivers semantic representations
that correlate with those obtained from speaker-generated features elicited in an
experimental setting, as well as those obtained from the WordNet environment.
Study Two
The purpose of this study was to evaluate FDT’s ability to categorize given
concepts into semantically meaningful clusters, based on their co-occurrence across
the Flickr tagsets, and compare such categorization to that obtained from semantic
feature norms.
150
80
100
60
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Number of clusters Number of clusters
FIGURE 7.1 (a) Data partitioning, analysis of the within group sum of squares by
number of clusters in FDT data. (b) Data partitioning, analysis of the within group
sum of squares by number of clusters on McRae’s features forms.
FIGURE 7.2 (a) Cluster analysis performed in R with hclust on FDT data. The function cutree shows the solution for a six-way partitioning.
(b) Cluster analysis performed in R with hclust on McRae’s features norms data. The function cutree shows the solution for a six-way
partitioning.
Cluster dendrogram McRae’s features norms
(b)
at the dendrogram shows that the six categories are not the six predetermined
categories, and they are not internally coherent. The six main categories that seem
to emerge are: General foods, a sample of vegetables, a sample of fruits, musical
instruments + weapons + vehicles + some animals, clothing, and animals. The
same procedure was performed on McRae’s features norms data, and it is plotted
in Figure 7.2(b). Here, the categories that seem to emerge are also non-coherent
from a semantic perspective: Fruit, other foods, weapons, musical instruments +
clothing + vehicles, birds, other animals.
A quantitative analysis of the data was performed with the package pvclust13
(Suzuki & Shimodaira 2006), which in addition to hclust assesses the uncertainty
in hierarchical clustering via multiscale bootstrap analysis, and highlights the results
that are highly supported by the data (e.g. p > 0.95). Figure 7.3(a) and (b) shows
the clusters on FDT and McRae’s features norms data that are supported by the
data with alpha = 0.95.
FIGURE 7.3 (a) Cluster analysis performed in R with pvclust on FDT data. (b) Cluster analysis performed in R with pvclust on McRae’s
features norms data.
(b) McRae’s features norms highly supported clusters
TABLE 7.4 Different measures of cluster analysis validation, related to the six-way
cluster solutions obtained with FDT data and McRae feature norms, and set top and
bottom rules.
other hand, allowing for a higher number of clusters, FDT shows accurate
intra-category distinctions between different types of vehicles (air, ground, and
water transportation), as well as different types of animals (farm animals and wild
animals).
In conclusion, Study Two showed that mining Flickr tags through the FDT
procedure, it is possible to create a cognitively plausible categorization of given
concepts, which are automatically divided into semantically coherent clusters.
Conclusions
Big Data such as the metadata that characterize the Web 2.0 (the web of
social networks) are increasingly attracting the interest of a variety of scientific
communities for they present at the same time a challenge and an opportunity to
bridge toward the next web, or Web 3.0, the Semantic Web. The challenge is that
of creating adequate tools for mining and structuring the semantic information that
such datasets contain; the opportunity is that of the availability of huge amounts of
data that are not subject to the same biases and constraints of laboratory-collected
Flickr® Distributional Tagspace 169
data. Yet, the majority of applications that involve Big Data mining seem to be
specifically market-oriented (rather than being aimed at reproducing the human
cognitive system) and task-oriented (designed to serve a specific query). On the
other hand, in cognitive science the distributional hypothesis that lays under the
implementation of several word space models is gaining more attention thanks
to the recent implementation of hybrid and multimodal word spaces that would
account for the grounded nature of human semantic knowledge, by integrating in
the word vectors some perceptually derived information.
The study hereby presented relates to both the above-mentioned fields of
research (Big Data mining and cognitive science) for it proposes a distributional
semantic space that harvests semantic representations by looking at concepts’
co-occurrences across Flickr tags, and it compares them to those that are obtained
from semantic data collected in experimental settings (speaker-generated features).
The analyses here presented showed that FDT is a general distributional semantic
space for meaning representation that is based on a unique but intrinsically
variegated source of semantic information, and thus it avoids the artificial and
arbitrary merging of linguistic and visual streams of semantic information, as
it is performed by hybrid distributional models. FDT’s fair ability to model
semantic representations and predict human-like similarity judgments suggests that
this would be a promising way to observe from a distributional perspective the
semantic representations of abstract concepts, a timely topic in cognitive science
(e.g. Pecher, Boot, & Van Dantzig, 2011). The fact that abstract concepts can
hardly be described through predicates suggests that FDT, which is based on
contextualized conceptual associations, could highlight some hidden peculiarities
of the inner structure of abstract concepts. The scientific literature about abstract
concepts’ representations claims that these concepts differ from concrete ones in
that their mental simulation appears to involve a wider amount of participating
entities, arguably because they are used in a wider variety of contexts (Barsalou &
Wiemer-Hastings, 2005). Moreover, the processing of abstract concepts seems to
require a special focus on introspections and emotions, which seems to be less
crucial for concrete concepts (Kousta, Vigliocco, Vinson, Andrews, & Del Campo,
2011). These claims will need to be investigated through extensive distributional
analyses of concrete versus abstract concepts in the FDT environment.
Notes
1 On demand, the author can release the raw data and the materials used for the
studies described in the following sections.
2 Ranking the number of tags attributed to each picture in an informal analysis
conducted over a sample of 5 million pictures (i.e. how many pictures have one,
two, three tags, etc), it appeared that most pictured contain one to 15 tags. After
170 M. Bolognesi
15, the graph’s curve that indicates the number of pictures containing 15+ tags
drops dramatically.
3 This number is chosen without a specific quantitative investigation: Out of 15
tags considered, the first three tags are considered to be the most salient, but
a deeper psycholinguistic investigation could test whether the tagging speed
actually decreases after the first three tags, suggesting a decrease in salience.
4 Other validation methods such as the popular purity and entropy measures
obtained for example with the software CluTo demand the experimenter to set
the number of clusters, an operation which here was avoided on purpose.
5 In the instruction provided to the participants, McRae and colleagues request
them to list properties of the concept to which a given word refers. For the
exact wording used in the instructions, cf. McRae et al. (2005).
6 I am extremely grateful to Ted Pedersen for personally providing these data.
7 Straightforward nodes count in the noun and verb WordNet “is-a” hierarchies.
8 Wu and Palmer (1994).
9 Jiang and Conrath (1997).
10 The JCN measure requires that a word be observed in the sense-tagged corpus
used to create information content values, and the coverage of this corpus
is somewhat limited. As a result, quite a few pairs showed zero relatedness,
but in fact this has to be read as zero information on this pair, rather than
zero relatedness. The concepts whose vector was a list of zeroes are: blackbird,
carrot, crocodile, dolphin, dunebuggy, flamingo, garlic, giraffe, gorilla, hamster,
helicopter, limousine, mittens, moose, mushroom, nectarine, pajamas, parsley,
pineapple, pumpkin, pyramid, raspberry, raven, robin, scooter, skateboard,
turtle, zebra, zucchini. Not considering these concepts in the correlation study,
the average correlation between JCN and FDT is 0.78 (SD=0.12).
11 Documentation at http://127.0.0.1:30955/library/stats/html/hclust.html.
Arguments: method=“ward.” Other parameters were left on the default
settings.
12 Documentation at http://127.0.0.1:30955/library/stats/html/cutree.html.
Arguments: k= 6.
13 Documentation at: http://127.0.0.1:30955/library/pvclust/html/pvclust.html.
Arguments: method.hclust=“ward,” method.dist=“euclidean”.
14 Documentation can be found at: http://127.0.0.1:13703/library/clValid/html/
clValid.html.
15 Documentation can be found at: http://deim.urv.cat/∼sergio.gomez/multidendro
grams.php.
References
Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. Proceedings of the
SIGCHI Conference 2004 (pp. 319–326). Wien, Austria.
Flickr® Distributional Tagspace 171
Ames, M., & Naaman, M. (2007). Why we tag: Motivations for annotation in mobile and
online media. In Proceedings of the SIGCHI Conference 2007 (pp. 971–980). New York,
NY, USA.
Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and
distributional data to learn semantic representations. Psychological Review, 1163, 463–498.
Baroni, M., & Lenci, A. (2008). Concepts and properties in word spaces. In A. Lenci (Ed.),
From context to meaning: Distributional models of the lexicon in linguistics and cognitive science.
Special issue of the Italian Journal of Linguistics, 201, 55–88.
Baroni, M., & Lenci, A. (2010). Distributional Memory: A general framework for corpus-
based semantics. Computational Linguistics, 36(4), 673–721.
Baroni, M., Evert, S., & Lenci, A. (Eds.) (2008). Lexical Semantics: Bridging the
gap between semantic theory and computational simulation. Proceedings of the ESSLLI
Workshop on Distributional Lexical Semantics 2008.
Barsalou, L. (1983). Ad hoc categories. Memory & Cognition, 11, 211–227.
Barsalou, L., & Wiemer-Hastings, K. (2005). Situating abstract concepts. In D. Pecher and
R. Zwaan (Eds.), Grounding cognition: The role of perception and action in memory, language,
and thought (pp. 129–163). New York: Cambridge University Press.
Beaudoin, J. (2007). Flickr® image tagging: Patterns made visible. Bulletin of the American
Society for Information Science and Technology, 341, 26–29.
Bolognesi, M. (2014). Distributional Semantics meets Embodied Cognition: Flickr® as a
database of semantic features. Selected Papers from the 4th UK Cognitive Linguistics Conference
(pp. 18–35). London, UK.
Bouma, G. (2009). Normalized (Pointwise) mutual information in collocation extraction.
In C. Chiarcos, R. E. de Castilho, & M. Stede (Eds.), From form to meaning: Processing texts
automatically. Proceedings of the Biennial GSCL Conference (pp. 31–40). Potsdam, Germany:
Narr Verlag.
Bruni, E., Tran, N., & Baroni, M. (2014). Multimodal Distributional Semantics. Journal of
Artificial Intelligence Research, 49, 1–47.
Buratti, A. (2011). FlickrSearch 1.0. Retrieved August 2014 from https://code.google.com/
p/irrational-numbers/wiki/Downloads.
Burgess, C., & Lund, K. (1997). Modelling parsing constraints with high-dimensional
context space. Language and Cognitive Processes, 12, 1–34.
Butterfield, S. (2004, August 4) Sylloge. Retrieved March 20, 2014 from http://
www.sylloge.com/personal/2004/08/folksonomy-social-classification-great.html.
Cree, G., & McRae, K. (2003). Analyzing the factors underlying the structure and
computation of the meaning of chipmunk, cherry, chisel, cheese, and cello and many
other such concrete nouns. Journal of Experimental Psychology, 132, 163–201.
El-Hamdouchi, A., & Willett, P. (1986). Hierarchic document clustering using Ward’s
method. Proceedings of the Ninth International Conference on Research and Development in
Information Retrieval (pp. 149–156). Washington: ACM.
Firth, J. (1957). Papers in Linguistics. London: Oxford University Press.
Harris, Z. (1954). Distributional structure. Word, 10(23), 146–162.
Heckner, M., Heilemann, M. & Wolff, C. (2009). Personal information management vs.
resource sharing: Towards a model of information behaviour in social tagging systems.
International AAAI Conference on Weblogs and Social Media (ICWSM), May, San Jose, CA,
USA.
172 M. Bolognesi
Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical
taxonomy. The International Conference on Research in Computational Linguistics, Taiwan.
Johns, B. T., & Jones, M. N. (2012). Perceptual inference from global lexical similarity.
Topics in Cognitive Science, 4(1), 103–120.
Körner, C., Benz, D., Hotho, A., Strohmaier, M., & Stumme, G. (2010) Stop thinking,
start tagging: Tag semantics arise from collaborative verbosity. Proceedings of the 19th
International Conference on World Wide Web (WWW 2010), April, Raleigh, NC, USA:
ACM.
Kousta, S., Vigliocco, G., Vinson, D., Andrews, M., & Del Campo, E. (2011). The
representation of abstract words: Why emotion matters. Journal of Experimental Psychology,
140, 14–34.
Landauer, T., & Dumais, S. (1997). A solution to Plato’s problem: The Latent
Semantic Analysis theory of the acquisition, induction and representation of knowledge.
Psychological Review, 104(2), 211–240.
McRae, K., Cree, G., Seidenberg, M., & McNorgan, C. (2005). Semantic feature
production norms for a large set of living and nonliving things. Behavioral Research
Methods, Instruments, and Computers, 37, 547–559.
Marlow, C., Naaman, M., Boyd, D., & Davis, M. (2006). HT06, tagging paper, taxonomy,
Flickr, academic article, to read. In Proceedings of the 7th Conference on Hypertext and
Hypermedia (pp. 31–40).
Nov, O., Naaman, M., & Ye, C. (2009). Motivational, structural and tenure factors that
impact online community photo sharing. Proceedings of AAAI International Conference on
Weblogs and Social Media (ICWSM 2009).
Pecher, D., Boot, I., & Van Dantzig, S. (2011). Abstract concepts: Sensory-motor
grounding, metaphors, and beyond. In B. Ross (Ed.), The psychology of learning and
motivation, Vol. 54, pp. 217–248. Burlington: Academic Press.
Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity—Measuring the
relatedness of concepts. Proceedings of Fifth Annual Meeting of the North American Chapter of
the Association for Computational Linguistics (NAACL-04) (pp. 38–41). May, Boston, MA.
Peters, I., & Stock, W. (2007). Folksonomy and information retrieval. Proceedings of the 70th
Annual Meeting of the American Society for Information Science and Technology, 70, 1510–1542.
Peters, I., & Weller, K. (2008). Tag gardening for folksonomy enrichment and maintenance.
Webology, 5(3), article 58.
Rapp, R. (2004). A freely available automatically generated thesaurus of related words.
Proceedings of the 4th International Conference on Language Resources and Evaluation LREC
2004 (pp. 395–398).
Recchia, G., & Jones, M. (2012). The semantic richness of abstract concepts. Frontiers in
Human Neuroscience, 6, article 315.
Riordan, B., & Jones, M. (2010). Redundancy in linguistic and perceptual experience:
Comparing distributional and feature-based models of semantic representation. Topics in
Cognitive Science, 3(2), 303–345.
Sahlgren, M. (2006). The Word-Space Model: Using distributional analysis to represent syntagmatic
and paradigmatic relations between words in high-dimensional vector spaces. Doctoral thesis,
Department of Linguistics, Stockholm University.
Shaoul, C., & Westbury, C. (2008). Performance of HAL-like word space models on
semantic categorization tasks. Proceedings of the Workshop on Lexical Semantics ESSLLI 2008
(pp. 42–46).
Flickr® Distributional Tagspace 173
Steyver, M. (2010). Combining feature norms and text data with topic model. Acta
Psychologica, 133, 234–243.
Strohmaier, M., Koerner, C., & Kern, R. (2012). Understanding why users tag: A survey
of tagging motivation literature and results from an empirical study. Web Semantics, 17,
1–11.
Suzuki R., & Shimodaira, H. (2006). Pvclust: An R package for assessing the uncertainty
in hierarchical clustering. Bioinformatics 22(12), 1540–1542.
Thaler, S., Simperl, E., Siorpaes, K., & Hofer, C. (2011). A survey on games for knowledge
acquisition. STI technical report, May 2011, 19.
Turney, P., & Pantel, P. (2010). From frequency to meaning: Vector space models of
semantics. Journal of Artificial Intelligence Research, 371, 141–188.
Ward, J. (1963). Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58, 236–244.
Weber, I., Robertson, S. & Vojnovic, M. (2008) Rethinking the ESP game. Report for Microsoft
Research, Microsoft Corporation.
Wu, L., & Barsalou, L. (2009). Perceptual simulation in conceptual combination: Evidence
from property generation. Acta Psychologica, 132, 173–189.
Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. Proceedings of the 2nd
Annual Meeting of the Association for Computational Linguistics (pp. 133–138).
8
LARGE-SCALE NETWORK
REPRESENTATIONS OF SEMANTICS
IN THE MENTAL LEXICON
Simon De Deyne, Yoed N. Kenett, David Anaki,
Miriam Faust, and Daniel Navarro
Abstract
The mental lexicon contains the knowledge about words acquired over a lifetime. A
central question is how this knowledge is structured and changes over time. Here we
propose to represent this lexicon as a network consisting of nodes that correspond
to words and links reflecting associative relations between two nodes, based on free
association data. A network view of the mental lexicon is inherent to many cognitive
theories, but the predictions of a working model strongly depend on a realistic scale,
covering most words used in daily communication. Combining a large network
with recent methods from network science allows us to answer questions about its
organization at different scales simultaneously, such as: How efficient and robust is
lexical knowledge represented considering the global network architecture? What
are the organization principles of words in the mental lexicon (i.e. thematic versus
taxonomic)? How does the local connectivity with neighboring words explain why
certain words are processed more efficiently than others? Networks built from word
associations are specifically suited to address prominent psychological phenomena
such as developmental shifts, individual differences in creativity, or clinical states like
schizophrenia. While these phenomena can be studied using these networks, various
future challenges and ways in which this proposal complements other perspectives
are also discussed.
Introduction
How do people learn and store the meaning of words? A typical American university
student knows the meaning of about 40,000 words by adulthood, but even young
Large-scale Semantic Networks 175
& McNorgan, 2005), and predict measures such as the relatedness between word
pairs (Dry & Storms, 2009) and typicality as a member of a category (Ameel &
Storms, 2006). On the theoretical side, we can study the semantic properties and
relations among a small set of words using connectionist networks (McClelland, &
Rogers, 2003) or Bayesian models (Kemp, Tenenbaum, Griffiths, Yamada, & Ueda,
2006). One difficulty with these approaches is the very fact that they are small in
scale, and it is not clear which results will generalize when the entire lexicon is
considered. Indeed, most psychological studies rely on small sets of concrete nouns
(Medin, Lynch, & Solomon, 2000) even though work on abstract and relational
words is available (e.g. Recchia & Jones, 2012; Wiemer-Hastings & Xu, 2005).
Moreover, selection biases also extend towards certain types of relations between
these nouns (mostly perceptual properties and categorical relations) and types of task
(highlighting relations at the same hierarchical category level), which might render
some of the conclusions regarding how the lexicon is structured and represents
word meaning premature.
A different approach that emerges from linguistics might be called the “thesaurus”
model, and is best exemplified by WordNet (Fellbaum, 1998). WordNet is a
linguistic network consisting of over 150,000 words. The basic unit within WordNet
is a synset, a set of words deemed to be synonymous. For instance, the word platypus
belongs to a synset that contains duckbill, duck-billed platypus, and Ornithorhynchus
anatinus. Synsets are connected to one another via is-a-kind-of relationships, so the
platypus synset is linked to a synset that consists of monotreme and egg-laying mammal.1
Unlike the traditional psychological approach, WordNet does not suffer from the
problem of small scale. Quite the contrary, in fact: The synsets and their connections
form an extensive network from which one can predict the similarity of two
words, query the different senses that a word may have, and so on. Unfortunately,
when viewed as a tool for studying the mental lexicon, WordNet has fairly severe
limitations of its own. The fundamental difficulty is that WordNet is not derived
from empirical data, and as a consequence it misses properties that would be
considered critical when studying human semantics. A simple example would be
the fact that it treats the elements of a synset as equivalent. This is highly implausible:
While any Australian would have a detailed mental representation for platypus, only
a small group of experts would have any representation of the term Ornithorhynchus
anatinus. Moreover, the meaning of platypus is culture specific and will be
much more central within (some) Australian cultures than in American culture
(cf. Szalay & Deese, 1978). Even ignoring culture-specific knowledge that
differentiates among the members of the synset, the WordNet representation
misses important lexical knowledge about the word platypus that would be shared
among almost all English speakers. To most English speakers, platypus is a rare
word: The frequency of the word platypus (less than one per million words) is
just a fraction of that of duck (about 25 times per million words), which has a
tremendous influence on how quickly people can decide whether it is a real word
Large-scale Semantic Networks 177
(839 ms for platypus versus 546 ms for duck), or name the animal in question (830 ms
versus 572 ms).2 As this illustrates, the WordNet approach—useful as it may be
for its original purpose—is not well suited to the empirical study of the mental
lexicon.
A third tradition, inspired by information retrieval research in computer science,
is to study semantic knowledge by analyzing the structure of large text corpora. One
of the best known approaches is latent semantic analysis (LSA; Landauer & Dumais,
1997). Using a vocabulary of nearly a hundred thousand words derived from a
variety of documents, LSA is able to capture the meaning of words by comparing
how similar the contexts are in which two words occur. For example, it infers that
opossum, marsupials, mammal, duck-billed, warm bloodedness, and anteater are related
to platypus because these words occur in similar contexts to platypus.3 In recent
years a large number of corpus-based methods have been developed (Recchia,
Sahlgren, Kanerva, Jones, & Jones, 2015). These methods differ in terms of how
they define a word’s context (e.g. the paragraph, the document, etc.), the extent to
which they use grammatical information (e.g. word order), and how the meaning is
represented (e.g. latent spaces, mixture models, etc.). Not surprisingly, the choice
of text corpus also has a very strong influence on how these models behave, and
can even become the determining factor of how well they capture human semantic
processing (Recchia & Jones, 2009).
One of the main selling points to the corpus approach is that it serves as an
existence proof for how meaning can be acquired from the world. That is, if the
text corpus is viewed as a summary of the statistics of the linguistic environment,
then LSA and the like can be construed as methods for extracting meaning from the
world (Firth, 1968; Landauer & Dumais, 1997). Without wishing to make the point
too strongly, there are some reasons to be cautious about the claim that this is how
humans do so. Even supposing that the text corpus is a good proxy for the statistics
of the linguistic input available to the human learner, it is not at all clear that the
linguistic input presents a good summary of the statistics of the world that children
are exposed to. For example, when people are asked to generate associates to banana,
the word yellow is one of the top responses, correctly capturing a relationship that
any child acquires from perceptual data. Yet, due to pragmatic constraints on human
discourse, we rarely talk about yellow bananas. Some studies have looked at this
explicitly, by comparing participants who generate ten sentences to a number of
verbal stimuli, whereas others generated a closely similar word association task.
The results showed that the type of responses, after carefully preprocessing the
sentences, correlated only moderately (Szalay & Deese, 1978, r = 0.48). Similarly,
word co-occurrence extracted from a large text-corpus show only weak correlations
with response frequencies from word associations (De Deyne, Verheyen, & Storms,
2015). Second, non-linguistic processes contribute to word meaning in the lexicon
which are picked up by word associations but not necessarily by text (for the reasons
we just mentioned). Further evidence comes from a study on the incidental learning
178 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro
of word associations during sentence processing (Prior & Bentin, 2003). Finally, in
contrast to natural discourse where fully grammatical utterances need to be formed,
there is less monitoring of the responses in word association tasks. Presumably, the
transformation from idea to language is quicker and easier (Szalay & Deese, 1978).
Limitations notwithstanding, all three methods serve a useful purpose, but each
captures a different aspect of the problem. In this chapter we discuss a different
approach, one that comes with its own perspective on the problem. Drawing
from traditional experimental psychology, we seek to base our study on empirical
measures. Much like the corpus approach, we aim to describe the mental lexicon
on a large scale in a fashion that is psychologically plausible. Finally, like WordNet,
the form of this lexical knowledge can be described using the language of networks.
semantic network. In the next two sections we explain how these two elements are
implemented in an explicit network, where the connections between words are
derived from human responses in a word association task.
FIGURE 8.1 Portion of the associative network around the word platypus showing direct
and indirect neighboring nodes.
Spreading Activation
The second key feature of the Collins and Loftus (1975) proposal is the notion of
spreading activation. Once a word in the network is activated, activation spreads to
other connected words, and quickly dissipates with time and distance (Collins and
Loftus, 1975; Den Heyer & Briand, 1986). This principle has been influential in
many psychological theories and models such as the adaptive control of thought
theory (Anderson, 1983, 1991), and various connectionist models of semantic
cognition (Lerner, Bentin, & Shriki, 2012; McClelland, & Rogers, 2003). Through
spreading activation, the meaning of a word in the network is represented by how
it is linked to other words and how these are interlinked themselves. In that sense,
spreading activation provides a mechanism by which distributed meaning can be
extracted from a network.
Formally, spreading activation can be implemented as a stochastic random walk
defined over the network. Starting at a particular node, a random walker selects an
out-bound edge with a probability proportional to the edge weight and moves across
Large-scale Semantic Networks 181
it. Gradually, it will explore more paths around the start node. For many of these
walkers the probability of any walker being in on a specific node reaches a state that
remains stable after many iterations. In this random walk, the relatedness in meaning
between nodes reflects the number and length of the directed paths through the
network that connect two nodes. Many short paths between a source and target
node allow a random walk to quickly reach the target, which reflects the fact that
both nodes are considerably similar in meaning. In the simplest version, a single
parameter determines this walk. This parameter governs the decay of activation: It
determines the weight of paths of a specific length in such a way that longer paths get
less weight than shorter ones, which might be useful depending on the type of task.
In recent years, various empirical studies have demonstrated how memory search
is governed by a fairly simple random walk over semantic memory (Abbott et al.,
2015; Bourgin, Abbott, Griffiths, Smith, & Vul, 2014; Smith, Huber, & Vul, 2013).
To sum up, we propose to construct a mental lexicon as a network derived from
word association. This network is a localist representation with nodes corresponding
to words.5 The semantic representations derived from it are functionally distributed,
in the sense that the meaning of a word is represented by activation distributed
over all edges connected with that word. The scale of the network is crucial: If the
network is too small or too poorly connected the spreading activation mechanism
becomes biased and lower frequency words like platypus might become unreachable
(i.e. they will have no incoming links).
network, rather than any particular part. For example, the network of the mental
lexicon exhibits a small-world structure. In comparison to random networks,
small-world networks are characterized by a high degree of clustering, while
maintaining short paths between nodes (Borge-Holthoefer & Arenas, 2010a; De
Deyne & Storms, 2008; Solé, Corominas-Murtra, Valverde, & Steels, 2010; Steyvers
& Tenenbaum, 2005). Similarly, the mental lexicon networks contain a small
number of highly connected nodes or hubs, to a much greater extent than would be
expected of a random graph. In network terms, these hubs exhibit a degree (i.e. num-
ber of connected nodes) that is much higher than other nodes. More generally, the
connectivity of the network has a characteristic distribution, in which the degree of
all nodes in the network follows a truncated power-distribution (Morais et al., 2013).
These properties are not arbitrary: There is evidence that macroscopic-level
properties such as small-world organization produce networks that are robust against
damage and allow efficient information distribution (Borge-Holthoefer, Moreno, &
Arenas, 2012). Moreover, small world organization often emerges when a network
forms via growth process, as is the case for other dynamic networks such as networks
of scientific collaboration, neural networks, and the World Wide Web (Watts &
Strogatz, 1998). From this perspective, the observed structure of the semantic
network provides insights into how the network grows over time (Steyvers &
Tenenbaum, 2005).
syndrome. In that study, the semantic network of people with Asperger syndrome
was characterized by higher modularity than the network derived from controls,
which is argued to be related to rigidity in thought (Kenett, Gold & Faust, 2015).
Another example of this is the case of schizophrenia. Here semantic networks could
be used to test whether thought disorders can be attributed to a lack of inhibition or
an increase in activation spreading through the lexicon. leading to phenomena such
as hyper-priming, where semantically related words show larger priming effects
compared to normal controls (Gouzoulis-Mayfrank et al., 2003; Pomarol-Clotet,
Oh, Laws, & McKenna, 2008; Spitzer, 1997).
FIGURE 8.2 Hierarchical tree visualization of clusters in the lexicon with five most
central members of the mental lexicon adapted from De Deyne, Verheyen, & Storms
(2015). While the deepest level of the hierarchy shows coherent content, higher levels
also convey the relations between smaller clusters and highlight other organizational
principles of the lexicon. For example, at the highest level a word’s connotation or
valence tends to capture network structure.
indirect links that exist between them. The idea that the closeness between a pair of
nodes in terms of the paths connecting them predicts the time to verify sentences
like a bird can fly motivated the early propositional network model by Collins and
Quillian (1969).
Human relatedness judgments for word pairs like rabbit—hare, or rabbit—carrot
provides a direct way to test various topological network properties at the mesoscopic
level and test hypotheses for different kinds of words (e.g. abstract or concrete) and
semantic relations. Distinct topological properties of the network affect these
predictions. A first one is the role of weak links, where the introduction of
weak links through continued responses represents a systematic improvement over
networks derived from single-response procedures (De Deyne, Navarro, & Storms,
2013; Hahn, 2008; Kenett et al., 2011). A second factor is the role of indirect
links that could contribute to the relatedness of word pairs. Several studies show
that incorporating indirect mesoscopic structure using random walks improves
predictions of human similarity judgments (Borge-Holthoefer & Arenas, 2010a; De
Deyne, Verheyen, & Storms, 2015; Van Dongen, 2000) and can be used to extract
Large-scale Semantic Networks 187
ensemble of paths with arbitrary lengths. At the end of this spectrum, the summed
activation over many nodes will start to resemble the activation of semantic features
in distributed models (e.g. Plaut, McClelland, Seidenberg, & Patterson, 1996).
Thus, a network account provides a flexible, yet well-defined way to understand
many of the documented priming effects but also questions certain theoretical
predictions, for instance in the case of the symmetric nature of semantic priming.
Storms, 2015; Gupta, Jang, Mednick, & Huber, 2012; Thompson & Kello,
2014).
(Steyvers & Tenenbaum, 2005), contextual diversity, and word frequency (Monaco,
Abbott, & Kahana, 2007). In the memory literature, the clustering coefficient of a
node has been proposed to explain a host of other memory and word processing
phenomena including recognition (Nelson, Zhang, & McKinney, 2001) and cued
recall (Nelson et al., 1998). However, more often than not a network is only invoked
as an explanatory device rather than a full-fledged computational model.
Why should large-scale network representations be used to examine the microscopic
organization level of the mental lexicon? The main reason is that large-scale
network implementations offer a more complete framework that allows us to
explicitly test various ways in which nodes can have processing advantages. If
the network is sufficiently large, it is possible to cover both the number of in
and out-going links as well as the number of links that might exist between
the neighbors of a node, leading to richer explanations of node centrality than
previous proposals. A good example is the concreteness effect, a finding where
highly imageable words such as chicken will be processed faster and more accurately
than abstract ones like intuition in word processing tasks like lexical decision
(Kroll & Merves, 1986). According to one hypothesis about the representation
of these words in memory, concrete words have smaller associate sets than abstract ones
(Galbraith & Underwood, 1973; Schwanenflugel & Shoben, 1983; and see de Groot,
1989), but such an explanation ignores both the weights and directionality of the
links. This goes against evidence suggesting that centrality measures derived from
undirected networks do not correspond as much with external centrality measures
such as imageability (Galbraith & Underwood, 1973) and age of acquisition (De Deyne
& Storms, 2008) or decision latencies in lexical decisions (De Deyne, Navarro, &
Storms, 2013) as compared with directed centrality measures. In particular, estimates
of in-degree and in-strength rely on how representative the set of cues is to build the
network. For example, if, for some reason, the word water, which frequently occurs
as a response, was never presented as a cue, the out-degree or out-strength for many
words will be biased as these responses are not encoded in the network.
Similarly, other studies have also found that centrality measures that capture some
of the mesoscopic or macroscopic properties might explain additional variance in
word processing. One example of such a measure is the word centrality measure
(Kenett et al., 2011). This measure examines the effect of each node on the general
structure of the network. This is achieved by removing a node and examining
the effect of the node removal on the average shortest path length (ASPL) of
the network without that node. In a study analyzing the Hebrew mental lexicon,
Kenett et al. (2011) found that some nodes greatly increase the ASPL of the network
once they are removed, thus indicating that these facilitative nodes enhance the
spread of activation within the network. The authors also found that some nodes
greatly decrease the ASPL of the network once they are removed, thus indicating the
presence of inhibitive nodes that hamper the spread of activation within the network.
Altogether, more complex centrality measures based on the reverberatory spread
of activation of a node or ASPL highlight how interpreting the complexity at each
of the three levels provides a richer explanation for word processing effects.
Discussion
In this chapter we have shown how a large-scale network representation of the
mental lexicon and the processes operating on it can account for a large diversity of
cognitive phenomena. At the macroscopic level these include language development,
creativity, and communication and thought disorders in clinical populations. At
the mesoscopic level they include the analysis of lexicon organization principles,
semantic relatedness, semantic priming, and word retrieval processes. At the
microscopic level they include explanations for word processing advantages for
environmental variables such as concreteness, age of acquisition, or word frequency,
but also an overarching framework for memory-based explanations including the
fan effect and potentially other measures of semantic richness of words. These
studies gradually depict a larger, broader picture of the role of lexicon structure
in a wide variety of cognitive phenomena. It can be expected that applications of
network theory to studies of the lexicon will continue to grow in the years to come
(Baronchelli, Ferrer-i-Cancho, Pastor-Satorras, Chater, & Christiansen, 2013; Faust
& Kenett, 2014).
From a modeling perspective, looking at different scales of the network
provides us with a rich way of evaluating and contrasting different proposals. In
particular, any model of semantic processing can now be evaluated in terms of the
type of macroscopic structure it exhibits, which can be achieved by looking at
degree-distributions or global modularity indices. At this level, models are expected
to be robust against damage and promote efficient diffusion of information. At the
mesoscopic level, models should be able to account for the relatedness in meaning
between words both in offline and online tasks. Contrasting different tasks such as
overt relatedness judgments and semantic priming allows us to investigate issues such
192 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro
as the time course of information retrieval from the lexicon, potential asymmetric
semantic relations, and the interaction between the centrality of a node and how
its meaning is accessed. Finally, at a microscopic level, the network-based account
indicates that word processing advantages can be the result of distinct connectivity
patterns in directed networks, which are equally likely to affect the predictions of
language-based tasks (e.g. production or retrieval; online or offline) in different
ways. At each of these three levels, various studies have provided valid accounts, but
only a few studies have integrated evidence to capture the multilevel structure of the
lexicon (Griffiths, Steyvers, & Tenenbaum (2007) is a notable exception). Adapting
this approach limits the possible models but also forces models to be comprehensive
from the start. Ultimately, both factors should provide us with a more accurate
appraisal of how knowledge is represented throughout the lexicon.
The application of a multilevel network view might even go further than
providing a general framework upon which several empirical hypotheses and
predictions can be examined simultaneously via quantitative means. It might also
inspire new theoretic views. One example of how large-scale mental networks
are starting to shape theories in cognitive science is a novel proposal which relates
lexicon structure to typical and atypical semantic processing (Faust & Kenett, 2014).
This theory proposes a cognitive continuum of lexicon structure. On one extreme
of this continuum lies rigid, structured lexicon networks, such as those exhibited in
individuals with Asperger syndrome (Kenett, Gold & Faust, 2015). On the other
end of this continuum lies chaotic, unstructured lexicon networks, such as those
exhibited in individuals with schizophrenia (Zeev-Wolf, Goldstein, Levkovitz, &
Faust, 2014). According to this theory, efficient semantic processing is achieved
via a balance between rigid and chaotic lexicon structure (Faust & Kenett, 2014).
For example, in regard to individual differences in creative ability, as the mental
lexicon structure is more rigid, it is less creative, even to the point of a clinical state.
On the contrary, as the mental lexicon structure is more chaotic, it is more creative,
again producing a clinical state in the extreme case (see also Bilder & Knudsen,
2014). This theory demonstrates how network analysis of the mental lexicon can
provide a general account for a wide variety of cognitive phenomena. In this regard,
large-scale representations of the mental lexicon are crucial to advancing such
network analysis in cognitive science. As stated above, uncovering larger portions
of the mental lexicon, via large-scale representations, will advance examination of
cognitive phenomena at all network levels.
Challenges
Recently, Griffiths (2015) presented a manifesto for a computational cognitive
revolution in the era of Big Data. In line with his view, we advocate in this chapter
194 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro
the significance of investigating the mental lexicon from a multi-level, Big Data,
association-based approach. So far we have mainly focused on the representation
of word meanings, without saying much about other factors that affect semantic
processing and memory retrieval. Various studies provide indirect evidence that
executive functions, working memory, attention, mood and personality traits all
contribute to how we process and retrieve meaning in the lexicon (Bar, 2009; Beaty,
Silvia, Nusbaum, Jauk, & Benedek, 2014; Benedek, Franz, Heene, & Neubauer,
2012; Benedek, Jauk, Sommer, Arendasy, & Neubauer, 2014; Heyman, Van
Rensbergen, Storms, Hutchison, & DeDeyne, 2015). To advance the application of
large-scale representation of the mental lexicon, such application must account for
the effect of these variables.
A further challenge is elucidating the relation between the phonological network
(Arbesman, Strogatz, & Vitevitch, 2010), which serves as the gateway into the
mental lexicon, and the semantic network. This relation can be studied from a
network of networks perspective (Kenett et al., 2014), that provides a way to analyze
networks which are related to each other and the interaction between them. Such
an approach will enable quantitative analysis of broader linguistic issues.
Besides the challenges in further integrating psychological factors and other
aspects of word representations, the use of network analysis also raises some
methodological challenges. The picture drawn so far still only covers a small portion
of how network science for large graphs can contribute to many psychologically
interesting phenomena. At the moment, many methods and ideas in network
science are just recent developments and continue to improve. Only a few years
ago, using binary undirected networks was conditional on the lack of methods for
weighted directed graphs. Similarly, the statistical underpinnings for identifying
clusters in these types of networks (Lancichinetti et al., 2011) or comparing different
networks have only very recently become available. Currently, developing statistical
models that allow us to test hypotheses for comparing networks remains a major
challenge for applying network science in empirical research. This is mainly due to
difficulties in estimating or collecting a large sample of empirical networks and there
are only a few statistical methods to compare between networks (Moreno & Neville,
2013). In these cases, bootstrapping methods over comparable networks will be a
solution (Baxter, Dorogovtsev, Goltsev, & Mendes, 2010). A similar approach is
used in more advanced applications of community detection for large-scale directed
weighted networks, where the cluster membership is determined by evaluating
the likelihood of this event in a comparable random network (Lancichinetti et al.,
2011).
Perhaps an even bigger challenge is the need to implement dynamic properties
in the networks, which might be needed to address the dynamic nature of semantic
processing over the mental lexicon and growth and evolution over the lifespan. Such
a dynamic time-course process of semantic retrieval within an individual might
involve the availability of different types of semantic information. In these cases it
Large-scale Semantic Networks 195
Acknowledgments
Work on this chapter was supported to the first author from ARC grant
DE140101749. This work was also supported to authors YNK and MF by the
Binational Science Fund (BSF) grant (number 2013106) and by the I-CORE
Program of the Planning and Budgeting Committee and the Israel Science
196 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro
Foundation (grant 51/11). DJN received salary support from ARC grant
FT110100431. Comments may be sent to simon.dedeyne@adelaide.edu.au.
Notes
1 Information retrieved from http://wordnet.princeton.edu/wordnet/man/wnstats.
7WN.html.
2 Results were retrieved from the English Lexicon project website, see http://
elexicon.wustl.edu.
3 For similar examples, try deriving neighbors using the LSA website at
http://lsa.colorado.edu.
4 Some researchers do ask participants to give “meaningful responses” (Nelson,
McEvoy, & Dennis, 2000); in all studies of our own, we have stuck to a true
free task.
5 The idea that each word maps onto exactly one node is most likely an unrealistic
assumption about how words are actually represented in the brain. However,
this simplification offers us both the flexibility needed to integrate key findings
in word processing and the ability to understand explicitly how information is
retrieved as the state of the network is interpretable by looking at which nodes
are activated.
6 In this easy example, the related word is star.
References
Abbott, J. T., Austerweil, J. L., & Griffiths, T. L. (2015). Random walks on semantic
networks can resemble optimal foraging. Psychological Review, 122(3), 558–569.
Aitchison, J. (2012). Words in the mind: An introduction to the mental lexicon. Oxford:
Wiley-Blackwell.
Ameel, E., & Storms, G. (2006). From prototypes to caricatures: Geometrical models for
concept typicality. Journal of Memory and Language, 55, 402–421.
Anderson, J. R. (1983). A spreading activation theory of memory. Journal of Verbal Learning
and Verbal Behavior, 22(3), 261–295.
Anderson, J. R. (1991). The adaptive nature of human categorization. Psychological Review,
98(3), 409.
Anderson, P. W. (1972). More is different. Science, 177(4047), 393–396.
Arbesman, S., Strogatz, S. H., & Vitevitch, M. S. (2010). The structure of phonological
networks across multiple languages. International Journal of Bifurcation and Chaos, 20(3),
679–685.
Baker, M. K., & Seifert, L. S. (2001). Syntagmatic-paradigmatic reversal in Alzheimer-type
dementia. Clinical Grontologist, 23(1–2), 65–79.
Bar, M. (2009). A cognitive neuroscience hypothesis of mood and depression. Trends in
Cognitive Sciences, 13(11), 456–463.
Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., & Christiansen, M. H.
(2013). Networks in cognitive science. Trends in Cognitive Sciences, 17(7), 348–360.
Baxter, G. J., Dorogovtsev, S. N., Goltsev, A. V., & Mendes, J. F. (2010). Bootstrap percolation
on complex networks. Physical Review E, 82(1), 011103.
Large-scale Semantic Networks 197
Beaty, R. E., Silvia, P. J., Nusbaum, E. C., Jauk, E., & Benedek, M. (2014). The roles
of associative and executive processes in creative cognition. Memory & Cognition, 42(7),
1186–1197.
Beckage, N. M., Smith, L. B., & Hills, T. (2010). Semantic network connectivity is related
to vocabulary growth rate in children. In Proceedings of the 32nd Annual Conference of the
Cognitive Science Society, Portland, OR, USA (pp. 2769–2774).
Benedek, M., Franz, F., Heene, M., & Neubauer, A. C. (2012). Differential effects of
cognitive inhibition and intelligence on creativity. Personality and Individual Differences,
53(4), 480–485.
Benedek, M., Jauk, E., Sommer, M., Arendasy, M., & Neubauer, A. C. (2014). Intelligence,
creativity, and cognitive control: The common and differential involvement of executive
functions in intelligence and creativity. Intelligence, 46, 73–83.
Bilder, R. M., & Knudsen, K. S. (2014). Creative cognition and systems biology on the edge
of chaos. Frontiers in Psychology, 1–4.
Borge-Holthoefer, J., & Arenas, A. (2010a). Categorizing words through semantic memory
navigation. The European Physical Journal B-Condensed Matter and Complex Systems, 74(2),
265–270.
Borge-Holthoefer, J., & Arenas, A. (2010b). Semantic networks: Structure and dynamics.
Entropy, 12(5), 1264–1302.
Borge-Holthoefer, J., Moreno, Y., & Arenas, A. (2012). Topological versus dynamical
robustness in a lexical network. International Journal of Bifurcation and Chaos in Applied
Sciences and Engineering, 22(7), 1250157.
Bourgin, D. D., Abbott, J. T., Griffiths, T. L., Smith, K. A., & Vul, E. (2014). Empirical
evidence for markov chain Monte Carlo in memory search. In Proceedings of the 36th
Annual Meeting of the Cognitive Science Society.
Brown, R., & McNeill, D. (1966). The tip of the tongue phenomenon. Journal of Verbal
Learning and Verbal Behavior, 5(4), 325–337.
Cabana, A., Valle-Lisboa, J. C., Elvevåg, B., & Mizraji, E. (2011). Detecting order-disorder
transitions in discourse: Implications for schizophrenia. Schizophrenia Research, 131(1),
157–164.
Cañas, J. J. (1990). Associative strength effects in the lexical decision task. The Quarterly
Journal of Experimental Psychology, 42(1), 121–145.
Capitán, J. A., Borge-Holthoefer, J., Gómez, S., Martinez-Romo, J., Araujo, L., Cuesta,
J. A., & Arenas, A. (2012). Local-based semantic navigation on a networked representation
of information. PLoS One, 7(8), e43694.
Chumbley, J. I. (1986). The roles of typicality, instance dominance, and category dominance
in verifying category membership. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 12, 257–267.
Chwilla, D. J., & Kolk, H. H. (2002). Three-step priming in lexical decision. Memory &
Cognition, 30, 217–225.
Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing.
Psychological Review, 82, 407–428.
Collins, A. M., & Quillian, M. R. (1969). Retrieval time from semantic memory. Journal of
Verbal Learning and Verbal Behavior, 9, 240–247.
Cramer, P. (1968). Word association. New York, NY: Academic Press.
Crutch, S. J., & Warrington, E. K. (2005). Abstract and concrete concepts have structurally
different representational frameworks. Brain, 128, 615–627.
198 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro
De Deyne, S., & Storms, G. (2008). Word associations: Network and semantic properties.
Behavior Research Methods, 40, 213–231.
De Deyne, S., Navarro, D. J., Perfors, A., Storms, G. (2016). Structure at every scale:
A semantic network account of the similarities between unrelated concepts. Journal of
Experimental Psychology: General, 145(9), 1228–1254.
De Deyne, S., Navarro, D. J., & Storms, G. (2013). Better explanations of lexical and semantic
cognition using networks derived from continued rather than single word associations.
Behavior Research Methods, 45, 480–498.
De Deyne, S., Verheyen, S., Ameel, E., Vanpaemel, W., Dry, M., Voorspoels, W., & Storms,
G. (2008). Exemplar by feature applicability matrices and other Dutch normative data for
semantic concepts. Behavior Research Methods, 40, 1030–1048.
De Deyne, S., Verheyen, S., & Storms, G. (2015). Structure and organization of the mental
lexicon: A network approach derived from syntactic dependency relations and word
associations. In A. Mehler, A. Lucking, S. Banisch, P. Blanchard, & B. Job (Eds.), Towards
a theoretical framework for analyzing complex linguistic networks (pp. 47–82). Berlin/New York:
Springer.
De Deyne, S., Verheyen, S., & Storms, G. (2015). The role of corpus size and syntax in
deriving lexico-semantic representations for a wide range of concepts. Quarterly Journal of
Experimental Psychology, 68(8), 1643–1644.
Deese, J. (1965). The structure of associations in language and thought. Baltimore, MD: Johns
Hopkins Press.
de Groot, A. M. (1995). Determinants of bilingual lexicosemantic organisation. Computer
Assisted Language Learning, 8(2–3), 151–180.
Den Heyer, K., & Briand, K. (1986). Priming single digit numbers: Automatic spreading
activation dissipates as a function of semantic distance. The American Journal of Psychology,
99(3), 315–340.
Dry, M., & Storms, G. (2009). Similar but not the same: A comparison of the utility
of directly rated and feature-based similarity measures for generating spatial models of
conceptual data. Behavior Research Methods, 41, 889–900.
Elman, J. L. (2009). On the meaning of words and dinosaur bones: Lexical knowledge
without a lexicon. Cognitive Science, 33(4), 547–582.
Ervin, S. M. (1961). Changes with age in the verbal determinants of word-association. The
American Journal of Psychology, 74(3), 361–372.
Faust, M., & Kenett, Y. N. (2014). Rigidity, chaos and integration: Hemispheric interaction
and individual differences in metaphor comprehension. Frontiers in Human Neuroscience,
8(511), 1–10. doi: 10.3389/fnhum.2014.00511.
Fellbaum, C. (1998). WordNet: An electronic lexical database. Retrieved from
www.cogsci.princeton.edu/wn. Cambridge, MA: MIT Press.
Firth, J. R. (1968). Selected papers of J. R. Firth, 1952–59./edited by F. R. Palmer (Longmans’
Linguistics Library). Harlow: Longmans.
Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3), 75–174.
Galbraith, R. C., & Underwood, B. J. (1973). Perceived frequency of concrete and abstract
words. Memory & Cognition, 1, 56–60.
Goldstone, R. L. (1996). Isolated and interrelated concepts. Memory & Cognition, 24,
608–628.
Gouzoulis-Mayfrank, E., Voss, T., M’orth, D., Thelen, B., Spitzer, M., & Meincke, U.
(2003). Semantic hyperpriming in thought-disordered patients with schizophrenia: State
or trait?—a longitudinal investigation. Schizophrenia Research, 65(2–3), 65–73.
Large-scale Semantic Networks 199
Lancichinetti, A., Radicchi, F., Ramasco, J. J., & Fortunato, S. (2011). Finding statistically
significant communities in networks. PLoS One, 6(4), e18961.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s Problem: The latent semantic
analysis theory of acquisition, induction and representation of knowledge. Psychological
Review, 104, 211–240.
Lerner, A. J., Ogrocki, P. K., & Thomas, P. J. (2009). Network graph analysis of category
fluency testing. Cognitive and Behavioral Neurology, 22(1), 45–52.
Lerner, I., Bentin, S., & Shriki, O. (2012). Spreading activation in an attractor network
with latching dynamics: Automatic semantic priming revisited. Cognitive Science, 36(8),
1339–1382.
Lin, E. L., & Murphy, G. L. (2001). Thematic relations in adults’ concepts. Journal of
Experimental Psychology: General, 1, 3–28.
Louwerse, M., & Connell, L. (2011). A taste of words: Linguistic context and perceptual
simulation predict the modality of words. Cognitive Science, 35, 381–398.
Lucas, M. (2000). Semantic priming without association: A meta-analytic review. Psychonomic
Bulletin & Review, 7(4), 618–630.
McClelland, J. L., & Rogers, T. T. (2003). The Parallel Distributed Processing approach to
semantic cognition. Nature Reviews Neuroscience, 4, 310–322.
McEvoy, C. L., Nelson, D. L., & Komatsu, T. (1999). What is the connection between
true and false memories? The differential roles of interitem associations in recall and
recognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25(5),
1177.
McNamara, T. P. (1992). Theories of priming: i. associative distance and lag. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 18(6), 1173.
McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature
production norms for a large set of living and nonliving things. Behavior Research Methods,
37, 547–559.
McRae, K., Khalkhali, S., & Hare, M. (2011). Semantic and associative relations: Examining
a tenuous dichotomy. In V. F. Reyna, S. Chapman, M. Dougherty, & J. Confrey (Eds.),
The adolescent brain: Learning, reasoning, and decision making (pp. 39–66). Washington, DC,
US: American Psychological Association.
Medin, D. L., Lynch, E. B., & Solomon, K. O. (2000). Are there kinds of concepts? Annual
Review of Psychology, 51(1), 121–147.
Mednick, S. (1962). The associative basis of the creative process. Psychological Review,
69(3), 220.
Mollin, S. (2009). Combining corpus linguistics and psychological data on word
co-occurrence: Corpus collocates versus word associations. Corpus Linguistics and Linguistic
Theory, 5, 175–200.
Monaco, J. D., Abbott, L. F., & Kahana, M. J. (2007). Lexico-semantic structure and the
word-frequency effect in recognition memory. Learning & Memory, 14, 204–213.
Morais, A. S., Olsson, H., & Schooler, L. J. (2013). Mapping the structure of semantic
memory. Cognitive Science, 37(1), 125–145.
Moreno, S., & Neville, J. (2013). Network hypothesis testing using mixed Kronecker product
graph models. In The IEEE International Conference on Data Mining series (ICDM) (pp.
1163–1168). IEEE.
Mota, N. B., Vasconcelos, N. A., Lemos, N., Pieretti, A. C., Kinouchi, O., Cecchi, G. A.,
. . . Ribeiro, S. (2012). Speech graphs provide a quantitative measure of thought disorder
in psychosis. PLoS One, 7(4), e34928.
Large-scale Semantic Networks 201
Nelson, D. L., & McEvoy, C. L. (2000). What is this thing called frequency? Memory &
Cognition, 28, 509–522.
Nelson, D. L., McEvoy, C. L., & Bajo, M. T. (1988). Lexical and semantic search in cued
recall, fragment completion, perceptual identification, and recognition. American Journal of
Psychology, 101(4), 465–480.
Nelson, D. L., McEvoy, C. L., & Dennis, S. (2000). What is free association and what does it
measure? Memory & Cognition, 28, 887–899.
Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida
free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments,
and Computers, 36, 402–407.
Nelson, D. L., McKinney, V. M., Gee, N. R., & Janczura, G. A. (1998). Interpreting the
influence of implicitly activated memories on recall and recognition. Psychological Review,
105, 299–324.
Nelson, D. L., Zhang, N., & McKinney, V. M. (2001). The ties that bind what is known to
the recognition of what is new. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 27, 1147–1159.
Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank citation ranking:
Bringing order to the web. Stanford InfoLab, Stanford, USA.
Pexman, P. M., Holyk, G. G., & Monfils, M. (2003). Number-of-features effects in semantic
processing. Memory & Cognition, 31, 842–855.
Plaut, D. C., McClelland, J. L., Seidenberg, M. S., & Patterson, K. (1996). Understanding
normal and impaired word reading: Computational principles in quasi-regular domains.
Psychological Review, 103, 56–115.
Pomarol-Clotet, E., Oh, T. M. S. S., Laws, K. R., & McKenna, P. J. (2008). Semantic
priming in schizophrenia: Systematic review and meta-analysis. The British Journal of
Psychiatry: The Journal of Mental Science, 192(2), 92–97.
Prior, A., & Bentin, S. (2003). Incidental formation of episodic associations: The importance
of sentential context. Memory & Cognition, 31, 306–316.
Radvansky, G. A. (1999). The fan effect: A tale of two theories. Journal of Experimental
Psychology: General, 128(2), 198–206.
Recchia, G., & Jones, M. N. (2009). More data trumps smarter algorithms: Comparing
pointwise mutual information with latent semantic analysis. Behavior Research Methods, 41,
647–656.
Recchia, G., & Jones, M. N. (2012). The semantic richness of abstract concepts. Frontiers in
Human Neuroscience, 41(3), 647–656.
Recchia, G., Sahlgren, M., Kanerva, P., Jones, M. N., & Jones, M. (2015). Encoding
sequential information in semantic space models: Comparing holographic reduced
representation and random permutation. Computational Intelligence and Neuroscience, 2015,
1–18.
Roediger, H. L., Balota, D. A., & Watson, J. M. (2001). Spreading activation and arousal of
false memories. In H. L. Roediger III, J. S. Nairne, I. Neath, and A. M. Surprenant (Eds)
The nature of remembering: Essays in honor of Robert G. Crowder (pp. 95–115). Washington,
DC: American Psychological Association.
Rosch, E., Mervis, C., Grey, W., Johnson, D., & Boyes-Braem, P. (1976). Basic objects in
natural categories. Cognitive Psychology, 8, 382–439.
Rossmann, E., & Fink, A. (2010). Do creative people use shorter association pathways?
Personality and Individual Differences, 49, 891–895.
202 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro
Runco, M. A., & Jaeger, G. J. (2012). The standard definition of creativity. Creativity Research
Journal, 24(1), 92–96.
Schilling, M. A. (2005). A “small-world” network model of cognitive insight. Creativity
Research Journal, 17(2–3), 131–154.
Schwanenflugel, P. J., & Shoben, E. J. (1983). Differential context effects in the
comprehension of abstract and concrete verbal materials. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 9, 82–102.
Smith, K. A., Huber, D. E., & Vul, E. (2013). Multiply-constrained semantic search in the
remote associates test. Cognition, 128(1), 64–75.
Solé, R. V., Corominas-Murtra, B., Valverde, S., & Steels, L. (2010). Language networks:
Their structure, function, and evolution. Complexity, 15(6), 20–26.
Spitzer, M. (1997). A cognitive neuroscience view of schizophrenic thought disorder.
Schizophrenia Bulletin, 23(1), 29–50.
Stam, C. J. (2014). Modern network science of neurological disorders. Nature Reviews
Neuroscience, 15(10), 683–695.
Steyvers, M., & Tenenbaum, J. B. (2005). The large-scale structure of semantic networks:
Statistical analyses and a model of semantic growth. Cognitive Science, 29, 41–78.
Szalay, L. B., & Deese, J. (1978). Subjective meaning and culture: An assessment through word
associations. Hillsdale, NJ: Lawrence Erlbaum.
Thompson, G. W., & Kello, C. T. (2014). Walking across wikipedia: A scale-free network
model of semantic memory retrieval. Frontiers in Psychology, 5, 1–9.
Thompson-Schill, S. L., Kurtz, K. J., & Gabrieli, J. D. E. (1998). Effects of semantic
and associative relatedness on automatic priming. Journal of Memory and Language, 38(4),
440–458.
Van Dongen, S. (2000). Graph clustering by flow simulation. Doctoral dissertation, University
of Utrecht.
Verheyen, S., Stukken, L., De Deyne, S., Dry, M. J., & Storms, G. (2011). The generalized
polymorphous concept account of graded structure in abstract categories. Memory &
Cognition, 39, 1117–1132.
Voorspoels, W., Storms, G., Longenecker, J., Verheyen, S., Weinberger, D. R., & Elvevåg,
B. (2014). Deriving semantic structure from category fluency: Clustering techniques and
their pitfalls. Cortex, 55, 130–147.
Watts, D. J. & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature,
393, 440–442. doi:10.1038/30918.
Waxman, S., & Gelman, R. (1986). Preschoolers’ use of superordinate relations in
classification and language. Cognitive Development, 1(2), 139–156.
Wiemer-Hastings, K., & Xu, X. (2005). Content differences for abstract and concrete
concepts. Cognitive Science, 29, 719–736.
Zeev-Wolf, M., Goldstein, A., Levkovitz, Y., & Faust, M. (2014). Fine-coarse
semantic processing in schizophrenia: A reversed pattern of hemispheric dominance.
Neuropsychologia, 56, 119–128.
Zortea, M., Menegola, B., Villavicencio, A., & Salles, J. F. d. (2014). Graph analysis of
semantic word association among children, adults, and the elderly. Psicologia: Reflexaõe
Crítica, 27(1), 90–99.
9
INDIVIDUAL DIFFERENCES IN
SEMANTIC PRIMING PERFORMANCE
Insights from the Semantic Priming Project
Melvin J. Yap,
Keith A. Hutchison,
and Luuan Chin Tan
Abstract
The semantic/associative priming effect refers to the finding of faster recognition times for words
preceded by related targets (e.g. cat—DOG), compared to words preceded by unrelated
targets (e.g. hat—DOG). Over the past three decades, a voluminous literature has explored
the influence of semantic primes on word recognition, and this work has been critical in
shaping our understanding of lexical processing, semantic representations, and automatic
versus attentional influences. That said, the bulk of the empirical work in the semantic
priming literature has focused on group-level performance that averages across participants,
despite compelling evidence that individual differences in reading skill and attentional
control can moderate semantic priming performance in systematic and interesting ways.
The present study takes advantage of the power of the semantic priming project (SPP;
Hutchison et al., 2013) to answer two broad, related questions. First, how stable are semantic
priming effects, as reflected by within-session reliability (assessed by split-half correlations)
and between-session reliability (assessed by test–retest correlations)? Second, assuming that
priming effects are reliable, how do they interact with theoretically important constructs such
as reading ability and attentional control? Our analyses replicate and extend earlier work by
Stolz, Besner, and Carr (2005) by demonstrating that the reliability of semantic priming
effects strongly depends on prime-target association strength, and reveal that individuals
with more attentional control and reading ability are associated with stronger priming.
sessions? Should it turn out that semantic priming is inherently unreliable (e.g.
Stolz, Besner, & Carr, 2005), this will severely limit the degree to which
the magnitude of an individual’s semantic priming effect can be expected to
correlate with other measures of interest (Lowe & Rabbitt, 1998). Related to
this, an unreliable dependent measure makes it harder for researchers to detect
between-group differences on this measure (Waechter, Stolz, & Besner, 2010).
The present study capitalizes on the power of the SPP (Hutchison et al., 2013)
to address the above-mentioned questions. The SPP is a freely accessible online
repository (http://spp.montana.edu) containing lexical and associative/semantic
characteristics for 1,661 words, along with lexical decision (i.e. classify letter strings
as words or non-words, e.g. flirp) and speeded pronunciation (i.e. read words aloud)
behavioral data of 768 participants from four testing universities (512 in lexical
decision and 256 in speeded pronunciation). Data were collected over two sessions,
separated by no more than one week. Importantly, Hutchison et al. (2013) also
assessed participants on their attentional control (Hutchison, 2007), vocabulary
knowledge, and reading comprehension. It is noteworthy that the SPP contains
data for over 800,000 lexical decision trials and over 400,000 pronunciation trials,
collected from a large and diverse sample of participants, making this a uniquely
valuable resource for studying individual differences in primed lexical processing.
The SPP exemplifies the megastudy approach to studying lexical processing,
in which researchers address a variety of questions using databases that contain
behavioral data and lexical characteristics for very large sets of words (see Balota,
Yap, Hutchison, & Cortese, 2012, for a review). Megastudies allow the language
to define the stimuli, rather than compelling experimenters to select stimuli
based on a limited set of criteria. They now serve as an important complement
to traditional factorial designs, which are associated with selection artifacts, list
context effects, and limited generalizability (Balota et al., 2012; Hutchison
et al., 2013). For example, in the domain of semantic priming, it is often
methodologically challenging to examine differences in priming as a function of
some other categorical variable (e.g. target word frequency), due to the difficulty
of matching experimental conditions on the many dimensions known to influence
word recognition. Using megastudy data, regression analyses can be conducted
to examine the effects of item characteristics on priming, with other correlated
variables statistically controlled for (e.g. Hutchison, Balota, Cortese, & Watson,
2008).
We will be using the SPP to address two broad and related questions. First,
how stable are semantic priming effects, as reflected by within-session reliability
(assessed by split-half correlations) and between-session reliability (assessed by
test–retest correlations)? Second, assuming that priming effects are stable, how
are they moderated by theoretically important constructs such as reading ability
and attentional control? While these questions are not precisely novel, they have
received relatively little attention in the literature. To our knowledge, the present
206 M. J. Yap, K. A. Hutchison and L. C. Tan
study is the first attempt to answer these intertwined questions in a unified and
comprehensive manner, using an unusually large and well-characterized set of
items and participants. We will first provide a selective review of studies that
have explored individual differences in semantic priming, before turning to an
important study by Stolz et al. (2005), who were the first to explore the reliability
of semantic priming effects in word recognition.
in priming (see Hutchison, 2007, for a review); this is known as the relatedness
proportion effect. By definition, increasing relatedness proportion increases the
validity of the prime, leading participants to rely more heavily on expectancy-based
processes, which in turn increases priming. Hutchison (2007) demonstrated that
participants who were associated with greater AC (as reflected by performance
on a battery of attentionally demanding tasks) produced a larger relatedness
proportion effect, suggesting that high-AC participants were sensitive to contextual
variation in relatedness proportion, and were adaptively increasing their reliance
on expectancy generation processes as prime validity increased. Related to this,
there is also recent evidence, based on the use of symmetrically associated (e.g.
sister—BROTHER), forward associated (i.e. prime to target; e.g. atom—BOMB),
and backward associated (i.e. target to prime; e.g. fire—BLAZE) prime-target pairs
that high-AC individuals are more likely to prospectively generate expected targets
and hold them in working memory, whereas low-AC individuals are more likely
to engage retrospective semantic matching processes (Hutchison, Heap, Neely, &
Thomas, 2014).
In summary, the individual differences literature is consistent with the idea
that as readers accrue more experience with words, they rely more on automatic
lexical processing mechanisms in both isolated (Yap et al., 2012) and primed
(Yap et al., 2009) word recognition. This is consistent with the lexical quality
hypothesis (Perfetti & Hart, 2001), which says that highly skilled readers are
associated with high-quality lexical representations which are both fully specified
and redundant. For these skilled readers, the process of identifying a word involves
the precise activation of the corresponding underlying lexical presentation, with
minimal activation of orthographically similar words (Andrews & Hersch, 2010).
Furthermore, such readers are less dependent on the strategic use of context (e.g.
prime information) to facilitate lexical retrieval (Yap et al., 2009). In addition to
high-quality representations, lexical processing is also modulated by the extent to
which readers can exert attentional control; high-AC readers are able to flexibly
adjust their reliance on priming mechanisms so as to maximize performance on a
given task (see Balota & Yap, 2006).
in semantic priming. Stolz et al. (2005) were the first to report the unexpectedly
low reliabilities associated with the semantic priming effect. In Stolz et al. (2005),
the number of participants in each experiment ranged between 48 and 96,
priming effects were based on 25 related and 25 unrelated trials, and the two test
sessions were operationalized by presenting two blocks of trials within the same
experiment. Given the theoretical and applied importance of Stolz et al.’s (2005)
findings, it is worthwhile exploring if they generalize to the SPP sample, which
contains a much larger number of participants and items. Specifically, in the SPP,
512 participants contributed to the lexical decision data, priming effects are based
on at least 100 related and 100 unrelated trials, and the two test sessions were held
on different days, separated by no more than one week. The SPP was also designed
to study priming under different levels of SOA (200 ms versus 1200 ms) and prime
type (first associate versus other associate), allowing us to assess reliability across
varied priming contexts.
Of course, for our purposes, reliability is largely a means to an end. The other
major aspect of the present work concerns the extent to which a participant’s
semantic priming effect is predicted by theoretically important measures of
individual differences, including vocabulary knowledge, reading comprehension,
and attentional control. Although there has been some work relating priming to
vocabulary knowledge (Yap et al., 2009) and to attentional control (Hutchison,
2007), there has not been, to our knowledge, a systematic exploration of the
relationship between semantic priming effects and a comprehensive array of
individual differences measures. The present study will be the first to address
that gap, by considering how priming effects that are assessed to be reliable are
moderated by a host of theoretically important variables. Collectively, the results
of these analyses will help shed more light on the relationships between the quality
of underlying lexical representations, attentional control, and priming phenomena.
Our findings could also be potentially informative for more foundational questions
pertaining to how changes in reading ability are associated with developments in
the semantic system (e.g. Nation & Snowling, 1999).
Method
Dataset
All analyses reported in this chapter are based on archival trial-level lexical
decision data from the semantic priming project (see Hutchison et al., 2013,
for a full description of the dataset). The 512 participants were native English
speakers recruited from four institutions (both private and public) located across
the midwest, northeast, and northwest regions of the United States. Data were
collected over two sessions on different days, separated by no more than one week.
Across both sessions, participants received a total of 1,661 lexical decision trials
(half words and half non-words), with word prime–word target pairs selected from
Individual Differences in Semantic Priming 211
the Nelson, McEvoy, and Schreiber (2004) free association norms; the relatedness
proportion was fixed at 0.50. Non-words were created from word targets by
changing one or two letters of each word to form pronounceable non-words
that did not sound like real words. For each participant, each session comprised
two blocks (a 20 ms and 1200 ms SOA block), and within each block, half the
related prime-target pairs featured first associates (i.e. the target is the most common
response to a cue/prime word, e.g. choose—PICK) while the other half featured
other associates (i.e. any response other than the most common response to a cue,
e.g. preference—PICK).
Additional demographic information collected included performance on the
vocabulary and passage comprehension subtests of the Woodcock–Johnson III
diagnostic reading battery (Woodcock, McGrew, & Mather, 2001) and on
Hutchison’s (2007) attentional control battery. The vocabulary measures include
a synonym, antonym, and an analogy test; for reading comprehension, participants
have to read a short passage and identify a missing keyword that makes sense
in the context of that passage. The attentional control battery consists of an
automated operational span (Unsworth, Heitz, Schrock, & Engle, 2005), Stroop,
and antisaccade task (Payne, 2005). In the operational span task, participants have
to learn and correctly recall letter sequences while solving arithmetic problems. In
the Stroop task, participants are presented with incongruent (e.g. red printed
in green), congruent (e.g. green printed in green), and neutral (e.g. deep printed
in green) words and are required to name the ink color of the word as quickly
and accurately as possible; the dependent variable is the difference in the mean
RT or error rate between the congruent and incongruent conditions. In the
antisaccade task, participants are instructed to look away from a flashed start (*)
in order to identify a target (O or Q) that is briefly presented on the other
side of the screen; the dependent variable is the target identification accuracy
rate.
Results
We first excluded 14 participants whose datasets were incomplete (i.e. their data
contained fewer than 1,661 trials); this left 498 participants. We then excluded
incorrect trials and trials with response latencies faster than 200 ms or slower than
3000 ms. For the remaining correct trials, RTs more than 2.5 SDs away from each
participant’s mean were also treated as outliers. For the RT analyses, data trimming
removed 7.3 percent (4.6 percent errors; 2.7 percent RT outliers). For all analyses,
we used z-score transformed RTs, which serve to control for individual differences
in processing speed (Faust et al., 1999) and to eliminate much of the variability in
priming for items (Hutchison et al., 2008). Z-scores were computed separately for
each participant.
212 M. J. Yap, K. A. Hutchison and L. C. Tan
the 200 ms SOA (M = 0.11) than at the 1200 ms SOA (M = 0.09). The
greater priming for first-associate trials is not surprising. However, priming effects
are generally larger (not smaller) at longer SOAs (Neely, 1977; but see Neely,
O’Connor, & Calabrese, 2010). We will comment on this intriguing pattern in
the Discussion.
It is also worth noting that the priming effects in the SPP are somewhat smaller
than what one would expect, using studies such as Hutchison et al. (2008) as a
reference point. As Hutchison et al. (2013) have already acknowledged, it is not
entirely clear why this difference exists, but they suggest that this may be due to the
fact that related trials in semantic priming experiments (e.g. Hutchison et al., 2008)
typically predominantly feature very strong associates, whereas the SPP stimuli are
far more diverse with respect to semantic and associative relations.
Turning to the reliability analyses, Table 9.2 presents the Pearson correlations
between odd and even-numbered trials (split-half reliability), and between S1
and S2 trials (test–retest reliability), for participant-level priming effects. Like
Stolz et al. (2005), we are examining correlations between responses to distinct
sets of prime-target pairs, but the counterbalancing procedure ensures that the
descriptive statistics of different variables are relatively similar across different
sub-lists. For lexical decision, with respect to within-session reliability (reflected
by split-half reliability), we observed moderate correlations (rs from 0.21 to 0.27)
for first-associate trials, and very low correlations (rs from 0.07 to 0.08) for
other-associate trials. Turning to between-session reliability (reflected by test–retest
reliability), correlations were moderate (rs from 0.25 to 0.31) for first-associate
trials, and very low (rs from 0.07 to 0.11) for other-associate trials.
Clearly, the reliabilities of semantic priming for first-associate trials are not
only statistically significant, but are consistently higher than for counterpart
other-associate trials (all ps < 0.05), whose reliabilities approach non-significance.1
It is also reassuring that our estimates (for first-associate trials) fall broadly within
the range of test–retest reliabilities reported by Stolz et al. (2005) for their
conditions where the relatedness proportion was 0.50. Specifically, for the SOAs
of 200 ms, 350 ms, and 800 ms, they found test–retest correlations of 0.30, 0.43,
and 0.27, respectively.
Z-score priming
Z-score priming
0.5 0.5
Adj R-sq = 0.01, p = 0.02 Adj R-sq = 0.00, ns
0.4 0.4
0.3 0.3
Z-score priming
Z-score priming
0.2 0.2
0.1 0.1
0.0 0.0
–0.1 –0.1
–0.2 –0.2
25 30 35 40 45 10 15 20 25
Z-score priming
Z-score priming
0.2 0.2 0.2
0.5 0.5
Adj R-sq = 0.03, p < 0.001 Adj R-sq = 0.01, p = 0.009
0.4 0.4
0.3 0.3
Z-score priming
Z-score priming
0.2 0.2
0.1 0.1
0.0 0.0
–0.1 –0.1
–0.2 –0.2
25 30 35 40 45 10 15 20 25
Reading comprehension Vocabulary knowledge
(for both short and long SOA) and attentional control measures (antisaccade,
Stroop,2 operational span), and between priming and reading ability (reading
comprehension and vocabulary knowledge).3
In order to address the possibility that reading comprehension differences in
priming are spuriously driven by differences in antisaccade performance (or vice
versa), we also computed partial correlations. For short SOA priming, antisaccade
216 M. J. Yap, K. A. Hutchison and L. C. Tan
Discussion
In the present study, by analyzing the large-scale behavioral data from the SPP
(Hutchison et al., 2013), we examined whether semantic priming was reliable, and
if so, how semantic priming was moderated by individual differences. There are a
couple of noteworthy observations. First, we extend previous work on reliability
(Hutchison et al., 2008; Stolz et al., 2005) by demonstrating that the reliability of
semantic priming effects depends on the association strength between primes and
targets. Second, we considered the impact of a broad array of individual difference
measures on semantic priming, and our analyses reveal that participants with more
attentional control and reading ability are associated with stronger priming.
This account helps provide a unified explanation for Stolz et al.’s (2005) results.
That is, priming is unreliable at all SOAs when the relatedness proportion is 0.25
because there is insufficient incentive (only a one in four chance the target is
related to the prime) for the participant to generate potential targets or for the
prime episode to be retrieved. Increasing the relatedness proportion to 0.50 drives
up the likelihood of episodic prime retrieval (for short SOAs) and expectancy
generation (for longer SOAs), which then yields reliable priming effects. However,
one might then wonder why priming was reliable only at the longest SOA of
800 ms when the relatedness proportion was 0.75. Stolz et al. (2005) suggested that
at very high relatedness proportions, participants begin to notice the prime-target
relation and attempt to generate expectancies in an intentional manner. At shorter
SOAs (200 ms and 350 ms), expectancy-based processes are likely to fail, because
there is insufficient time for intentional generation and successful application of
expectations. These failed attempts disrupt and overshadow the impact of prime
retrieval, hence attenuating priming reliability.
We think our results can be nicely accommodated by a similar perspective. In
the SPP, a relatedness proportion of 0.50 was consistently used, which facilitates the
recruitment of the prime episode. However, why was reliability so much higher
for first-associate, compared to other-associate, prime-target pairs? As mentioned
earlier, prime recruitment processes are positively correlated with prime utility.
That is, there is a lower probability of prime recruitment under experimental
contexts where the prime is less useful, such as when the relatedness proportion
is low (Bodner & Masson, 2003), when targets are clearly presented (Balota, Yap,
Cortese, & Watson, 2008), and as the present results suggest, when primes are
weakly associated with their targets. In general, our results provide converging
support for Stolz et al.’s (2005) assertion that an increase in prime episode retrieval
yields greater reliability in priming.
However, while significant test–retest and split-half correlations at the very short
SOA of 200 ms (see Table 9.2) may suggest that reliability under these conditions
is not mediated by expectancy, Hutchison (2007) has cautioned against relying on
a rigid cutoff for conscious strategies such as expectancy generation. Indeed, it
is more plausible that expectancy-based processes vary across items, participants,
and practice. Consistent with this, there is mounting evidence for expectancy
even at short SOAs for strong associates and high-AC individuals (e.g. Hutchison,
2007). We tested this by using a median split to categorize participants as low-AC
or high-AC, based on their performance on the antisaccade task; reliabilities of
short SOA, first-associate trials were then separately computed for the two groups.
Interestingly, reliabilities were numerically higher for the high-AC group (split-half
r = 0.327, p < 0.001; test–retest r = 0.388, p < 0.001) than for the low-AC group
(split-half r = 0.215, p < 0.001; test–retest r = 0.234, p < 0.001); while the group
difference was not significant for split-half reliability, it approached significance
(p = 0.06) for test–retest reliability. Hence, it is possible that for high-AC
Individual Differences in Semantic Priming 219
Notes
1 Given the weaker priming observed for other-associate, compared to
first-associate, trials, one might wonder if reliability is lower in this condition
because of decreased variability in priming across participants. As suggested by
a reviewer, this possibility is ruled out by our data (see Table 9.1), which reveal
comparable variability in priming for first and other-associate trials.
2 Stroop performance was computed by averaging standardized Stroop effects in
RTs and accuracy rates.
Individual Differences in Semantic Priming 223
References
Anderson, J. R., & Milson, R. (1989). Human memory: An adaptive perspective.
Psychological Review, 96, 703–719.
Andrews, S. (2012). Individual differences in skilled visual word recognition and reading:
The role of lexical quality. In J. S. Adelman (Ed.), Visual word recognition volume 2: Meaning
and context, individuals, and development (pp. 151–172). Hove, UK: Psychology Press.
Andrews, S., & Hersch, J. (2010). Lexical precision in skilled readers: Individual differences
in masked neighbor priming. Journal of Experimental Psychology: General, 139, 299–318.
Balota, D. A., & Yap, M. J. (2006). Attentional control and flexible lexical processing:
Explorations of the magic moment of word recognition. In S. Andrews (Ed.), From
inkmarks to ideas: Current issues in lexical processing (pp. 229–258). New York: Psychology
Press.
Balota, D. A., Yap, M. J., Cortese, M. J., & Watson, J. M. (2008). Beyond mean response
latency: Response time distributional analyses of semantic priming. Journal of Memory and
Language, 59, 495–523.
Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B.,
. . . Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39,
445–459.
Balota, D. A., Yap, M. J., Hutchison, K.A., & Cortese, M. J. (2012). Megastudies: What
do millions (or so) of trials tell us about lexical processing? In James S. Adelman (Ed.),
Visual word recognition Volume 1: Models and methods, orthography and phonology (pp. 90–115).
Hove, UK: Psychology Press.
Becker, C. A. (1980). Semantic context effects in visual word recognition: An analysis of
semantic strategies. Memory & Cognition, 8, 493–512.
Bodner, G. E., & Masson, M. E. J. (1997). Masked repetition priming of words and
non-words: Evidence for a nonlexical basis for priming. Journal of Memory and Language,
37, 268–293.
Bodner, G. E., & Masson, M. E. J. (2003). Beyond spreading activation: An influence of
relatedness proportion on masked semantic priming. Psychonomic Bulletin and Review, 10,
645–652.
Borgmann, K. W. U., Risko, E. F., Stolz, J. A., & Besner, D. A. (2007). Simons says:
Reliability and the role of working memory and attentional control in the Simon Task.
Psychonomic Bulletin and Review, 14, 313–319.
Butler, B., & Hains, S. (1979). Individual differences in word recognition latency. Memory
and Cognition, 7, 68–76.
224 M. J. Yap, K. A. Hutchison and L. C. Tan
Chateau, D., & Jared, D. (2000). Exposure to print and word recognition processes. Memory
and Cognition, 28, 143–153.
Faust, M. E., Balota, D. A., Spieler, D. H., & Ferraro, F. R. (1999). Individual differences in
information processing rate and amount: Implications for group differences in response
latency. Psychological Bulletin, 125, 777–799.
Forster, K. (1998). The pros and cons of masked priming. Journal of Psycholinguistic Research,
27, 203–233.
Goodwin, C. J. (2009). Research in psychology: Methods and design (6th edn.). Hoboken, NJ:
Wiley.
Heyman, T., Van Rensbergen, B. V., Storms, G., Hutchison, K. A., & De Deyne, S. (2015).
The influence of working memory load on semantic priming. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 41, 911–920.
Hutchison, K. A. (2007). Attentional control and the relatedness proportion effect in
semantic priming. Journal of Experimental Psychology: Learning, Memory, and Cognition,
33, 645–662.
Hutchison, K. A., Balota, D. A., Cortese, M., & Watson, J. M. (2008). Predicting semantic
priming at the item-level. The Quarterly Journal of Experimental Psychology, 61, 1036–1066.
Hutchison, K. A., Balota, D. A., Neely, J. H., Cortese, M. J., Cohen-Shikora, E. R., Tse,
C-S., . . . Buchanan, E. (2013). The Semantic Priming Project. Behavior Research Methods,
45, 1099–1114.
Hutchison, K. A., Heap, S. J., Neely, J. H., & Thomas, M. A. (2014). Attentional control
and asymmetric associative priming. Journal of Experimental Psychology: Learning, Memory,
and Cognition, 40, 844–856.
Kinoshita, S., Forster, K. I., & Mozer, M. C. (2008). Unconscious cognition isn’t that smart:
Modulation of masked repetition priming effect in the word naming task. Cognition, 107,
623–649.
Kinoshita, S., Mozer, M. C., & Forster, K. I. (2011). Dynamic adaptation to history of
trial difficulty explains the effect of congruency proportion on masked priming. Journal
of Experimental Psychology: General, 140, 622–636.
Lewellen, M. J., Goldinger, S. D., Pisoni, D. B., & Greene, B. G. (1993). Lexical familiarity
and processing efficiency: Individual differences in naming, lexical decision, and semantic
categorization. Journal of Experimental Psychology: General, 122, 316–330.
Lowe, C., & Rabbitt, P. (1998). Test/re-test reliability of the CANTAB and ISPOCD
neuropsychological batteries: Theoretical and practical issues. Neuropsychologia, 36,
915–923.
McNamara, T. P. (2005). Semantic priming: Perspectives from memory and word recognition. Hove,
UK: Psychology Press.
Matthews, G., & Harley, T. A. (1993) Effects of extraversion and self-report arousal on
semantic priming: A connectionist approach. Journal of Personality and Social Psychology,
65, 735–756.
Meyer, D. E., & Schvaneveldt, R. W. (1971). Facilitation in recognizing pairs of words:
Evidence of a dependence between retrieval operations. Journal of Experimental Psychology,
90, 227–234.
Morgan, C. J. A., Bedford, N. J., & Rossell, S. L. (2006). Evidence of semantic
disorganization using semantic priming in individuals with high schizotypy. Schizophrenia
Research, 84, 272–280.
Individual Differences in Semantic Priming 225
Nakamura, E., Ohta, K., Okita, Y., Ozaki, J., & Matsushima, E. (2006). Increased inhibition
and decreased facilitation effect during a lexical decision task in children. Psychiatry and
Clinical Neurosciences, 60, 232–239.
Nation, K., & Snowling, M. J. (1999). Developmental differences in sensitivity to semantic
relations among good and poor comprehenders: Evidence from semantic priming.
Cognition, 70, B1–B13.
Neely, J. H. (1977). Semantic priming and retrieval from lexical memory: Roles of
inhibitionless spreading activation and limited-capacity attention. Journal of Experimental
Psychology: General, 106, 226–254.
Neely, J. H. (1991). Semantic priming effects in visual word recognition: A selective review
of current findings and theories. In D. Besner & G. Humphreys (Eds.), Basic processes in
reading: Visual word recognition (pp. 236–264). Hillsdale, NJ: Erlbaum.
Neely, J. H., Keefe, D. E., & Ross, K. L. (1989). Semantic priming in the lexical
decision task: Roles of prospective prime-generated expectancies and retrospective
semantic matching. Journal of Experimental Psychology: Learning, Memory, and Cognition,
15, 1003–1019.
Neely, J. H., O’Connor, P. A., & Calabrese, G. (2010). Fast trial pacing in a lexical decision
task reveals a decay of automatic semantic activation. Acta Psychologica, 133, 127–136.
Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida
free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments,
& Computers, 36, 402–407.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd edn.). New York:
McGraw-Hill.
Payne, B. K. (2005) Conceptualizing control in social cognition: How executive functioning
modulates the expression of automatic stereotyping. Journal of Personality and Social
Psychology, 89, 488–503.
Perfetti, C. A. (1992). The representation problem in reading acquisition. In P. B. Gough,
L. C. Ehri, & R. Treiman (Eds.), Reading acquisition (pp. 145–174). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Perfetti, C. A, & Hart, L. (2001). The lexical bases of comprehension skill. In D. S. Gorfein
(Ed.), On the consequences of meaning selection: Perspectives on resolving lexical ambiguity (pp.
67–86). Washington, DC: American Psychological Association.
Plaut, D. C., & Booth, J. R. (2000). Individual and developmental differences in semantic
priming: Empirical and computational support for a single-mechanism account of lexical
processing. Psychological Review, 107, 786–823.
Posner, M. I., & Snyder, C. R. R. (1975). Attention and cognitive control. In R. Solso
(Ed.), Information processing and cognition: The Loyola symposium (pp. 55–85). Hillsdale, NJ:
Erlbaum.
Shipstead, Z., Lindsey, D. R. B., Marshall, R. L., & Engle, R. L. (2014). The mechanisms of
working memory capacity: Primary memory, secondary memory, and attention control.
Journal of Memory and Language, 72, 116–141.
Stolz, J. A., Besner, D., & Carr, T. H. (2005). Implications of measures of reliability for
theories of priming: Activity in semantic memory is inherently noisy and uncoordinated.
Visual Cognition, 12, 284–336.
Thomas, M. A., Neely, J. H., & O’Connor, P. (2012). When word identification gets tough,
retrospective semantic processing comes to the rescue. Journal of Memory and Language, 66,
623–643.
226 M. J. Yap, K. A. Hutchison and L. C. Tan
Unsworth, N., Heitz, R. P., Schrock, J. C., & Engle, R. W. (2005). An automated version
of the operation span task. Behavior Research Methods, 37, 498–505.
Waechter, S., Stolz, J. A., & Besner, D. (2010). Visual word recognition: On the reliability
of repetition priming. Visual Cognition, 18, 537–558.
Whittlesea, B. W. A., & Jacoby, L. L. (1990). Interaction of prime repetition with visual
degradation: Is priming a retrieval phenomenon? Journal of Memory and Language, 29,
546–565.
Woodcock, R. W., McGrew, K. S., & Mather, N. (2001). Woodcock Johnson III tests of
cognitive abilities. Rolling Meadows, IL: Riverside Publishing.
Yap, M. J., Balota, D. A., Sibley, D. E., & Ratcliff, R. (2012). Individual differences in
visual word recognitions: Insights from the English lexicon project. Journal of Experimental
Psychology: Human Perception and Performance, 38, 53–79.
Yap, M. J., Sibley, D. E., Balota, D. A., Ratcliff, R., & Rueckl, J. (2015). Responding to
non-words in the lexical decision task: Insights from the English Lexicon Project. Journal
of Experimental Psychology: Learning, Memory, and Cognition, 41, 597–613.
Yap, M. J., Tse, C.-S., & Balota, D. A. (2009). Individual differences in the joint effects of
semantic priming and word frequency: The role of lexical integrity. Journal of Memory and
Language, 61, 303–325.
10
SMALL WORLDS AND BIG DATA
Examining the Simplification Assumption in
Cognitive Modeling
Brendan Johns,
Douglas J. K. Mewhort,
and Michael N. Jones
Abstract
The simplification assumption of cognitive modeling proposes that to understand a given
cognitive system, one should focus on the key aspects of the system and allow other sources
of complexity to be treated as noise. The assumption grants much power to a modeller,
permits a clear and concise exposition of a model’s operation, and allows the modeller
to finesse the noisiness inherent in cognitive processes (e.g., McClelland, 2009; Shiffrin,
2010). The small-world (or “toy model” approach) allows a model to operate in a simple
and highly controlled artificial environment. By contrast, Big Data approaches to cognition
(e.g. Landauer & Dumais, 1997; Jones & Mewhort, 2007) propose that the structure of a
noisy environment dictates the operation of a cognitive system. The idea is that complexity
is power; hence, by ignoring complexity in the environment, important information about
the nature of cognition is lost. Using models of semantic memory as a guide, we examine
the plausibility, and the necessity, of the simplification assumption in light of Big Data
approaches to cognitive modeling.
Approaches to modeling semantic memory fall into two main classes: Those
that construct a model from a small and well-controlled artificial dataset (e.g.
Collins and Quillian, 1969; Elman, 1990; Rogers & McClelland, 2004) and
those that acquire semantic structure from a large text corpus of natural language
(e.g. Landauer & Dumais, 1997; Griffiths, Steyvers, & Tenenbaum, 2007; Jones
& Mewhort, 2007). We refer to the first class as the small-world approach1 and
the latter as the Big Data approach. The two approaches differ on the method by
which to study semantic memory and exemplify a fundamental point of theoretical
divergence in subfields across the cognitive sciences.
The small-world approach to semantic memory is appealing because it makes
ground-truth possible; all of the knowledge about a world is contained within the
228 B. Johns, D. J. K. Mewhort and M. N. Jones
living thing
can grow
is
living
ISA ISA
FIGURE 10.1 The semantic network used in Collins & Quillian (1969) and Rogers &
McClelland (2004).
training materials, and the true generating model is known. For example, Figure
10.1 reconstructs the taxonomic tree from Collins and Quillian (1969), and is also
used to generate the training materials in Rogers and McClelland (2004). The
reconstructed tree contains knowledge about a limited domain of living things,
namely plants and animals. Sampling propositions from the tree (as Rogers &
McClelland do) produces well-structured and accurate knowledge for a limited
domain of items. Thus, even though it is small in scale, the theorist knows the
structure of the world under question (e.g. that a canary is a bird), and hence, is
able to gauge how well the model acquires and processes information from the
training set.
The Big Data approach to the same problem is fundamentally different. Rather
than assuming that the world is structured in a single way, it proposes that the lexical
environment provides the ground-truth for the meaning of words, an approach
that also has a rich history (e.g. Wittgenstein, 1953). Although different learning
mechanisms have been proposed (see Jones, Willits, & Dennis, 2015 for a review),
they all rely upon co-occurrence patterns of words across large text corpora to infer
the meaning of a word. A model’s performance is assessed on a behavioral task (e.g.
a synonym test, ratings of semantic similarity, semantic priming, etc.) where the
model’s use of a word is compared against human performance. The model’s match
to the item-level behavior is used to justify the plausibility of the specific algorithm
used by the model.
Small Worlds and Big Data 229
selected at each training point, with a selected proposition not being removed
from the search set if it was chosen (that is, it could be sampled multiple times
across training). Each proposition was treated as a sentence, with both relations
(e.g. isa, has, can, is) and features (fly, feathers, wings, swim, etc.), being represented
as words with their own unique environmental vectors. Words were represented
as the composite of the context and order vectors, similar to the approach taken
to understand artificial grammar acquisition in Jamieson and Mewhort (2011).
Similarity values taken from BEAGLE were averaged across 25 resamples of the
environmental vectors, to ensure that the results were not due to the random initial
configuration of similarity and that the emerging structure is from the learning of
propositional information.
To test how well the model learned the structure contained in the semantic
networks, a rank correlation of the vector cosine of the word’s representation to the
number of shared features between the words in the semantic network was taken.
For example, in the network displayed in Figure 10.1, robin and canary share ten
features (isa bird, has wings, can fly, has feathers, isa animal, can move, has skin, can grow,
is living, isa living_thing), while robin and oak contain three (isa living_thing, can grow,
is living). We used a simple metric of learning by assessing how related the similarity
values are to the feature overlap values. We use dendograms generated from a
hierarchical clustering algorithm to display that the model is learning accurate
hierarchical information, similar to how structure was illustrated by Rogers and
McClelland (2004).
Figure 10.2 shows the increase in correlation of the BEAGLE similarity values
to the amount of feature overlap of words in the two proposition corpora, as
a function of the number of propositions studied for both the small and large
networks. Figure 10.2 shows a simple trend: The model learned the structure of
the small world rapidly, even with minimal exposure to the training materials. For
the small network, performance hits asymptote at 75 propositions, close to the
size of the actual corpus. However, even at only 25 propositions the model was
capable of inferring a large amount of structure. As would be expected, it took
longer for the model to learn the structure of the large semantic network, with
the model hitting asymptote at around 150 propositions. The entire proposition
set is not needed to acquire accurate knowledge of the small world, due to the
highly structured nature of the learning materials. That is, the model was capable
of acquiring an accurate representation of the small world with only 75 percent of
the total number of propositions, in only a single training session.
To determine if the model reproduced the hierarchical structure of the
propositions correctly, Figure 10.3 shows the dendogram of the hierarchical
clustering of the similarity space for the small network, while Figure 10.4 contains
the same information for the large network. The dendograms were generated from
a BEAGLE model that was trained on all of the propositions from the different
networks, with no repeats. A dendogram uses hierarchical clustering to determine
Small Worlds and Big Data 233
1.0
0.8
0.6 Small
Large
R
0.4
0.2
0.0
25 50 75 100 125 150 175 200 225 250 275 300
Number of propositions
FIGURE 10.2 Increase in correlation between the cosine similarity of items and feature
overlap values derived from the two semantic networks, as a function of the number
of propositions studied.
daisy
rose
pine
oak
robin
canary
sunfish
salmon
FIGURE 10.3 Dendogram of the hierarchical structure of BEAGLE trained on
propositions derived from the small semantic network.
234 B. Johns, D. J. K. Mewhort and M. N. Jones
dog
pig
mouse
cat
goat
canary
penguin
sparrow
robin
trout
cod
flounder
salmon
rose
daisy
tulip
sunflower
oak
maple
birch
pine
FIGURE 10.4 Dendogram of the BEAGLE model trained on propositions derived from
the large semantic network.
the hierarchical relationships among items. Figures 10.3 and 10.4 show that the
model learned the correct hierarchical generating structure of the environment, in
that it is identical to the structure displayed in the semantic network in Figure 10.1
(for the small network). This demonstrates that the BEAGLE model was able to
acquire hierarchical information (central to the proposals of the semantic cognition
theory), even with no explicit goal to do so, and with a very limited amount of
experience (as compared to the amount of training that a typical backpropagation
algorithm would require). These simulations demonstrate that a simple encoding
of the training materials provides enough power to learn both networks.
Discussion
Our goal in this section was twofold: First to determine the power that a
well-formed description of a small world provides a learning mechanism (in
the form of sets of propositions derived from semantic networks), and second
to assess how easily this information is learned with a simple co-occurrence
Small Worlds and Big Data 235
Model
The model that will be used here is an approximation of BEAGLE that is based on
sparse representations rather than Gaussian ones (Sahlgren, Host, & Kanerva, 2008;
Recchia et al., 2015). The advantage of the sparse-vectors approach is that it greatly
reduces the computational complexity of the model and allows for a greater degree
of scaling. A very large amount of text is going to be used in this analysis, so the
computationally simpler model has obvious advantages. Only context vectors will
be used, rather than order vectors, in order to simplify the learning task, as now the
model is only using sentential context to form semantic representations. Similar to
past studies (e.g. Recchia et al., 2015), vectors will be large (5,000 dimensional) and
the environmental vectors are very sparse (six non-zero values randomly sampled),
similar to binary spatter codes.
Training materials
The set of texts that will be used to train the model is drawn from five different
sources: (1) Wikipedia; (2) Amazon product descriptions (from McAuley &
Leskovec, 2013); (3) a set of 1,000 fiction books; (4) a set of 1,050 non-fiction
books; and (5) a set of 1,500 young adult books. All book text was scraped from
e-books, and all were published in the last 60 years by popular authors. To ensure
that each text source would contribute equally, each source was trimmed to a set
of six million sentences with random sampling, for a total of 30 million sentences
across all texts (approximately 400 million words). The data fitting method will
determine which set of texts are the most informative for generating the small
worlds.
Small Worlds and Big Data 237
Small Worlds
The small worlds that will be approximated here are the same that were used in the
previous section, both the small and large semantic networks. The only difference was
that the word sunfish was replaced with trout, due to its very low frequency across the
238 B. Johns, D. J. K. Mewhort and M. N. Jones
different corpora. The first tests conducted examined the similarity values (assessed
with a vector cosine) between the word-level items in the hierarchy. These were
assessed with a rank correlation to the amount of feature overlap in the semantic
network, identical to the small world analysis described above. This analysis will
provide an analogous examination to those used in the small world analysis.
In order to further test the knowledge acquisition process, we used an additional
test over both semantic networks. Specifically, we used an alternative forced choice
(AFC) task, where the model has to determine which semantic feature (e.g. plant
or animal) is more likely for a particular word (e.g. canary). This test was conducted
at each level of the hierarchy (e.g. the model was asked to discriminate plant/animal,
and then tree/flower/bird/fish/mammal, etc.. . . until the bottom level was reached).
The test involves 52 questions for the small network and 140 questions for the large
network, derived from every level of the hierarchy for both of the networks. Not all
levels contained the same number of features, so the number of alternatives ranges
from two to ten. Obviously, discriminating among ten alternatives is a difficult task
for the model, but it does provide a strong test of the semantic representation that
the model is constructing. Performance was assessed by determining the proportion
of correct responses on the AFC test.
Results
Figure 10.5 shows the fit of the model when it is optimized to account for
feature overlap values for both the small and large semantic networks. As shown
in Figure 10.5, the model is capable of rapidly acquiring the semantic structure
of a small world, as the correlation between the model’s vector cosine and the
feature overlap values from the network increases substantially as more linguistic
information is learned. The complexity of the small world obviously plays a large
role in the amount of linguistic experience that is necessary to account for the
proposed structure. The small network maximized at 150,000 sentences, while the
large network maximized at 880,000 sentences. However, the small network hit
asymptote at around 50,000 sentences, while the large network did the same at
approximately 200,000 sentences.
To demonstrate that the model acquired the same hierarchical information
learned in the small-world modeling, the dendograms for both the small and large
networks are displayed in Figure 10.6 and Figure 10.7, respectively. For the small
network, the hierarchical clustering method accurately infers the clusters across
the eight words. For the large network, the model reproduces the overall cluster
properties (clustered into trees/flowers/fish/bird/mammals), with only one error (robin
was classified in its own cluster, closer to mammals). The error arose because
the word robin had approximately equal similarity to both birds and mammals.
Additionally, the clusters are not as well discriminated as the dendogram in Figure
10.4, a result expected given the differences in noise across the training sets. Natural
Small Worlds and Big Data 239
1.0
0.8
Small
0.6 Large
R
0.4
0.2
0.0
0 200,000 400,000 600,000 800,000 1,000,000
Number of sentences
FIGURE 10.5 Correlation of the similarity between words and the feature overlap
values from the two semantic networks, as a function of the number of sentences
sampled.
language contains much more knowledge about the words under question (e.g.
that a robin nests in trees), that shift the similarity space. However, the simulation
is impressive because acquiring hierarchical information is not an explicit goal for
the model. Instead, the language environment was enough to acquire such data, as
the hierarchical structure emerged across training, similar to the findings of Rogers
and McClelland (2004).
A stronger test of the model’s ability to acquire the requisite information is given
by the AFC test described above. In the AFC test, the model has to discriminate
features associated with the word in the semantic networks. Figure 10.8 displays
the increase in performance across the corpus sampling routine. Even though
the learning task was quite difficult, the model reached an impressive 98 percent
accuracy for the small network questions, and a 87 percent accuracy for the large
network questions. The test is similar to a synonym test, such as the TOEFL
test used in Landauer and Dumais (1997), where LSA achieved a performance
level of 55 percent. That the model achieves such a high performance level
demonstrates the power both of the training materials that were assembled and
of the data fitting method that was used. Rather surprisingly, the model did
not require as many sentences to learn the features of the semantic network as
it did to learn the connection among words, as only 130,000 sentences were
240 B. Johns, D. J. K. Mewhort and M. N. Jones
pine
oak
daisy
rose
trout
salmon
robin
canary
FIGURE 10.6 Dendogram of the hierarchical structure of the representation learned
from the corpus sampling routine for words from the small network.
flounder
cod
trout
salmon
canary
penguin
sparrow
goat
pig
cat
dog
mouse
robin
pine
birch
maple
oak
daisy
tulip
sunflower
rose
1.0
0.8
Small
0.6 Large
% Correct
0.4
0.2
0.0
0 100,000 200,000 300,000 400,000 500,000 600,000
Number of sentences
FIGURE 10.8 Performance on the AFC test as a function of the number of sentences
sampled.
needed to reach maximum performance for the small network, and 560,000
sentences were needed for the large network. It is worth noting that the TASA
corpus—a standard corpus used in co-occurrence models of semantics since
Landauer and Dumais (1997)—contains approximately 760,000 sentences. So, the
number of sentences that the optimization method required is not overly large
when compared with standard corpora that are used in the field of semantic
memory. The test demonstrates that a model trained directly with data from the
linguistic environment not only learned the hierarchical structure of a small world
but also learned the semantic features that define it.
The simulations demonstrate that a Big Data analysis approximates the structure
contained in small worlds quite readily. However, a different question lies in the
comparative complexity of the learning task that models of semantic memory face.
For the small network, the small world had 84 propositions. In the Big Data
analysis, the model maximized performance at 1,807 times more sentences than
is contained in the proposition corpus. For the larger semantic network, the scale
was that 3,911 times more sentences were required. However, the feature AFC test
only needed 1,566 and 2,488 times more sentences than the proposition corpus
for the small and large networks, respectively, suggesting that different types of
information can be learned more efficiently. Overall, our analysis suggests that the
242 B. Johns, D. J. K. Mewhort and M. N. Jones
Discussion
This section attempted to understand the connection between the proposals of
small world analyses of semantic memory, compared to those that rely upon
Big Data. We found that a model which relied upon large amounts of lexical
experience to form a semantic representation of an item was able to acquire
a high quality approximation to the structure of different small worlds, assessed
with multiple tests, including the acquisition of hierarchical information and the
learning of semantic features. The corpus-sampling algorithm allowed the model
to select a set of texts that provided the best fit to the small world structure. Even
with this data fitting methodology, to obtain a maximal fit required large amounts
of natural language materials.
General Discussion
The goal of this chapter was to examine the simplification assumption in cognitive
modeling by comparing the small world versus Big Data approaches to semantic
memory and knowledge acquisition. A central aspect of semantic memory is
the learning of environmental structure through experience. The small-world
approach proposes that in order to understand the mechanisms that underlie this
ability, the structure of the environment must be constrained to lay bare the
learning process itself. By constraining the complexity of the training materials,
it is possible to study the operations of a model at a fine-grained level, as the
theorist knows the generating model that produced the environmental experience.
In contrast, a Big Data approach proposes that ecologically valid training materials
are necessary—natural language materials. Although the researcher loses control
of the minutiae of what the model learns, it gains power through the plausibility
of the approach, as such models readily scale up to the linguistic experience that
humans receive.
To determine the relation between these two approaches, the small world
analysis examined how readily a vector-accumulation model of semantics (a
standard Big Data model) could acquire the structure of a small world. We found
that the model rapidly acquired knowledge of the small world through a small
corpus of propositions. By limiting the complexity of the training materials, the
learning task became quite simple. The structure of the items used was sufficient
to explain the behaviors under question, with no advanced learning or abstraction
processes necessary. This echoes recent work on implicit memory and artificial
grammar learning (e.g. Jamieson & Mewhort, 2011) and natural language sentence
processing (Johns & Jones, 2015).
Small Worlds and Big Data 243
Given the ease with which the model was capable of learning the small world,
the big data analysis determined how much natural language information was
necessary to approximate the structure proposed by the two semantic networks
from Rogers and McClelland (2004). We used a very large amount of high quality
language sources—large sets of fiction, non-fiction, and young adult books, along
with Wikipedia and Amazon product descriptions. The texts were split into smaller
sections, and a hill-climbing algorithm was used to select the texts iteratively that
allowed the model to attain the best fit to the proposed structure. Across multiple
tests, the model was readily capable of learning the small worlds, but the amount of
experience needed was orders of magnitude greater than the size of the proposition
corpora. This analysis suggests that the learning tasks under the two approaches
differ greatly in the complexity of the materials used.
This issue of informational complexity gets at the crux of the problems
surrounding the simplification assumption in studying semantic memory: By
reducing the complexity of the structure of the environment to a level that is
tractable for a theorist to understand fully, the problem faced by a hypothesized
learner is trivialized. The linguistic environment, although heavily structured in
some ways, is still extremely noisy, and requires learning mechanisms that are
capable of discriminating items contained in very large sets of data. Thus, the
simplification assumption biases theory selection towards learning and processing
mechanisms that resemble humans on a small and artificial scale. The emergence of
Big Data approaches to cognition suggests that artificial toy datasets are no longer
necessary. Models can now be trained and tested on data that are on a similar
scale to what people experience, increasing the plausibility of the model selection
and development process. For models of semantic memory, the existence of high
quality large corpora to train models eliminates the necessity for oversimplification,
and offers additional constraints as to how models should operate.
This is not to say that the small-world approach is without merits, a point
clear in the history of cognitive science. The goal of small-world assumption is
to provide an accurate understanding of the operation of a model with clear and
concise examples, something that models that focus only on Big Data techniques
cannot achieve. Thus, as in the evolution of any new theoretical framework, past
knowledge should be used to constrain new approaches. The Big Data approach
should strive to embody the ideals of the simplification assumption in cognitive
modeling, that of clear and concise explanations, while continuing to expand the
nature of cognitive theory.
Note
1 What we refer to here as a small-world approach is also commonly referred to
as a “toy model” approach.
244 B. Johns, D. J. K. Mewhort and M. N. Jones
References
Bannard, C., Lieven, E., & Tomasello, M. (2009). Modeling children’s early grammatical
knowledge. Proceedings of the National Academy of Sciences of the United States of America,
106, 17284–17289.
Barsalou, L. W. (1999). Perceptual symbol systems. Behavioral Brain Science, 22, 577–660.
Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word
co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44,
890–907.
Collins, A. M., & Quillian, M. R. (1969). Retrieval time from semantic memory. Journal
of Verbal Learning and Verbal Behavior, 8, 240–247.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
Estes, W. K. (1975). Some targets for mathematical psychology. Journal of Mathematical
Psychology, 12, 263–282.
Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation.
Psychological Review, 114, 211–244. doi: 10.1037/0033-295X.114.2.211.
Hare, M., Jones, M. N., Thomson, C., Kelley, S., & McRae, K. (2009). Activating event
knowledge. Cognition, 111, 151–167.
Hills, T. T., Jones, M. N., & Todd, P. T. (2012). Optimal foraging in semantic memory.
Psychological Review, 119, 431–440.
Jamieson, R. K., & Mewhort, D. J. K. (2009a). Applying an exemplar model to the
artificial-grammar task: Inferring grammaticality from similarity. Quarterly Journal of
Experimental Psychology, 62, 550–575.
Jamieson, R. K., & Mewhort, D. J. K. (2009b). Applying an exemplar model to the
serial reaction time task: Anticipating from experience. Quarterly Journal of Experimental
Psychology, 62, 1757–1784.
Jamieson, R. K., & Mewhort, D. J. K. (2010). Applying an exemplar model to
the artificial-grammar task: String-completion and performance for individual items.
Quarterly Journal of Experimental Psychology, 63, 1014–1039.
Jamieson, R. K., & Mewhort, D. J. K. (2011). Grammaticality is inferred from global
similarity: A reply to Kinder (2010). Quarterly Journal of Experimental Psychology, 64,
209–216.
Johns, B. T., & Jones, M. N. (2015). Generating structure from experience: A retrieval-based
model of sentence processing. Canadian Journal of Experimental Psychology, 69, 233–251.
Johns, B. T., Taler, V., Pisoni, D. B., Farlow, M. R., Hake, A. M., Kareken, D. A., . . . Jones,
M. N. (2013). Using cognitive models to investigate the temporal dynamics of semantic
memory impairments in the development of Alzheimer’s disease. In Proceedings of the 12th
International Conference on Cognitive Modeling, ICCM.
Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order
information in a composite holographic lexicon. Psychological Review, 114, 1–37.
Jones, M. N., Kintsch, W., & Mewhort, D. J. K. (2006). High-dimensional semantic space
accounts of priming. Journal of Memory and Language, 55, 534–552.
Jones, M. N., Willits, J. A., & Dennis, S. (2015). Models of semantic memory. In
J. R. Busemeyer & J. T. Townsend (Eds.), Oxford handbook of mathematical and computational
psychology.
Landauer, T. K., & Dumais, S. (1997). A solution to Plato’s problem: The Latent
Semantic Analysis theory of the acquisition, induction, and representation of knowledge.
Psychological Review, 104, 211–240.
Small Worlds and Big Data 245
McAuley, J., & Leskovec, J. (2013). Hidden factors and hidden topics: Understanding rating
dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender
Systems, Rec-Sys ’13 (pp. 165–172).
McClelland, J. L. (2009). The place of modeling in cognitive science. Topics in Cognitive
Science, 1, 11–38.
Recchia, G. L., & Jones, M. N. (2009). More data trumps smarter algorithms: Comparing
pointwise mutual information to latent semantic analysis. Behavior Research Methods, 41,
657–663.
Recchia, G. L., Jones, M. N., Sahlgren, M., & Kanerva, P. (2015). Encoding sequential
information in vector space models of semantics: Comparing holographic reduced
representation and random permutation. Computational Intelligence and Neuroscience.
Retrieved from http://dx.doi.org/10.1155/2015/986574.
Rogers, T. T., & McClelland, J. L. (2004). Semantic Cognition: A parallel distributed processing
approach. Cambridge, MA: MIT Press.
Rumelhart, D. E. (1990). Brain style computation: Learning and generalization. In S. F.
Zornetzer, J. L. Davis, & C. Lau (Eds.), An introduction to neural and electronic networks (pp.
405–420). San Diego, CA: Academic Press.
Rumelhart, D. E., & Todd, P. M. (1993) Learning and connectionist representations. In
D. E. Meyer & S. Kornblum (Eds.), Attention and performance XIV: Synergies in experimental
psychology, artificial intelligence, and cognitive neuroscience (pp. 3–30). Cambridge, MA: MIT
Press.
Sahlgren, M., Holst, A., & Kanerva, P. (2008). Permutations as a means to encode
order in word space. In Proceedings of the 30th Conference of the Cognitive Science Society
(pp. 1300–1305).
Shiffrin, R. M. (2010). Perspectives on modeling in cognitive modeling. Topics in Cognitive
Science, 2, 736–750.
Shiffrin, R. M., Lee, M. D., Kim, W. T., Wagenmakers, E. -J. (2008). A survey of model
evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science,
32, 1248–1284.
Simon, H. A. (1969). The sciences of the artificial. Cambridge, MA: MIT Press.
Wittgenstein, L. (1953). Philosophical investigations. Oxford: Blackwell.
11
ALIGNMENT IN WEB-BASED
DIALOGUE
Who Aligns, and How Automatic Is It? Studies in
Big-Data Computational Psycholinguistics
David Reitter
Abstract
Studies on linguistic alignment explain the emergence of mutual understanding in
dialogue. They connect to psycholinguistic models of language processing. Recently,
more computational cognitive models of language processing in the dialogue setting have
been published, and they increasingly are based on observational, large datasets rather
than hypothesis-driven experimentation. I review literature in data-intensive computational
psycholinguistics and demonstrate the approach in a study that elucidates the mechanisms
of alignment. I describe consistent results on data from large online forums, Reddit, and
the Cancer Survivors Network, which suggest that linguistic alignment is a consequence of
the architecture of the human memory system and not an explicit, consciously controlled
process tailored to the recipient of a speaker’s contributions to the dialogue.
Introduction
In this chapter, I will survey studies on linguistic alignment in order to connect
the inferences we make from large datasets to the psycholinguistics of dialogue.
This work is motivated by a high-level question: How does our mind select and
combine words in a way that reliably communicates our intentions to a dialogue
partner? The search for a computational answer to this question has been revitalized
in recent years by the advent of large datasets. Large data give us a window into an
individual’s mind and a cooperative process between minds. It allows us to look at
how dialogue partners gradually converge in their choices of words and sentence
structure, thereby creating a shared language.
This process relies on an implied contract among the people engaging in
dialogue. It specifies long-term and temporary conventions that establish the
Alignment in Web-based Dialogue 247
An integrated account can take a stance with respect to each of these questions.
In following Newell’s call for models that “must have all the details” and describe
“how the gears clank and how the pistons go” (Anderson, 2007), the model should
be a computational account that actually carries out language acquisition, language
comprehension, and language production. Further, Newell’s call for functional
models means that we need to cover the broad range of linguistic constructions
present in a corpus. To achieve this objective, we must use the large-scale language
resources that are standard in computational linguistics. They reflect language use
in the wild.
The conversation about such approaches has been taking place in a relatively
new field, computational psycholinguistics, which discovered a range of phenomena
that may form the basis for how we think about the mechanisms of human language
acquisition and processing.
Linguists have asked provocative questions using these methods. To name a
few: How is information density distributed throughout text, and why? When
is language production incremental? How is working memory used in language
processing? Computational psycholinguistics has pushed the boundaries to cover
the broad expressive range found in corpora.
The field discovers how humans learn, produce, and comprehend natural lan-
guage, and the models are informed by observations from contemporary language
use. Standard psycholinguistic methods examine human language performance by
collecting data on comprehension and production speed, eye movements while
reading (Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995; Demberg &
Keller, 2008; Henderson & Ferreira, 2004), or specific processing difficulties
(e.g. self-embedding or general center-embedding: Chomsky & Miller, 1963,
and Gibson, 1991). These methods are productive but require data collection,
while more can be learned from unannotated data. The machine learning
field of semi-supervised learning has developed computational accounts that
describe successful learning from small portions of annotated and large portions
of unannotated data (e.g. Chapelle, Scholkopf, & Zien, 2006; Ororbia II, Reitter,
Wu, & Giles, 2015).
Large datasets have been used to first verify experimental results in naturalistic
language, and now they allow us to find more fine-grained support for theoretical
models of dialogue. The subject that exemplifies the use of Big Data is
alignment, the tendency for people in a conversation to conform to each other
in syntax, word choice, or other linguistic decision levels. The studies relating
to alignment are particularly interesting, as they relate low-level repetition effects
such as syntactic priming to high-level dialogue strategies and even up to the
social roles of those participating in dialogue. Big Data has been contributing
to our understanding of this process. I will bridge the range of work from
traditional controlled experimentation to a new analysis of a very large Internet
dataset, illustrating not only results, but also challenges associated with such
datasets.
Alignment in Web-based Dialogue 249
Alignment
I focus on a set of well-known memory-related phenomena around alignment
(Bock, 1986; Pickering & Ferreira, 2008). This is an effect of gradual convergence
throughout dialogue. Alignment describes a range of related phenomena that cause
speakers to repeat themselves or others, to gradually adapt to someone else, or to
become more consistent in their own speech. This tendency affects not just the
words speakers use; it also affects their sentence structure and even some aspects
of semantics. Indeed, alignment1 is claimed to be based on adaptation effects at
several linguistic levels (Pickering and Garrod, 2004): Lexical priming, syntactic
(structural) priming, or priming at even higher, behavioral levels.
Syntactic Priming
Syntactic priming is one adaptation effect that contributes to alignment. Speakers
mainly choose different words and grammatical structures to express their ideas.
250 D. Reitter
customer) versus direct objects (The painter showed the customer his drawing)
(Bock, 1986; Pickering & Branigan, 1998), complex noun phrases (Cleland &
Pickering, 2003), the order of main verbs and auxiliary verbs (Hartsuiker, Bernolet,
Schoonbaert, Speybroeck, & Vanderelst, 2008), and a range of other grammatical
structures in various languages. Not surprisingly, syntactic priming applies to both
dialogues and monologues. Nonetheless, some studies suggest that priming effects
are stronger in dialogue than in monologue (Cleland & Pickering, 2003; Kootstra,
van Hell, & Dijkstra, 2010).
The tendency to repeat a particular structure when it has been recently
encountered has also been observed in corpus studies of spoken language (Reitter,
2008; Szmrecsanyi, 2005; Travis, 2007) and Internet forum conversations (Wang
et al., 2014). Alignment applies to other levels of linguistic analysis, including
referring expressions such as pronouns (Brennan & Clark, 1996) and style
(Niederhoffer & Pennebaker, 2002).
Corpus studies have since provided evidence for syntactic priming outside of
carefully controlled laboratory settings. Speakers adapt in situated, realistic dialogue.
For instance, the Map Task corpus (Anderson, et al., 1991; McKelvie, 1998) shows
syntactic priming-like repetition effects (Reitter, Moore, & Keller, 2006c). The cited
corpus study also models priming as an effect that applies to syntactic rules in general,
rather than specific alternations such as those in the above examples.
Analyses of spoken language corpora generally showed that the probability of
repeating a structure decreases as the amount of time between priming object
and the primed object increases (Gries, 2005; Reitter et al., 2006b; Szmrecsanyi,
2006). This would suggest priming effects decay over time. However, these earlier
corpus studies did not control for the characteristics of the language between prime
and target. To control for this possible confound, I modeled decay of repetition
probability as the variable that quantified priming (Reitter, 2008).
There are two clearly separate syntactic priming effects: (a) Fast, short-term
and short-lived priming, and (b) slow, long-term adaptation that is likely a result
of implicit learning (see Ferreira & Bock, 2006; Pickering & Ferreira, 2008).
Long-term adaptation is a learning effect that can persist, at least, over several days
(Bock, Dell, Chang, & Onishi, 2007; Kaschak, Kutta, & Schatschneider, 2011).
Recent work has proposed models that explain the mechanisms of these effects
within the context of language acquisition (Bock & Griffin, 2000; Chang, Dell, &
Bock, 2006; Kaschak, Kutta, & Jones, 2011) and general memory retrieval (Reitter
et al., 2011). These studies suggests that priming is the precursor to persistent
language change.
p(prime=target|target,distance) 0.016
0.014
0.012 Switchboard
PP
Switchboard
CP
0.010 Map task
PP
2 4 6 8 10 12 14
Distance: temporal distance between prime and target (seconds)
FIGURE 11.1 Logistic regression models of syntactic repetition fitted to data from
two corpora: Map Task (task-oriented dialogue) and Switchboard (spontaneous
conversation by phone). CP: Priming between speakers, PP: Self-priming.
Pickering, & Cleland, 1999). This decay has been shown to follow a logarithmic
function of time or linguistic activity (Reitter et al., 2006b). Since this decay is
a side effect of priming, we can use it to quantify the strength of the priming:
A stronger priming effect will decay more quickly. This way, we can distinguish
it from other potential sources of increased repetition, such as text genre or the
clustering of topics. Figures 11.1 and 11.4 illustrate the decay effect using different
methods and datasets. The repetition rate of linguistic material can be modeled as
a function of either time (Figure 11.1) or linguistic activity (Figure 11.4). Indeed,
decay can be observed at much larger time-scales than the one found in spoken
dialogue, which has been observed in Internet forum conversations (Figure 11.4,
Wang et al., 2014).
Syntactic priming is also cumulative (Hartsuiker & Westenberg, 2000; Jaeger
& Snider, 2008; Kaschak, Loney, & Borregine, 2006). While cumulativity has
not been taken into account by previous corpus analyses, it is included in the
proposed effort. Models that account for cumulativity and decay in a cognitively
plausible manner will make more precise predictions about structure use, processing
principles, and parameters defining alignment strength.
for adaptation by and among people with social communication disorders and/or
autism spectrum disorders.
Indeed, we have observed an increased amount of syntactic priming in
task-oriented dialogue compared to spontaneous conversation (Reitter & Moore,
2007). A follow-up cognitive model of syntactic priming (Reitter et al., 2011)
may have a mechanistic explanation for the differences we observed between
task-oriented dialogue and spontaneous conversation. According to the model,
working memory serves as cues to the retrieval of syntactic material. That means
that attention to concepts discussed in the conversation plays a role in making
syntactic material available. By learning associations of concepts with syntactic
decisions, remaining within a specific topic would yield stronger priming effects,
because the semantics and associations to syntactic construction are still available,
while switching topics would reduce priming effects. This does not disallow any
control over adaptation, but it argues for mechanistic adaptation as the default in
conversation. However, more empirical work is clearly needed to back up the
account.
According to the authors of the paper, this means that alignment has been
engrained in communicative patterns and is removed from its functional role. They
even compare linguistic alignment to what was called the Chameleon Effect, namely
the “nonconscious mimicry of the postures, mannerisms, facial expressions, and
other behaviors of one’s interaction partners” (Chartrand & Bargh, 1999: 893).
Modulation of alignment levels may be linked to affect: del Prado Martin and Bois
(2015) find that positive attitudes are linked to more alignment. It is possible that
attitudes are linked to engagement and attention, which affects alignment.
This brings up the question of just how does alignment really depend on
attention. Do we align primarily with attended-to speakers? This would support a
more mechanistic theory of alignment rooted in basic cognitive processes. On the
other hand, it may be the case that alignment is an audience-design effect, which
causes us to align strategically with those speakers we address as opposed to those
speakers to whose language we were most recently exposed.
Branigan, Pickering, McLean, and Cleland (2007) studied the latter hypothesis
using lab-based, staged interactions among experimental participants. The results
were mixed. While there was some alignment to speakers that weren’t addressed,
the effect was smaller. This leaves an attention-based explanation as well as social
modulation on the table. Observational studies with large datasets allow us to look
beyond primary effects of repetition. They cannot replace studies that establish
causality, but with enough high-quality data, we can examine more fine-grained
interactions that, in this case, would reveal the effects of social modulation on
decay. Such observations would have consequences for the architecture of both
language production and language acquisition.
The proposed memory-based explanation for alignment suggests that alignment
relies on the same mechanisms as language learning. This has the empirically
verifiable consequence that alignment is a precursor to permanent language
change in individuals and among members of a cultural group. We can consider
this possibility in the context of two scientific realms: Psycholinguistics and
information science. From a psycholinguistic perspective, we seek confirmation
of the mechanisms of language production. A sample of pertinent questions:
Is it attention or intention that modulates alignment? Is memory the driver of
alignment, and if so, which kind of memory: Declarative or procedural? Answering
these psycholinguistic questions faces challenges, as Big Data in the form of online
forums comes with both variability and confounds. Variability occurs due to a
lack of control over which messages an author actually read before writing a
reply; however, we can counter such error with more data. Confounds occur,
for instance, because the selection of words is inextricably linked to topics, which
shift systematically throughout dialogue. However, we can avoid this confound by
measuring alignment on other linguistic decisions, such as sentence structure.
From the perspective of information science, I am interested in how choices
of words and topics propagate through a community. How does discourse change
256 D. Reitter
Research Questions
To determine social influence on alignment, we ask whether messages of different
status in a thread can be more or less linguistically influential. We determine the
amount of repetition of words and of syntactic constructions for pairs of messages:
Consisting of an origin and a target message. The earlier “origin” message can take
on one of several roles. The origin message can be the first reply to the topic (F),
messages by the initial author (I), self-replies (S), and any other messages (A).
Under the hypothesis of social modulation of the memory effect, we would
expect differences in strength of adaptation regardless of the distance. However, we
would also expect differences in decay. Specifically, we would expect to see more
adaptation, and thus more decay, in important origin messages or those origin posts
authored by someone deemed important. For the purposes of this study, we assume
that the initiator of the conversation is important, as is his or her first message, as
well as the first reply to the initiator. Under the alternative hypothesis of a purely
mechanistic effect, we would see no difference in decay in this scenario or possibly
even the opposite relationship.
In order to examine decay of alignment, we first need a system for measuring
the distance between the origin and target. One metric for measuring decay in
forums is the reply distance. For an example, if post PA replies to post PB , which
in turn is a reply to post PC , then we would say the reply distance from PA to PC
is 2. As an alternative measure of distance, we can use the actual time that the post
was written. This can be useful because information in a relatively uncontrolled
conversation becomes stale and loses influence as time passes by. Notably, there is
no single, correct measure of distance when it comes to the analysis of linguistic
258 D. Reitter
−5.0
−4.0
Syntactic alignment (log−SILLA)
Lexical alignment (log−LILLA)
−5.5
−4.4
−6.0
Post distance between prime and target post Post distance between prime and target post
0.025 0.035
0.030
0.015
0.025
0.010
0.020
primeType primeType
initial post initial post
any post by initial author any post by initial author
0.005 parent parent
origin-target pairs including for those where the origin was written by the thread
initiator. For longer time spans between origins and targets beyond about 30,000
seconds, repetitions may actually increase with increasing distance to the origin
message, which is not alignment. Only the syntax analysis shows a mild repetition
increase, which is then followed by a strong decay. Overall, these data do not
seem to support an audience-design hypothesis. However, I would caution that
the time between writing origin and target posts is a measure that confounds
an individual author’s memory with a form of externalized, networked memory.
This can be appropriate from the perspective of an ecological, high-level model of
multi-party dialogue, but is more problematic when interpreting these data from
the psychologist’s perspective.
The CSN dataset in an analysis first discussed elsewhere tells a similar story
(Wang et al., 2014). Lexical alignment in CSN decreases with post distance.
Alignment with posts written by the thread initiator is lower initially and also
decays less (Figure 11.4). For syntactic priming, we even observe an increase in
alignment over time, speaking against the audience design hypothesis.
The evidence we find, overall, confirms the results by Branigan et al. (2007)
for the cases of both naturalistic multi-party dialogue and lexical alignment. Any
memory mechanisms underlying alignment seem to have little sensitivity to the
role of the source (origin message), as decay is not greater for such roles. Further,
absolute repetitions are initially lower for origin messages by the thread initiator
than for other origin messages. The observed differences in lexical similarity can be
interpreted as the result of the pragmatic consequences of addressing one another’s
260 D. Reitter
primeType 0.0125
initial post
0.005 any post by initial author
LILLA(word ∈ target | word ∈ prime post)
any post
0.003 0.0075
0.002
0.0050 primeType
initial post
any post by initial author
any post
0.001
0 25 50 75 100 0 25 50 75 100
Post distance between prime and target post Post distance between prime and target post
FIGURE 11.4 Lexical and syntactic alignment in the Cancer Survivors Network. Data
and analysis from Wang et al. (2014).
language. The basic principle we follow is that another speaker will adjust his or
her uses of X after hearing another speaker use X. However, for this adjustment of
the speaker’s language model to happen, structure X must be actually cognitively
present. Otherwise, there would be no memory item to reinforce. That means
that by identifying speakers adapt upon hearing X we find evidence in favor
of X as a cognitive artifact. As long as we can cheaply determine, on a large
dataset, where that structure applies, we can measure sensitivity to its use and
thereby detect adaptation. As an example, we have done that in a small study
with competing classes of representations (Reitter, Hockenmaier, & Keller, 2006a).
The structures we looked at described either fully incremental or non-incremental
syntactic processing. (The question here is whether new words and phrases are
immediately adjoined to the semantics or syntactic type of the existing sentence, or
if they are buffered in some form of working memory and combined out of order.)
By looking at adaptation in a relatively small corpus, we found some hints that
incrementality is actually flexible—although more work is necessary to robustly
model incrementality on more data. Much of this work can be done cheaply on
unannotated data once we have the computational means to induce grammar from
data (c.f., Bod, 1992) based on weak adaptation effects.
Conclusion
In this chapter, I hoped to demonstrate a Big Data approach to cognitive
science that observes linguistic performance in the wild. The minimally controlled
environment comes with obvious benefits and with some challenges. The benefits
lie in the broad coverage of syntactic constructions, conversational styles, and
communities. With the analysis of dialogue corpora such as Switchboard, Maptask,
264 D. Reitter
and Reddit, we were able to not only show that alignment effects in real-world
data were smaller than observed in the lab, but that they also varied in theoretically
relevant ways: For example, with task success (Reitter & Moore, 2014), but not
necessarily with the intended audience.
The challenges of the Big Data approach, however, also illustrate where a
carefully constructed experiment can produce more informative conclusions. The
correlation between lexical and syntactic levels is an example for this problem.
Work with large datasets in general comes with an inherent challenge: They
are observational. While we can observe correlations, causal inference is much
more difficult and requires more information, such as temporal relationships
(what happens later cannot have caused what happened earlier). However, direct
causal relationships without latent variables cannot be inferred. Further, with a
large dataset, we can usually find some correlations that are deemed significant,
numerically. As Adam Kilgariff put it: “Language is never, ever random”
(Kilgarriff, 2005). We should not rely on a single dataset or at least not on a single
sample of one to draw good conclusions. Most importantly, the opportunity to
observe seemingly reliable correlations between variables emphasizes our obligation
to always begin with a theoretical framework and clear hypotheses. For with
hindsight, many models can explain observational data.
Acknowledgments
This research was supported by the National Science Foundation (IIS 1459300
and BCS 1457992). The analyses on the CSN dataset were made possible by a
collaboration agreement between Penn State and the American Cancer Society. I
would like to thank Yafei Wang, Yang Xu, and John Yen for their discussions and
work on the CSN dataset, Kenneth Portier and Greta E. Greer for creating and
making available the CSN dataset, Lillian Lee for her comments on the Reddit
dataset, as well as Jason M. Baumgartner (Pushshift.io) for curating the Reddit
dataset, and Jeremy Cole for advice in preparing this chapter.
Notes
1 Alignment in this sense is distinct from alignment between texts or sentences
as used in machine translation. It is also different from alignment as used in
constituent substitutability, as by Van Zaanen (2000).
2 For a review, see Pickering and Ferreira (2008a); the terms syntactic priming
and structural priming are used more or less interchangeably in the literature.
3 A parent is defined as being a message to which the target message responds, or
as being the parent of such any parent message.
Alignment in Web-based Dialogue 265
References
Anderson, A. A., Bader, M., Bard, E., Boyle, E., Doherty, G. M., Garrod, S., . . . Weinert,
R. (1991). The HCRC Map Task corpus. Language and Speech, 34(4), 351–366.
Anderson, J. R. (1993). Rules of the mind. Hillsdale, NJ: Erlbaum.
Anderson, J. R. (2007). How can the human mind occur in the physical universe? Oxford, UK:
Oxford University Press.
Anderson, J. R., & Lebiere, C. (1998). The atomic components of thought. Mahwah, NJ:
Erlbaum.
Bard, E. G., Anderson, A. H., Sotillo, C., Aylett, M., Doherty-Sneddon, G., & Newlands,
A. (2000). Controlling the intelligibility of referring expressions in dialogue. Journal of
Memory and Language, 42, 1–22.
Bicknell, K., & Levy, R. (2010). A rational model of eye movement control in reading.
In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp.
1168–1178). ACL ’10. Uppsala, Sweden: Association for Computational Linguistics.
Bock, J. K. (1986). Syntactic persistence in language production. Cognitive Psychology, 18,
355–387.
Bock, J. K., & Griffin, Z. M. (2000). The persistence of structural priming: Transient
activation or implicit learning? Journal of Experimental Psychology: General, 129(2), 177.
Bock, J. K., Dell, G. S., Chang, F., & Onishi, K. H. (2007). Persistent structural priming
from language comprehension to language production. Cognition, 104(3), 437–458.
Bod, R. (1992). A computational model of language performance: Data oriented parsing.
In Proceedings of the 14th Conference on Computational Linguistics—volume 3 (pp. 855–859).
Association for Computational Linguistics.
Branigan, H. P., Pickering, M. J., & Cleland, A. A. (1999). Syntactic priming in language
production: Evidence for rapid decay. Psychonomic Bulletin and Review, 6(4), 635–640.
Branigan, H. P., Pickering, M. J., McLean, J. F., & Cleland, A. A. (2007).
Syntactic alignment and participant role in dialogue. Cognition, 104(2), 163–197. doi:
http://dx.doi.org/10.1016/j.cognition.2006.05.006.
Branigan, H. P., Pickering, M. J., Pearson, J., McLean, J. F., & Brown, A. (2011). The
role of beliefs in lexical alignment: Evidence from dialogs with humans and computers.
Cognition, 121(1), 41–57. doi: 10.1016/j.cognition.2011.05.011.
Brennan, S. E., & Clark, H. H. (1996). Conceptual pacts and lexical choice in conversation.
Journal of Experimental Psychology: Learning, Memory, and Cognition, 22(6), 1482.
Chang, F., Dell, G. S., & Bock, J. K. (2006). Becoming syntactic. Psychological Review,
113(2), 234–272.
Chapelle, O., Schölkopf, B., Zien, A. (Eds.). (2006). Semi-supervised learning: Adaptive
computation and machine learning. Cambridge, MA: MIT Press.
Chartrand, T. L., & Bargh, J. A. (1999). The chameleon effect: The perception-behavior
link and social interaction. Journal of Personality and Social Psychology, 76(6), 893.
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.
Chomsky, N., & Miller, G. A. (1963). Introduction to the formal analysis of natural
languages. In R. D. Luce, R. R. Bush, & E. Galanter (Eds), Handbook of mathematical
psychology (Vol. 2, pp. 269–321). New York, NY: Wiley.
Cleland, A. A., & Pickering, M. J. (2003). The use of lexical and syntactic information
in language production: Evidence from the priming of noun-phrase structure. Journal of
Memory and Language, 49, 214–230.
Danescu-Niculescu-Mizil, C., & Lee, L. (2011). Chameleons in imagined conversations: A
new approach to understanding coordination of linguistic style in dialogs. In Proceedings
266 D. Reitter
of the 2nd Workshop on Cognitive Modeling and Computational Linguistics (pp. 76–87).
Association for Computational Linguistics.
Danescu-Niculescu-Mizil, C., Cheng, J., Kleinberg, J., & Lee, L. (2012). You had me at
hello: How phrasing affects memorability. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics: Long papers—volume 1 (pp. 892–901). ACL ’12.
Jeju Island, Korea: Association for Computational Linguistics.
Dell, G. S., Chang, F., & Griffin, Z. M. (1999). Connectionist models of language
production: Lexical access and grammatical encoding. Cognitive Science, 23(4), 517–542.
del Prado Martin, F. M., & Bois, J. W. D. (2015). Syntactic alignment is an index of affective
alignment: An information-theoretical study of natural dialogue. In Proceedings of the 37th
Annual Meeting of the Cognitive Science Society. Pasadena, CA: Cognitive Science Society.
Demberg, V., & Keller, F. (2008). Data from eye-tracking corpora as evidence for theories
of syntactic processing complexity. Cognition, 109(2), 193–210.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
Ferreira, V. S., & Bock, J. K. (2006). The functions of structural priming. Language and
Cognitive Processes, 21(7–8), 1011–1029.
Fusaroli, R., Bahrami, B., Olsen, K., Roepstorff, A., Rees, G., Frith, C., & Tylen, K.
(2012). Coming to terms: Quantifying the benefits of linguistic coordination. Psychological
Science, 23(8), 931–939.
Garrod, S., & Anderson, A. (1987). Saying what you mean in dialogue: A study in
conceptual and semantic coordination. Cognition, 27, 181–218.
Genzel, D., & Charniak, E. (2003). Variation of entropy and parse trees of sentences as a
function of the sentence number. In Proceedings of the 2003 Conference on Empirical Methods
in Natural Language Processing (pp. 65–72). Association for Computational Linguistics.
Gibson, E. (1998). Linguistic complexity: Locality of syntactic dependencies. Cognition,
68(1), 1–76. doi: 10.1016/S0010–0277(98)00034-1.
Gibson, E. A. F. (1991). A computational theory of human linguistic processing: Memory limitations
and processing breakdown. Doctoral dissertation, School of Computer Science, Carnegie
Mellon University.
Gries, S. T. (2005). Syntactic priming: A corpus-based approach. Journal of Psycholinguistic
Research, 34(4), 365–399.
Hale, J. (2003). The information conveyed by words in sentences. Journal of Psycholinguistic
Research, 32(2), 101–123.
Hartsuiker, R. J., & Westenberg, C. (2000). Persistence of word order in written and spoken
sentence production. Cognition, 75B, 27–39.
Hartsuiker, R. J., Bernolet, S., Schoonbaert, S., Speybroeck, S., & Vanderelst, D. (2008).
Syntactic priming persists while the lexical boost decays: Evidence from written and
spoken dialogue. Journal of Memory and Language, 58(2), 214–238.
Healey, P. G., Purver, M., & Howes, C. (2014). Divergence in dialogue. PLoS One, 9(6),
e98598.
Henderson, J., & Ferreira, F. (2004). Interface of language, vision and action: Eye movements and
the visual world. New York, NY: Psychology Press.
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief
nets. Neural Computation, 18(7), 1527–1554.
Jaeger, T. F. (2006). Redundancy and syntactic reduction in spontaneous speech. Doctoral
dissertation, Stanford University.
Alignment in Web-based Dialogue 267
Linguistics on Human Language Technologies: Short papers (pp. 169–172). HLT-Short ’08.
Columbus, Ohio: Association for Computational Linguistics.
Niederhoffer, K. G., & Pennebaker, J. W. (2002). Linguistic style matching in social
interaction. Journal of Language and Social Psychology, 21(4), 337–360.
Ororbia II, A. G., Giles, C. L., & Reitter, D. (2015). Learning a deep hybrid model
for semi-supervised text classification. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing (EMNLP). Lisbon, Portugal.
Ororbia II, A. G., Reitter, D., Wu, J., & Giles, C. L. (2015). Online learning of deep hybrid
architectures for semi-supervised categorization. In ECML PKDD. European conference on
machine learning and principles and practice of knowledge discovery in databases. Porto, Portugal:
Springer.
Paxton, A., Abney, D., Kello, C. K., & Dale, R. (2014). Network analysis of multimodal,
multiscale coordination in dyadic problem solving. In P. M. Bello, M. Guarini, M.
McShane, & B. Scassellati (Eds.), Proceedings of the 36th Annual Meeting of the Cognitive
Science Society. Austin, TX: Cognitive Science Society.
Pickering, M. J., & Branigan, H. P. (1998). The representation of verbs: Evidence
from syntactic priming in language production. Journal of Memory and Language, 39,
633–651.
Pickering, M. J., & Ferreira, V. S. (2008). Structural priming: A critical review. Psychological
Bulletin, 134(4), 427–459.
Pickering, M. J., & Garrod, S. (2004). Toward a mechanistic psychology of dialogue.
Behavioral and Brain Sciences, 27, 169–225.
Pinker, S. (1991). Rules of language. Science, 253(5019), 530–535.
Pollard, C., & Sag, I. A. (1994). Head-driven phrase structure grammar. Chicago: University of
Chicago Press.
Pullum, G. K., & Scholz, B. C. (2002). Empirical assessment of stimulus poverty arguments.
The Linguistic Review, 18(1–2), 9–50.
Qian, T., & Jaeger, T. F. (2011). Topic shift in efficient discourse production. In Proceedings
of the 33rd Annual Conference of the Cognitive Science Society (pp. 3313–3318).
Reichle, E. D., Rayner, K., & Pollatsek, A. (2003). The E-Z reader model of eye-movement
control in reading: Comparisons to other models. Behavioral and Brain Sciences, 26(04),
445–476.
Reitter, D. (2008). Context effects in language production: Models of syntactic priming in dialogue
corpora. Doctoral dissertation, University of Edinburgh.
Reitter, D., & Moore, J. D. (2007). Predicting success in dialogue. In Proceedings of the 45th
Annual Meeting of the Association of Computational Linguistics (pp. 808–815). Prague, Czech
Republic.
Reitter, D., & Moore, J. D. (2014). Alignment and task success in spoken dialogue. Journal
of Memory and Language, 76, 29–46. doi: 10.1016/j.jml.2014.05.008.
Reitter, D., Hockenmaier, J., & Keller, F. (2006a). Priming effects in combinatory categorial
grammar. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language
Processing (EMNLP) (pp. 308–316). Sydney, Australia.
Reitter, D., Keller, F., & Moore, J. D. (2006b). Computational modeling of structural
priming in dialogue. In Proceedings of the Human Language Technology Conference/North
American Chapter of the Association for Computational Linguistics (HLT/NAACL) (pp.
121–124). New York, NY.
Alignment in Web-based Dialogue 269
Reitter, D., Keller, F., & Moore, J. D. (2011). A computational cognitive model of syntactic
priming. Cognitive Science, 35(4), 587–637. doi: 10.1111/j.1551–6709.2010.01165.
Reitter, D., Moore, J. D., & Keller, F. (2006c). Priming of syntactic rules in task-oriented
dialogue and spontaneous conversation. In Proceedings of the 28th Annual Conference of the
Cognitive Science Society (pp. 685–690). Vancouver, Canada.
Scheepers, C. (2003). Syntactic priming of relative clause attachments: Persistence of
structural configuration in sentence production. Cognition, 89, 179–205.
Schilling, H. E., Rayner, K., & Chumbley, J. I. (1998). Comparing naming, lexical decision,
and eye fixation times: Word frequency effects and individual differences. Memory &
Cognition, 26(6), 1270–1281.
Smith, N. J., & Levy, R. (2013). The effect of word predictability on reading time is
logarithmic. Cognition, 128(3), 302–319.
Snider, N., & Jaeger, T. F. (2009). Syntax in flux: Structural priming maintains probabilistic
representations. Poster at the 15th Annual Conference on Architectures and Mechanisms of
Language Processing, Barcelona.
Steedman, M. (2000). The syntactic process. Cambridge, MA: MIT Press.
Szmrecsanyi, B. (2005). Creatures of habit: A corpus-linguistic analysis of persistence in
spoken English. Corpus Linguistics and Linguistic Theory, 1(1), 113–149.
Szmrecsanyi, B. (2006). Morphosyntactic persistence in spoken English: A corpus study at
the intersection of variationist sociolinguistics, psycholinguistics, and discourse analysis. Berlin,
Germany: Mouton de Gruyter.
Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. (1995).
Integration of visual and linguistic information in spoken language comprehension.
Science, 268(5217), 1632–1634.
Travis, C. E. (2007). Genre effects on subject expression in Spanish: Priming in narrative
and conversation. Language Variation and Change, 19(2), 101–135.
Van Zaanen, M. (2000). ABL alignment-based learning. In Proceedings of the 18th Conference
on Computational Linguistics (pp. 961–967). Association for Computational Linguistics.
Wang, Y., Reitter, D., & Yen, J. (2014). Linguistic adaptation in online conversation
threads: Analyzing alignment in online health communities. In Cognitive Modeling
and Computational Linguistics (CMCL) (pp. 55–62). Baltimore, MD: Association for
Computational Linguistics.
Wang, Y., Yen, J., & Reitter, D. (2015). Pragmatic alignment on social support type in health
forum conversations. In Proceedings of Cognitive Modeling and Computational Linguistics
(CMCL) (pp. 9–18). Denver, CO: Association for Computational Linguistics.
Xu, Y., & Reitter, D. (2015). An evaluation and comparison of linguistic alignment
measures. In Proceedings of Cognitive Modeling and Computational Linguistics (CMCL) (pp.
58–67). Denver, CO: Association for Computational Linguistics.
Xu, Y., & Reitter, D. (2016). Convergence of syntactic complexity in conversation. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. ACL
’16. Berlin: Association for Computational Linguistics.
Zeldes, A. (2016) The GUM corpus: Creating multilayer resources in the classroom.
Language Resources and Evaluation, 1–32.
12
ATTENTION ECONOMIES,
INFORMATION CROWDING,
AND LANGUAGE CHANGE
Thomas T. Hills,
James S. Adelman,
and Takao Noguchi
Abstract
Language is a communication system that adapts to the cognitive capacities of its users
and the social environment in which it used. In this chapter we outline a theory, inspired
by the linguistic niche hypothesis, which proposes that language change is influenced by
information crowding. We provide a formal description of this theory and how information
markets, caused by a growth in the communication of ideas, should influence conceptual
complexity in language over time. Using American English and data from multiple language
corpora, including over 400 billion words, we test the proposed crowding hypothesis as well
as alternative theories of language change, including learner-centered accounts and semantic
bleaching. Our results show a consistent rise in concreteness in American English over the
last 200 years. This rise is not strictly due to changes in syntax, but occurs within word classes
(e.g. nouns), as well as within words of a given length. Moreover, the rise in concreteness
is not correlated with surface changes in language that would support a learner-centered
hypothesis: There is no concordant change in word length across corpora nor is there a
preference for producing words with earlier age of acquisition. In addition, we also find
no evidence that this change is a function of semantic bleaching: In a comparison of two
different concreteness norms taken 45 years apart, we find no systematic change in word
concreteness. Finally, we test the crowding hypothesis directly by comparing approximately
1.5 million tweets across the 50 US states and find a correlation between population density
and tweet concreteness, which is not explained by state IQ. The results demonstrate both
how Big Data can be used to discriminate among alternative hypothesis as well as how
language may be changing over time in response to the rising tide of information.
Language has changed. Consider the following example from The Scarlet Letter,
written by Nathaniel Hawthorne and originally published in 1850 (Hawthorne,
2004):
Information Crowding and Language Change 271
If you’re looking for sympathy you’ll find it between shit and syphilis in the
dictionary.
Both examples contain words that readers are likely to recognize. Both allude
to fairly abstract concepts in relation to the word sympathy. However, they use
dramatically different words to do so. Yet, with but a few examples, there is always
the risk that one is cherry picking. Compare, as another example, the following
two quotations, both published in Nature, but more than 100 years apart:
When so little is really known about evolution, even in the sphere of organic
matter, where this grand principle was first prominently brought before our
notice, it may perhaps seem premature to pursue its action further back in
the history of the universe. (Blanshard, 1873)
Each sex is part of the environment of the other sex. This may lead
to perpetual coevolution between the sexes, when adaptation by one sex
reduces fitness of the other. (Rice, 1996)
The difference in language between these examples hints at something
systematic, with the more recent examples appearing more efficient—they are
perhaps more conceptually clear and perhaps more memorable, arguably without
any loss of content or fuzzying of the intended meaning. One could consider
this a kind of conceptual evolution in the language. Still, these are but a few
examples and the change is difficult to quantify. To characterize conceptual change
in language over hundreds of years would require reading thousands or millions
of books and attempting to precipitate out change in a psycholinguistic variable
capable of capturing change in conceptual efficiency. The computational and data
resources associated with “Big Data” make this possible. In recent years millions of
words of natural language have been digitized and made available to researchers and
large-scale psycholinguistic norms have been published using online crowdsourcing
to rate thousands of words. In this chapter, we combine a wide collection of
such resources—books, newspapers, magazines, public speeches, tweets, and
psycholinguistic norms—to investigate the recent psycholinguistic evolution of
American English.
272 T. T. Hills, J. S. Adelman, and T. Noguchi
Language Change
It is well established that languages change over time (Labov, 1980; Lieberman,
Michel, Jackson, Tang, & Nowak, 2007; Michel et al., 2011), and this evolution
has been characterized at both long time scales, on the order of thousands of years
(Pagel et al., 2007; Lupyan & Dale, 2010), and short time scales, on the order
of generations (Labov, 1980; Michel et al., 2011). Word forms change, such as
the transition from Old English lufie to present day love. Words are replaced by
other words, such as replacement of Old English wulcen by the Scandinavian word
sky (Romaine et al., 1992). Words change their grammatical value, such as the
word going, which once strictly referred to physical movement from one place
to another and now refers to the future, as in “They are going to change their
policy.” And words often change in more than one way at the same time, as in the
Proto-Indo-European word kap, which may have originally meant “to seize” (as
in the Latin root cap, from “capture” and “captive”), but now hides in plain sight
in English as the word have (Deutscher, 2010).
The factors influencing which words change and why has seen an upsurge of
interest outside traditional linguistics fueled by the availability of large databases
of language. Using this approach, researchers have isolated a variety of factors
that influence language change. For example, the average half-life of a word
is between 2,000 and 4,000 years (i.e. the time at which half of all words
are replaced by a new non-cognate word), but the half-life of a word is
prolonged for high-frequency words (Pagel et al., 2007) and for words with
earlier age of acquisition (Monaghan, 2014). Languages are also susceptible to
losing grammatically obligatory word affixes, called morphological complexity.
This varies across languages in predictable ways as a function of the number of
speakers. English, compared to most other languages, has very little morphological
complexity. For example, English speakers add the suffix -ed to regular verbs to
signal the past tense. Navajo, on the other hand, adds obligatory morphemes to
indicate a wide variety of information about past action, for example, whether
it occurred among groups, was repeated, or changed over time (Young, Morgan,
& Midgette, 1992). In a comparison of more than 2,000 languages, languages
with more speakers have less morphological complexity than languages with
fewer speakers (Lupyan & Dale, 2010). All of the above examples indicate that
potential selective pressures such as frequency of shared language use can preserve
language features and that these can be characterized in much the same way
as genetic evolution (compare, e.g. Pagel et al., 2007; Duret & Mouchiroud,
2000).
The hypothesis that languages evolve in response to selective pressures provided
by the social environment in which they are spoken is called the linguistic niche
hypothesis (Lupyan & Dale, 2010). Lupyan & Dale (2010) interpret the above
changes in morphological complexity as evidence consistent with a selective force
Information Crowding and Language Change 273
Conceptual Crowding
Cognitive performance is intimately connected with our capacity to process,
understand, and recall new information. One of the principal features of cognitive
information is that it can experience crowding. An example is list-length effects,
where items on a longer list are remembered less well than items on shorter lists
(Ratcliff, Clark, & Shiffrin, 1990). Similar information crowding may also be
taking place at the level of our day to day experience where the production and
exposure to information has seen unprecedented growth (Varian & Lyman, 2000;
Eppler & Mengis, 2004). These conditions create highly competitive environments
for information, especially among languages with many speakers, where speakers
represent potential competitors in the language marketplace. This competition in
information creates information markets, which have inspired the idea of attention
economies (Hansen & Haas, 2001; Davenport & Beck, 2001). Attention represents
a limiting resource in information-rich environments. Attention economies can
drive evolutionary change in communication in the same way that noise drives
change in other communication systems. Acoustic crowding associated with bird
song is known to influence communication strategies in bird species, in ways
that make songs easier to discriminate (Luther, 2009; Grant & Grant, 2010).
Similar to a global cocktail party problem—where communicators compete for
the attention of listeners—conceptual information may also compete for attention,
both at the source—when it is spoken—as well as later, when it is being
retrieved.
To formalize the above argument, consider that crowding mimics the inclusion
of noise in information theoretic accounts of signal transduction (Shannon &
Weaver, 1949). This is often discussed in terms of signal to noise ratio, NS . From
the perspective of an information producer, other producers’ messages are noise,
and as these increase our own messages are transmitted less efficiently.
How should cognitive systems adapt to this situation? Simon provided an
answer for receivers: “a wealth of information creates a poverty of attention and
a need to allocate that attention efficiently” (Simon, 1971, pp. 40–41). One way
to accomplish this is to order information sources in proportion to their value
to the receiver. Here, people should learn to tune out (i.e. inhibit) unwanted
messages and pay more attention to those that represent real conceptual value.
This is the basis of algorithmic approaches to filtering junk e-mail (e.g. Sahami,
274 T. T. Hills, J. S. Adelman, and T. Noguchi
10
λ
1.1
1.5
8 2
5
Optimal message length (b)
when he writes “The professor’s mishandling of his fruit influenced his colleague’s
work.” This is true even for a reader with enough knowledge to disambiguate
the references, because recomposing a message from its parts requires cognitive
processing and more so when those parts are ambiguous (Murphy & Smith, 1982;
Rosch, Mervis, Gray, Johnson, & Boyesbraem, 1976).
Messages that require more cognitive processing are more vulnerable to
incomplete transmission. First, each unit of processing work is vulnerable to
intrinsic failure; the greater the amount of work, the higher the probability of
failure. All else being equal, this will result in more complex messages being lost.
Second, as discussed above, there is competition for the receiver’s resources. New
messages may arrive during the processing of older messages, and interrupt their
transmission.
Messages may also be more complex than their payloads require. Take for
example the following sentence, which won the Bad Writing Contest in 1997:
Regardless of how profound the sentence may be, many have found it to be
most valuable as a lesson in maximizing words per concept (e.g. Pinker, 2014).
In the Dr. Estes example above, using less specific terms such as “fruit” when
“banana” is meant, increases the conceptual complexity of the message without
changing the payload, and without a gain in surface complexity. Writers can
eliminate such inefficiencies, but they may not do so unless pressured because
the cost to language users of being more specific is non-zero. G. K. Zipf referred
to a similar balance between reductive and expansive forces in language:
Crowding is but one force that language users must accomodate for.
Information Crowding and Language Change 277
The formal basis for the costs of surface and conceptual complexity are
analogous, but they are subject to different trade-offs. At its simplest, surface
complexity corresponds to physical length (e.g. word length), suggesting a cost
for longer words. However, longer words are also more phonologically isolated,
such that they are effectively error-correcting codes: If a long word is disrupted
slightly, it is still unlikely to be confused with another word, but if a short word
suffers distortion, the outcome can be more easily confused with another word.
Conceptual length is not necessarily associated with physical length, because it
relies on higher-order cognitive processing capacities involved in message retrieval
and production. In the next section we expand further on this by relating
conceptual length to historical change in language.
The important point here is that each of these theories acknowledges the powerful
role that concreteness plays in facilitating linguistic processing.
The wealth of evidence on the value of concreteness in language presents
a problem. Why should words ever become abstract? The assumption in
the mathematical example above provides what we feel is the most likely
explanation—abstract words, by the nature of their generality, provide information
about broader classes of phenomena. The word fruit, being more abstract than items
subordinate to that category, e.g. apple, can efficiently communicate information
about categories of truth that would otherwise take many individual instances of
more concrete examples—“fruit can prevent scurvy when eaten on long ocean
voyages.” Or to take another example, the word essentially is one of the most
abstract words in concreteness norms (see Brysbaert et al., 2013); however the
sentence “My office hours are essentially on Wednesday at noon” provides a degree
of probabilistic hedging that “My office hours are on Wednesday at noon” does
not.
Besides the theory of crowding proposed above, we know of no prior theories
that speak directly to evolutionary changes in concreteness at the level of word
distributions. Nonetheless, some evidence of cultural change may speak indirectly
to this issue. The most prominent is associated with an explanation for the Flynn
effect. The Flynn effect is the observation that intelligence scores, associated with
both crystallized and fluid intelligence, have risen steadily from approximately the
middle of the last century (Flynn, 2012). Flynn noted that nutrition has failed to
explain the observed effects and, in the absence of evidence for other biological
theories, more cognitive theories have risen to the foreground (Flynn, 2012). In
particular, numerous theories speak to an increase in computational, symbolic, and
potentially more abstract processing abilities (Greenfield, 1998; Fox, & Mitchum,
2013). One implication of knowledge-based accounts is that language may change
its composition to reflect our capacity for more abstract processing, and thus show
an increase in abstract words.
However, the causal arrow may point in the other direction. An increase
in concrete language may enhance our individual capacity to process complex
information. By this interpretation, language should have become more concrete,
and where it is concrete people should tend to learn and process more about it
(Sadoski, 2001).
where ci is the concreteness for word i as found in the Brysbaert et al. concreteness
norms and pi is the proportion of word i in year y. The proportion is computed
only over the n words in the concreteness norms, or the appropriate comparison
set (as described in the caption for Figure 12.2). We also computed concreteness
on a per document basis, as opposed to per word, with similar results.
As shown in Figure 12.2, we found a steady rise in concreteness across multiple
corpora, including books (Google Ngrams), newspapers and magazines (the
Corpus of Historical American English), and presidential speeches. The Google
Ngrams also provide a corpus based on English fiction, which shows the same
pattern, with a slightly more dramatic rise in concreteness from approximately
2.35 in 1800 to 2.57 in 2000 (data not shown).
We also found that changes in concreteness occurred within word classes and
were therefore not strictly due to changes in syntax (e.g. by a reduction in the
use of articles). Figure 12.3 shows that, over the same time period, concreteness
increases within nouns, verbs, and prepositions. Together these findings show that
the change is not only systematic but systemic, permeating many different aspects
of the language.
This observation is consistent with a number of hypotheses, including crowding
and the influence of second language learners. In what follows, we examine a
number of these hypotheses in an effort to discriminate among a wide set of
possibilities.
Semantic Bleaching
It has been proposed that word evolution follows a fairly predictable pattern over
time, from specific to abstract. This takes a variety of forms including semantic
bleaching, desemanticization, and grammaticalization (Hopper & Traugott, 2003;
Aitchison & Lewis, 2003). An example is the word very, which derives from the
French word vrai. In French, vrai did and still does mean “true.” In Middle English,
280 T. T. Hills, J. S. Adelman, and T. Noguchi
2.4
Concreteness
2.3
2.2
2.1
the word very also meant “true,” as in a very knight, meaning a real knight. However,
over time the word became a means to emphasize the strength of a relationship,
in a probabilistic way. For example, nowadays we say that something is very true,
meaning that there are degrees of truth and this particular thing may have more of
it than others.
Although semantic bleaching is proposed to be unidirectional, this is not
without debate (Hollmann, 2009). Moreover, it is certainly the case that not all
diachronic linguistic patterns are associated with loss of concreteness. Metonymy
Information Crowding and Language Change 281
3.7
GN
COHA
3.6
2.45
Concreteness
Concreteness
3.5
2.40
3.4 3.3
2.35
1800 1850 1900 1950 2000 1800 1850 1900 1950 2000
Year Year
Prepositions
2.18
2.16
2.14
Concreteness
2.12
2.10
2.08
2.06
FIGURE 12.3Changes in concreteness within word classes. Figure taken from Hills &
Adelman (2015).
150
0.6
Frequency
100
0.4
0.2 50
0.0 0
0.0 0.2 0.4 0.6 0.8 1.0 –0.4 –0.2 0.0 0.2 0.4 0.6
Paivio norms 1968 Change in concreteness
FIGURE 12.4 Comparison of the Paivio and Brysbeart norms. The Paivio concreteness
norms (Paivio et al., 1968) consist of 925 nouns, collected in the laboratory and using
a 7-point scale. The Brysbaert norms were collected on a 5-point scale. Both norms
are normalized to be between zero and one. (a) Shows the change in concreteness
over the 45-year span between collection. (b) Shows the histogram of concreteness
differences per word. Figure taken from Hills & Adelman (2015).
comparison with the Brysbaert norms and provides the only basis we know of for a
quantitative test of semantic bleaching. Normalizing the ratings for both shows that
there are no systematic changes in word concreteness over the approximately 900
words used for comparison (Figure 12.4). The median change is centered around
zero and a paired t-test finds no significant difference in concreteness (t (876) =
−0.79, p = 0.45). This suggests that a systematic loss of concreteness is unlikely to
explain the apparent rise in concreteness we see in the data.
These results also provide a large-scale approach to better understanding the
unidirectionality of semantic bleaching, which to our knowledge has not been
possible in the past. As a preliminary step in that direction, in Table 12.1 we provide
the ten words that have increased or decreased the most in concreteness over the
last 45 years. Importantly, the words from each side of the distribution offer an
effective demonstration that semantic bleaching may be mirrored by an equally
powerful semantic enrichment. A dreamer may have become more concrete—but
the devil, although he may have been in the details in the past, resides more in
abstraction today.
Word Length
More concrete words tend to be phonologically and orthographically shorter.
Among the words in the Brysbaert norms (Brysbaert et al., 2013), the correlation
between concreteness and word length is β =-0.40, p < 0.001. If the selective
forces driving concreteness are more directly driven by preferences for shorter
words, then word length should change in tandem with concreteness. However,
Figure 12.5 shows that the general trends found in concreteness are not preserved
across corpora in relation to word length. In general, word length does not change
much across the three corpora until the 1900s, and then the direction of change
appears to depend on the corpora. Words in presidential speeches get shorter, while
words in books tend to get longer. Words in newspapers and magazines, on the
284 T. T. Hills, J. S. Adelman, and T. Noguchi
GN
4.6 COHA
Inaugural
4.5
4.4
Word length
4.3
4.2
4.1
4.0
other hand, first show a trend towards reduced length but then increase in length,
but only up to approximately the point they were in 1800.
One can also look at concreteness with words of a given length, and ask if the
rise in concreteness is independent of word length. Figure 12.6 shows that this is
largely the case. Although words of one, two, or three characters in length show
roughly no change in concreteness over time, words of four or more characters
consistently show a rise in concreteness over time ( ps < 0.001).
Additional evidence of the independence between concreteness and word
length is found in Figure 12.7, which shows that within a range of concreteness
words tend to grow in length, especially following the 1900s. This is also mirrored
by an increase in word length across the full corpus. This would appear to be
counter to a potential selective force imposed by language learners. In sum, changes
in concreteness do not appear to be driven by changes in word length—on the
contrary, concreteness appears to rise despite an overall trend towards longer words.
Information Crowding and Language Change 285
3.0
6
5
7
8
4
2.5 9
Full corpus
Concreteness
2
2.0 3
1.5 1
Age of Acquisition
Age of acquisition provides an additional, and possibly more direct, measure of
evidence for a learner-centered hypothesis. In a comparision of the 23,858 words
that are shared between the Brysbaert concreteness norms and the Kuperman age
of acquisition norms (Kuperman et al., 2012), age of acquisition is correlated with
concreteness, β = −0.35, p < 0.001. Moreover, previous work has established
that words with earlier age of acquisition are more resilient to age-related decline
and are retrieved more quickly in lexical decision tasks than words acquired later
in life (Juhasz, 2005; Hodgson & Ellis, 1998). If language change in American
English is driven by the influence of language learners—who may only show partial
learning—or the influence of an aging population—who produce earlier acquired
words preferentially—then the language as a whole may move towards words of
earlier age of acquisition.
To evaluate changes in preference for words of earlier acquisition over time, we
used the Kuperman age of acquisition norms (Kuperman et al., 2012) to compute
286 T. T. Hills, J. S. Adelman, and T. Noguchi
6.0
5.5
(2,2.5]
(4,4.5]
5.0 (3,3.5]
(4.5,5]
Word length
(3.5,4]
(2.5,3]
4.5
Full corpus
4.0
(1.5,2]
3.5
(1,1.5]
3.0
1800 1850 1900 1950 2000
Year
FIGURE 12.7 Changes in word length within narrow ranges of concreteness. Data are
taken from the Google Ngrams.
a weighted age of acquisition value for each corpora as was done for concreteness.
Figure 12.8 shows that age of acquisition tends to follow a similar pattern as that
found for word length, but not concreteness. Changes in age of acquisition and
word length are also highly correlated across the three corpora (Google Ngram:
β = 0.96, p < 0.001; COHA: β = 0.66, p < 0.001; inaugural addresses: β = 0.95,
p < 0.001). On the other hand age of acquisition is not well correlated with
changes in concreteness (e.g. Google Ngram: β = 0.33, p < 0.001).
GN
COHA
Inaugural
6.0
5.8
Age of acquisition
5.6
5.4
5.2
learners, who we would predict would also show a preference for shorter words
and words with earlier age of acquisition. Furthermore, the results also suggest
that the change in concreteness is not being driven by a rising tide of more
children or lower IQ individuals entering the language market. Again, if this
were the case, we would expect language to also show systematic changes in
surface complexity. In the next section we examine more directly the relationship
between crowding and concrete language by looking at another source of Big Data:
Twitter.
2.72 Louisiana
Florida
Hawaii
Mississippi Maryland
2.70
Nevada
Concreteness
Georgia
Oregon South Carolina
ColoradoTexas Delaware
Vermont California
VirginiaNew York
Alabama Illinois New Jersey
Arkansas
Arizona Connecticut
2.68
Michigan
Wisconsin Massachusetts
North Carolina
Alaska Montana Tennessee
Missouri Rhode Island
Pennsylvania
Washington
Nebraska
Wyoming
Minnesota
New Mexico Maine
Oklahoma
Kansas New HampshireOhio
Indiana
2.66 Kentucky
North Dakota West
Iowa Virginia
South Dakota
Idaho Utah
0 2 4 6
Log (population density)
FIGURE 12.9 Concreteness in language increases with the population density across US
states. The data are taken from approximately 1.5 million tweets, with 30,000 tweets
per state.
collected 66,348,615 tweets, made within 50 states of the USA, using Twitter’s
streaming API. The collected tweets exclude retweets (i.e. repetition of tweets
previously made). The number of collected tweets varies between the states from
39,397 (Wyoming) to 8,009,114 (California), and thus to achieve similarity in
measurement accuracy, we randomly sampled 30,000 tweets from each state. Then
after removing hash tags, non-ascii characters, and punctuation marks (e.g. ”), we
calculated the concreteness for each tweet and then mean averaged these for each
state.
Figure 12.9 shows the relationship between log population density and tweet
concreteness for states (β = 0.36, p < 0.01). There is a clear pattern of rising
concreteness with population density. There are many potential confounds here, as
Information Crowding and Language Change 289
styles of writing (e.g. syntax and tweet length) may change across states. However,
as we note above, concreteness is but one of many ways that conceptual efficiency
may change and thus we see it as an indicator, which may in turn be driven by other
aspects of language use. One factor that is unlikely to be influenced by crowding,
however, is IQ, which may in turn be associated with concreteness, as we note
in the introduction. In our data, IQ is inversely correlated with concreteness
(β = −0.003, p < 0.001), but this may not be particulary surprising as the
McDaniel (2006) measure is based partly on reading comprehension. However,
the relationship between concreteness and population density is preserved after
partialing out the variance accounted for by changes in IQ (McDaniel, 2006),
with population density still making up approximately 12 percent of the variance
( p < 0.01).
Conclusions
Culture is a marketplace for competing ideas. This leads to the prediction that
any broad medium of communication should evolve over time to have certain
properties that facilitate communication. Certain aspects of these signals should
be enhanced as competition (i.e. crowding) increases. In particular, aspects of
information that increase speed of processing and memorability should be favored
as information markets become more crowded. As we have shown, concreteness
facilitates these cognitive demands and has risen systematically in American English
for at least the last 200 years. We have also shown that these changes are not
consistent with a learner-centered hypothesis, because we would expect additional
changes in language associated with a reduction in surface complexity, such as
reduced word length and preference for words with earlier age of acquisition,
which we do not observe. The lack of evidence for these changes also indicates
that the change in concreteness is not due to a general simplifying of the language,
which one might predict if language were being influenced by, for example, a
younger age of entry into the language marketplace or a general dumbing down of
the language.
The work we present here is preliminary in many respects. We have taken
a bird’s eye view of language and focused on psycholinguistic change, but these
necessarily require some assumptions on our part and do not focus on other
factors in language change. It is very likely that there are other changes in writing
and speech conventions that one could document. To see how these align with
our present investigation, one would also need to investigate the causes of these
changes. If writing was once meant to provide additional information about the
intelligence of the author, this may have been lost in modern language—but the
natural question is why? When there are but few authors, the authors may compete
along different dimensions than when there are many, and conventions may change
accordingly.
290 T. T. Hills, J. S. Adelman, and T. Noguchi
The present work demonstrates the capacity for data analytic approaches to
language change that can discriminate among alternative hypotheses and even
combine data from multiple sources to better inform hypotheses. Naturally, we
also hope this work leads to future questions and research on the influence of
concreteness and language evolution. In particular, we find it particularly intriguing
to ask if the rise in concrete language may be associated with the rise in IQ
associated with the Flynn effect (Flynn, 2012)? Compared with the writing of
several hundred years ago, the examples we provide in the introduction suggest
that today’s writing is more succinct, often to the point of being terse. It is difficult
to deny the comparative ease with which modern language conveys its message.
Indeed, we suspect that more memorable language (such as aphorisms) share a
similar property of making their point clearly and efficiently.
The research we present also poses questions about the influence of competition
in language. Is language produced in a competitive environment more memorable
in general, or is the increased memorability for some passages just a consequence of
a larger degree of variance among the language produced? If the former, this may
suggest that something about competitive language environments facilitates the
production of more memorable messages, and that this is something that humans
are potentially aware of and capable of modulating. Such a capacity would explain
the enhanced memorability of Facebook status updates relative to other forms
of language (Mickes et al., 2013). If such competition is driving language, and
language change maintains its current course, we may all be speaking Twitterese
in the next 100 to 500 years (compare the y-axis on Figure 12.2 and Figure 12.9).
Finally, this work may also provide applications in relation to producing more
memorable information in learning environments, for example, via a mechanism
for concretizing text or competitive writing. Although we recognize that these
questions are speculative, we hope that this work provides some inspiration for
their further investigation.
References
Adorni, R., & Proverbio, A. M. (2012). The neural manifestation of the word concreteness
effect: an electrical neuroimaging study. Neuropsychologia, 50, 880–891.
Aitchison, J., & Lewis, D. M. (2003). Polysemy and bleaching. In B. Nerlich, Z. Todd, V.
Hermann, & D. D. Clarke (Eds.), Polysemy: Flexible patterns of meaning in mind and language
(pp. 253–265). Berlin: Walter de Gruyter.
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python. Sebastopol,
CA: O’Reilly.
Blanshard, C. (1873). Evolution as applied to the chemical elements. Nature, 9, 6–8.
Brysbaert, M., Warriner, A. B., & Kuperman, V. (2013). Concreteness ratings for 40
thousand generally known English word lemmas. Behavior Research Methods, 46, 1–8.
Butler, J. (1997). Further reflections on conversations of our time. Diacritics, 27, 13–15.
Information Crowding and Language Change 291
Christiansen, M. H., & Chater, N. (2008). Language as shaped by the brain. Behavioral and
Brain Sciences, 31, 489–509.
Davenport, T. H., & Beck, J. C. (2001). The attention economy: Understanding the new currency
of business. Brighton, MA: Harvard Business Press.
Davies, M. (2009). The 385+ million word corpus of contemporary American English
(1990– 2008+): Design, architecture, and linguistic insights. International Journal of Corpus
Linguistics, 14, 159–190.
De Groot, A., & Keijzer, R. (2000). What is hard to learn is easy to forget: The roles of
word concreteness, cognate status, and word frequency in foreign-language vocabulary
learning and forgetting. Language Learning, 50, 1–56.
Deutscher, G. (2010). The unfolding of language. London: Random House.
Duret, L., & Mouchiroud, D. (2000). Determinants of substitution rates in mammalian
genes: Expression pattern affects selection intensity but not mutation rate. Molecular
Biology and Evolution, 17, 68–670.
Eppler, M. J., & Mengis, J. (2004). The concept of information overload: A review of
literature from organization science, accounting, marketing, MIS, and related disciplines.
The Information Society, 20, 325–344.
Fliessbach, K., Weis, S., Klaver, P., Elger, C., & Weber, B. (2006). The effect of word
concreteness on recognition memory. Neuroimage, 32, 1413–1421.
Flynn, J. R. (2012). Are we getting smarter? Rising IQ in the twenty-first century. Cambridge,
UK: Cambridge University Press.
Fox, M. C., & Mitchum, A. L. (2013). A knowledge-based theory of rising scores on
‘culture-free’ tests. Journal of Experimental Psychology: General, 142, 979–1000.
Grant, B. R., & Grant, P. R. (2010). Songs of Darwin’s finches diverge when a new species
enters the community. Proceedings of the National Academy of Sciences, 107, 2015620163.
Greenfield, P. M. (1998). The cultural evolution of IQ. In U. Neisser (Ed.), The rising
curve: Long-term gains in IQ and related measures (pp. 81–123). Washington, DC: American
Psychological Association.
Hansen, J., & Wänke, M. (2010). Truth from language and truth from fit: The impact of
linguistic concreteness and level of construal on subjective truth. Personality and Social
Psychology Bulletin, 36, 1576–1588.
Hansen, M. T., & Haas, M. R. (2001). Competing for attention in knowledge markets:
Electronic document dissemination in a management consulting company. Administrative
Science Quarterly, 46, 1–28.
Hawthorne, N. (2004). The scarlet letter. New York: Simon and Schuster.
Hills, T. T., & Adelman, J. S. (2015). Recent evolution of learnability in American English
from 1800 to 2000. Cognition, 143, 87–92.
Hodgson, C., & Ellis, A. W. (1998). Last in, first to go: Age of acquisition and naming in
the elderly. Brain and Language, 64, 146–163.
Hollmann, W. B. (2009). Semantic change. In J. Culpeper, F. Katamba, P. Kerswill, &
T. McEnery (Eds.), English language: Description, variation and context (pp. 525–537).
Basingstoke: Palgrave.
Hopper, P. J., & Traugott, E. C. (2003). Grammaticalization. Cambridge, UK: Cambridge
University Press.
Huang, H.-W., Lee, C.-L., & Federmeier, K. D. (2010). Imagine that! ERPs provide
evidence for distinct hemispheric contributions to the processing of concrete and abstract
concepts. Neuroimage, 49, 1116–1123.
292 T. T. Hills, J. S. Adelman, and T. Noguchi
Abstract
Decision by sampling theory (DbS) offers a unique example of a cognitive science theory in
which the role of Big Data goes beyond providing high-powered tests of hypotheses: Within
DbS, Big Data can actually form the very basis for generating those hypotheses in the first
place. DbS is a theory of decision-making that assumes people evaluate decision variables,
such as payoffs, probabilities, and delays, by comparing them to relevant past observations
and experiences. The theory’s core reliance on past experiences as the starting point in the
decision-making process sets it apart from other decision-making theories and allows it to
form a priori predictions about the patterns of preferences that people will exhibit. To do so,
however, the theory requires good proxies for the relevant distributions of comparison values
(i.e. past observations and experiences) that people are likely to hold in memory. In this
chapter, we summarize the theory of DbS and describe several examples of Big Data being
successfully used as rich proxies for memory distributions that form the foundations of the
theory. We show how, using these Big Data sets, researchers were able to independently predict
(i.e. without fitting choice data) the shapes of several important psychoeconomic functions
that describe standard preference patterns in risky and intertemporal decision-making. These
novel uses of Big Data reveal that well-known patterns of human decision-making, such as
loss aversion and hyperbolic discounting (among others), originate from regularities in the
world.
Introduction
The study of human decision-making has made great strides over the past several
decades: The normatively appealing, but descriptively lacking, axiomatic theories
of expected utility maximization (e.g. von Neumann & Morgenstern, 1947) were
successfully challenged, and gave way to more behaviorally inspired approaches,
Decision by Sampling 295
such as prospect theory (Kahneman & Tversky, 1979; Tversky & Kahneman,
1992), regret theory (Loomes & Sugden, 1982), and other variants (for reviews,
see Schoemaker, 1982; Starmer, 2000). Another major step forward has been the
development of dynamic theories, which attempt to capture, not just the output
of decision-making, but also the process of deliberation that ultimately leads to
observed choices (e.g. Busemeyer & Townsend, 1993; Usher & McClelland, 2004).
And there is no sign of slowing down: Even the past couple of years have witnessed
the birth of new decision-making theories (e.g. Bhatia, 2013; Dai & Busemeyer,
2014).
Yet, for all of the progress that has been made, most decision-making theories
remain fundamentally tethered to the utility-based approach—that is, they are built
on the core assumption (i.e. have as their starting point) that there are underlying
utility or “value” functions1 that govern our preferences, and thus our choices.
In this way, they have not really escaped the shadow of expected utility theory
(EUT; von Neumann & Morgenstern, 1947), which they sought to challenge
and replace. By comparison, there have been far fewer attempts to conceptualize
the decision-making process as being “free” of utility (or value) functions. Rare
examples include reason-based choice (Shafir, Simonson, & Tversky, 1993), query
theory (Appelt, Hardisty, & Weber, 2011; Johnson, Häubl, & Keinan, 2007; Weber
et al., 2007), and a few simple choice heuristics (e.g. Brandstätter, Gigerenzer, &
Hertwig, 2006; Thorngate, 1980).
More generally, nearly all utility-based and utility-absent theories have one
fundamental thing in common: They conceptualize the decision-making process as
an interaction between a pre-existing set of preferences (or decision rules) and the
attribute values of the choice options under consideration. The decision maker’s
past experiences and observations, however, are either totally absent from the
decision-making process (e.g. Brandstätter et al., 2006), or their role is limited
to a very short time-window (e.g. Gonzalez, Lerch, & Lebiere, 2003; Plonsky,
Teodorescu, & Erev, 2015), or they merely shape his or her beliefs about the
likelihoods of outcomes occurring (e.g. Fudenberg & Levine, 1998). By contrast,
the decision-maker’s extended past typically plays no role in shaping how the
attribute values of choice alternatives are evaluated.2
Thus, a striking feature of most models of decision-making is that each decision
occurs largely in a vacuum, so that only the most recent (if any) environmental cues
are brought to bear on the decision-making process. Indeed, it is typically assumed
that people evaluate the attributes of choice alternatives without reference to their
experience of such attributes in the past. Yet, even intuitively, this assumption
seems implausible. Consider, for example, the decision to purchase a car: Doing
so requires weighing and trading-off relevant attributes, such as fuel consumption
and safety features. However, we are not able to directly evaluate how different
absolute values of these attributes will impact our well-being (e.g. we are not
able to quantify the absolute boost in utility associated with a particular reduction
296 C. Y. Olivola and N. Chater
The process described above can be repeated for other values (e.g. other
monetary loss amounts) that a person might be trying to evaluate. In fact, DbS
allows us to map out the relationship between each possible outcome or attribute
magnitude and its subjective value (or weight) in the eyes of the decision-maker (as
we will illustrate, in the next section). The result is a percentile function relating
each outcome/attribute to its corresponding percentile rank, and therefore to its
subjective value. This percentile function can be used to model and predict people’s
preferences and choices in the same way that utility (or value) functions are used
to predict decisions (Olivola & Sagara, 2009; Stewart, 2009; Stewart, Chater, &
Brown, 2006; Walasek & Stewart, 2015). The key difference between DbS and
utility-based theories is that in DbS the “value functions” emerge locally from the
interactions between the memory-sampling-plus-binary-comparison process and
the distributions of relevant magnitudes that one has experienced or observed over
time. For example, according to DbS, a person’s value function for financial gains
is essentially the function relating each monetary payoff she might encounter to
its percentile rank among all the sums of money that she has earned (or observed
others earning) in the past.
Critically, this implies that the “utility” a person ascribes to an outcome (e.g.
winning a particular sum of money) will be determined by her accumulated
experiences, as these govern the distribution of comparison magnitudes (e.g.
past monetary gains) that she can draw from. Therefore, if we can know (or
at least approximate) the distribution of relevant outcomes that a person has
observed in her lifetime then, using DbS, we can predict the typical shape of
her corresponding value function(s).6 The same logic applies for other attribute
values, such as the probabilities or timings of outcomes: DbS allows us to predict
the shapes of a person’s probability weighting and time-discounting functions
from the distributions of probabilities and delays she has previously observed and
experienced. This contrasts sharply with utility-based approaches, which have no
way of independently predicting a person’s preferences in advance (i.e. without
first observing some of their past choices). Of course, being able to test this
important distinguishing feature of DbS requires access to rich and representative
data concerning the occurrence of these outcomes or attributes. Fortunately, as
we’ll show, the growing availability of Big Data has made it possible for researchers
to estimate a variety of relevant distributions.
Like any theory, DbS rests on a few key assumptions—namely, that
decision-makers sample from their memory, engage in binary comparisons, and
finally tally up the proportion of favorable (versus unfavorable) comparisons. The
first assumption (memory sampling) is supported by evidence that humans and
other animals automatically encode and recall frequencies (Sedlmeier & Betsch,
2002). The second assumption (binary comparison) is supported by the simplicity
of the mechanism and by extensive evidence that people are much more adept at
providing relative, rather than absolute, judgments (Stewart et al., 2005). Finally,
Decision by Sampling 299
typically non-linear. One of these, prospect theory (Kahneman & Tversky, 1979;
Tversky & Kahneman, 1992), has been particularly successful at accounting for
much of the data observed in the laboratory and the field (e.g. Camerer, 2004).
According to prospect theory (and the copious evidence that supports it), people
treat monetary gains and losses separately and they exhibit a diminishing sensitivity
to both dimensions, such that initial gains (losses) have a larger impact on their
(dis)utility than subsequent gains (losses). Consequently, people tend to be risk
averse for monetary gains and risk seeking for monetary losses (Kahneman &
Tversky, 1979; Tversky & Kahneman, 1992). Another defining feature of prospect
theory is that the value function for losses is assumed to be steeper than the
one for gains, so that people derive more disutility from losing a given amount
of money than utility from winning that same amount. In other words, losing
an amount of money (e.g. –$50) typically feels worse than winning the same
amount (e.g. +$50) feels good. These various assumptions are beautifully and
succinctly represented in prospect theory’s S-shaped and kinked value function
(see Figure 13.1(a)). Although it is descriptively accurate, the prospect theory
value function fails to explain why or how people come to perceive monetary
gains and losses as they do. Fortunately, DbS offers a theoretical framework for not
only predicting, but also explaining these patterns of behaviors. As Stewart et al.
(2006) showed, the diminishing sensitivity that people exhibit for monetary gains
and losses, as well as the tendency to react more strongly to losses than equivalent
gains, emerge from the relevant distributions of values in the world, as predicted
by DbS.
Monetary Gains
DbS assumes that people evaluate the utility (subjective positive value) of a
monetary gain by comparing it to other monetary gains they have experienced
or observed. These comparison values could be past sums of money that they
received or won (e.g. previous earnings or lotteries wins), past sums of money
that they observed others winning (e.g. learning about a colleague’s salary), or
other sums of money that are currently on offer (e.g. when given a choice
between several different payoffs). Together, these various comparison values form
a distribution in memory that a decision-maker samples in order to evaluate the
target payoff. The values that people ascribe to monetary gains therefore depend on
the distribution of earnings and winnings that they typically observe. To estimate
the shape of this distribution, Stewart et al. (2006) analyzed a random sample of
hundreds of thousands of bank deposits (i.e. money that people added to their
own accounts) made by customers of a leading UK bank. Unsurprisingly, the
sums of money that people receive (and deposit into their accounts) follow a
power-law like function, such that small deposits are more frequent than large
ones (Figure 13.1(c)). Consequently, the percentile function for monetary gains
(a)
(c)
(d) (b)
FIGURE 13.1 The value function for monetary gains and losses. (a) Shows the standard S-shaped value function from prospect theory, with a
steeper curve for monetary losses. (b) Shows the value functions (black dots) for gains (top-right quadrant) and losses (bottom-left quadrant)
predicted by DbS. These predictions are derived from data on the occurrence frequencies of bank deposits (c) and bank withdrawals (d),
reported in Stewart et al. (2006). The grey dots in the top-right quadrant of (b) represent a 180◦ rotation of the DbS predictions for
monetary losses (bottom-left quadrant), and show that a steeper curve for monetary losses (compared to monetary gains) emerges naturally
from the frequency distributions in (c) and (d).
302 C. Y. Olivola and N. Chater
Monetary Losses
The same logic applies to monetary losses. DbS assumes that people evaluate the
disutility (subjective negative value) of a monetary loss by comparing it to other
monetary losses they have previously experienced or observed. These comparison
values could be past payments they have made (e.g. previous purchases or debts
paid), past sums of money that they observed others losing (e.g. learning about
the sum that someone lost to a friendly bet), or other potential losses under
consideration (e.g. several different ways to pay a bill). As with monetary gains,
these various (negative) comparison values form a distribution in memory that a
decision-maker can sample in order to evaluate the seriousness of a given loss.
The disutilities that people ascribe to monetary losses therefore depend on the
distribution of costs and payments that they typically observe. To estimate the
shape of this second distribution, Stewart et al. (2006) analyzed a random sample
of more than one million bank debits (i.e. money that people withdrew from
their own accounts) made by the same population of UK bank customers. As
with gains, the sizes of payments that people make (and withdraw from their
accounts) follow a power-law like function, such that small payments are more
frequent than large ones (Figure 13.1(d)). Consequently, the percentile function for
monetary losses is concave when plotted against disutility and convex when plotted
against utility (Figure 13.1(b)), implying diminishing sensitivity and risk-seeking
preferences for monetary losses. As another proxy for the monetary losses that
people typically experience and observe, Stewart and his colleagues (Stewart &
Simpson, 2008; Stewart et al., 2006) looked at the distribution of prices for various
goods. Across a wide variety of goods, prices followed similar distributions, such
that cheaper products were more frequent than their more expensive counterparts.
Thus, sampling prices to generate the comparison set would also lead to convex
utility (concave disutility) evaluations.
losses than it is for gains. As a result, the percentile function is steeper for losses
than it is for gains (Figure 13.1(b)), which explains the origin of loss aversion.
(c)
(d) (b)
FIGURE 13.2 The value function for lives saved and lost. (a) Shows the standard S-shaped value function from prospect theory, with a steeper
curve for lives lost. (b) Shows the value functions (black dots) for lives saved (top-right quadrant) and lives lost (bottom-left quadrant)
predicted by DbS. These predictions are derived from data on the frequency of media reporting of events involving human lives saved
(c) and lives lost (d), reported in Olivola and Sagara (2009). The grey dots in the top-right quadrant of (b) represent a 180◦ rotation of the
DbS predictions for lives lost (bottom-left quadrant), and show that a steeper curve for lives lost (compared to lives saved) emerges naturally
from the frequency distributions in (c) and (d).
Decision by Sampling 305
one. Olivola and Sagara showed, in line with the predictions of DbS, that people’s
diminishing sensitivity to human lives emerges from the distributions of human
fatalities (or lives saved) that they learn about from the news, reading books, talking
to their friends and colleagues, etc.
Human Fatalities
To proxy the distribution of death tolls that people are likely to observe and thus
hold in memory, Olivola and Sagara (2009) looked at three types of data. First,
they examined a large archival dataset that tracked the occurrence of epidemics,
natural disasters, and industrial disasters, and recorded each event’s associated death
toll. This provided a proxy for the frequency of death tolls, as they actually occur.
Next, they iteratively queried Google News Archives—a massive online repository
of published news stories—for news articles about human fatalities. Specifically,
they counted the number of articles reporting a given number of fatalities, using
search terms that specified the number of fatalities and contained keywords related
to deaths (e.g. “3 died,” “4 killed,” etc.). The resulting counts of “hits” provided a
proxy for the relative frequency with which various death tolls are reported in the
media (Figure 13.2(d)). Finally, Olivola and Sagara asked a sample of respondents
to recall eight past deadly events (the first eight that came to mind) and to estimate
each one’s death toll. These recollections provided a proxy for the distribution
of death tolls that people hold in memory and can access. It turns out that
all three distributions follow a similar power-law-like pattern (Olivola & Sagara,
2009): The larger the death toll (actual, reported, or recalled), the less frequent it
was. Consequently, the percentile functions for all three distributions are concave,
implying a diminishing sensitivity to human fatalities and a preference for higher
variance intervention options that offer a chance of preventing a greater number
of fatalities but also risk failing.
Olivola, Rheinberger, and Hammitt (2015) also looked at the distributions of
death tolls produced by low-magnitude, frequent events; namely, auto-accidents
and avalanches. Specifically, they looked at three sources of data: (i) government
statistics on the occurrence of fatalities caused by auto-accidents (or avalanches);
(ii) the frequencies of news stories reporting deaths caused by auto-accidents (or
avalanches); and (iii) the responses of participants who were asked to estimate the
percentages of auto-accidents (or avalanches) that cause a given death toll. Again,
these proxy distributions all followed a power-law-like pattern: The larger the
death toll, the smaller its frequency (actual, reported, or estimated). Consequently,
the percentile function for smaller-scale, frequent events is also concave, implying a
diminishing sensitivity and a preference for risky rescue strategies when it comes to
preventing fatalities in the context of auto-accidents and avalanches (Olivola et al.,
2015).
306 C. Y. Olivola and N. Chater
Lives Saved
Compared to human fatalities, finding a proxy for the distribution of lives saved is
considerably trickier since there are few good statistics available and the media tends
to focus on the loss of lives associated with deadly events. However, Olivola and
Sagara (2009) were able to obtain such a proxy using the Google News Archives
database by modifying their search terms to focus on lives saved (e.g. “3 saved,” “4
rescued,” etc.) rather than lives lost. Doing so allowed them to capture news stories
that reported on the numbers of lives saved during (potentially) deadly events.
The resulting distribution also follows a power-law-like function (Figure 13.2(c)):
There are more news stories reporting small numbers of lives saved than there are
reporting large numbers of lives saved. The resulting percentile function is thus
concave (Figure 13.2(b)), implying a diminishing sensitivity (i.e. diminishing joy)
and an aversion to risky rescue strategies when it comes to saving human lives.
(a) 1
0.9
0.8
Subjective likelihood
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Likelihood (probability)
(c) (b)
1
10000 0.9
0.8
1000 0.7
Frequency
Percentile
0.6
100
0.5
0.4
10
0.3
1 0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Likelihood (probability)
Likelihood (probability)
FIGURE 13.3 The probability weighting function. (a) Shows the standard inverse
S-shaped probability weighting function from prospect theory. (b) Shows the
probability weighting function (black dots + connecting grey lines) predicted by DbS.
These predictions are derived from data on the usage frequency of likelihood terms
(c), reported in Stewart et al. (2006).
1999; Prelec, 1998; Tversky & Kahneman, 1992; Wu & Gonzalez, 1996;
although see Stott, 2006). The non-linear weighting of probabilities incorporated
in prospect theory accurately predicts a number of tendencies in decisions under
risk (Camerer & Ho, 1994; Gonzalez & Wu, 1999; Kahneman & Tversky, 1979;
Tversky & Kahneman, 1992; Wu & Gonzalez, 1996), yet fails to explain the
origin of this tendency. DbS explains this pattern in terms of the distribution of
probability-related terms that people are typically exposed to in the real world. We
say “probability-related” terms because most people are less frequently exposed to
probabilities and other numerical expressions of likelihood (e.g. “0.2 probability,”
“50/50 odds,” “100% chance”) than they are to verbal descriptors that denote
likelihoods (e.g. “unlikely,” “possible,” “certain”). Therefore, one needs to consider
the relative frequency of verbal terms that convey different likelihoods when trying
to proxy the distribution of probability magnitudes that people can draw from to
evaluate a given probability of occurrence.
To find a proxy for this distribution, Stewart et al. (2006) searched the British
National Corpus (BNC) for verbal terms that people typically use to communicate
308 C. Y. Olivola and N. Chater
likelihoods (e.g. “small chance,” “maybe,” “likely,” etc.; see Karelitz & Budescu,
2004). The BNC provides a large corpus of spoken and written English words and
phrases, along with their frequencies of usage. Stewart et al. were therefore able
to estimate the occurrence frequency of each likelihood term. Next, following
the approach used in previous studies (Beyth-Marom, 1982; Budescu & Wallsten,
1985; Clarke, Ruffin, Hill, & Beamen, 1992; Reagan, Mosteller, & Youtz,
1989; for a review, see Budescu & Wallsten, 1995), they asked a sample of
40 participants to translate these verbal likelihood terms into their equivalent
numerical probabilities. This was done to identify the probability magnitudes
that people typically associate with each verbal term. For example, the median
participant in their study judged the word “likely” to indicate a 70 percent
chance of occurrence. The translation of likelihood terms into their equivalent
probabilities allowed Stewart et al. to estimate the frequencies with which people
(implicitly) refer, and are thus exposed, to various probability magnitudes in their
day-to-day lives (Figure 13.3(c)).
When Stewart et al. (2006) plotted the cumulative usage frequencies (i.e.
percentile ranks within the BNC) of probability magnitudes that people associate
with verbal likelihood terms they obtained an inverse S-shaped curve that
mimics the basic features of the prospect theory probability weighting function
(Figure 13.3(b)). This resemblance, between the predictions of DbS and prospect
theory, was more than just qualitative: Stewart et al. fit their data to a commonly
used single-parameter probability weighting function and found that the resulting
estimate (β = 0.59) was close to previous ones obtained from risk preference
data (β = 0.56, 0.61, 0.71, reported by Camerer & Ho, 1994; Tversky &
Kahneman, 1992; Wu & Gonzalez, 1996, respectively). In other words, likelihood
terms that denote very small or very large probabilities are more commonly used
than those that denote mid-range probabilities. As a result, people are more
frequently exposed, and therefore more sensitive, to extreme probabilities than
they are to most medium-sized probabilities. Thus, DbS can explain the classic
pattern of probability weighting that has been observed in the literature without
resorting to a priori assumptions about the shape of an underlying probability
weighting function. Specifically, it suggests that human perceptions of probability
magnitudes, and the peculiar inverse S-shape that they seem to follow, are
governed by the distribution of probability terms that people are exposed to
in their daily lives. People are more sensitive to variations in extremely low or
high probabilities (including departures from certainty and impossibility) because
those are the kinds of likelihoods they most frequently refer to and hear (or read)
others refer to. By contrast, the general hesitance (at least within the Western,
English-speaking world) to communicate mid-range probabilities means there is
far less exposure to these values, leading to a diminished sensitivity outside of the
extremes.
Decision by Sampling 309
(a) 1
0.9
0.8
Discounted value
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Time delay
(c) (b)
1
0.9
10000000
0.8
0.7
100000
Frequency
Percentile
0.6
0.5
100000
0.4
0.3
10000
0.2
0.1
1000
1 10 100 0
0 10 20 30 40 50 60 70 80 90 100
Time delay (days)
Time delay (days)
FIGURE 13.4 The time discounting function. (a) Shows a standard hyperbolic
time-discounting function. (b) Shows the time discounting function (black dots)
predicted by DbS. These predictions are derived from data on the frequency with
which delay lengths are referenced on the Internet (c), reported in Stewart et al.
(2006).
resulting cumulative percentile function for delay lengths, they found that it is
close to hyperbolic in shape (Figure 13.4(b)). Thus, DbS is able to predict, and also
explain, how people perceive time delays.
One concern is that the direction of the relationship between the regularities
we observe and people’s preferences might actually be reversed. For example, it
might be the case that people’s preferences for monetary gains and losses shape
their spending and savings decisions, rather than the other way around. At least
two pieces of evidence attenuate this concern. First, this reverse causality is harder
to argue in the case of death tolls from natural disasters (i.e. it seems implausible that
people’s preferences would determine the magnitudes of natural disasters). As such,
reverse causality of this sort fails to explain why distributions of death tolls predict
various features of the value functions for human lives (e.g. Study 1A in Olivola &
Sagara, 2009). Second, and perhaps more to the point, several studies have shown
that experimentally manipulating the relevant distribution(s) people are exposed
to alters their preferences in the direction(s) predicted by DbS (e.g. Walasek &
Stewart, 2015; Study 2 in Olivola & Sagara, 2009). In other words, there also exists
causal evidence in support of DbS (and the hypothesis that real-world distributions
can impact preferences), although we did not focus on these studies in this chapter
since they did not utilize Big Data.
A second concern is that most of the Big Data distributions that have been
used (so far) to test the predictions of DbS share an important common feature:
An inverse relationship between event frequency and event magnitude. Therefore,
one could speculate that DbS is benefitting from the (merely coincidental) fact that
taking the cumulative of these kinds of distributions yields curvatures that would
predict a diminishing relationship between objective outcomes and subjective
evaluations. Such a coincidence could allow the theory to correctly predict
that people will be risk averse for gains, risk seeking for losses, and exhibit
hyperbolic time preferences. Again, several pieces of evidence attenuate this second
concern. First, this coincidence fails to explain why DbS successfully predicts loss
aversion (Olivola & Sagara, 2009; Stewart et al., 2006). Second, not all of the
distributions we examined exhibited the inverse frequency–magnitude relationship.
In particular, the usage frequency of likelihood terms was not inversely related to
their magnitude; indeed, had that been the case, DbS would not predict the inverse
S-shaped probability weighting function. Third, as Olivola and Sagara (2009)
demonstrated (in their Study 3), the predictions of DbS go beyond qualitative
statements about people’s broad preference tendencies. Specifically, they compared
the distributions of death tolls across several different countries and found that DbS
successfully predicted variations in the extent to which the people in each country
were risk seeking when it came to choosing between different outcomes involving
human fatalities. In doing so, Olivola and Sagara clearly showed that DbS’s capacity
to predict patterns of preferences (and even, in this case, cross-country differences
in these patterns) goes well beyond the mere fact that event frequencies are
often inversely related to their magnitudes. In sum, the ability to make nuanced
quantitative predictions with DbS undermines concerns that its predictive power
mainly derives from a general property of all real-world distributions.
314 C. Y. Olivola and N. Chater
Conclusion
The rapid growth and accessibility of Big Data, over the last decade, seems to hold
enormous promise for the study of human behavior (Moat, Olivola, Chater, &
Preis, 2016; Moat, Preis, Olivola, Liu, & Chater, 2014). Indeed, a steady stream of
studies have demonstrated creative uses of Big Data sources, such as for predicting
human behavior on a large scale (e.g. Choi & Varian, 2012; Ginsberg et al., 2009;
Goel, Hofman, Lahaie, Pennock, & Watts, 2010), studying network dynamics
(e.g. Calcagno et al., 2012; Szell, Lambiotte, & Thurner, 2010), verifying the
robustness of empirical laws (e.g. Klimek, Bayer, & Thurner, 2011; Thurner,
Szell, & Sinatra, 2012), or providing new macro-level indices (e.g. Noguchi,
Stewart, Olivola, Moat, & Preis 2014; Preis, Moat, Stanley, & Bishop, 2012;
Saiz & Simonsohn, 2013). However, the contribution of Big Data to cognitive
science has been noticeably smaller than in other areas. One likely reason is that
most existing Big Data sets within the social sciences (with the exception of brain
imaging data) tend to focus on human behaviors and thus lack variables related to
mental processes, making it difficult to form insights about cognition. Decision
by sampling theory (DbS) therefore offers an exceptional example of a theoretical
framework in which Big Data don’t merely provide high-powered tests of existing
hypotheses; they form the basis for developing the hypotheses in the first place.
Acknowledgments
The authors would like to thank Mike Jones for his guidance and support as we
prepared our manuscript, and two anonymous reviewers for useful suggestions that
helped us further improve it. We also thank Neil Stewart and Rich Lewis for
providing us with the data for Figures 13.1, 13.3, and 13.4. Finally, we thank
Aislinn Bohren, Alex Imas, and Stephanie Wang for helpful feedback on the
discussion of economic theories that incorporate experience and memory.
Notes
1 Most dynamic models of decision-making, such as Decision Field Theory
(Busemeyer & Townsend, 1993) and the leaky competing accumulator model
(Usher & McClelland, 2004), also assume the existence of one or more
(typically non-linear) functions that transform objective attribute values into
subjective evaluations. As such, even these dynamic theories could be
considered “utility”-based approaches, to some extent, since they require
utility-like transformations to account for preferences. However, these models
differ from the more classic types of utility theories in that they do not assign
utilities to choice alternatives as a whole (only to their attributes).
2 Although there have been attempts to explicitly model the role of experience
(e.g. Becker & Murphy, 1988) and memory (e.g. Mullainathan, 2002)
Decision by Sampling 315
in shaping the decision-making process, these theories still start from the
assumption that there are stable, inherent utility functions that interact with
experiences and/or memory to shape preferences. By contrast, the theory
we discuss here (decision by sampling or “DbS”), assumes that the value
functions themselves are inherently malleable and shaped by past experiences,
via memory.
3 This chapter mainly focuses on the role of long-term memory sampling (i.e. the
stored accumulation of past experiences over a person’s lifetime). However, the
presence of two memory sources raises an interesting question concerning the
relative contribution of each one to the final comparison sample. Prior evidence
suggests the predictions of DbS can be surprisingly robust to a wide range of
assumptions about the proportion of sampling that derives from long-term vs.
short-term memory sources (Stewart & Simpson, 2008). At the same time,
it is clear that short-term memory needs to play some role in order for DbS
to explain certain context effects (e.g. Study 2 in Olivola & Sagara, 2009;
Walasek & Stewart, 2015), while long-term memory sampling is necessary for
the theory to also explain systematic preference tendencies (e.g. Studies 1 and
3 in Olivola & Sagara, 2009; Stewart, Chater, & Brown, 2006). The relative
role of these two memory systems remains an open question and (we believe) a
fruitful topic for future research.
4 Counting the proportion of comparison values that are smaller than or equal
to a target value has a nice property: doing so over the entire range of
(potential) target values yields the cumulative distribution function. In other
words, under this approach (for treating ties), DbS predicts that the evaluation
function relating objective (target) values to their subjective percepts (i.e.
utilities or weights) is equivalent to the cumulative distribution function (CDF)
of relevant comparison values. Thus, one can estimate the predictions of
DbS by integrating over the frequency distribution of previously observed
comparison values. Alternatively, one could treat ties differently from other
outcomes. Olivola and Sagara (2009) compared different ways of treating ties
and found that these did not considerably influence the broad predictions of
DbS (at least when it comes to how people evaluate the potential loss of human
lives).
5 Alternatively, she might routinely experience large losses from gambling or
day trading, but nonetheless sample her memory in a narrow fashion when
evaluating the parking ticket; for example, by comparing it only to previous
parking penalties. In this latter case, DbS predicts that the ticket could still seem
quite large, and thus be upsetting to her, if it exceeds most of her previous fines.
6 Of course, the precise shape of her value function for a given decision will
also depend on attentional factors and her most recent (and/or most salient)
experiences, as these jointly shape the sampling process.
316 C. Y. Olivola and N. Chater
References
Appelt, K. C., Hardisty, D. J., & Weber, E. U. (2011). Asymmetric discounting of gains and
losses: A query theory account. Journal of Risk and Uncertainty, 43(2), 107–126.
Ariely, D. (2001). Seeing sets: Representation by statistical properties. Psychological Science,
12(2), 157–162.
Becker, G. S., & Murphy, K. M. (1988). A theory of rational addiction. Journal of Political
Economy, 96(4), 675–700.
Beyth-Marom, R. (1982). How probable is probable: A numerical translation of verbal
probability expressions. Journal of Forecasting, 1, 257–269.
Bhatia, S. (2013). Associations and the accumulation of preference. Psychological Review,
120(3), 522–543.
Brandstätter, E., Gigerenzer, G., & Hertwig, R. (2006). The priority heuristic: Making
choices without trade-offs. Psychological Review, 113(2), 409–432.
Budescu, D. V., & Wallsten, T. S. (1985). Consistency in interpretation of probabilistic
phrases. Organizational Behavior and Human Decision Processes, 36, 391–405.
Budescu, D. V., & Wallsten, T. S. (1995). Processing linguistic probabilities: General
principles and empirical evidence. In J. Busemeyer, D. L. Medin, & R. Hastie (Eds.),
Decision making from a cognitive perspective (pp. 275–318). San Diego, CA: Academic Press.
Busemeyer, J. R., & Townsend, J. T. (1993). Decision field theory: A dynamic-cognitive
approach to decision-making in an uncertain environment. Psychological Review, 100(3),
432–459.
Calcagno, V., Demoinet, E., Gollner, K., Guidi, L., Ruths, D., & de Mazancourt, C.
(2012). Flows of research manuscripts among scientific journals reveal hidden submission
patterns. Science, 338(6110), 1065–1069.
Camerer, C. F. (2004). Prospect theory in the wild: Evidence from the field. In C. F.
Camerer, G. Loewenstein, & M. Rabin (Eds.), Advances in behavioral economics (pp.
148–161). Princeton, NJ: Princeton University Press.
Camerer, C. F., & Ho, T. H. (1994). Violations of the betweenness axiom and non-linearity
in probability judgment. Journal of Risk and Uncertainty, 8, 167–196.
Choi, H., & Varian, H. (2012). Predicting the present with Google Trends. Economic Record,
88, 2–9.
Clarke, V. A., Ruffin, C. L., Hill, D. J., & Beamen, A. L. (1992). Ratings of orally
presented verbal expressions of probability by a heterogeneous sample. Journal of Applied
Social Psychology, 22, 638–656.
Dai, J., & Busemeyer, J. R. (2014). A probabilistic, dynamic, and attribute-wise model of
intertemporal choice. Journal of Experimental Psychology: General, 143(4), 1489–1514.
Frederick, S., Loewenstein, G., & O’Donoghue, T. (2002). Time discounting and time
preference: A critical review. Journal of Economic Literature, 40(2), 351–401.
Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. Cambridge, MA:
MIT Press.
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L.
(2009). Detecting influenza epidemics using search engine query data. Nature, 457,
1012–1014.
Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M., & Watts, D. J. (2010). Predicting
consumer behavior with web search. Proceedings of the National Academy of Sciences, 107,
17486–17490.
Decision by Sampling 317
Gonzalez, C., Lerch, J. F., & Lebiere, C. (2003). Instance-based learning in dynamic
decision-making. Cognitive Science, 27, 591–635.
Gonzalez, R., & Wu, G. (1999). On the shape of the probability weighting function.
Cognitive Psychology, 38(1), 129–166.
Guria, J., Leung, J., Jones-Lee, M., & Loomes, G. (2005). The willingness to accept value
of statistical life relative to the willingness to pay value: Evidence and policy implications.
Environmental and Resource Economics, 32(1), 113–127.
Hanemann, W. M. (1994). Valuing the environment through contingent valuation. The
Journal of Economic Perspectives, 8(4), 19–43.
Hertwig, R. (2012). The psychology and rationality of decisions from experience. Synthese,
187(1), 269–292.
Johnson, E. J., Häubl, G., & Keinan, A. (2007). Aspects of endowment: A query theory
of value construction. Journal of Experimental Psychology: Learning, Memory, and Cognition,
33(3), 461–474.
Kahneman, D., & Tversky, A. (1979). Prospect theory. Econometrica, 47, 263–292.
Kahneman, D., Ritov, I., Jacowitz, K. E., & Grant, P. (1993). Stated willingness to pay for
public goods: A psychological perspective. Psychological Science, 4(5), 310–315.
Karelitz, T. M., & Budescu, D. V. (2004). You say “probable” and I say “likely”: Improving
interpersonal communication with verbal probability phrases. Journal of Experimental
Psychology: Applied, 10, 25–41.
Klimek, P., Bayer, W., & Thurner, S. (2011). The blogosphere as an excitable social
medium: Richter’s and Omori’s Law in media coverage. Physica A: Statistical Mechanics
and its Applications, 390(21), 3870–3875.
Kornienko, T. (2013). Nature’s measuring tape: A cognitive basis for adaptive utility (Working
paper). Edinburgh, Scotland: University of Edinburgh.
Loomes, G., & Sugden, R. (1982). Regret theory: An alternative theory of rational choice
under uncertainty. Economic Journal, 92, 805–824.
McCrink, K., & Wynn, K. (2007). Ratio abstraction by 6-month-old infants. Psychological
Science, 18(8), 740–745.
McDaniels, T. L. (1992). Reference points, loss aversion, and contingent values for auto
safety. Journal of Risk and Uncertainty, 5(2), 187–200.
Moat, H. S., Olivola, C. Y., Chater, N., & Preis, T. (2016). Searching choices: Quantifying
decision-making processes using search engine data. Topics in Cognitive Science, 8,
685–696.
Moat, H. S., Preis, T., Olivola, C. Y., Liu, C., & Chater, N. (2014). Using big data to
predict collective behavior in the real world. Behavioral and Brain Sciences, 37, 92–93.
Mullainathan, S. (2002). A memory-based model of bounded rationality. Quarterly Journal of
Economics, 117(3), 735–774.
Noguchi, T., Stewart, N., Olivola, C. Y., Moat, H. S., & Preis, T. (2014). Characterizing
the time-perspective of nations with search engine query data. PLoS One, 9, e95209.
Olivola, C. Y. (2015). The cognitive psychology of sensitivity to human fatalities:
Implications for life-saving policies. Policy Insights from the Behavioral and Brain Sciences,
2, 141–146.
Olivola, C. Y., & Sagara, N. (2009). Distributions of observed death tolls govern sensitivity
to human fatalities. Proceedings of the National Academy of Sciences, 106, 22151–22156.
Olivola, C. Y., & Shafir, E. (2013). The martyrdom effect: When pain and effort increase
prosocial contributions. Journal of Behavioral Decision Making, 26, 91–105.
318 C. Y. Olivola and N. Chater
Olivola, C. Y., & Wang, S. W. (in press). Patience auctions: The impact of time versus
money bidding on elicited discount rates. Experimental Economics.
Olivola, C. Y., Rheinberger, C. M., & Hammitt, J. K. (2015). Sensitivity to fatalities from
frequent small-scale deadly events: A Decision-by-Sampling account. Unpublished manuscript,
Carnegie Mellon University.
Plonsky, O., Teodorescu, K., & Erev, I. (2015). Reliance on small samples, the wavy recency
effect, and similarity-based learning. Psychological Review, 122(4), 621–647.
Porter E. (2011). The price of everything. New York, NY: Penguin.
Preis, T., Moat, H. S., Stanley, H. E., & Bishop, S. R. (2012). Quantifying the advantage of
looking forward. Scientific Reports, 2, 350.
Prelec, D. (1998). The probability weighting function. Econometrica, 66(3) 497–527.
Read, D. (2004). Intertemporal choice. In D. J. Koehler & N. Harvey (Eds.), Blackwell
handbook of judgment and decision-making (pp. 424-443). Oxford, UK: Blackwell.
Read, D., Olivola, C. Y., & Hardisty, D. J. (in press). The value of nothing: Asymmetric
attention to opportunity costs drives intertemporal decision making. Management Science.
Reagan, R. T., Mosteller, F., & Youtz, C. (1989). Quantitative meaning of verbal probability
expressions. Journal of Applied Psychology, 74, 433–442.
Ritov, I., & Baron, J. (1994). Judgements of compensation for misfortune: The role of
expectation. European Journal of Social Psychology, 24(5), 525–539.
Saiz, A., & Simonsohn, U. (2013). Proxying for unobservable variables with Internet
document frequency. Journal of the European Economic Association, 11, 137–165.
Schoemaker, P. J. (1982). The expected utility model: Its variants, purposes, evidence and
limitations. Journal of Economic Literature, 20(2), 529–563.
Sedlmeier, P. E., & Betsch, T. E. (2002). ETC: Frequency processing and cognition. Oxford,
UK: Oxford University Press.
Shafir, E., Simonson, I., & Tversky, A. (1993). Reason-based choice. Cognition, 49(1),
11–36.
Slovic, P. (2007). “If I look at the mass I will never act”: Psychic numbing and genocide.
Judgment and Decision Making, 2, 79–95.
Soman, D., Ainslie, G., Frederick, S., Li, X., Lynch, J., Moreau, P., . . . , Wertenbroch,
K. (2005). The psychology of intertemporal discounting: Why are distant events valued
differently from proximal ones? Marketing Letters, 16(3–4), 347–360.
Starmer, C. (2000). Developments in non-expected utility theory: The hunt for a
descriptive theory of choice under risk. Journal of Economic Literature, 38(2), 332–382.
Stewart, N. (2009). Decision by sampling: The role of the decision environment in risky
choice. The Quarterly Journal of Experimental Psychology, 62, 1041–1062.
Stewart, N., & Simpson, K. (2008). A decision-by-sampling account of decision under risk.
In N. Chater & M. Oaksford (Eds.), The probabilistic mind: Prospects for Bayesian cognitive
science (pp. 261–276). Oxford, UK: Oxford University Press.
Stewart, N., Brown, G. D., & Chater, N. (2005). Absolute identification by relative
judgment. Psychological Review, 112, 881–911.
Stewart, N., Chater, N., & Brown, G. D. A. (2006). Decision by sampling. Cognitive
Psychology, 53, 1–26.
Stewart, N., Chater, N., Stott, H. P., & Reimers, S. (2003). Prospect relativity: How choice
options influence decision under risk. Journal of Experimental Psychology: General, 132,
23–46.
Decision by Sampling 319
Stott, H. P. (2006). Cumulative prospect theory’s functional menagerie. Journal of Risk and
Uncertainty, 32(2), 101–130.
Szell, M., Lambiotte, R., & Thurner, S. (2010). Multirelational organization of large-scale
social networks in an online world. Proceedings of the National Academy of Sciences, 107(31),
13636–13641.
Thorngate, W. (1980). Efficient decision heuristics. Behavioral Science, 25(3), 219–225.
Thurner, S., Szell, M., & Sinatra, R. (2012). Emergence of good conduct, scaling and Zipf
laws in human behavioral sequences in an online world. PLoS One, 7, e29796.
Tversky A., & Kahneman, D. (1981). The framing of decisions and the psychology of
choice. Science, 211, 453–458.
Tversky, A., & Kahneman, D. (1992). Advances in prospect theory: Cumulative
representation of uncertainty. Journal of Risk and Uncertainty, 5, 297–323.
Usher, M., & McClelland, J. L. (2004). Loss aversion and inhibition in dynamical models
of multialternative choice. Psychological Review, 111(3), 757–769.
Viscusi, W. K., & Aldy, J. E. (2003). The value of a statistical life: A critical review of market
estimates throughout the world. Journal of Risk and Uncertainty, 27(1), 5–76.
von Neumann, J., & Morgenstern, O. (1947). Theory of games and economic behavior.
Princeton, NJ: Princeton University Press.
Walasek, L., & Stewart, N. (2015). How to make loss aversion disappear and reverse: Tests of
the decision by sampling origin of loss aversion. Journal of Experimental Psychology: General,
144, 7–11.
Walker, M. E., Morera, O. F., Vining, J., & Orland, B. (1999). Disparate WTA–WTP
disparities: The influence of human versus natural causes. Journal of Behavioral Decision
Making, 12(3), 219–232.
Weber, E. U., Johnson, E. J., Milch, K. F., Chang, H., Brodscholl, J. C., & Goldstein, D.
G. (2007). Asymmetric discounting in intertemporal choice: A query-theory account.
Psychological Science, 18(6), 516–523.
Weyl, E. G. (in press). Price theory. Journal of Economic Literature.
Wu, G., & Gonzalez, R. (1996). Curvature of the probability weighting function.
Management Science, 42, 1676–1690.
14
CRUNCHING BIG DATA WITH
FINGERTIPS
How Typists Tune Their Performance Toward
the Statistics of Natural Language
Abstract
People have the extraordinary ability to control the order of their actions. How people
accomplish sequencing and become skilled at it with practice is a long-standing problem
(Lashley, 1951). Big Data techniques can shed new light on these questions. We used the
online crowd-sourcing service, Amazon Mechanical Turk, to measure typing performance
from hundreds of typists who naturally varied in skill level. The large dataset allowed us to
test competing predictions about the acquisition of serial-ordering ability that we derived
from computational models of learning and memory. These models suggest that the time
to execute actions in sequences will correlate with the statistical structure of actions in the
sequence, and that the pattern of correlation changes in particular ways with practice. We
used a second Big Data technique, n-gram analysis of large corpuses of English text, to
estimate the statistical structure of letter sequences that our typists performed. We show
the timing of keystrokes correlates with sequential structure (letter, bigram, and trigram
frequencies) in English texts, and examine how this sensitivity changes as a function of
expertise. The findings hold new insights for theories of serial-ordering processes, and how
serial-ordering abilities emerge with practice.
Introduction
The infinite monkey theorem says a room full of monkeys typing letters on
a keyboard for infinity will eventually produce any text, like the works of
Shakespeare or this chapter (Borel, 1913). This small space of natural texts has
more predictable structure than the many other random texts produced by the
typing monkeys. For example, letters and bigrams that occur in English appear
with particular frequencies, some high and some low. The present work examines
whether typists, who routinely produce letters by manipulating a keyboard,
Crunching Big Data with Fingertips 321
become sensitive to these statistical aspects of the texts they type. Answering
this question addresses debate about how people learn to produce serially ordered
actions (Lashley, 1951).
letter pairs in English. Generating sequences of letters using these bigram statistics
will produce sequences with the same bigram frequency structure as the English
language, but will rarely produce words, let alone meaningful sentences. So an
associative chain theory of letter sequencing would take a slightly shorter amount
of infinite time to produce the works of Shakespeare, compared to random
monkeys.
Although associative chains fail to explain serial-ordering behavior in complex
domains like language, the more general goal of explaining serial ordering in terms
of basic learning and memory processes has not been abandoned. For example,
Wickelgren (1969) suggested that associative chains could produce sequences
of greater complexity by allowing contextual information to conditionalize
triggering of upcoming actions. And, as we will soon discuss, contemporary
neural network approaches (Elman, 1990), which are associative in nature, have
been successfully applied as accounts of serial-ordering behavior in a number of
tasks.
Lashley’s critique inspired further development of associative theories, and
opened the door for new cognitive approaches to the serial-order problem, which
we broadly refer to as hierarchical control theories.
which elements from higher levels contain a one-to-many mapping with elements
in the lower levels. Levels are encapsulated. The labor of information processing
is divided between levels, and one level may not know the details of how another
level accomplishes its goals. Because of the division of labor, different levels should
respond to different kinds of feedback. Finally, although levels are divided, they
must be connected or coordinated to accomplish task goals.
The terms outer and inner loop are used to refer to the hierarchically nested
processes controlling typing. The outer loop relies on language production and
comprehension to turn ideas into sentences and words, passing the result one word
at a time to the inner loop. The inner loop receives words as plans from the outer
loop, and translates each word into constituent letters and keystrokes. The outer
loop does not know how the inner loop produces keystrokes. For example, typists
are poor at remembering where keys are located on the keyboard (Liu, Crump, &
Logan, 2010; Snyder, Ashitaka, Shimada, Ulrich, & Logan, 2014), and their typing
speed slows when attending to the details of their actions (Logan & Crump, 2009).
The outer and inner loop rely on different sources of feedback, with the outer
loop using visual feedback from the computer screen to detect errors, and the
inner loop using tactile and kinesthetic feedback to guide normal typing, and
to independently monitor and detect errors (Crump & Logan, 2010b; Logan &
Crump, 2010). Finally, word-level representations connect the loops, with words
causing parallel activation of constituent letters within the inner loops’ response
scheduling system (Crump & Logan, 2010a).
pressed. This clever rule is specialized for the task of response-scheduling, but
has emergent qualities because the model types accurately without assuming a
monitoring process.
The model explains expert typing skill, but says nothing about learning and
skill-development. It remains unclear how associations between letters and specific
motor movements develop with practice, or how typing speed and accuracy for
individual letters changes with practice. Addressing these issues is the primary goal
of the present work.
Crafted Random
180
Simulated typing time
160
140
–1.0 –0.5 0.0 0.5 1.0 –1.0 –0.5 0.0 0.5 1.0
Letter frequency and simulated IKSI correlation
FIGURE 14.1 Simulations showing that mean letter typing time to type a text is shorter
for simulated typists whose typing times for individual letters negatively correlate with
letter frequency in the typed text.
quickly, they would benefit from knowing letter frequencies in texts they type. All
things being equal, and assuming that typists are typing non-random texts, a typist
whose letter typing times are negatively correlated with letter frequency norms for
the text (i.e. faster times for more than less frequent letters) will finish typing the
same text faster than a typist whose letter typing times do not negatively correlate
with letter frequency.
To illustrate, we conducted the simulations displayed in Figure 14.1. First,
we created a vector of 26 units populated with the same number (e.g. 150 ms)
representing mean keystroke times for letters in the alphabet. This scheme assumes
that all letters are typed at the same speed. Clearly, overall speed will be increased
by decreasing the value of any of those numbers. However, to show that sensitivity
to letter frequency alone increases overall speed, we crafted new vectors that could
be negatively or positively correlated with letter frequency counts consistent with
English texts (taken from Jones & Mewhort, 2004). We created new vectors using
the following rules: (1) randomly pick a unit and subtract X, then randomly pick a
different unit, and add the same value of X; (2) compute the correlation between
the new vector and the vector of letter frequencies; (3) keep the change to the
vector only if the correlation increases; (4) repeat 1–3. All of the crafted vectors
summed to the same value, but were differentially correlated to letter frequencies
through the range of positive and negative values. The random panel shows a
second simulation where the values for simulated letter typing speed were simply
randomly selected, with the constraint that they sum to the same value. Figure 14.1
shows that time to finish typing a text is faster for vectors that are more negatively
correlated with letter frequencies in the text.
Crunching Big Data with Fingertips 327
The simulation shows that typists have an opportunity to further optimize their
typing speed by modifying individual letter typing speeds in keeping with the
frequencies of individual letters in the text they are typing. Indeed, there is some
existing evidence that, among skilled typists, letter typing times do negatively
correlate with letter and bigram frequencies (Grudin & Larochelle, 1982).
However, it is unclear how these micro-adjustments to the timing of individual
keystrokes take place. If typists are simply changing the timing parameters for each
keystroke whenever they can, without prioritizing the changes as a function of
letter frequency, then we would not expect systematic correlations to exist between
letter typing times and letter frequencies. The next two hypotheses assume that
typists become sensitive to the statistics of their typing environment “for free,”
simply by using the same general learning or memory processes they always use
when learning a new skill.
action, with the faster process winning the race and controlling action. Instances
in memory are not created equal, and some can be retrieved faster than others. As
memory for a specific action is populated with more instances, that action is more
likely to be produced by the memory process because one of the instances will tend
to have a faster retrieval time than the algorithmic process. So, memory speeds are
responding as a function of the likelihood of sampling increasingly extreme values
from increasingly large distributions. More simply, faster memory-based responses
are more likely when the instance pool is larger than smaller.
We simulated predictions of instance theory for acquiring sensitivity to letter
and bigram frequencies in English texts (Jones & Mewhort, 2004) with practice.
Response times to a letter or a bigram were sampled from normal distributions,
with the number of samples constrained by the number of instances in memory
for that letter or bigram. The response time for each was the fastest time sampled
from the distribution.
To simulate practice, we repeated this process between the ranges of 50
experiences with letters and bigrams, up to 1,000,000 experiences. To determine
sensitivity to letter and bigram frequency, we correlated the vectors of retrieval
times for presented letters and bigrams with the vectors of letter and bigram
frequencies from the corpus counts.
The No floor panel in Figure 14.2 shows increasingly negative correlations
between retrieval time and corpus count frequency as a function of practice
for letters and bigrams. The correlations plateau with practice, and letter and
bigram sensitivity develop roughly in parallel. Bigram sensitivity is delayed because
Floor No floor
0.0
N-gram frequency by speed correlation
–0.2
–0.4
–0.6
50 100 1000 10000 1e+05 1e+06 50 100 1000 10000 1e+05 1e+06
Simulated practice (no. of experiences)
FIGURE 14.2 Simulated instance theory (Logan, 1988) predictions for how correlations
between letter and bigram frequency, and their simulated typing times, would change
as a function of practice. Floor versus No floor refers to whether simulated typing
times were limited by some value reflecting physical limitations for movement time.
330 L. P. Behmer Jr. and M. J. C. Crump
experience with specific bigrams occurs more slowly than letters (which repeat
more often within bigrams). So, different from the SRN model, the instance
model does not predict sensitivity to lower-order statistics decreases with increasing
sensitivity to higher-order statistics. However, a modified version of this simulation
that includes a floor on retrieval times to represent the fact that reaction times will
eventually hit physical limitations does show waxing and waning of sensitivities,
with the value of letter and bigram correlations increasing to a maximum, and
then slowly decreasing toward zero as all retrieval times become more likely to
sample the same floor value.
the sequential structure that typists may be learning is fairly stable across English
texts.
Methodological Details
All typists copy-typed five normal paragraphs from the Simple English Wiki, a
version of the online encyclopedia Wikipedia written in basic English. Four
of the paragraphs were from the entry about cats (http://simple.wikipedia.
org/wiki/Cat), and one paragraph was from the entry for music (http://simple.
wikipedia.org/wiki/Music). Each normal paragraph had an average of 131 words
(range 124–137). The paragraphs were representative of English texts and highly
correlated with Gutenberg letter (26 unique letters, 3051 total characters, r = 0.98),
bigram (267 unique bigrams, 2398 total bigrams, r = 0.91), and trigram frequencies
(784 unique trigrams, 1759 total trigrams, r = 0.75).
As part of an exploratory analysis we also had typists copy two paragraphs of
non-English text, each composed of 120 five-letter strings. The strings in the
bigram paragraph were generated according to bigram probabilities from our corpus
counts, resulting in text that approximated the bigram structure of English text (i.e.
a first-order approximation to English (Mewhort, 1966, 1967). This paragraph was
generally well correlated with the Gutenberg letter counts (24 unique, 600 total,
r = 0.969), bigram counts (160 unique, 480 total, r = 0.882), and trigram counts
(276 unique, 360 total, r = 0.442).
The strings in the random letter paragraph were constructed by sampling each
letter from the alphabet randomly with replacement. This paragraph was not well
correlated with Gutenberg letter counts (26 unique, 600 total, r = 0.147), bigram
counts (351 unique, 479 total, r = −0.056), or trigram counts (358 unique,
360 total, r = −0.001).
Workers (restricted to people from the USA, with over 90% HIT completion rate)
on mTurk found our task, consented, and then completed the task. The procedure
was approved by the institutional review board at Brooklyn College of the City
University of New York. Four hundred individuals started the task; however data
were only analyzed for the 346 participants who successfully completed the task
(98 men, 237 women, 11 no response). Participants reported their age within
5-year time bins, ranging from under 20 to over 66 years old (mean bin = 35
to 40 years old, +/− 2 age bins). Two hundred and ninety-six participants were
right-handed (33 left-handed, 11 ambidextrous, 6 no response). One hundred
and thirty-five participants reported normal vision (202 corrected, 5 reported
“vision problems,” 4 no response). Three hundred and twenty-nine participants
reported that English was their first language (7 reported English being their second
language, 10 no response). Participants reported that they had been typing between
1 and 60 years (M = 20.2 years, SE = 9.3), and had started typing at between
3 and 49 years old (M = 13.3 years old, SE = 5.5). Two hundred and eighty
Crunching Big Data with Fingertips 333
participants reported being touch typists (63 not touch typists, 3 no response), and
187 reported having formal training (154 no formal training, 5 no response).
During the task, participants were shown each of the seven different paragraphs
in a text box on their monitor (order randomized). Paragraph text was black,
presented in 14 pt, Helvetica font. Participants were instructed to begin typing
with the first letter in the paragraph. Correctly typed letters turned green, and
typists could only proceed to the next by typing the current letter correctly. After
completing the task, participants were presented with a debriefing, and a form to
provide any feedback about the task. The task took around 30 to 45 minutes to
complete. Participants who completed the task were paid $1.
The Data
We collected inter-keystroke interval times (IKSIs; in milliseconds), for every
correct and incorrect keystroke for each subject and each paragraph. Each IKSI
is the difference between the timestamp for typing the current letter and the most
recent letter. IKSIs for each letter were also coded in terms of their associated
bigrams and trigrams. Consider typing the word cat. The IKSI for typing letter t
(timestamp of t – timestamp of a) has the letter level t, the bigram level at, and
the trigram level cat. In addition, each letter, bigram, and trigram has a frequency
value from the corpus count.
In this way, for each typist we compiled three signatures of sensitivity to letter,
bigram, and trigram frequency. For letters, we computed the vector of mean IKSIs
for all unique correctly typed letters and correlated it with the vector of letter
frequencies. The same process was repeated for the vector of mean IKSIs for all
unique correctly typed bigrams and trigrams. The resulting correlation values for
each typist appear as individual dots in the figures that follow.
(r = −0.220, SE = 0.003). Additionally, one sample t-tests revealed that the mean
correlation of each n-gram type was significantly greater than zero. All of the
mean correlations were significant and negative, showing that in general, typing
times are faster for higher than lower frequency n-grams. And, the size of the
negative correlation decreases with increasing n-gram order, showing that there
is more sensitivity to lower than higher-order structure. The major take home
finding is that typists are indeed sensitive to sequential structure in the texts they
type.
0.0
−0.3
−0.6
100 200 300 400 100 200 300 400 100 200 300 400
Mean typing speed
qualitatively different pattern. Here we see that the faster typists on the left show
larger negative correlations than the slower typists on the right. In other words,
there was a small positive correlation between sensitivity to bigram frequency
and skill (r = 0.144, p < 0.007), and trigram frequency and skill (r = 0.146,
p < 0.006). Again, predictions of the learning and memory models are generally
consistent with the data, which show that highly skilled typists are more sensitive
to higher-order sequential statistics than poor typists.
0.0
−0.5
100 200 300 400 500 600100 200 300 400 500 600100 200 300 400 500 600
Mean typing speed
0.00
Frequency/IKSI Correlations
0.25
−0.50
−0.75
200 400 600 800 200 400 600 800 200 400 600 800
Mean typing speed
p < 0.001; r = 0.340, p < 0.001), and trigram (r = 0.171, p < 0.001; r = 0.244,
p < 0.001) frequencies, respectively, for both the bigram and random paragraphs,
and as a function of mean IKSI or overall typing speed. In general, we see the same
qualitative patterns as before. For the bigram paragraph, the slower typists are more
negatively correlated with letter frequency than the faster typists, and the faster
typists are more negatively correlated with bigram and trigram frequency than the
slower typists. For the random paragraph, the slope of the regression line relating
mean typing speed to letter correlations was not significantly different from 0,
showing no differences between faster and slower typists. However, the figures
shows that all typists were negatively correlated with letter frequency. Typing
random strings of letters disrupts normal typing (Shaffer & Hardwick, 1968), and
appears to have turned our faster typists into novices, in that the faster typists’
pattern of correlations looks like the novice signature pattern. It is noteworthy
that even though typing random letter strings disrupted normal typing by slowing
down mean typing speed, it did not cause a breakdown of sensitivity to sequential
structure. We return to this finding in the general discussion.
General Discussion
We examined whether measures of typing performance could test predictions
about how learning and memory participate in the acquisition of skilled
serial-ordering abilities. Models of learning and memory make straightforward
predictions about how people become sensitive to sequential regularities in actions
that they produce. Novices become tuned to lower-order statistics, like single
letter frequencies, then with expertise develop sensitivity to higher-order statistics,
like bigram and trigram frequencies, and in the process appear to lose sensitivity
to lower-order statistics. We saw clear evidence of these general trends in our
cross-sectional analysis of a large number of typists.
The faster typists showed stronger negative correlations with bigram and trigram
frequencies than the slower typists. This is consistent with the prediction that
sensitivity to higher-order sequential structure develops over practice. We also
found that faster typists showed weaker negative correlations with letter frequency
than the slower typists. This is consistent with the prediction that sensitivity to
lower-order sequential structure decreases with practice.
8000
Paragraph
No. of errors 6000
Normal
4000
Bigram
2000 Random
0
–1.0 –0.5 0.0 0.5 1.0
Typed error MLE---planned correct MLE
FIGURE 14.5 Histograms of the distribution of differences between the maximum
likelihoods for the typed error letter and the planned correct letter based on the bigram
context of the correct preceding letter.
slips (Norman, 1981), which might be detected in typists’ errors. For example,
consider having to type the letters “qi.” Knowledge of sequential statistics should
lead to some activation of the letter “u,” which is more likely to follow “q” than “i.”
We examined all of our typists, incorrect, keystrokes following the intuition that
when statistical slips occur, the letters typed in error should have a higher maximum
likelihood expectation given the prior letter than the letter that was supposed to
be typed according to the plan. We limited our analyses to erroneous keystrokes
that were preceded by one correct keystroke. For each of the 38,739 errors, we
subtracted the maximum likelihood expectation for the letter that was typed in
error given the correct predecessor, from the maximum likelihood expectation for
the letter that was supposed to be typed given the correct predecessor.
Figure 14.5 shows the distributions of difference scores for errors produced
by all typists by paragraph typing conditions. If knowledge of sequential statistics
biases errors, then we would expect statistical action slips to occur. Letters typed
in error should have higher likelihoods than the planned letter, so we would
expect the distributions of difference scores to be shifted away from 0 in a positive
direction. None of the distributions for errors from typing any of the paragraphs
are obviously shifted in a positive direction. So, it remains unclear how learning and
memory processes contribute to the operations of response-scheduling. They do
influence the keystroke speed as a function of n-gram frequency, but apparently
do so without causing a patterned influence on typing errors that would be
expected if n-gram knowledge biased activation weights for typing individual
letters. Typists make many different kinds of errors for other reasons, and larger
datasets could be required to tease out statistical slips from other more common
errors like hitting a nearby key, transposing letters within a word, or missing letters
entirely.
340 L. P. Behmer Jr. and M. J. C. Crump
Acknowledgment
This work was supported by a grant from NSF (#1353360) to Matthew Crump.
The authors would like to thank Randall Jamieson, Gordon Logan, and two
anonymous reviewers for their thoughtful comments and discussion in the
preparation of this manuscript.
References
Borel, É. (1913). La mécanique statique et l’irréversibilité. Journal de Physique Theorique et
Appliquee, 3, 189–196.
Botvinick, M. M., & Plaut, D. C. (2004). Doing without schema hierarchies: A recurrent
connectionist approach to normal and impaired routine sequential action. Psychological
Review, 111, 395–429.
Botvinick, M. M., & Plaut, D. C. (2006). Such stuff as habits are made on: A reply to
Cooper and Shallice (2006). Psychological Review, 113, 917–927.
Cleeremans, A. (1993). Mechanisms of implicit learning connectionist models of sequence processing.
Cambridge, MA: MIT Press.
Cleeremans, A., & McClelland, J. L. (1991). Learning the structure of event sequences.
Journal of Experimental Psychology: General, 120, 235–253.
Cooper, R. P., & Shallice, T. (2000). Contention scheduling and the control of routine
activities. Cognitive Neuropsychology, 17, 297–338.
Cooper, R. P., & Shallice, T. (2006). Hierarchical schemas and goals in the control of
sequential behavior. Psychological Review, 113, 887–916.
Crunching Big Data with Fingertips 341
Crump, M. J. C., & Logan, G. D. (2010a). Hierarchical control and skilled typing: Evidence
for word-level control over the execution of individual keystrokes. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 36, 1369–1380.
Crump, M. J. C., & Logan, G. D. (2010b). Warning: This keyboard will deconstruct—The
role of the keyboard in skilled typewriting. Psychonomic Bulletin & Review, 17, 394–399.
Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s
Mechanical Turk as a tool for experimental behavioral research. PLoS One, 8, e57410.
Dell, G. S., Burger, L. K., & Svec, W. R. (1997). Language production and serial order: A
functional analysis and a model. Psychological Review, 104, 123–147.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211.
Estes, W. K. (1972). An associative basis for coding and organization in memory. In A. W.
Melton & E. Martin (Eds.), Coding processes in human memory (pp. 161–190). Washington,
DC: V. H. Winston & Sons.
Fodor, J. A. (1983). The modularity of mind: An essay on faculty psychology. Cambridge, MA:
MIT Press.
Gentner, D. R., Grudin, J., & Conway, E. (1980). Finger movements in transcription typing.
DTIC document. San Diego, CA: University of California, San Diego, La Jolla Center
for Human Information Processing.
Grudin, J. T., & Larochelle, S. (1982). Digraph frequency effects in skilled typing. DTIC
document. San Diego, CA: University of California, San Diego, La Jolla Center for
Human Information Processing.
Jamieson, R. K., & Mewhort, D. J. K. (2009a). Applying an exemplar model to the
artificial-grammar task: Inferring grammaticality from similarity. The Quarterly Journal
of Experimental Psychology, 62, 550–575.
Jamieson, R. K., & Mewhort, D. J. K. (2009b). Applying an exemplar model to the serial
reaction-time task: Anticipating from experience. The Quarterly Journal of Experimental
Psychology, 62, 1757–1783.
Jones, M. N., & Mewhort, D. J. (2004). Case-sensitive letter and bigram frequency counts
from large-scale English corpora. Behavior Research Methods, Instruments, & Computers, 36,
388–396.
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as
corpus. Computational Linguistics, 29, 333–347.
Lashley, K. S. (1951). The problem of serial order in behavior. In L. A. Jeffress (Ed.), Cerebral
mechanisms in behavior (pp. 112–136). New York: Wiley.
Liu, X., Crump, M. J. C., & Logan, G. D. (2010). Do you know where your fingers have
been? Explicit knowledge of the spatial layout of the keyboard in skilled typists. Memory
& Cognition, 38, 474–484.
Logan, G. D. (1988). Toward an instance theory of automatization. Psychological Review, 95,
492–527.
Logan, G. D., & Crump, M. J. C. (2009). The left hand doesn’t know what the right hand is
doing: The disruptive effects of attention to the hands in skilled typewriting. Psychological
Science, 20, 1296–1300.
Logan, G. D., & Crump, M. J. C. (2010). Cognitive illusions of authorship reveal hierarchical
error detection in skilled typists. Science, 330, 683–686.
Logan, G. D., & Crump, M. J. C. (2011). Hierarchical control of cognitive processes: The
case for skilled typewriting. In B. H. Ross (Ed.), The Psychology of Learning and Motivation
(Vol. 54, pp. 1–27). Burlington: Academic Press.
342 L. P. Behmer Jr. and M. J. C. Crump
Abstract
Big Data seems to have an ever-increasing impact on our daily lives. Its application
to human vision has been no less impactful. In particular, Big Data methods
have been applied to both content and data analysis, enabling a new, more
fine-grained understanding of how the brain encodes information about the
visual environment. With respect to content, the most significant advance has
been the use of large-scale, hierarchical models—typically “convolutional neural
networks” or “deep networks”—to explicate how high-level visual tasks such
as object categorization can be achieved based on learning across millions of
images. With respect to data analysis, complex patterns underlying visual behavior
can be identified in neural data using modern machine-learning methods or
“multi-variate pattern analysis.” In this chapter, we discuss the pros and cons
of these applications of Big Data, including limitations in how we can interpret
results. In the end, we conclude that Big Data methods hold great promise
for pursuing the challenges faced by both vision scientists and, more generally,
cognitive neuroscientists.
Introduction
With its inclusion in the Oxford English Dictionary (OED), Big Data has come
of age. Beyond according “Big Data” an entry in their lexicon, the quotes
that the OED chose to accompany the definition are telling, painting a picture
from skepticism in 1980, “None of the big questions has actually yielded to the
bludgeoning of the big data people,” to the realization of Big Data’s value in
344 M. J. Tarr and E. M. Aminoff
2003, “The recognition that big data is a gold mine and not just a collection
of dusty tapes” (Oxford English Dictionary, 2016). Today Big Data is being
used to predict flu season severity (Ginsberg et al., 2009; sometimes imperfectly;
Lazer, Kennedy, King, & Vespignani, 2014), guide game time decisions in sports
(Sawchik, 2015), and precisely call elections before they happen (Clifford, 2008).
These and other high-profile Big Data applications typically rely on quantitative
or textual data—people’s preferences, atmospheric measurements, online search
behavior, hitting and pitching statistics, etc. In contrast, Big Data applied to vision
typically involves image statistics; that is, what kinds of visual information across
1,000,000s of images support object categorization, scene recognition, or other
high-level visual tasks. In this vein, perhaps the most well-known result over the
past decade is the finding by a Google/Stanford team that YouTube videos are
frequently populated by cats (Le et al., 2012). Although the cats-in-home-movies
was certainly not the first application of Big Data to images, this paper’s notoriety
signaled that Big Data had come to vision.
The vision community’s deployment of Big Data mirrors similar applications
across other domains of artificial and biological intelligence (e.g., natural language
processing). Such domains are unique in that they often attempt to link artificial
systems to biological systems performing the same task. As such, Big Data is
typically deployed in two distinct ways.
First, Big Data methods can be applied to content. That is, learning systems
such as convolutional or “deep” neural networks (LeCun, Bengio, & Hinton,
2015) can be trained on millions of images, movies, sounds, text passages, and
so on. Of course, before the rise of the Internet accessing such large datasets
was nearly impossible, hence the term “web-scale,” which is sometimes used
to denote models relying on this sort of data (Mitchell et al., 2015; Chen,
Shrivastava, & Gupta, 2013). Recent demonstrations of the application of Big
Data to image classification are numerous, exemplified by the generally strong
interest in the ImageNet competition (Russakovsky et al., 2015), including systems
that automatically provide labels for the content of images drawn from ImageNet
(Deng et al., 2014). However, a more intuitive demonstration of the popularity
and application of Big Data to image analysis can be found in Google Photos.
Sometime in the past few years Google extended Photos’ search capabilities to
support written labels and terms not present in any of one’s photo labels (Rosenberg
& Image Search Team, 2013). For example, the first author almost never labels his
uploaded photos, yet entering the term “cake” into the search bar correctly yielded
five very different cakes from his photo collection (Figure 15.1). Without knowing
exactly what Google is up to, one presumes that they have trained a learning model
on millions of images from their users’ photos and that there are many labeled and
unlabeled images of cakes in this training set. Given this huge training set, when
presented with the label “cake,” Google Photos is able to sift through an unlabeled
photo collection and pick the images most likely to contain a cake.
Can Big Data Help Us Understand Human Vision? 345
FIGURE 15.1 Big Data methods applied to visual content. Images returned by
the search term “cake” from the first author’s personal Google Photos collection
(https://photos.google.com).
Second, Big Data methods can be applied to data analysis. That is, a set of
neural (typically) or behavioral (occasionally) data collected in a human experiment
may be analyzed using techniques drawn from machine learning or statistics.
In cognitive neuroscience, a family of such approaches is often referred to as
multi-voxel or multivariate pattern analysis or “MVPA” (Norman, Polyn, Detre, &
Haxby, 2006; Haxby et al., 2001). MVPA is appealing in that it takes into account
the complex pattern of activity encoded across a large number of neural units
rather than simply assuming a uniform response across a given brain region. MVPA
methods are often used to ask where in the brain is information encoded with
respect to a specific stimulus contrast. More specifically, one of many different
classifiers (typically linear) will be trained on a subset of neural data and then
used to establish which brain region(s) are most effective for correctly classifying
new data (i.e. the best separation of the data with respect to the contrast of
interest). However, when one looks more closely at MVPA methods, it is not
clear that the scientific leverage they provide is really Big Data. That is, while
it is certainly desirable to apply more sophisticated models to neuroimaging data,
acknowledging, for example, that neural codes may be spatially distributed in a
non-uniform manner, there is little in MVPA and related approaches that suggests
that they employ sufficient numbers of samples to enable the advantages that come
with Big Data. As we review below, we hold that all present-day neuroimaging
methods are handicapped with respect to how much data can practically be
collected from a given individual or across individuals.
346 M. J. Tarr and E. M. Aminoff
We believe that the real advantage of Big Data applied to understanding human
vision will come from a third approach—that of using large-scale artificial models as
proxy models of biological processing. That is, as we will argue below, well-specified
computational models of high-level biological vision are scarce. At the same time,
progress in computer vision has been dramatic over the past few years (Russakovsky
et al., 2015). Although there is no way to know, a priori, that a given artificial
vision system either represents or processes visual information in a manner similar
to the human visual system, one can build on the fact that artificial systems rely on
visual input that is often similar to our own visual world—for example, images of
complex scenes, objects, or movies. At the same time, the output of many artificial
vision systems is likewise coincident with the apparent goals of the human visual
system—object and/or scene categorization and interpretation.
is, our experience is sequential and governed by the casual dynamics of our
environment. So many visual samples over shorter time windows are at least locally
non-independent, consisting of a series of coherent “shots.” The most likely image
following a given image in dynamic sequence is almost always another image
with nearly the same content. Indeed, the human visual system appears to rely
on this fact to build up three-dimensional representations of objects over time
(Wallis, 1996). At the same time, more globally, there is likely to be significant
independence between samples taken over longer time lags. That is, we do see a
variety of cats, dogs, and much more as we go about our everyday lives. These
sorts of issues are important to consider in terms of stimuli for training artificial
vision systems or testing human vision in the era of Big Data. When systems
and experiments only included a handful of conditions, variation across stimuli
was difficult to achieve and one often worried about the generalizability of one’s
results.
Now, however, with the use of many more stimuli, the issue is one of variation:
We need sufficient variation to support generalization, but sufficient overlap to
allow statistical inferences. That is, with larger stimulus sets, it is important to
consider how image variation is realized within a given set. For example, movies
(the YouTube kind, not the feature-length kind) as stimuli are a great source of
large-scale visual data. By some estimates there are nearly 100,000,000 videos on
YouTube averaging about 10 minutes each in length. In terms of video frames,
assuming a 30 fps frame rate, we have about 1,800,000,000,000 available image
samples on YouTube alone. However, frames taken from any given video are
likely to be highly correlated with one another (Figure 15.2). Thus, nominally
“Big Data” models or experiments that rely on movie frames as input may be
overestimating the actual scale of the employed visual data. Of course, vision
researchers realize that sequential samples from a video sequence are typically
non-independent, but in the service of nominally larger datasets, this consideration
is sometimes overlooked.
Modulo the issue of sample variance, the high-dimensionality of the visual
world and our mental representation of that world that makes one think “Aha!
FIGURE 15.2 Three non-adjacent frames from a single movie. Although each frame of
the movie might be considered a distinct sample under some approaches, the content
contained in each frame is strongly correlated with the content shown in other frames
from the same movie (adapted from www.beachfrontbroll.com).
348 M. J. Tarr and E. M. Aminoff
Vision is clearly big data—people see millions of different images everyday and
there are about 3 × 109 cortical neurons involved in this process” (Sereno &
Allman, 1991). Even better, there are probably about 5,000–10,000 connections
per one neuron. Thus, both visual content and biological vision appear to be
domains of Big Data. Moreover, much of the leverage that Big Data provides
for vision is in densely sampling these spaces, thereby providing good coverage
of almost all possible images or vision-related brain states. However, we should
note that although the vision problem is both big and complicated, it is not the
problem itself that determines whether we can find solutions to both artificial and
biological vision problems using Big Data approaches. Rather, it is the kind of data
we can realistically collect that determines whether Big Data provides any leverage
for understanding human vision.
In particular, within the data analysis domain, sample sizes are necessarily
limited by sampling methods, including their bandwidth limitations and their cost,
and by the architecture of the brain itself. With respect to this latter point, almost all
human neuroimaging methods unavoidably yield highly correlated samples driven
by the same stimulus response, rather than samples from discrete responses. This is
true regardless as to whether we measure neural activity at spatial locations in the
brain or at time points in the processing stream (or both).
To make this point more salient, consider the methodological poster child
of modern cognitive neuroscience: functional magnetic resonance imaging
(fMRI). fMRI is a powerful, non-invasive method for examining task-driven,
function-related neural activity in the human brain. The strength of fMRI is spatial
localization—where in the brain differences between conditions are reflected in
neural responses. The unit for where is a “voxel”—the minimal volume of brain
tissue across which neural activity may be measured. While the size of voxels in
fMRI has been continually shrinking, at present, the practical volume limit2 for
imaging the entire brain is about 1.0 mm3 —containing about 100,000 cortical
neurons (Sereno & Allman, 1991). As a consequence, the response of a voxel in
fMRI is actually the aggregate response across 100,000 or more neurons. This level
of resolution has the effect of blurring any fine-grained neural coding for visual
information and, generally, creating samples that are more like one another than
they otherwise might be if the brain could be sampled at a finer scale. Reinforcing
this point, there is evidence that individual visual neurons have unique response
profiles reflecting high degrees of selectivity for specific visual objects (Woloszyn
& Sheinberg, 2012). A second reason why spatially adjacent voxels tend to exhibit
similar neural response profiles is that the brain is organized into distinct, localized
neural systems that realize different functions (Fodor, 1983). In vision this means
that voxels within a given region of the visual system hierarchy (e.g. V1, V2, V4,
MT, etc.) respond in a like fashion to visual input (which is how functional regions
are defined in fMRI). That is, adjacent voxels are likely to show similarly strong
responses to particular images and similarly weak responses to the particular images
Can Big Data Help Us Understand Human Vision? 349
(a fact codified in almost all fMRI analysis pipelines whereby there is a minimum
cluster or volume size associated with those regions of activity that are considered
to be significant). By way of example, voxels in a local neighborhood often appear
to be selective for specific object categories (Kanwisher, McDermott, & Chun,
1997; Gauthier, Tarr, Anderson, Skudlarski, & Gore, 1999). Thus, regardless of
how many voxels we might potentially sample using fMRI, their potentially high
correlation implies that it may be difficult to obtain enough data to densely sample
the underlying fine-grained visual representational space.
Similar limitations arise in the temporal domain. First, within fMRI there is a
fundamental limit based on the “hemodynamic response function” or HRF. Since
fMRI measures changing properties of blood (oxygenation), the rate at which
oxygenated blood flows into a localized brain region limits the temporal precision
of fMRI. A full HRF spans some 12–16 seconds, however methodological
cleverness has allowed us to reduce temporal measurements using fMRI down
to about 2–3 seconds (Mumford, Turner, Ashby, & Poldrack, 2012). Still, given
that objects and scenes may be categorized in about 100 ms (Thorpe, Fize, &
Marlot, 1996), 2–3 seconds is a relatively coarse sampling rate that precludes
densely covering many temporal aspects of visual processing. As discussed below,
there are also practical limits on how many experimental trials may be run in a
typical one hour study.
Alternatively, vision scientists have used electroencephalography (EEG; as well
as its functional variant, event-related potentials or ERPs) and magnetoencephalog-
raphy (MEG) to explore the fine-grained temporal aspects—down to the range of
milliseconds—of visual processing. With such techniques, the number of samples
that may be collected in a relatively short time period is much greater than with
fMRI. However, as with spatial sampling, temporal samples arising from neural
activity are likely to be highly correlated with one another and probably should
not be thought of as discrete samples—the measured neural response functions are
typically quite smooth (much as in the wave movie example shown in Figure 15.1).
Thus, the number of discrete temporal windows that might be measured during
visual task performance is probably much smaller than the raw number of samples.
At the same time, the spatial sampling resolution of EEG and MEG is quite
poor in that they both measure summed electrical potentials (or their magnetic
effects) using a maximum of 256–306 scalp sensors. Not only is the dimensionality
of the total sensor space small as compared to the number of potential neural
sources in the human brain, but source reconstruction methods must be used to
estimate the putative spatial locations generating these signals (Yang, Tarr, & Kass,
2014). Estimation of source locations in EEG or MEG is much less reliable than in
fMRI—on the order of 1 cm at best given current methods (at least 100,000,000
neurons per one reconstructed source). As such, it is difficult to achieve any sort
of dense sampling of neural units using either technique.
In trying to leverage Big Data for the study of human vision, we are also
faced with limitations in experimental power. By power, we mean the minimum
350 M. J. Tarr and E. M. Aminoff
sample size required to reliably detect an effect between conditions (i.e. correctly
rejecting the null). Power is constrained by three different factors that necessarily
limit the amount of data we can collect from both individuals and across a
population.
First, although different neuroimaging methodologies measure different
correlates of neural activity, they are all limited by human performance. That is, we
can only show so many visual stimuli and collect so many responses in the typical
vision experiment. Assuming a minimum display time of between 100 and 500 ms
per stimulus and a minimum response time of between 200 and 500 ms per subject
response, adding in a fudge factor for recovery between trials, we ideally might
be able to run 2,400 1.5 sec experimental trials during a one hour experiment.
However, prep time, consent, rests between runs, etc. rapidly eat into the total
time we can actually collect data. Moreover, 1.5 sec is an extremely rapid pace that
is likely to fatigue subjects. Realistically, 1,000 is about the maximum number of
discrete experimental trials that might be run in one hour.
Second, as we have already discussed, different neuroimaging methodologies are
limited by what they actually measure. That is, the particular correlates of neural
activity measured by each method have specific spatial or temporal limitations.
Spatially, MRI methods provide the highest resolution using non-invasive
techniques.3 As mentioned above, at present, the best generally deployed MRI
systems can achieve a resolution of about 1 mm3 when covering the entire brain.
More realistically, most vision scientists are likely to use scanning parameters that
will produce a functional brain volume of about 700,000 voxels at one time
point assuming the frequently used 2 mm3 voxel size.4 In contrast, as already
discussed, because EEG and MEG measure electrical potentials at the scalp, the
spatial localization of neural activity requires source reconstruction, whereby the
highest resolution that can be achieved is on the order of 1 cm3 . In either case, the
sampling density is quite low relative to the dimensionality of neurons in the visual
cortex.
The challenge of limited sample sizes—in absolute terms or relative to the
dimensionality of the domain-to-be-explained—is exacerbated by the third, highly
practical factor: cost. That is, regardless as to the efficacy of a given method,
experimental power is inherently limited by the number of subjects that can be
run. In this vein, both fMRI and MEG tend to be relatively expensive—costing
in the neighborhood of $500/hour. Moreover, this cost estimate typically reflects
only operating costs, but not acquisition costs, which are in the neighborhood of
$1,000,000 a tesla (e.g. a 3T scanner would cost about $3,000,000 to purchase
and install). As such, modern vision science, which has enthusiastically adopted
neuroimaging tools in much of its experimental research, appears to be suffering
from a decrease in experimental power relative to earlier methods. Whether this
matters for understanding human vision is best addressed by asking what makes
Big Data “work”—the question we turn to next.
Can Big Data Help Us Understand Human Vision? 351
That is, across 1,000,000s of objects particular image regularities will emerge and
can be instantiated as probabilistic visual “rules”—that is, default inferences that
are highly likely to hold across most contexts. Of note, because of the scale
of the training data and number of available parameters within the model, the
number of learned regularities can be quite large (i.e. much larger than the posited
number of grammatical rules governing human language) and can be quite specific
to particular categories. What is critical is that this set of visual rules can be applied
across collections of objects and scenes, even if the rules are not so general that they
are applicable across all objects and scenes. Interestingly, this sort of compositional
structure, while learnable by these models (perhaps because of their depth), may
be otherwise difficult to intuit or derive through formal methods.
In sum, Big Data may be an effective tool for studying many domains of
biological intelligence, and, in particular, vision, because it is often realized in
models that are good at both ends of the problem. That is, the sheer number of
parameters in these approaches means that the visual world can be densely learned
in a memory-intensive fashion across 1,000,000s of training examples. At the same
time, to the extent visual regularities exist within the domain of images, such
regularities—not apparent with smaller sample sizes—will emerge as the number
of samples increases. Of course, these benefits presuppose large-scale, sufficiently
varying, discrete samples within training data—something available when studying
vision of the content domain, but, as we have reviewed above, less available or
possible in the analysis of neural data (using present-day methodologies). That is,
it is rarely the case that data arising from neuroimaging studies are of sufficient
scale to enable sufficient sampling and clear signals about how the human visual
system encodes images and make inferences about them with respect to either raw
memorial processes or compositional representations.
to find parameter weights for the WordNet-based model for the response of each
individual voxel in the brain. The key result of this analysis was a semantic map
across the whole brain, showing which neural units responded to which of the
objects and actions (in terms of the 1,705 lexical labels). Interestingly, this map
took the form of a continuous semantic space, organized by category similarity,
contradicting the idea that visual categories are represented in highly discrete brain
regions.
As an alternative to decoding, proxy models allow the leverage provided by Big
Data in the content domain to be applied to behavioral or neural data. Many
examples of this sort of approach can be seen in the recent, and rapidly growing,
trend of applying models drawn from computer vision, and, in particular, deep
neural network models, to fMRI and neurophysiological data (Agrawal et al., 2014;
Yamins et al., 2014).
A somewhat more structured approach has been adopted by directly using
artificial vision models to account for variance in brain data. As alluded to
earlier, deep neural networks or convolutional neural networks have gained
rapid popularity as models for content analysis in many domains with artificial
intelligence (LeCun et al., 2015). One of the more interesting characteristics of
such models is that they are hierarchical: Higher layers represent more abstract, or
high-level visual representations such as object or scene categories, while lower
layers represent low-level visual information, such as lines, edges, or junctions
localized to small regions of the image. This artificial hierarchical architecture
appears quite similar to the low-to-high-level hierarchy realized in the human
visual system. Moreover, these artificial models appear to have similar goals to
human vision: Taking undifferentiated points of light from a camera or a retina and
generating high-level visual representations that capture visual category structure,
including highly abstract information such as living/non-living or functional roles.
Recently, studies involving both human fMRI (Agrawal et al., 2014) and
monkey electrophysiology (Yamins et al., 2014) have found that, across the visual
perception of objects and scenes, deep neural networks are able to successfully
predict and account for patterns of neural activity in high-level visual areas
(Khaligh-Razavi & Kriegeskorte, 2014). Although the deep network models
employed in these studies are not a priori models of biological vision, they serve, as
we have argued, as proxy models whereby progress will be made by incorporating
and testing the efficacy of biologically derived constraints. For example, based
on present results we can confirm, not surprisingly, that the primate visual
system is indeed hierarchical. Perhaps a little less obviously, Yamins et al. (2014)
used a method known as hierarchical modular optimization to search through
a space of convolutional neural networks and identify which model showed the
best—from a computer vision point of view—object categorization performance.
What is perhaps surprising is that there was a strong correlation between model
performance and a given model’s ability to predict neuron responses recorded from
354 M. J. Tarr and E. M. Aminoff
monkey IT. That is, the model that performed best on the object categorization
task also performed best at predicting the responses of IT neurons. This suggests
that when one optimizes a convolutional neural network to perform the same task
for which we assume the primate ventral stream is optimized, similar intermediate
visual representations emerge in both the artificial and biological systems. Yamins
et al. (2014) support this claim with the finding that the best performing model was
also effective at predicting V4 neuron responses. At the same time, many challenges
remain. In particular, how do we understand such intermediate-level visual
representations—the “dark matter” of both deep networks and human vision?
To demonstrate how we might pursue the question of visual representation
in neural systems, we introduce an example from our own lab regarding how
the human brain represents complex, natural scenes. As in the studies discussed
above, we applied an artificial vision model trained on Big Data—1,000,000s of
images—to fMRI data. However, in this case, rather than predicting IT or V4
neural responses, we used the model to account for responses in three brain regions
already known to be involved in scene processing. Our goal was not to identify
new “scene-selective” areas, but to articulate how scenes are neurally represented
in terms of mid-level scene attributes. That is, we still know very little about how
scenes are neurally encoded and processed. Scenes are extremely complex stimuli,
rich with informative visual features at many different scales. What is unknown is
the “vocabulary” of these features—at present, there is no model for articulating
and defining these visual features that may then be tested against neural scene
processing data.
As mentioned to above, we are interested in those mid-level visual features
that are built up from low-level features, and are combined to form high-level
features. In particular, such intermediate features seem likely to play a critical role
in visual recognition (Ullman, Vidal-Naquet, & Sali, 2002). By way of example,
although intuitively we can categorize a “contemporary” apartment from a “rustic”
apartment, possibly based on the presence of objects in each scene, there are also
many non-semantic, mid-level visual features that may separate these categories.
Critically, such features are difficult to label or define. As such, it is at this
mid-level that the field of human visual science has fallen short in articulating
clear hypotheses. Why are mid-level features difficult to define? When trying to
articulate potential mid-level visual features we, as theorists, are biased and limited
in two ways. First, we are limited in that we define only those features that we can
explicitly label. However, useful visual features may not be easy to describe and
label, and therefore may not be obvious. This leads to the second limitation: We
are limited by defining those features that we think are important. Yet, introspection
does not provide conscious access to much (or even most) of our visual processing.
In order to move beyond these limitations and define a set of mid-level features
that may correlate with more high-level semantics, we have adopted artificial vision
models trained on “Big Data.” Big Data is likely to be useful here because it
Can Big Data Help Us Understand Human Vision? 355
0.25
SCENES
SCENES
0.20
0.15
0.10
FIGURE 15.3 fMRI results. (a) Regions of interest in scene selective cortex—PPA,
RSC, and TOS. (b) Activity within an ROI varies across voxels. We create a feature
space using fMRI data with the responses from each voxel within and ROI for each
scene. This feature space is then cross-correlated to create a similarity, or correlation
matrix that represents the “scene space” for that set of data. Using the data from
computer vision models, the feature space would consist of the different features of the
model instead of voxels, as illustrated here. (c) Cross-correlated the fMRI similarity
matrix with different computer vision models and behavior. As can be seen NEIL
does just about as well as SUN, and SIFT, a low-level visual model, does not do
nearly as well. (d) A hierarchical regression was run to examine what unique variance
can be accounted for by NEIL. The order of blocks was 1—all low-level models,
2—GIST (a low-level model of spatial frequency that has shown to be important
in scene perception), 3—GEOM (a probability map of scene sections), 4—SUN,
5—NEIL, and then 6—behavioral data. (e) A table representing the change in R with
each sequential block. NEIL significantly accounted for unique variance, above all
other computer vision models used, in the PPA and TOS (adapted from Aminoff et
al., 2015).
358 M. J. Tarr and E. M. Aminoff
location of the sky and ground, where these sections capture very broad features
that intuitively seem important for scenes representation. Again, consistent with
our expectations, we observed that NEIL accounted for unique variance in the
responses seen within PPA and TOS; somewhat puzzling was the fact that NEIL
also accounted for unique variance in early visual control regions (Fig. 15.3(d,e)).
To be clear, we are not under the illusion that NEIL is a model of human
vision or that the features that emerge from NEIL are the ideal candidate features
for understanding human vision. At the same time, in terms of both inputs and
goals, NEIL and human vision share a great deal. As such, NEIL may serve as
a proxy model—a first step in elucidating a comprehensive account of how we
learn and represent visual information. To the extent that we find NEIL to be
effective in serving this role, we believe that much of its power lies in its use of
large-scale data—learning over millions of images. Data of this scale enables NEIL
to derive features and attributes from emergent statistical regularities that would
otherwise be unavailable to vision scientists. Thus, although our fMRI data is not
“big,” we are able to take advantage of Big Data approaches. In particular, here we
examined neural scene representation and found that NEIL significantly accounted
for variance in patterns of neural activity within scene-selective ROIs. NEIL’s
performance was equivalent or near-equivalent to another artificial model, SUN,
in which features were selected based on intuition (Figure 15.3(c)). Moreover,
NEIL was able to account for unique variance over and above all other artificial
vision models (Figure 15.3(d, e)).
One of the most important aspects of our results using NEIL are that they are
likely to be both scalable and generalizable. Hand-tuned models such as SUN are
only effective when the right features are chosen for a particular set of images.
When the set of images changes, for example, shifting from scenes to objects
or from one subset of scenes to another, models such as SUN may need to be
“reseeded” with new features. In contrast, NEIL learns and makes explicit features
and attributes that are likely to support the recognition of new classes of images.
As such, deploying Big Data artificial vision models such as NEIL or deep neural
networks moves us a step closer to developing successful models of human vision.
Conclusions
At present, both within and outside science, Big Data is, well . . . big. The
question is whether this degree of excitement is warranted—will heretofore
unavailable insights and significant advances in the study of human vision emerge
through novel applications of these new approaches? Or will vision scientists be
disappointed as the promise of these new methods dissipates without much in the
way of real progress (Figure 15.4)? Put another way, are we at the peak of inflated
expectations or the plateau of productivity (Figure 15.5)?
Can Big Data Help Us Understand Human Vision? 359
FIGURE 15.4 The Massachusetts Institute of Technology Project MAC Summer Vision
Project. An overly optimistic of view of the difficulty of modeling human vision circa
1966. Oops.
It is our speculation that the application of Big Data to biological vision is more
likely at the peak of inflated expectations than at the plateau of productivity. At the
same time, we are somewhat optimistic that the trough will be shallow and that the
toolbox afforded by Big Data will have a significant and lasting impact on the study
of vision. In particular, Big Data has already engendered dramatic advances in our
ability to process and organize visual content and build high-performing artificial
vision systems. However, we contend that, as of 2015, Big Data has actually had
little direct impact on visual cognitive neuroscience. Rather, advances have come
from the application of large-scale content analysis to neural data. That is, given a
shortage of well-specified models of human vision, Big-Data models that capture
both the breadth and the structure of the visual world can serve to help increase our
understanding of how the brain represents and processes visual images. However,
even this sort of application is data limited due to the many constraints afforded
by present-day neuroimaging methods: The dimensionality of the models being
applied is dramatically larger than the dimensionality of the currently available
360 M. J. Tarr and E. M. Aminoff
Plateau of Productivity
Visibility
Slope of Enlightenment
Trough of Disillusionment
Technology Trigger
Time
FIGURE 15.5 The Gartner Hype Cycle. Only time will tell whether Big Data is sitting
at the peak of inflated expectations or at the plateau of productivity. (Retrieved
October 19, 2015 from https://commons.wikimedia.org/wiki/File:Gartner_Hype_-
Cycle.svg ).
Acknowledgments
This work was supported by the National Science Foundation, award 1439237,
and by the Office of Naval Research, award MURI N000141010934.
Notes
1 A decade from now this definition may seem quaint and the concept of “big”
might be something on the order of 109 samples. Of course, our hedge is
likely to be either grossly over-optimistic or horribly naive, with the actual
conception of big in 2025 being much more or less.
2 This limit is based on using state-of-the-art, 7T MRI scanners. However, most
institutions do not have access to such scanners. Moreover, high-field scanning
introduces additional constraints. In particular, many individuals suffer from
Can Big Data Help Us Understand Human Vision? 361
nausea, headaches, or visual phosphenes if they move too quickly within the
magnetic field. As a consequence, even with access to a high-field scanner,
most researchers chose to use lower field, 3T systems where the minimum
voxel size is about 1.2 to 1.5 mm3 .
3 Within the neuroimaging community, there has been a strong focus on
advancing the science by improving the spatial resolution of extant methods.
This has been particularly apparent in MRI, where bigger magnets, better
broadcast/antenna combinations, and innovations in scanning protocols have
yielded significant improvements in resolution.
4 At the extreme end of functional imaging, “resting state” or “functional
connectivity” MRI (Buckner et al., 2013) allows, through the absence of
any task, a higher rate of data collection. Using state-of-the-art scanners,
about 2,300 samples of a 100,000 voxel volume (3 mm3 voxels) can be
collected in one hour using a 700 ms sample rate (assuming no rest periods
or breaks). Other, non-functional, MRI methods may offer even higher
spatial sampling rates. For example, although less commonly employed as a
tool for understanding vision (but see Pyles, Verstynen, Schneider, & Tarr,
2013; Thomas et al., 2009), diffusion imaging neuroimaging may provide
as many as 650 samples per a 2 mm3 voxel where there are about 700,000
voxels in a brain volume. Moreover, because one is measuring connectivity
between these voxels, the total number of potential connections that could be
computed is 490,000,000,000. At the same time, as with most neuroimaging
measurements, there is a lack of independence between samples in structural
diffusion imaging and the high dimensionality of such data suggests complexity,
but not necessarily “Big Data” of the form that provides leverage into solving
heretofore difficult problems.
References
Agrawal, P., Stansbury, D., Malik, J., & Gallant, J. L. (2014). Pixels to voxels: modeling visual
representation in the human brain. arXiv E-prints, arXiv: 1407.5104v1 [q-bio.NC].
Retrieved from http://arxiv.org/abs/1407.5104vi.
Aminoff, E. M., Toneva, M., Shrivastava, A., Chen, X., Misra, I., Gupta, A., & Tarr,
M. J. (2015). Applying artificial vision models to human scene understanding. Frontiers in
Computational Neuroscience, 9(8), 1–14. fncom.2015.00008.
Anderson, J. R. (1993). Rules of the mind. Hillsdale, NJ: Erlbaum.
Barenholtz, E., & Tarr, M. J. (2007). Reconsidering the role of structure in vision. In
A. Markman & B. Ross (Eds.), Categories in use (Vol. 47, pp. 157–180). San Diego, CA:
Academic Press.
Buckner, R. L., Krienen, F. M., & Yeo, B. T. (2013). Opportunities and limitations
of intrinsic functional connectivity MRI. Nature Neuroscience, 16(7), 832–837. doi:
10.1038/nn.3423.
362 M. J. Tarr and E. M. Aminoff
Chen, X., Shrivastava, A., & Gupta, A. (2013). NEIL: Extracting visual knowledge from
web data. In Proceedings of the International Conference on Computer Vision (ICCV). Sydney:
IEEE.
Clifford, S. (2008). Finding fame with a prescient call for Obama. New York Times (online,
November 9). Retrieved from www.nytimes.com/2008/11/10/business/media/10silver.
html.
Deng, J., Russakovsky, O., Krause, J., Bernstein, M. S., Berg, A., & Fei-Fei, L. (2014).
Scalable multi-label annotation. In CHI ’14: Proceedings of the SIGCHI conference on human
factors in computing systems (pp. 3099–3102). ACM. doi:10.1145/2556288.2557011.
Fodor, J. A. (1983). Modularity of mind. Cambridge, MA: MIT Press.
Gauthier, I., Tarr, M. J., Anderson, A. W., Skudlarski, P., & Gore, J. C. (1999). Activation
of the middle fusiform ‘face area’ increases with expertise in recognizing novel objects.
Nature Neuroscience, 2(6), 568–573. doi: 10.1038/9224.
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L.
(2009). Detecting influenza epidemics using search engine query data. Nature, 457(7232),
1012–1014. doi:10.1038/nature07634.
Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., & Pietrini, P. (2001).
Distributed and overlapping representations of faces and objects in ventral temporal
cortex. Science, 293(5539), 2425–2430. doi:10.1126/science.1063736.
Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic
space describes the representation of thousands of objects and action categories across the
human brain. Neuron, 76(6), 1210–1224. doi: 10.1016/j.neuron.2012.10.014.
Kanwisher, N., McDermott, J., & Chun, M. M. (1997). The fusiform face area: A module
in human extrastriate cortex specialized for face perception. Journal of Neuroscience, 17(11),
4302–4311.
Khaligh-Razavi, S. M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised,
models may explain IT cortical representation. PLoS Computational Biology, 10(11),
e1003915. doi: 10.1371/journal.pcbi.1003915.
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept
learning through probabilistic program induction. Science, 350(6266), 1332–1338. doi:
10.1126/science.aab3050.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu:
Traps in Big Data analysis. Science, 343(6176), 1203–1205. doi: 10.1126/science.1248506.
Le, Q. V., Ranzato, M. A., Monga, R., Devin, M., Chen, K., Corrado, G. S., . . . Ng,
A. Y. (2012). Building high-level features using large scale unsupervised learning. In
International Conference in Machine Learning.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521(7553), 436–444.
doi: 10.1038/nature14539.
Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM,
38(11), 39–41.
Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A.,
. . . Welling, J. (2015). Never-ending learning. In Proceedings of the Twenty-Ninth AAAI
Conference on Artificial Intelligence, AAAI.
Mumford, J. A., Turner, B. O., Ashby, F. G., & Poldrack, R. A. (2012). Deconvolving
BOLD activation in event-related designs for multivoxel pattern classification analyses.
Neuroimage, 59(3), 2636–2643. doi: 10.1016/j.neuroimage.2011.08.076.
Can Big Data Help Us Understand Human Vision? 363
Norman, K. A., Polyn. S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading:
Multi-voxel pattern analysis of fMRI data. Trends in Cognitive Science, 10(9), 424–430.
doi: 10.1016/j.tics.2006.07.005.
Oxford English Dictionary. (2016), Oxford University Press. Retrieved from www.
oed.com/view/Entry/18833.
Patterson, G., Xu, C., Su, H., & Hays, J. (2014). The SUN attribute database:
Beyond categories for deeper scene understanding. International Journal of Computer Vision,
108(1–2), 59–81.
Pinker, S. (1999). Words and rules: The ingredients of language (pp. xi, 348). New York, NY:
Basic Books Inc.
Pyles, J. A., Verstynen, T. D., Schneider, W., & Tarr, M. J. (2013). Explicating
the face perception network with white-matter connectivity. PLoS One, 8(4). doi:
10.1371/journal.pone.0061611.
Rosenberg, C., & Image Search Team. (2013). Improving photo search: A step
across the semantic gap. Google research blog. Retrieved from http://googleresearch.
blogspot.com/2013/06/improving-photo-search-step-across.html.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . Fei-Fei, L. (2015).
ImageNet large scale visual recognition challenge. International Journal of Computer Vision.
115(3), 211–252. doi: 10.1007/s11263-015-0816-y.
Sawchik, T. (2015). Big Data baseball: Math, miracles, and the end of a 20-year losing streak.
New York, NY: Flatiron Books.
Sereno, M. I., & Allman, J. M. (1991) Cortical visual areas in mammals. In A.G. Leventhal
(Ed.), The Neural Basis of Visual Function (pp. 160–172). London: Macmillan.
Stansbury, D. E., Naselaris, T., & Gallant, J. L. (2013). Natural scene statistics account for
the representation of scene categories in human visual cortex. Neuron, 79(5), 1025–1034.
doi: 10.1016/j.neuron.2013.06.034.
Thomas, C., Avidan, G., Humphreys, K., Jung, K. J., Gao, F., & Behrmann, M. (2009).
Reduced structural connectivity in ventral visual cortex in congenital prosopagnosia.
Nature Neuroscience, 12(1), 29–31. doi: 10.1038/nn.2224.
Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system.
Nature, 381(6582), 520–522. doi: 10.1038/381520a0.
Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity
and their use in classification. Nature Neuroscience, 5(7), 682–687.
Wallis, G. (1996). Using spatio-temporal correlations to learn invariant object recognition.
Neural Networks, 9(9), 1513–1519.
Woloszyn, L., & Sheinberg, D. L. (2012). Effects of long-term visual experience on
responses of distinct classes of single units in inferior temporal cortex. Neuron, 74(1),
193–205. doi: 10.1016/j.neuron.2012.01.032.
Yamins, D. L. K., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J.
(2014). Performance-optimized hierarchical models predict neural responses in higher
visual cortex. Proceedings of the National Academy of Sciences of the United States of America,
111(23), 8619–8624.
Yang, Y., Tarr, M. J., & Kass, R. E. (2014). Estimating learning effects: A short-time
Fourier transform regression model for MEG source localization. In Springer lecture notes
on artificial intelligence: MLINI 2014: Machine learning and interpretation in neuroimaging.
New York: Springer.
INDEX
collaborative tagging, memory cue time series analysis, 128–32, 130, 131,
hypothesis in (cont.) 132, 133
time series analysis, 128–32, 130, 131, Web 2.0, impact of
132, 133 see also Flickr Distributional Tagspace
Web 2.0, impact of, 118 (FDT)
see also Flickr Distributional Tagspace convolutional neural networks, 353–4
(FDT) Cooper, R. P., 322
Collins, A. M., 178, 179 creativity, 183
complexity in Baysian statistics, 67–9 crowd-sourced data
computational psycholinguistics, 248, LATER (linear approach to threshold
261–3 with ergodic data) model, 23–8, 24,
see also alignment in web-based dialogue 27, 28
conceptual length, 275–7 MindCrowd Project, 22–3
concreteness in language crowding hypothesis, 273–5, 275
age of acquisition, 285–6 age of acquisition, 285–6
conceptual efficiency, 277–8 attention economies, 273
conceptual length, 275–7
data analysed, 278–9
concreteness, 277–9, 280, 281
future research, 290
data analysed, 278–9
learner-centred change, 282–7, 284,
Flynn effect, 278
285, 286, 287
future research, 290
population density, 287–9, 288 information markets, 273
rise of in American English, 278–9 noise, impact of, 274–5
word length, 283–4, 284 population density, 287–9, 288
conditional informational variability (CIV), reduction in surface complexity, 282–7,
96, 96–7 284, 285, 286, 287
content tagging, memory cue hypothesis in semantic bleaching, 279–82, 283
collaborative, 119 word length, 283–4, 284
academic interest in, 120 cued recall, 188
analytic approaches, 128–9 see also memory cue hypothesis in
application of, 118 collaborative tagging
audience for tagging, 122
Big Data issues, 135–6, 139
causal analyses, 136–8, 138 Dale, R., 93
Danescu-Niculescu-Mizil, C., 254–5
clustering, 136–7
data analysis, Big Data and, 344
cued recall research, 122–4
data capture, Big Data use in, 7–8
dataset, 125–7, 127
data mining
entropy, tag, 133–6, 134 collaborative filtering, 42–4, 43
evidence for motivation, lack of, 118, see also statistical inference
125 Davis, E., 4
folksonomies, 119–20 decision by sampling theory (DbS)
future listening to tagged item, 127, assumptions of, 298–9
136–8, 138 Big Data, 299–312, 301, 304, 307, 310,
hypotheses, 127–8 314
information theoretic analyses, 132–6, causality, 312–13
134 coincidence, 313
Last.fm, 125–6 comparison values in memory, use of in,
motivation for tagging, 120–2 296–8, 310–11
purpose of tagging, 117 human lives, subjective value of, 303–6,
recommendation systems, 139 304, 311–12
retrieval function, 120–1 monetary gains and losses, 299–303,
specificity, tag, 127, 132–6, 134 301, 311
Index 367
evidence for motivation, lack of, 118, clinical populations, study of, 193
125 clusters of words, 184–5, 186
folksonomies, 119–20 corpus-based methods, 177–8
future listening to tagged item, 127, and creativity, 183
136–8, 138 dictionary metaphor, 175
hypotheses, 127–8 directionality of network, 187
information theoretic analyses, 132–6, individuals, networks of, 193
134 insights from, 181–91
Last.fm, 125–6 language development, 182–3
motivation for tagging, 120–2 macroscopic level insights, 181–4, 191–2
purpose of tagging, 117 memory retrieval and search, 188–9
recommendation systems, 139 mesoscopic level insights, 184–9, 186,
research question, 125 191–2
retrieval function, 120–1 microscopic level insights, 189–90,
specificity, tag, 127, 132–6, 134 191–2
time series analysis, 128–32, 130, 131, multilevel network view, 191–2
132, 133 node centrality, 189–90
Web 2.0, impact of, 118 priming, 187–8
memory retention relatedness of words, 185–7
ACT-R, 40–1 representation of semantic similarity,
collaborative filtering, 42–4, 43 179–80, 180
computational models, 40–2 research into, 175–6
and e-commerce recommender systems, rigid/chaotic structure, 192
42–3 small scale studies, 175–6
forgetting, 37–8, 44–9, 46, 47, 48, 58, small world structure, 181–2
59–60 specific groups, use with, 192–3
human-memory phenomena, 37–40, 39 spreading activation, 180–1
integration of psychological theory with thematic organization of the mental
Big Data methods, 44–60, 46, 47, lexicon, 184–5, 186
48, 54, 55, 57 thesaurus model, 176–7
intersession intervals, 38–9, 39 WordNet, 176–7
item-response theory (IRT), 43–4 Metropolis-Hastings algorithm, 68–9
knowledge state, 35–7, 37 Miller, G. A., 5–6
machine learning, 42, 43 MindCrowd Project, 22–3
MCM (Multiscale Context Model), Mitroff, S. R., 8
41–2 Monte Carlo approximation, advances in,
network view of the mental lexicon, 69–71
188–9 Morais, A. S., 193
personalized review, 49–58, 54, 55, 57 morphological complexity, 272
3PL/2PL models, 44 Multiscale Context Model (MCM), 41–2
psychological theories, 37–42, 39 multivariate pattern analysis (MVPA), 345
retention intervals, 38, 39
simulation methodology, 58, 59–60
spacing effects, 38–42, 39 n-gram analysis, 331
strengths of theory/data-driven Naaman, M., 121–2, 148
approaches, 58 natural scene categories, 80–5, 85, 86
mental lexicon, network view of NEIL (never ending image learner), 355–7,
association networks, 178–81 357
centrality measures, 189–91 network measures, 98, 99–100, 100, 101
challenges of using, 193–5 network science, 9
clinical populations, structural network view of the mental lexicon
differences in, 183–4 association networks, 178–81
370 Index
letter frequency, 325–7, 326 future regarding Big Data, 358–60, 359,
methodology, 332–3 360
n-gram analysis, 331 hierarchical modular optimization,
Project Gutenburg, 331–2 353–4
response-scheduling process, 338–9 image analysis, Big Data and, 344–5, 345
sensitivity to sequential structure, mid-level features, 354–5
327–30, 329, 335, 336 multivariate pattern analysis (MVPA),
serial ordering processes, 320–2 345
serial recurrent network predictions, NEIL (never ending image learner),
327, 367–8 355–6, 357, 358
skill development, 325 neuroimaging methods, 348–50
speed improvement, 325–30, 326, 329 as non-independent, 346–7
testing predications, 330 proxy models, 346, 353
unfamiliar letter strings and sensitivity to variation in data, 347
sequential structure, 335, 336, 337 visual attention, Big Data use in, 8
vocabulary knowledge, 206–7, 210, 211
uniform information density, theory of, 94 Vojnovic, M., 147
uniformity assumption, 205 volume
unigram informational variability (UIV), as characteristic of Big Data, 3
96, 96–7 sequential Bayesian updating, 21–2