0% found this document useful (0 votes)
615 views

Big Data in Cognitive Science

Uploaded by

Lucas Rodrigues
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
615 views

Big Data in Cognitive Science

Uploaded by

Lucas Rodrigues
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 382

BIG DATA IN COGNITIVE SCIENCE

While laboratory research is the backbone of collecting experimental data in


cognitive science, a rapidly increasing amount of research is now capitalizing on
large-scale and real-world digital data. Each piece of data is a trace of human
behavior and offers us a potential clue to understanding basic cognitive principles.
However, we have to be able to put the pieces together in a reasonable way,
which necessitates both advances in our theoretical models and development of
new methodological techniques.
The primary goal of this volume is to present cutting-edge examples of mining
large-scale and naturalistic data to discover important principles of cognition and
evaluate theories that would not be possible without such a scale. This book also
has a mission to stimulate cognitive scientists to consider new ways to harness Big
Data in order to enhance our understanding of fundamental cognitive processes.
Finally, this book aims to warn of the potential pitfalls of using, or being
over-reliant on, Big Data and to show how Big Data can work alongside traditional,
rigorously gathered experimental data rather than simply supersede it.
In sum, this groundbreaking volume presents cognitive scientists and those
in related fields with an exciting, detailed, stimulating, and realistic introduction
to Big Data—and to show how it may greatly advance our understanding of
the principles of human memory, perception, categorization, decision-making,
language, problem-solving, and representation.

Michael N. Jones is the William and Katherine Estes Professor of Psychology,


Cognitive Science, and Informatics at Indiana University, Bloomington, and the
Editor-in-Chief of Behavior Research Methods. His research focuses on large-scale
computational models of cognition, and statistical methodology for analyzing
massive datasets to understand human behavior.
BIG DATA IN
COGNITIVE SCIENCE

Edited by Michael N. Jones


First published 2017
by Routledge
711 Third Avenue, New York, NY 10017
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2017 Taylor & Francis
The right of Michael N. Jones to be identified as the author of the editorial material,
and of the authors for their individual chapters, has been asserted in accordance
with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or
utilized in any form or by any electronic, mechanical, or other means, now
known or hereafter invented, including photocopying and recording, or in
any information storage or retrieval system, without permission in writing
from the publishers.
Trademark notice: Product or corporate names may be trademarks or registered
trademarks, and are used only for identification and explanation without intent
to infringe.
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library.
Library of Congress Cataloging in Publication Data
Names: Jones, Michael N., 1975– editor.
Title: Big data in cognitive science / edited by Michael N. Jones.
Description: New York, NY : Routledge, 2016. |
Includes bibliographical references and index.
Identifiers: LCCN 2016021775|
ISBN 9781138791923 (hb : alk. paper) |
ISBN 9781138791930 (pb : alk. paper) |
ISBN 9781315413570 (ebk)
Subjects: LCSH: Cognitive science–Research–Data processing. |
Data mining. | Big data.Classification: LCC BF311 .B53135 2016 |
DDC 153.0285–dc23LC record available at https://lccn.loc.gov/2016021775
ISBN: 978-1-138-79192-3 (hbk)
ISBN: 978-1-138-79193-0 (pbk)
ISBN: 978-1-315-41357-0 (ebk)

Typeset in Bembo
by Out of House Publishing
CONTENTS

Contributors vii

1 Developing Cognitive Theory by Mining Large-scale


Naturalistic Data 1
Michael N. Jones

2 Sequential Bayesian Updating for Big Data 13


Zita Oravecz, Matt Huentelman, and Joachim Vandekerckhove

3 Predicting and Improving Memory Retention: Psychological


Theory Matters in the Big Data Era 34
Michael C. Mozer and Robert V. Lindsey

4 Tractable Bayesian Teaching 65


Baxter S. Eaves Jr., April M. Schweinhart, and Patrick Shafto

5 Social Structure Relates to Linguistic Information Density 91


David W. Vinson and Rick Dale

6 Music Tagging and Listening: Testing the Memory Cue


Hypothesis in a Collaborative Tagging System 117
Jared Lorince and Peter M. Todd
vi Contents

7 Flickr® Distributional Tagspace: Evaluating the Semantic


Spaces Emerging from Flickr® Tag Distributions 144
Marianna Bolognesi

8 Large-scale Network Representations of Semantics in


the Mental Lexicon 174
Simon De Deyne, Yoed N. Kenett, David Anaki, Miriam Faust,
and Daniel Navarro

9 Individual Differences in Semantic Priming Performance:


Insights from the Semantic Priming Project 203
Melvin J. Yap, Keith A. Hutchison, and Luuan Chin Tan

10 Small Worlds and Big Data: Examining the Simplification


Assumption in Cognitive Modeling 227
Brendan Johns, Douglas J. K. Mewhort, and Michael N. Jones

11 Alignment in Web-based Dialogue: Who Aligns, and How


Automatic Is It? Studies in Big-Data Computational
Psycholinguistics 246
David Reitter

12 Attention Economies, Information Crowding, and Language


Change 270
Thomas T. Hills, James S. Adelman, and Takao Noguchi

13 Decision by Sampling: Connecting Preferences to Real-World


Regularities 294
Christopher Y. Olivola and Nick Chater

14 Crunching Big Data with Fingertips: How Typists Tune Their


Performance Toward the Statistics of Natural Language 320
Lawrence P. Behmer Jr. and Matthew J. C. Crump

15 Can Big Data Help Us Understand Human Vision? 343


Michael J. Tarr and Elissa M. Aminoff

Index 364
CONTRIBUTORS

James S. Adelman, University of Warwick


Elissa M. Aminoff, Fordham University
David Anaki, Bar-Ilan University
Lawrence P. Behmer Jr., Brooklyn College of the City University of New York
Marianna Bolognesi, University of Amsterdam
Nick Chater, University of Warwick
Matthew J. C. Crump, Brooklyn College of the City University of New York
Rick Dale, University of California at Merced
Simon De Deyne, University of Adelaide
Baxter S. Eaves Jr., Rutgers University
Miriam Faust, Bar-Ilan University
Thomas T. Hills, University of Warwick
Matt Huentelman, Translational Genomics Research Institute
Keith A. Hutchison, Montana State University
Brendan Johns, University at Buffalo
Yoed N. Kenett, University of Pennsylvania
Robert V. Lindsey, University of Colorado
Jared Lorince, Northwestern University
Douglas J. K. Mewhort, Queen’s University
Michael C. Mozer, University of Colorado
Daniel Navarro, University of New South Wales
Takao Noguchi, University College, London
Christopher Y. Olivola, Carnegie Mellon University
Zita Oravecz, The Pennsylvania State University
David Reitter, The Pennsylvania State University
viii Contributors

April M. Schweinhart, Rutgers University


Patrick Shafto, Rutgers University
Luuan Chin Tan, National University of Singapore
Michael J. Tarr, Carnegie Mellon University
Peter M. Todd, Indiana University
Joachim Vandekerckhove, University of California, Irvine
David W. Vinson, University of California at Merced
Melvin J. Yap, National University of Singapore
1
DEVELOPING COGNITIVE THEORY BY
MINING LARGE-SCALE NATURALISTIC
DATA
Michael N. Jones

Abstract
Cognitive research is increasingly coming out of the laboratory. It is becoming much
more common to see research that repurposes large-scale and naturalistic data sources to
develop and evaluate cognitive theories at a scale not previously possible. We now have
unprecedented availability of massive digital data sources that are the product of human
behavior and offer clues to understand basic principles of cognition. A key challenge for
the field is to properly interrogate these data in a theory-driven way to reverse engineer
the cognitive forces that generated them; this necessitates advances in both our theoretical
models and our methodological techniques. The arrival of Big Data has been met with
healthy skepticism by the field, but has also been seen as a genuine opportunity to advance
our understanding of cognition. In addition, theoretical advancements from Big Data are
heavily intertwined with new methodological developments—new techniques to answer
questions from Big Data also give us new questions that could not previously have been
asked. The goal of this volume is to present emerging examples from across the field that
use large and naturalistic data to advance theories of cognition that would not be possible in
the traditional laboratory setting.

While laboratory research is still the backbone of tracking causation among


behavioral variables, more and more cognitive research is now letting experimental
control go in favor of mining large-scale and real-world datasets. We are seeing
an exponential1 expansion of data available to us that is the product of human
behavior: Social media, mobile device sensors, images, RFID tags, linguistic
corpora, web search logs, and consumer product reviews, just to name a few
streams. Since 2012, about 2.5 exabytes of digital data are created every day
(McAfee, Brynjolfsson, Davenport, Patil, & Barton, 2012). Each little piece of
data is a trace of human behavior and offers us a potential clue to understand basic
2 M. N. Jones

cognitive principles; but we have to be able to put all those pieces together in a
reasonable way. This approach necessitates both advances in our theoretical models
and development of new methodological techniques adapted from the information
sciences.
Big Data sources are now allowing cognitive scientists to evaluate theoretical
models and make new discoveries at a resolution not previously possible. For
example, we can now use online services like Netflix, Amazon, and Yelp to
evaluate theories of decision-making in the real world and at an unprecedented
scale. Wikipedia edit histories can be analyzed to explore information transmission
and problem solving across groups. Linguistic corpora allow us to quantitatively
evaluate theories of language adaptation over time and generations (Lupyan &
Dale, 2010) and models of linguistic entrainment (Fusaroli, Perlman, Mislove,
Paxton, Matlock, & Dale, 2015). Massive image repositories are being used to
advance models of vision and perception based on natural scene statistics (Griffiths,
Abbott, & Hsu, 2016; Khosla, Raju, Torralba, & Oliva, 2015). Twitter and
Google search trends can be used to track the outbreak and spread of “infectious”
ideas, memory contagion, and information transmission (Chen & Sakamoto, 2013;
Masicampo & Ambady, 2014; Wu, Hofman, Mason, & Watts, 2011). Facebook
feeds can be manipulated2 to explore information diffusion in social networks
(Bakshy, Rosenn, Marlow, & Adamic, 2012; Kramer, Guillory, & Hancock,
2014). Theories of learning can be tested at large scales and in real classroom
settings (Carvalho, Braithwaite, de Leeuw, Motz, & Goldstone, 2016; Fox,
Hearst, & Chi, 2014). Speech logs afford both theoretical advancements in auditory
speech processing, and practical advancements in automatic speech comprehension
systems.
The primary goal of this volume is to present cutting-edge examples that use
large and naturalistic data to uncover fundamental principles of cognition and
evaluate theories that would not be possible without such scale. A more general
aim of the volume is to take a very careful and critical look at the role of Big Data
in our field. Hence contributions to this volume were handpicked to be examples
of advancing theory development with large and naturalistic data.

What is Big Data?


Before trying to evaluate whether Big Data could be used to benefit cognitive
science, a very fair question is simply what is Big Data? Big Data is a very
popular buzzword in the contemporary media, producing much hype and many
misconceptions. Whatever Big Data is, it is having a revolutionary impact on a
wide range of sciences, is a “game-changer,” transforming the way we ask and
answer questions, and is a must-have for any modern scientist’s toolbox. But when
pressed for a definition, there seems to be no solid consensus, particularly among
cognitive scientists. We know it probably doesn’t fit in a spreadsheet, but opinions
Mining Naturalistic Data 3

diverge beyond that. The issue is now almost humorous, with Dan Ariely’s popular
quip comparing Big Data to teenage sex, in that “everyone talks about it, nobody
really knows how to do it, everyone thinks everyone else is doing it, so everyone
claims they are doing it.”
As scientists, we are quite fond of careful operational definitions. However,
Big Data and data science are still-evolving concepts, and are moving targets for
formal definition. Definitions tend to vary depending on the field of study. A
strict interpretation of Big Data from the computational sciences typically refers to
datasets that are so massive and rapidly changing that our current data processing
methods are inadequate. Hence, it is a drive for the development of distributed
storage platforms and algorithms to analyze datasets that are currently out of
reach. The term extends to challenges inherent in data capture, storage, transfer,
and predictive analytics. As a loose quantification, data under this interpretation
currently become “big” at scales north of the exabyte.
Under this strict interpretation, work with true Big Data is by definition quite
rare in the sciences; it is more development of architectures and algorithms to
manage these rapidly approaching scale challenges that are still for the most part on
the horizon (NIST Big Data Working Group, 2014). At this scale, it isn’t clear that
there are any problems in cognitive science that are true Big Data problems yet.
Perhaps the largest data project in the cognitive and neural sciences is the Human
Connectome Project (Van Essen et al., 2012), an ambitious project aiming to
construct a network map of anatomical and functional connectivity in the human
brain, linked with batteries of behavioral task performance. Currently, the project
is approaching a petabyte of data. By comparison, the Large Hadron Collider
project at CERN records and stores over 30 petabytes of data from experiments
each year.3
More commonly, the Gartner 3 Vs definition of Big Data is used across multiple
fields: “Big data is high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable enhanced decision-making,
insight discovery and process optimization” (Laney, 2012). Volume is often
indicative of the fact that Big Data records and observes everything within a
recording register, in contrast to our commonly used methods of sampling in the
behavioral sciences. Velocity refers to the characteristic that Big Data is often a
real-time stream of rapidly captured data. The final characteristic, variety, denotes
that Big Data draws from multiple qualitatively different information sources (text,
audio, images, GPS, etc.), and uses joint inference or fusion to answer questions
that are not possible by any source alone. But far from being expensive to collect,
Big Data is usually a natural byproduct of digital interaction.
So while a strict interpretation of Big Data puts it currently out of reach, it
is simultaneously everywhere by more liberal interpretations. Predictive analytics
based on machine learning has been hugely successful in many applied settings
(see Hu, Wen, & Chua, 2014, for a review). Newer definitions of Big Data
4 M. N. Jones

summarize it as more focused on repurposing naturalistic digital footprints; the


size of “big” is relative across different fields (NIST Big Data Working Group,
2014). The NIH BD2K (Big Data to Knowledge) program is explicit that a Big
Data approach is best defined by what is large and naturalistic to specific subfields,
not an absolute value in bytes. In addition, BD2K notes that a core Big Data
problem involves joint inference across multiple databases. Such combinatorial
problems are clearly Big Data, and are perfectly suited for theoretically driven
cognitive models—many answers to current theoretical and practical questions may
be hidden in the complimentary relationship between data sources.

What is Big Data to Cognitive Science?


Much of the publicity surrounding Big Data has focused on its insight power
for business analytics. Within the cognitive sciences, we have been considerably
more skeptical of Big Data’s promise, largely because we place such a high value
on explanation over prediction. A core goal of any cognitive scientist is to fully
understand the system under investigation, rather than being satisfied with a simple
descriptive or predictive theory.
Understanding the mind is what makes an explanatory cognitive model distinct
from a statistical predictive model—our parameters often reflect hypothesized
cognitive processes or representations (e.g. attention, memory capacity, decision
thresholds, etc.) as opposed to the abstract predictive parameters of, say, weights
in a regression model. Predictive models are able to make predictions of new data
provided they are of the same sort as the data on which the model was trained (e.g.
predicting a new point on a forgetting curve). Cognitive models go a step further:
An explanatory model should be able to make predictions of how the human will
behave in situations and paradigms that are novel and different from the situations
on which the model was built but that recruit the same putative mechanism(s) (e.g.
explaining the process of forgetting).
Marcus and Davis (2014) have argued rather convincingly that Big Data is
a scientific idea that should be retired. While it is clear that large datasets are
useful in discovering correlations and predicting common patterns, more data do
not on their own yield explanatory causal relationships. Big Data and machine
learning techniques are excellent bedfellows to make predictions with greater
fidelity and accuracy. But the match between Big Data and cognitive models is
less clear; because most cognitive models strive to explain causal relationships, they
may be much better paired with experimental data, which shares the same goal.
Marcus and Davis note several ways in which paying attention to Big Data may
actually lead the scientist astray, compared to a much smaller amount of data from
a well-controlled laboratory scenario.
In addition, popular media headlines are chock-full of statements about how
theory is obsolete now that Big Data has arrived. But theory is a simplified model
Mining Naturalistic Data 5

of empirical phenomena—theory explains data. If anything, cognitive theory is more


necessary to help us understand Big Data in a principled way given that much of
the data were generated by the cognitive systems that we have carefully studied in
the laboratory, and cognitive models help us to know where to search and what to
search for as the data magnitude grows.
Despite initial skepticism, Big Data has also been embraced by cognitive science
as a genuine opportunity to develop and refine cognitive theory (Griffiths, 2015).
Criticism of research using Big Data in an atheoretic way is a fair critique of
the way some scientists (and many outside academia) are currently using Big
Data. However, there are also scientists making use of large datasets to test
theory-driven questions—questions that would be unanswerable without access
to large naturalistic datasets and new machine learning approaches. Cognitive
scientists are, by training, [experimental] control freaks. But the methods used
by the field to achieve laboratory control also serve to distract it from exploring
cognitive mechanisms through data mining methods applied Big Data.
Certainly, Big Data is considerably more information than we typically collect
in a laboratory experiment. But it is also naturalistic, and a footprint of cognitive
mechanisms operating in the wild (see Goldstone & Lupyan, 2016, for a recent
survey). There is a genuine concern in the cognitive sciences that many models
we are developing may be overfit to specific laboratory phenomena that neither
exist nor can be generalized beyond the walls of the lab. The standard cognitive
experiment takes place in one hour in a well-controlled setting with variables
that normally covary in the real world held constant. This allows us to determine
conclusively that the flow of causation is from our manipulated variable(s) to the
dependent variable, and often by testing discrete settings (“factorology”; Balota,
Yap, Hutchison, & Cortese, 2012).
It is essential to remember that the cognitive mechanisms we study in the
laboratory evolved to handle real information-processing problems in the real
world. By “capturing” and studying a mechanism in a controlled environment,
we risk discovering experiment or paradigm-specific strategies that are a response
to the experimental factors that the mechanism did not evolve to handle, and in a
situation that does not exist in the real world. While deconfounding factors is an
essential part of an experiment, the mechanism may well have evolved to thrive
in a rich statistically redundant environment. In this sense, cognitive experiments
in the lab may be somewhat analogous to studying captive animals in the zoo and
then extrapolating to behavior in the wild.
The field has been warned about over-reliance on experiments several times
in the past. Even four decades ago Estes (1975) raised a concern in mathematical
psychology that we may be accidentally positing mechanisms that apply only to
artificial situations, and that our experiments may unknowingly hold constant
factors that may covary to produce very different behavior in the real world. More
recently, Miller (1990) reminded cognitive scientists of Estes’ reductionism caution:
6 M. N. Jones

I have observed over the years that there is a tendency for even the best
cognitive scientists to lose sight of large issues in their devotion to particular
methodologies, their pursuit of the null hypothesis, and their rigorous efforts
to reduce anything that seems interesting to something else that is not. An
occasional reminder of why we flash those stimuli and measure those reaction
times is sometimes useful.
(Miller, 1990: 7)
Furthermore, we are now discovering that much of the behavior we want
to use to make inferences about cognitive mechanisms is heavy-tail distributed
(exponential and power-law distributions are very common in cognitive research).
Sampling behavior in a one-hour lab setting is simply insufficient to ever observe
the rare events that allow us to discriminate among competing theoretical accounts.
And building a model from the center of a behavioral distribution may fail horribly
to generalize if the tail of the distribution is the important characteristic that the
cognitive mechanism evolved to deal with.
So while skepticism about Big Data in cognitive science is both welcome and
warranted, the above points are just a few reasons why Big Data could be a genuine
opportunity to advance our understanding of human cognition. If dealt with in
a careful and theoretically driven way, Big Data offers us a completely new set
of eyes to understand cognitive phenomena, to constrain among theories that
are currently deadlocked with laboratory data, to evaluate generalizability of our
models, and to have an impact on the real-world situations that our models are
meant to explain (e.g. by optimizing medical and consumer decisions, information
discovery, education, etc.). And embracing Big Data brings with it development of
new analytic tools that also allow us to ask new theoretical questions that we had
not even considered previously.

How is Cognitive Research Changing with Big Data?


Cognitive scientists have readily integrated new technologies for naturalistic data
capture into their research. The classic cognitive experiment typically involved a
single subject in a testing booth making two alternative forced choice responses to
stimuli presented on a monitor. To be clear, we have learned a great deal about
fundamental principles of human cognition with this basic laboratory approach.
But the modern cognitive experiment may involve mobile phone games with
multiple individuals competing in resource sharing simultaneously from all over
the world (Dufau et al., 2011; Miller, 2012), or dyads engaged in real-time
debate while their attention and gestures are captured with Google Glass (Paxton,
Rodriguez, & Dale, 2015).
In addition, modern cognitive research is much more open to mining datasets
that were created for a different purpose to evaluate the models we have developed
from the laboratory experiments. Although the causal links among variables are mur-
kier, they are still possible to explore with new statistical techniques borrowed from
Mining Naturalistic Data 7

informatics, and the scale of data allows the theorist to paint a more complete
and realistic picture of cognitive mechanisms. Furthermore, online labor markets
such as Amazon’s Mechanical Turk have accelerated the pace of experiments
by allowing us to conduct studies that might take years in the laboratory in
a single day online (Crump, McDonnell, & Gureckis, 2013; Gureckis et al.,
2015).
Examples of new data capture technologies advancing our theoretical inno-
vations are emerging all over the cognitive sciences. Cognitive development is a
prime example. While development unfolds over time, the field has traditionally
been reliant on evaluating infants and toddlers in the laboratory for short studies
at regular intervals across development. Careful experimental and stimulus control
is essential, and young children can only provide us with a rather limited range
of response variables (e.g., preferential looking and habituation paradigms are very
common with infants).
While this approach has yielded very useful information about basic cognitive
processes and how they change, we get only a small snapshot of development.
In addition, the small scale is potentially problematic because many theoretical
models behave in a qualitatively different way depending on the amount and
complexity of data (Frank, Tenenbaum, & Gibson, 2013; McClelland, 2009;
Qian & Aslin, 2014; Shiffrin, 2010). Aslin (2014) has also noted that stimulus
control in developmental studies may actually be problematic. We may be
underestimating what children can learn by using oversimplified experimental
stimuli: These controlled stimuli deconfound potential sources of statistical
information in learning, allowing causal conclusions to be drawn, but this may
make the task much more difficult than it is in the real world where multiple
correlated factors offer complimentary cues for children to learn the structure of
the world (see Shukla, White, & Aslin, 2011). The result is that we may well
endorse the wrong learning model because it explains the laboratory data well,
but is more complex than is needed to explain learning in the statistically rich real
world.
A considerable amount of developmental research has now come out of the
laboratory. Infants are now wired with cameras to take regular snapshots of
the visual information available to them across development in their real world
experiences (Aslin, 2009; Fausey, Jayaraman, & Smith, 2016; Pereira, Smith, & Yu,
2014). LENATM recording devices are attached to children to record the richness
of their linguistic environments and to evaluate the effect of linguistic environment
on vocabulary growth (VanDam et al., 2016; Weisleder & Fernald, 2013). In one
prominent early example, the SpeechHome project, an entire house was wired
to record 200,000+ hours of audio and video from one child’s first three years of
life (Roy, Frank, DeCamp, Miller, & Roy, 2015). Tablet-based learning games are
now being designed to collect theoretically constraining data as children are playing
them all over the world (e.g. Frank, Sugarman, Horowitz, Lewis, & Yurovsky,
2016; Pelz, Yung, & Kidd, 2015).
8 M. N. Jones

A second prime example of both new data capture methods and data scale
advancing theory is in visual attention. A core theoretical issue surrounds
identification performance as a function of target rarity in visual search, but the
number of trials required to get stable estimates in the laboratory is unrealistic.
Mitroff et al. (2015) opted instead to take a Big Data approach to the problem by
turning visual search into a mobile phone game called “Airport Scanner.” In the
game, participants act the part of a TSA baggage screener searching for prohibited
items as simulated luggage passes through an x-ray scanner. Participants respond
on the touchscreen, and the list of allowed and prohibited items grows as they
continue to play.
Mitroff et al. (2015) analyzed data from the first billion trials of visual search
from the game, making new discoveries about how rare targets are processed
when they are presented with common foils, something that would never have
been possible in the laboratory. Wolfe (1998) had previously analyzed 1 million
visual search trials from across 2,500 experimental sessions which took over
10 years to collect. In contrast, Airport Scanner collects over 1 million trials
each day, and the rate is increasing as the game gains popularity. In addition
to answering theoretically important questions in visual attention and memory,
Mitroff et al.’s example has practical implications for visual detection of rare
targets in applied settings, such as radiologists searching for malignant tumors on
mammograms. Furthermore, data from the game have the potential to give very
detailed information about how people become expert in detection tasks

Intertwined Theory and Methods


Our theoretical advancements from Big Data and new methodological develop-
ments are heavily interdependent. New methodologies to answer questions from
Big Data are giving us new hypotheses to test. But simultaneously, our new
theoretical models are helping to focus the new Big Data methodologies. Big Data
often flows in as an unstructured stream of information, and our theoretical models
are needed to help tease apart the causal influence of factors, often when the data
are constantly changing. Big Data analyses are not going to replace traditional
laboratory experiments. It is more likely that the two will be complimentary, with
the field settling on a process of recurring iteration between traditional experiments
and data mining methods to progressively zero in on mechanistic accounts of
cognition that explain both levels.
In contrast to our records from behavioral experiments, Big Data is usually
unstructured, and requires sophisticated analytical methods to piece together causal
effects. Digital behavior is often several steps from the cognitive mechanisms we
wish to explore, and these data often confound factors that are carefully teased
apart in the laboratory with experimental control (e.g. the effects of decision,
response, and feedback). To infer causal flow in Big Data, cognitive science has
Mining Naturalistic Data 9

been adopting more techniques from machine learning and network sciences.4
One concern that accompanies this adoption is that the bulk of current machine
learning approaches to Big Data are primarily concerned with detecting and
predicting patterns, but they tend not to explain why patterns exist. Our ultimate
goal in cognitive science is to produce explanatory models. Predictive models
certainly benefit from more data, but it is questionable whether more data helps to
achieve explanatory understanding of a phenomenon more than a well-controlled
laboratory experiment.
Hence, development of new methods of inquiry from Big Data based on cogni-
tive theory is a priority area of research, and has already seen considerable progress
leading to new tools. Liberman (2014) has compared the advent of such tools in
this century to the inventions of the telescope and microscope in the seventeenth
century. But Big Data and data mining tools on their own are of limited use for
establishing explanatory theories; Picasso had famously noted the same issue about
computers: “But they are useless. They can only give answers.” Big Data in no way
obviates the need for foundational theories based on careful laboratory experimen-
tation. Data mining and experimentation in cognitive science will continue to be
iteratively reinforcing one another, allowing us to generate and answer hypotheses
at a greater resolution, and to draw conclusions at a greater scope.

Acknowledgments
This work was supported by NSF BCS-1056744 and IES R305A150546

Notes
1 And I don’t use the term “exponential” here simply for emphasis—the amount
of digital information available currently doubles every two years, following
Moore’s Law (Gantz & Reinsel, 2012).
2 However, both the Facebook and OKCupid massive experiments resulted in
significant backlash and ethical complaints.
3 The Large Hadron Collider generates roughly two petabytes of data per second,
but only a small amount is captured and stored.
4 “Drawing Causal Inference from Big Data” was the 2015 Sackler Symposium
organized by the National Academy of Sciences.

References
Aslin, R. N. (2009). How infants view natural scenes gathered from a head-mounted
camera. Optometry and Vision Science: Official publication of the American Academy of
Optometry, 86(6), 561.
Aslin, R. N. (2014). Infant learning: Historical, conceptual, and methodological
challenges. Infancy, 19(1), 2–27.
10 M. N. Jones

Bakshy, E., Rosenn, I., Marlow, C., & Adamic, L. (2012). The role of social networks
in information diffusion. In Proceedings of the 21st International Conference on World Wide
Web (pp. 519–528). ACM.
Balota, D. A., Yap, M. J., Hutchison, K. A., & Cortese, M. J. (2012). Megastudies. Visual
word recognition volume 1: Models and methods, orthography and phonology. New York, NY:
Psychology Press, 90–115.
Carvalho, P. F., Braithwaite, D. W., de Leeuw, J. R., Motz, B. A., & Goldstone, R. L.
(2016). An in vivo study of self-regulated study sequencing in introductory psychology
courses. PLoS One 11(3): e0152115.
Chen, R., & Sakamoto, Y. (2013). Perspective matters: Sharing of crisis information
in social media. In System Sciences (HICSS), 2013 46th Hawaii International Conference
(pp. 2033–2041). IEEE.
Crump, M. J., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s
Mechanical Turk as a tool for experimental behavioral research. PLoS One, 8(3), e57410.
Dufau, S., Duñabeitia, J. A., Moret-Tatay, C., McGonigal, A., Peeters, D., Alario, F. X.,
... & Ktori, M. (2011). Smart phone, smart science: How the use of smartphones can
revolutionize research in cognitive science. PLoS One, 6(9), e24974.
Estes, W. K. (1975). Some targets for mathematical psychology. Journal of Mathematical
Psychology, 12(3), 263–282.
Fausey, C. M., Jayaraman, S., & Smith, L. B. (2016). From faces to hands: Changing visual
input in the first two years. Cognition, 152, 101–107.
Fox, A., Hearst, M. A., & Chi, M. T. H. (Eds.) Proceedings of the First ACM Conference on
Learning At Scale, L@S 2014, March 2014.
Frank, M. C., Sugarman, E., Horowitz, A. C., Lewis, M. L., & Yurovsky, D. (2016). Using
tablets to collect data from young children. Journal of Cognition and Development, 17(1),
1–17.
Frank, M. C., Tenenbaum, J. B., & Gibson, E. (2013). Learning and long-term retention
of large-scale artificial languages. PLoS One, 8(1), e52500.
Fusaroli, R., Perlman, M., Mislove, A., Paxton, A., Matlock, T., & Dale, R. (2015).
Timescales of massive human entrainment. PLoS One, 10(4), e0122742.
Gantz, J., & Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital
shadows, and biggest growth in the far east. IDC iView: IDC analyze the future, 2007,
1–16.
Goldstone, R. L., & Lupyan, G. (2016). Harvesting naturally occurring data to reveal
principles of cognition. Topics in Cognitive Science., 8(3), 548–568.
Griffiths, T. L. (2015). Manifesto for a new (computational) cognitive revolution. Cogni-
tion, 135, 21–23.
Griffiths, T. L., Abbott, J. T., & Hsu, A. S. (2016). Exploring human cognition using large
image databases. Topics in Cognitive Science, 8(3), 569–588.
Gureckis, T. M., Martin, J., McDonnell, J., Rich, A. S., Markant, D., Coenen,
A., ... & Chan, P. (2015). psiTurk: An open-source framework for conducting
replicable behavioral experiments online. Behavior Research Methods, 1–14. doi:
10.3758/s13428-015-0642-8.
Hu, H., Wen, Y., Chua, T. S., & Li, X. (2014). Toward scalable systems for big data analytics:
A technology tutorial. Access, IEEE, 2, 652–687.
Khosla, A., Raju, A. S., Torralba, A., & Oliva, A. (2015). Understanding and predicting
image memorability at a large scale. In Proceedings of the IEEE International Conference on
Computer Vision (pp. 2390–2398).
Mining Naturalistic Data 11

Kramer, A. D., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of


massive-scale emotional contagion through social networks. Proceedings of the National
Academy of Sciences, 111(24), 8788–8790.
Laney, D. (2012). The importance of ’Big Data’: A definition. Gartner. Retrieved, June 21,
2012.
Liberman, M. (2014). How big data is changing how we study languages. Retrieved from
http://www.theguardian.com/education/2014/may/07/what-big-data-tells-about-
language.
Lupyan, G., & Dale, R. (2010). Language structure is partly determined by social
structure. PLoS One, 5(1), e8559.
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big
data. The management revolution. Harvard Business Review, 90(10), 61–67.
McClelland, J. L. (2009). The place of modeling in cognitive science. Topics in Cognitive
Science, 1(1), 11–38.
Marcus, G., & Davis, E. (2014). Eight (no, nine!) problems with big data. The New York
Times, 6(4), 2014.
Masicampo, E. J., & Ambady, N. (2014). Predicting fluctuations in widespread interest:
Memory decay and goal-related memory accessibility in Internet search trends. Journal of
Experimental Psychology: General, 143(1), 205.
Miller, G. A. (1990). The place of language in a scientific psychology. Psychological
Science, 1(1), 7–14.
Miller, G. (2012). The smartphone psychology manifesto. Perspectives on Psychological
Science, 7(1), 221–237.
Mitroff, S. R., Biggs, A. T., Adamo, S. H., Dowd, E. W., Winkle, J., & Clark, K. (2015).
What can 1 billion trials tell us about visual search?. Journal of Experimental Psychology:
Human Perception and Performance, 41(1), 1.
NIST Big Data Working Group (2014). http://bigdatawg.nist.gov/home.php.
Paxton, A., Rodriguez, K., & Dale, R. (2015). PsyGlass: Capitalizing on Google Glass for
naturalistic data collection. Behavior Research Methods, 47, 608–619.
Pelz, M., Yung, A., & Kidd, C. (2015). Quantifying curiosity and exploratory play on
touchscreen tablets. In Proceedings of the IDC 2015 Workshop on Digital Assessment and
Promotion of Children’s Curiosity.
Pereira, A. F., Smith, L. B., & Yu, C. (2014). A bottom-up view of toddler word
learning. Psychonomic Bulletin & Review, 21(1), 178–185.
Qian, T., & Aslin, R. N. (2014). Learning bundles of stimuli renders stimulus order as a cue,
not a confound. Proceedings of the National Academy of Sciences, 111(40), 14400–14405.
Roy, B. C., Frank, M. C., DeCamp, P., Miller, M., & Roy, D. (2015). Predicting the
birth of a spoken word. Proceedings of the National Academy of Sciences, 112(41), 12663–
12668.
Shiffrin, R. M. (2010). Perspectives on modeling in cognitive science. Topics in Cognitive
Science, 2(4), 736–750.
Shukla, M., White, K. S., & Aslin, R. N. (2011). Prosody guides the rapid mapping of
auditory word forms onto visual objects in 6-mo-old infants. Proceedings of the National
Academy of Sciences, 108(15), 6038–6043.
VanDam, M., Warlaumont, A., Bergelson, E., Cristia, A., Soderstrom, M., De Palma, P., &
MacWhinney, B. (2016). HomeBank, an online repository of daylong child-centered
audio recordings. Seminars in Speech and Language, 37, 128–142.
12 M. N. Jones

Van Essen, D. C., Ugurbil, K., Auerbach, E., Barch, D., Behrens, T. E. J., Bucholz, R.,
... Della Penna, S. (2012). The Human Connectome Project: A data acquisition
perspective. Neuroimage, 62(4), 2222–2231.
Weisleder, A., & Fernald, A. (2013). Talking to children matters: Early language experience
strengthens processing and builds vocabulary. Psychological Science, 24(11), 2143–2152.
Wolfe, J. M. (1998). What can 1 million trials tell us about visual search? Psychological
Science, 9(1), 33–39.
Wu, S., Hofman, J. M., Mason, W. A., & Watts, D. J. (2011). Who says what to
whom on Twitter. In Proceedings of the 20th International Conference on World Wide
Web (pp. 705–714). ACM.
2
SEQUENTIAL BAYESIAN UPDATING
FOR BIG DATA
Zita Oravecz,
Matt Huentelman,
and Joachim Vandekerckhove

Abstract
The velocity, volume, and variety of Big Data present both challenges and opportunities
for cognitive science. We introduce sequential Bayesian updating as a tool to mine these
three core properties. In the Bayesian approach, we summarize the current state of
knowledge regarding parameters in terms of their posterior distributions, and use these
as prior distributions when new data become available. Crucially, we construct posterior
distributions in such a way that we avoid having to repeat computing the likelihood of old
data as new data become available, allowing the propagation of information without great
computational demand. As a result, these Bayesian methods allow continuous inference
on voluminous information streams in a timely manner. We illustrate the advantages of
sequential Bayesian updating with data from the MindCrowd project, in which crowd-sourced
data are used to study Alzheimer’s dementia. We fit an extended LATER (“Linear Approach
to Threshold with Ergodic Rate”) model to reaction time data from the project in order
to separate two distinct aspects of cognitive functioning: speed of information accumulation
and caution.

Introduction
The Big Data era offers multiple sources of data, with measurements that contain
a variety of information in large volumes. For example, neuroimaging data from
a participant might be complemented with a battery of personality tests and a set
of cognitive-behavioral data. At the same time, with brain imaging equipment
more widely accessible the number of participants is unlikely to remain limited
to a handful per study. These advancements allow us to investigate cognitive
phenomena from various angles, and the synthesis of these perspectives requires
highly complex models. Cognitive science is slated to update its set of methods to
foster a more sophisticated, systematic study of human cognition.
14 Z. Oravecz, M. Huentelman and J. Vandekerckhove

Cognitive science has traditionally relied on explicative models to summarize


observed data. Even simple cognitive measurement models (such as, e.g. process
dissociation) are non-linear and can capture complex processes in more interesting
terms than additive elements of true score and noise (such as in regression).
With more complex cognitive process models (e.g. Ratcliff, 1978) we can study
underlying mechanisms in meaningful terms and extract important facets of
cognitive functioning from raw behavioral data.
In this chapter, we focus on methods that can be used for predominantly
model-driven statistical inference. Here we use model-driven as a distinctive term, to
separate these methods from the largely data-driven ones, such as those in machine
learning (for Bayesian methods in machine learning see Zhu, Chen, & Hu, 2014).
In practice, the specifics of a research question, together with relevant domain
knowledge, will inform the choice of methods. Our purpose is not to advocate
one set of methods over the other, but to offer by example an insight into what
model-driven methods can achieve. In particular, we will focus on how Bayesian
methods can be employed to perform model-driven inference for Big Data in an
efficient and statistically coherent manner.
The primary reasoning behind considering model-based inference lies in the
fact that Big Data is often voluminous in both length (number of units, e.g.
people) and width (number of variables, e.g. cognitive measures). While increases
in computing power can help data-driven exploration, this doubly exponential
problem of “thick” datasets often calls for domain-specific expertise. As a start,
simple data curation can help to select variables that matter. These chosen variables
can then be combined into a coherent narrative (in the form of a mathematical
model), which opens up new ways of understanding the complex problem of
human cognition.
First we will review why classical statistical methods are often unsuited for Big
Data purposes. The reason is largely a lack of flexibility in existing methods, but
also the assumptions that are typically made for mathematical convenience, and
the particular way of drawing inference from data. Then we will elaborate on
how Bayesian methods, by contrast, form a principled framework for interpreting
parameter estimates and making predictions. A particular problem with Bayesian
methods, however, is that they can be extremely demanding in terms of
computational load, so one focus of this chapter is on how to reconcile these
issues with Big Data problems. Finally, an example application will focus on a
crowd-sourced dataset as part of a research project on Alzheimer’s dementia (AD).

Two Schools of Statistical Inference


Broadly speaking, there exist two schools of thought in contemporary statistics. In
psychological and cognitive science, the frequentist (or classical) school maintains a
dominant position. Here, we will argue that the Bayesian school (see, e.g. Gelman,
Sequential Bayesian Updating 15

Carlin, Stern, Dunson, Vehtari, & Rubin, 2014; Kruschke, 2014; Lee &
Wagenmakers, 2013), which is rising in popularity, holds particular promise for
the future of Big Data.
The most fundamental difference between the frequentist and the Bayesian
schools lies in the use and interpretation of uncertainty—possibly the most central
issue in statistical inference. In classical statistics (null hypothesis significance
testing/NHST, α and p-values, and confidence intervals), the thought process of
inference starts with an existing hypothesis—usually, the null hypothesis H0 . The
classical reasoning goes: “assuming that the null hypothesis is true, how surprising
are the data I have observed?” The word “surprising” in this context has a very
specific meaning. It means “the probability of a set of observations that is at least as
extreme as the real data”. In the case of a common t-test where the null hypothesis
is that a difference truly is zero but the observation is td , the surprise is given by the
probability of observing a t statistic that is at least as far away from zero as td (i.e.
larger than td if td was positive, and smaller if it was negative). If this probability is
small, then the data are considered to be very surprising, or unlikely, “under the
null,” and the null hypothesis is rejected in favor of the alternative hypothesis H A .
This conditional probability of certain constellations of data given a specific model
(H0 ) is commonly known as the p-value.
One common counterargument to this line of reasoning is that just because the
data are unlikely under H0 does not imply that they are likely under any other hypothesis—it
is possible for data to simply be unlikely under all hypotheses that are being consid-
ered. This argument is somewhat counterintuitive because it is tempting to think
that the probabilities under consideration should sum up to one. A counterexample
is easy to construct. Consider a fictional person K who plays the lottery:

Premise—either K is a person who is not Bertrand Russell (H0 ) or K is Bertrand


Russell (H A )
Premise—if K is a person who is not Bertrand Russell (i.e. if H0 is true), the
probability p of K winning the lottery is very small: p(win | H0 ) < α
Premise—K wins the lottery (an event with p < α has occurred)
Conclusion—therefore, H0 is false, and K is Bertrand Russell

The absurdity is obvious in this example: Conclusions from this method of


reasoning are entirely determined by which hypothesis was arbitrarily chosen as
the null, and clearly the probabilities p(e|H0 ) and p(e|H A ) do not necessarily add
up to one.1 For more discussion on the peculiarities of the p-value see for example
Wagenmakers (2007).
In a more fundamental sense, adherents of the two frameworks think about
data and parameters in rather different ways. The classical framework considers the
data to be random: The current data to be analyzed are just one possible instance
of thousands of hypothetical datasets—a population that is assumed to exist and
that we could observe if we could re-run the study or experiment with the exact
16 Z. Oravecz, M. Huentelman and J. Vandekerckhove

same settings. The observed data are then interpreted against the backdrop of this
population of hypothetical data in order to determine how surprising the outcome
was. The inferred hypothesis itself does not bear any probabilistic meaning: In
the classical sense parameters and hypotheses are fixed, meaning that there exists
a “true” parameter value, an exact value for a parameter that is waiting to be
found. The only probabilistic statements made are about data: How likely were
the data, and if we collect more data and compute confidence intervals, what
are the probabilistic properties of our conclusions?2 It is tempting to invert the
probabilistic statement and make it about the underlying truth rather than about
the data (e.g. “what is the probability that H A is true,” or “what is the probability
that the results are due to chance,” or “what is the probability these results will
reappear in a replication?”); however, such statements can only be evaluated with
the use of Bayes’ rule (see below).
Big Data applications in some sense preempt thoughts of hypothetical
datasets—we have a large amount of data at hand and the size of the sample often
approaches that of the population. Therefore in these settings it is more coherent
to assume that the data are fixed and we compute the probability distributions of
parameter values based on the information contained by all data available at present.
Moreover, a common goal in Big Data analysis is to make predictions about
future trends. Frequentist inference can only assign probabilities to random
events and to long-run frequencies, and is not equipped to make statements
that are conditioned on past data. In fact, by relying on frequentist inference
“one would not be allowed to model business forecast, industrial processes,
demographics patterns, or for that matter real-life sample surveys, all of which
involve uncertainties that cannot simply represented by physical randomization”
(Gelman, 2006: 149). To summarize, with Bayesian modeling uncertainty can
be directly addressed in terms of probability statements. To further illustrate the
advantages of Bayesian modeling, we first review some of its basic principles.

Principles of Bayesian Statistics


Bayesian methods are used to update current knowledge as information (data)
comes in. The core of Bayesian statistical inference is the posterior distribution
of the parameters, which contains the most up-to-date information about models
and parameters. The posterior is proportional to the product of the likelihood and
a prior distribution. The latter allows us to introduce information into current
inference based on past data. The likelihood describes the assumed data generating
mechanism. Formally, by using Bayes’ rule of conditional probability we can
estimate the probability distribution of parameters given the data:

p(D|θ ) p(θ)
p(θ |D) = , (1)
p(D)
Sequential Bayesian Updating 17

where θ stands for the vector of all parameters in the model and D denotes the data.
The left-hand side is referred to as the posterior distribution. p(D|θ ) is the likelihood
of the data D given θ . The second factor p(θ) in the numerator is the prior
distribution on θ , which incorporates prior information on the parameter of interest
and formalizes the current state of our knowledge of the parameters (before having
seen the current data, but after having seen all past data). The denominator, p(D),
is the probability of the data averaged over all models under consideration. It does
not depend on the model parameters and serves as a normalization constant in the
equation above. The posterior distribution can often be obtained using only the
repeated application of Bayes’ rule (Equation 1) and the law of total probability:

Z
p(a) = p(a | b) p(b)db, (2)
B

where B is the domain Rof the random variable b. For example, Equation 2 can be
used to obtain p(D) = 2 p(D | θ ) p(θ)dθ .

That Wretched Prior3


The most frequent criticism of Bayesian statistics involves the necessity of specifying
a prior distribution on the parameters of interest, even in cases when one has no
idea which values are likely to occur (e.g. Trafimow & Marks, 2015). A reply to
this criticism is that the need for a specified prior distribution is not a weakness
of Bayesian statistics but a necessary condition for principled statistical inference.
Alternatively, solutions in terms of uninformative prior distributions have been
offered (e.g. Jaynes, 2003).
Interestingly, however, from the perspective of Big Data, prior specification is
a blessing rather than a curse: Through specifying informative prior distributions
based on past data (or, crucially, previous, smaller subsets of a large dataset), the data
at hand (or other parts of some large dataset) can be analyzed without having to
re-fit the model for past data, while retaining the information from past data in
the newly derived parameter estimates. A worked-out example of this principle
appears at the end of this section, but Figure 2.1 shows a graphical example of
how the conditional posterior distribution of a certain parameter (that is, the
posterior distribution of that parameter conditional on all the other parameters) updates
and becomes more informative as data are added. At the outset, we have next to
no knowledge of the parameter, as expressed in the flat prior distribution. The
prior becomes more peaked when more information becomes available, and with
each update the parameter estimate is less noisy (i.e. has lower posterior standard
deviation). With informative priors, convergence can be fast even if only a handful
of new data points are added at a time.
18 Z. Oravecz, M. Huentelman and J. Vandekerckhove

5
µ

0
9 15 30 75 150 225 300
Participants
FIGURE 2.1 Sequential updating of the conditional posterior distribution of a
parameter µ. The parameter µ was simulated to be 5, and the probability density
function of the parameter given all the available data is updated with some number of
participants at a time (the total number is given on the horizontal axis). The distibution
concentrates around the true value as N increases.

Obtaining the Posterior


Statistical inference in the Bayesian framework typically focuses on the full
posterior distribution. When models are simple (e.g. linear models), the analytical
form of the posterior can be derived and posterior statistics can be calculated
directly. Most often, however, posteriors are far too complex to obtain through
straightforward derivation. In these cases approximate Bayesian inference can be
applied. We can divide these into two categories: structural and stochastic approaches.
Structural approaches (e.g. variational Bayes; Fox & Roberts, 2012) aim to
find an analytical proxy (variational distribution) of the model parameters that
are maximally similar to the posterior—as defined by some closeness/divergence
criterion—but have a simpler form. Posterior statistics are then based on this
proxy. Once this is derived and tested for a specific model, inference can be
carried out very efficiently (e.g. Ostwald, Kirilina, Starke, & Blankenburg,
2014). However, finding a proxy posterior distribution for new models can
be a labor of some tedium. On the other hand, stochastic (sampling-based)
techniques are implemented in ready-to-use generic inference engines such as
WinBUGS (“Bayesian inference Using Gibbs Sampling”; Lunn, Thomas, Best,
& Spiegelhalter, 2000), JAGS (“Just Another Gibbs Sampler”; Plummer, 2003),
and, more recently, Stan (Stan Development Team, 2013). Moreover, they provide
Sequential Bayesian Updating 19

an asymptotically exact representation of the posterior via Markov chain Monte


Carlo (MCMC) sampling schemes. While the computational cost of sampling may
be prohibitive when considering the large volumes of data in Big Data applications,
the readiness of these methods to fit a wide range of models make them an
appealing alternative and calls for the development of techniques to overcome
the computational hurdles. Later in this chapter we will describe how sequential
Bayesian updating can be a useful technique to consider.
Another quantity that is important for statistical inference in the Bayesian
framework is the Bayes factor, which is used to compare two models against each
other. The computational details to obtain the Bayes factor can be found in the
literature (e.g. Vandekerckhove, Matzke, & Wagenmakers, in press; Verdinelli &
Wasserman, 1995), but for our purposes it suffices to know that the Bayes factor
expresses the degree to which the available evidence should sway our beliefs from
one model to another. A Bayes factor of one indicates no change in belief, whereas
a Bayes factor of ten for model A over B indicates that we should be ten times more
confident in A over B after seeing the data than we were before.

Sequential Updating with Bayesian Methods


A canonical example in statistical inference is that of “the Lady Tasting Tea”
(Fisher, 1935; Lindley, 1993). In an account by Clarke (1991), Ronald Fisher
was once visited upon by his colleague, a Dr Muriel, who during the course of
a party reprimanded Fisher for pouring tea into a cup first, and milk second. She
claimed to be able to discern the difference and to prefer the reverse order. Fisher,
exclaiming that “surely, it makes no difference,” proceeded to set up a blind tasting
experiment with four pairs of cups. Dr. Muriel correctly identified her preferred
cup each time.
The pivotal quantity in this simple example is the rate π of correct
identifications. We are interested in the posterior distribution of the parameter
π given the data. Call that distribution p (π | C, N ), where C is the number of
correct judgments out of N trials. By Bayes’ theorem,
P (C, N | π) p (π )
p (π | C, N ) = .
P (C, N )
In this case, the likelihood, or the probability of observing the data takes the form
of a binomial distribution, and is
 
N C
P (C, N | π ) = π (1 − π ) N −C .
C
The marginal likelihood of the data, also known as the evidence, is
Z 1
P (C, N ) = P (C, N | π ) p(π)dπ.
0
20 Z. Oravecz, M. Huentelman and J. Vandekerckhove

Finally, the prior can be set to a Beta distribution with shape parameters α and β:
1
p(π) = Beta (α, β) = π α−1 (1 − π )β−1 . (3)
B(α, β)
α
The mean of this prior distribution is α+β . In order to allow all possible values of
rate π to be a priori equally likely, set α = β = 1, implying a prior mean of 0.5.
These elements can be combined to compute the posterior distribution of
π given the data. To simplify this calculation, isolate all factors that contain the
parameter π and collect the rest in a scale factor S that is independent of rate π :
 N C N −C
h 1 α−1 β−1
i
C
π (1 − π ) B(α,β)
π (1 − π )
p (π | C, N ) = R1
0
P (C, N | π ) p(π)dπ
= Sπ C (1 − π ) N −C π α−1 (1 − π )β−1
= Sπ C+α−1 (1 − π ) N −C+β−1 .

Now use the knowledge that the posterior distribution must be a proper
distribution (i.e. it must integrate to 1), so that S can be determined as that unique
value that ensures propriety. We exploit the similarity to the binomial distribution
to obtain:
 
N +α+β −2
p (π | C, N ) = × (3)
C +α
π C+α−1 (1 − π ) N −C+β−1 ,

which corresponds to a Beta distribution with updated parameters: Beta(α +


α+C 4
C, β + N − C), and with posterior mean α+β+N . Note that if we choose the
“flat prior” parameters α = β = 1, then the posterior reduces to the likelihood.
More interestingly, however, “today’s posterior is tomorrow’s prior” (Lindley,
1972: 2). Suppose that we observe new data from a second round of tastings,
with some sample size N ′ and C ′ correct identifications. We can then combine
this new information using the posterior as a new prior, using the exact same
methods:
N + N′ + α + β − 2
 

P π | C, N , C ′ , N ′ = ×
C + C′ + α − 1
π C+C +α−1 (1 − π ) N +N −(C+C )+β−1 ,
′ ′ ′

which corresponds to a Beta distribution with updated parameters: Beta(α + C +


α+C+C ′
C ′ , β + N + N ′ − C − C ′ ), and with posterior mean α+β+N +N ′
.
Crucially, note the similarity of this equation to Equation 3: This function is
exactly what would have been obtained if C + C ′ correct judgments had been seen
in N + N ′ trials done all at once: The prior distribution of π is updated by the
Sequential Bayesian Updating 21

data (C + C ′ , N + N ′ ) as if there had only ever been one round of tastings. The
Bayesian method of sequential updating is coherent in this sense: Datasets can be
partitioned into smaller parts and yet contribute to the posterior distribution with
equal validity.
We also note here that sequential updating does not always lead to an analytically
tractable solution. The example above has the special property that the prior
distribution of the parameter of interest (the Beta prior for the rate parameter π )
is of the same distributional form as the posterior distribution. This property is
called conjugacy; information from the data enters into the Beta distribution by
changing the parameters of the Beta prior, but not its parametric form. Many
simple and moderately complex problems can be described in terms of conjugate
priors and likelihoods. For models where the conjugacy property is not met,
non-parametric techniques have to be applied to summarize information in the
posterior distribution. Our example application will have conjugate properties,
and we provide further information on non-parametric modeling in the Discussion
section.

Advantages of Sequential Analysis


in Big Data Applications
The method of sequential Bayesian updating (SBU) can address the two
computational hurdles of Big Data: volume and velocity. The combination of
these solutions with the possibility of fitting cognitive models that can exploit the
variety in Big Data through flexible modeling makes SBU a useful tool for research
problems in cognitive science.
SBU is not the only way to deal with computational challenges in Bayesian
inference, and we mention some techniques based on parallelization in the
Discussion section. In choosing SBU our focus is more dominantly on the
time-varying aspect of data size and on online inference: Data are assumed to
accumulate over time and—presumably—sharpen the inference. In SBU all data
batches but the first are analyzed using informative priors, which should speed up
convergence relative to the parallel techniques.
As described above, the procedure of SBU is to summarize ones current state
of knowledge regarding parameters in terms of their posterior distributions, and
use these as prior distributions when new data become available. Crucially, we
construct posterior distributions in such a way that we avoid having to repeat
computing the likelihood of old data as new data become available. We can address
the three main properties of Big Data as follows:

Volume
One can think of Big Data simply as a large dataset that it is infeasible to analyze
at once on the available hardware. Through SBU, one can partition a large
22 Z. Oravecz, M. Huentelman and J. Vandekerckhove

dataset into smaller, more manageable batches, perform model fitting on those
sequentially, using each batch’s posterior distribution as a prior for the next batch.
This procedure avoids having to store large datasets in memory at any given time.

Velocity
Bayesian decision rules are by default sequential in nature, which makes them
suitable for handling Big Data streams. Unlike the frequentist paradigm, Bayesian
methods allow for inferences and decisions to be made at any arbitrary point in the
data stream, without loss of consistency. Information about past data is kept online
by means of the posterior distributions of the model parameters that sufficiently
summarize the data generation process. The likelihood only needs to be calculated
for the new data point to update the model parameters’ posteriors. We will focus
on cases where data are streaming continuously and a relatively complex model is
fit to the data. These principles scale seamlessly and can be applied where a large
volume of data is analyzed with complex models.

Variety
Big Data means a lot of information coming in from different sources. One
needs complex models to combine different sources of information (see van der
Linden, 2007, for a general method for combining information across sources). For
example, often not only neuroimaging data are collected, but several behavioral
measures are available (e.g. the Human Connectome Project). In such a case,
one could combine a neural model describing functional magnetic resonance
imaging data with a cognitive model describing behavioral data (see Turner,
Forstmann, Wagenmakers, Sederberg, & Steyvers, 2013, for an application in
cognitive neuroscience). Off-the-shelf software packages are not ready to make
inference with novel complex models, while Bayesian tools provide us with a
possibility to fit practically any model regardless of complexity.

Application: MindCrowd—Crowdsourcing in the Service


of Understanding Alzheimer’s Dementia
MindCrowd
MindCrowd (TGen and The University of Arizona; www.mindcrowd.org) is a
large-scale research project that uses web-based crowdsourcing to study AD. The
focus is on the assessment of cognition in a large cohort of healthy adults of all ages.
The project is in its first phase, where web-based memory testing is conducted
through two tasks: An attention task resulting in simple reaction times (of five
trials) and a pair-associated learning task with three stages of recall. Moreover, a set
of covariates is collected including age, gender, marital status, education, whether
the participant or a family member have been diagnosed with AD, and more. The
Sequential Bayesian Updating 23

goal is to collect data from one million people and select various profiles. Then in a
second phase more intensive cognitive testing will be carried out complimented by
DNA sampling and additional demographic questions. MindCrowd was launched
in April of 2013 and has recruited over 40,000 test takers who have completed
both tasks and answered at least 80 percent of the demographic questions. The
analyses presented here are based on 22,246 participants whose data were available
at the time of writing. With sequential Bayesian updating, inference regarding
substantively interesting parameters can be kept up to date in a continuous fashion,
adding only newly arriving data to a prior that is itself based on previous data.
This means, for example, that when the last responses (the ones closer to the one
million mark) arrive, computing the posterior distribution will be fast.

Modeling Simple Reaction Time with the LATER Model


Data collected through the MindCrowd website provides us with several
opportunities for cognitive modeling. We will focus here on the attention task of
the MindCrowd project: A vigilance task in which participants are asked to respond
as fast as they can to an appearing stimulus, with a randomized interstimulus
interval. The stimulus was a fuchsia disk and participants were instructed to hit
enter/return as soon as they saw it appear on their screen. At the time of writing
the task is still available on the MindCrowd website.
We will apply a hierarchical extension of a widely used process model for
reaction time (RT) called the Linear Approach to Threshold with Ergodic Rate model
(LATER; Reddi & Carpenter, 2000; Ratcliff, Carpenter, & Reddi, 2001). The
LATER model is one of a large class of sequential-sampling models, in which it is
assumed that during the course of a trial, information is accumulated sequentially
until a criterial amount of information is reached, upon which a response is
executed. In the LATER model, the accumulation process is assumed to be
linear, approaching a fixed threshold, with a rate that is random from trial to
trial. A graphical illustration of the model is shown in Figure 2.2.
The LATER model describes the latency distributions of observed RTs by
characterizing the decision-making process in terms of two cognitive variables,
namely (1) person-specific caution θ p , or the amount of information needed by p to
respond (the “threshold”), and (2) the average rate of information accumulation ν p (the
“accretion rate”). In taking this approach, we are fitting a probabilistic model to the
observed behavioral data. We think of this probabilistic abstraction as the generative
model, and it characterizes our assumptions regarding the process by which the data
come about. More specifically, at each trial i, a single, trial-specific, realization
of the accretion rate, denoted as z pi , is generated according to a unit-variance
Gaussian distribution:

z pi ∼ N (ν p , 1), (4)
24 Z. Oravecz, M. Huentelman and J. Vandekerckhove

where ∼ is the common notation used to indicate that the variable on the left
side is a draw from the distribution on the right side. The predicted response time
θ
at trial i is then t pi = z pip ; that is, the person-specific caution θ p divided by the
person-specific rate at the iþ trial: z pi .
θ
Rearranging this expression yields z pi = t pip , which by Equation 4 follows a
Gaussian distribution with mean ν p and variance 1. It further follows that
!
z pi 1 νp 1
= ∼N , , (5)
θp t pi θ p θ p2

where ν p remains the accretion rate parameter for person p, capturing their
information processing speed, and θ p is the threshold parameter implying their
caution in responding.
In what follows, we will apply a regression structure to the accretion rate
parameter in order to quantify between-person differences in speed of information
processing (see, e.g. Vandekerckhove, Tuerlinckx, & Lee, 2011, on hierarchical
Bayesian approaches to cognitive models). To the best of our knowledge this is the
first application of a hierarchical Bayesian LATER model. The caution parameter,
θ , is positive by definition—and is closely related to the precision of the accretion
distribution—so we choose a gamma distribution on the square of caution, θ p2 to
inherit the conjugacy of that distribution:

θ p2 ∼ Ŵ(sθ ,rθ ), (6)

with shape sθ and rate rθ the parameters of the gamma distribution on θ p2 .


Furthermore, assume that C covariates are measured and x pc denotes the score
of person p on covariate c (c = 1, . . . , C). All person-specific covariate scores are
collected into a vector of length C + 1, denoted as x p = (1, x p1 , x p2 , . . . , x pC )T .

4
Accumulation dimension

Latency tip
3
Threshold θp

2 te ν p
tion ra
Accre
1

0
0 0.5 1 1.5
Time (seconds)
FIGURE 2.2 Graphical illustration of the LATER model.
Sequential Bayesian Updating 25

The assumed distribution of the accumulation rate parameter ν p is then:


ν p ∼ N (x p β, σ 2 ) (7)
Finally, we need to choose priors for the remaining parameters of interest.
The regression terms  follow a multivariate normal distribution, specified as:
β ∼ M V N Mβ , 6β , with Mβ a vector of zeros, 6β a covariance matrix with
0.1 on the diagonal and 0 elsewhere. We specify gamma prior on the inverse of
the residual variance (i.e. on the precision): σ −2 ∼ Ŵ(sσ ,rσ ), where sσ = rσ = 0.01.
Fitting the specified model through sequential Bayesian updating is described in the
next section.

Study Design
We analyzed the reaction time data of N = 21,947 participants (each providing at
most five valid trials) from the MindCrowd project. While the original sample size
was slightly larger (22,246) we discarded data from participants whose covariate
information was missing. We also omitted reaction times above 5 seconds or below
180 ms, which are unrealistic for a simple vigilance task. As part of the project
several covariates are collected. From this pool, we chose the following variables
for inclusion in our analysis: Age, gender, and whether the participant or a family
member5 had been diagnosed with AD. Our interest is in the effect of the presence
of AD on the speed of information processing, and its possible interaction with
age. The hierarchical LATER model we construct for this purpose is very similar
to a classical regression model, with the main difference being that the predicted
distribution of the data is not a normal distribution, but rather the distribution of
RTs as predicted by a LATER model. The “target” of the regression analysis is
therefore not the mean of a normal distribution but the accretion rate of a LATER
process. For illustration, we write out the mean of the person-specific information
accumulation rates (ν p ) from Equation 7 as a function of age, sex, AD diagnosis
and the interaction of age and AD diagnosis, and the corresponding regression
terms:
x p β = β0 + β1 AGE p + β2 SEX p + β3 ALZ p + β4 AGE p ALZ p . (8)
The key regression equation (the mean in Equation 7; worked out in Equation 8),
together with Equations 5, 6, and 7 completes our model.
For carrying out the analysis, we specify prior distributions on the parameters
in Equations 6, 7, and 8 (i.e. for the regression terms β, for the inverse of the
residual variance σ −2 , and for the person-specific caution θ p ). The parametric
forms of these priors (namely, the multivariate normal distribution and the gamma
distribution) are chosen to be conjugate with the Gaussian likelihood of the data.
The sequential Bayesian updating then proceeds as follows: As described above,
we specify standard non-informative prior distributions for the first batch of data.
26 Z. Oravecz, M. Huentelman and J. Vandekerckhove

We then obtain posterior samples from JAGS. Once JAGS returns the results, we
summarize these samples in terms of the conditional posterior distributions of the
parameters of interest. More specifically, for the regression terms, we calculate the
mean vector and the covariance matrix of the multivariate normal distribution
based on the posterior samples. The mean vector expresses our best current state
of knowledge on the regression terms, the variances on the diagonal quantify
the uncertainty in these, and the covariances in the off-diagonal positions capture
possible trade-offs due to correlation in the covariates. These posterior summary
statistics sufficiently summarize our knowledge on the parameter given the data, up
to a small computational error due to deriving these posterior summaries through
sampling with JAGS, instead of deriving them analytically. The same principle
applies for the residual precision parameter, σ −2 , in terms of the shape parameters
(sσ , rσ ) of its Gamma distribution. Finally, we plug these estimated distributions in
as priors for the next batch of data.
In the current analysis we use exclusively conjugate priors (i.e. where the
parametric form of the prior on the parameter combined with the likelihood of
the model results in a conditional posterior distribution of the same parametric
form but with updated parameters based on the data). However, not all models
can be formulated by relying only on conjugate priors. In these cases, conjugacy
can be forced with the use of non-parametric methods, but this is beyond
the scope of the current chapter (but see the Discussion section for further
guidelines).
The analyses presented here were run on a desktop computer with a 3.40 GHz
CPU and 16 GB RAM. While in principle in this phase of the project (with
N = 21947) we could have analysed the entire dataset on this machine, for
the purposes of demonstration we divided the full dataset into 40 batches. In a
later phase of the MindCrowd project the sample size will increase substantially,
to an expected one million participants, in which case—due to the desktop
computer’s RAM limitations—batch processing will be required rather than
optional.
We implemented the model in JAGS using a homegrown MATLAB interface.6
The analysis took approximately 10 minutes to run. From the first batch of data,
parameters were estimated by running five chains with 1,500 iterations each,
discarding the first 1,000 samples as burnin.7
From the second batch until the last, we ran five chains with 800 iterations
each, from which 500 were discarded as burnin. The reason why we chose a
shorter adaptation for the second batch was that the algorithm was now better
“informed” by the prior distributions of the parameters inferred from the first
batch, so that we expect faster convergence to the highest posterior density area.
The final sample size was 1,500 samples. Convergence of the five chains was tested
by the R̂ statistics. R̂ was lower than 1.01 for all parameters (with the standard
criterion being R̂ < 1.1).
Sequential Bayesian Updating 27

0.6

0.4 σ = 0.187

σ=

σ=
Regression weight β4

0.13

σ
0.2

σ
0.0

=0

σ
=

=
=
=0
6

0.
98

.07

0.
0.

06
.08

0
07
4

60
5
0
0

3
−0.2

−0.4

−0.6

5 10 15 20 25 30 35 40
Available batches of 549 participants
FIGURE 2.3 Sequence of conditional posterior distributions for the regression
coefficient parameter β4 —the weight of the AD-by-age interaction regressed on the
speed of information processing parameter. As each batch of participants is added to
the analysis, our knowledge of β4 is updated and the posterior standard deviation
decreases while the posterior mean converges to a stable value (in this case, near 0).

Results from the Hierarchical Bayesian LATER Model


Figure 2.3 shows the evolution of the distribution of β4 as more data are
introduced. The results of our regression-style analysis are displayed in Table 2.1.
Parameters β1 and β2 show posterior distributions that are clearly far away from
zero, indicating high confidence in the existence of an effect. β1 is negative,
indicating that speed of information processing decreases with advancing age. β2
is positive, indicating an advantage for men over women in terms of speed of
information processing. Parameters β3 and β4 , however, do not show clear effects.
In both cases, the posterior mean is close to zero. In the case of β3 —the predictor
on whether the participant or a family member has been diagnosed with AD—the
value 0 is included in its 95 percent credibility interval (i.e. the interval between
the 2.5 and 97.5 percentiles), and the Bayes factor indicates weak evidence for the
null hypothesis (i.e. no effect, that is β3 = 0). More precisely, the Bayes factor is
2.7 (1/0.37) in favor of the null hypothesis. Similarly, there is no evidence that
information accumulation changes in relation to the interaction of age and the
28 Z. Oravecz, M. Huentelman and J. Vandekerckhove

TABLE 2.1 Summary of the regression weights where response speed was modeled
with the LATER model and the information accumulation rate was regressed on
age, gender, AD in the family, the interaction of age and AD in the family.

Predictor Mean SD 95% CrI B FAL T


β0 Intercept 5.6280 0.0268 ( 5.575 , 5.6800) ≫ 1010
β1 Age −0.7878 0.0196 (−0.8261 , −0.7492) ≫ 1010
β2 Gender 0.6185 0.0368 ( 0.5471 , 0.6891) ≫ 1010
β3 AD −0.0704 0.0672 (−0.2011 , 0.0560) 0.37
β4 Age×AD 0.0273 0.0602 (−0.0912 , 0.1478) 0.21

Mean and SD are posterior mean and standard deviation. CrI stands for “credibility interval.”
B FAL T is the Savage–Dickey approximation (Verdinelli & Wasserman, 1995) to the Bayes factor
in favor of the (alternative) hypothesis that β 6= 0.

presence of AD—in fact, the Bayes factor for β4 shows moderate evidence in favor
of the null hypothesis of no effect (1/0.21 = 4.76).
Especially in the case of Big Data, it is important to draw a distinction
between statistical significance—the ability of the data to help us distinguish effects
from non-effects—and practical significance—the degree to which an extant effect
influences people. In the current dataset, the difference in (mean) predicted RT
between a male participant (group mean accretion rate ν̄m ) and a female participant
 
(group mean accretion rate ν̄ f ) is approximately θ̄ ν̄1f − ν̄1m , which with our
results works out to about 10 ms. Hence, while the difference between these two
groups is detectable (the Bayes factor against the null is more than 1000:1), it is small
enough that any daily-life consequences are difficult to imagine.
To summarize, our cognitive model allows us to cast light on the information
processing system that is assumed to underlie the simple RT measures. The
process model identifies a parameter of interest—in this case, a rate of information
accumulation—and inferences can then be drawn in terms of this parameter.
Caution in the responding is factored into the inference, treated as a nuisance
variable, and separated from the accumulation rate.

Combining Cognitive Models


The MindCrowd website currently tests volunteers not only on the vigilance
task, but also on a paired-associate learning (PAL) task. Cognitive models exist
to model underlying processes in these decisions as well (e.g. multinomial models
for measuring storage and retrieval processes; Rouder & Batchelder, 1998). In the
hierarchical Bayesian modeling framework, we could combine data from these two
tasks together by specifying a joint hyperprior distribution of the parameters of the
Sequential Bayesian Updating 29

model for PAL and the model for the RTs (e.g. Pe, Vandekerckhove, & Kuppens,
2013; Vandekerckhove, 2014). Combining these joint modeling techniques that
were originally developed in psychometrics (e.g. van der Linden, 2007) with
Bayesian modeling can offer a flexible unified framework for drawing inference
from data that would classically be analyzed separately, thereby partially addressing
the “variety” aspect of Big Data challenges.

Discussion
In this chapter, we discussed one way in which Bayesian methods can contribute
to the challenges introduced by Big Data. A core aspect of Bayesian inference—the
sequential updating that is at the heart of the Bayesian paradigm—allows researchers
to partition large datasets so that they become more manageable under hardware
constraints. We have focused on one specific method for exploiting the sequential
updating property, namely using conjugate priors, which lead to closed-form
posterior distributions that can be characterized with only a few sufficient statistics,
and in turn serve as priors for future data. This particular method is limited because
it requires conjugacy of the focal parameters. However, we were able to apply it to a
non-trivial cognitive model (the hierarchical LATER model) and draw interesting
process-level conclusions. For more complex models, priors and posteriors could
be expressed in non-parametric ways (Gershman & Blei, 2012). This method solves
the need for conjugacy, but will itself introduce new computational challenges.
The sequential updating method is computationally efficient because it collapses
posterior samples into sufficient statistics, but also because the informative priors
that are generated from the first batches of data speed up convergence of later
batches.
Our method has also assumed a certain stationarity of data; that is, it was
assumed that as the data came in, the true parameters of the model did not
change. However, there are many real-world scenarios—ranging from negotiation
theory, learning psychology, and EEG analysis, over epidemiology, ecology, and
climate change, to industrial process control, fraud detection, and stock market
predictions—where the stationarity assumption would clearly be violated and the
academic interest would be in change point detection (e.g. Adams & MacKay, 2007).
Within our current approach, a change point detection model would require
that the parameters relevant to the regime switches are explicitly included, and
posteriors over these parameters can be updated as data become available.
Moving beyond sequential updating, there exist other methods for obtaining
samples of a posterior distribution using large datasets. For example, the Consensus
Monte Carlo Algorithm (Scott, Blocker, & Bonassi, 2013) or the Embarrassingly
Parallel, Asymptotically Exact MCMC algorithm (Neiswanger, Wang, & Xing,
2014) both rely on distributing the computational load across a larger hardware
infrastructure and reducing the total “wall time” required for an analysis. The
30 Z. Oravecz, M. Huentelman and J. Vandekerckhove

method we present here has the advantage of not requiring a large dedicated
computation infrastructure and can be run on a regular desktop computer, with
the size of the data affecting only the computation time.
All of these methods rely on Bayesian inference. As we have argued extensively,
we believe that Bayesian methods are not only useful and feasible in a Big Data
context, but are in fact superior from a philosophical point of view. Classical
inference is well known to generate bias against the null hypothesis, and this bias
increases with increasing data size. Recent attempts to reform statistical practice in
the psychological sciences (Cumming, 2014) shift the focus of statistical analysis
to parameter estimation, but with this there remain several major issues. First,
the estimation framework is still based in classical statistics and does not take into
account the prior distribution of parameters of interest. Second, it is not clear
if inference is possible at all in this framework, and “dichotomous thinking”
is discouraged entirely (though it is tempting to wrongly interpret confidence
intervals as posterior distributions, and to decide that an effect is present if the
interval does not contain zero). These recent recommendations seem to us to
throw the dichotomous baby away with the NHST bathwater, while a Bayesian
approach (as we and many others have demonstrated) is logically consistent, does
allow for inferential statements, and allows one to collect evidence in favor of a null
hypothesis. Especially in the case of Big Data, these are highly desirable qualities
that are not shared by classical methods, and we recommend Bayesian inference as
a default method.

Acknowledgments
JV was supported by grant #48192 from the John Templeton Foundation and by
NSF grant #1230118 from the Methods, Measurements, and Statistics panel.

Notes
1 If our example seems far fetched, consider that the existence of a counterex-
ample means one of two things. Either (a) p-values are never a logically valid
method of inference, or (b) p-values are sometimes a logically valid method
of inference, but there exist necessary boundary conditions on the use of
p-values that must be tested whenever p-values are applied. No such boundary
conditions are known to the authors.
2 These long-run guarantees of classical methods have issues in their own
right, which we will not discuss here. More on problematic interpretation
of confidence intervals can be found in Hoekstra, Morey, Rouder, &
Wagenmakers (2014).
3 This expression is due to Lindley (2004).
Sequential Bayesian Updating 31

αβ
4 The variance of the Beta distribution is defined as: (α+β)2 (α+β+1)
, which becomes
(α+C)(β+N −C)
(α+β+N )2 (α+β+N +1)
.
The posterior uncertainty regarding the parameter is hence
a strictly decreasing function of the added sample size N .
5 The phrasing of the item was: “Have you, a sibling, or one of your parents
been diagnosed with Alzheimer’s disease? Yes, No, NA.” The variable one only
took two values in the current dataset: 1—a first degree family member has AD
(including respondent, around 4,000 respondents); 0—there is no first degree
relative with AD in the family.
6 All scripts are available from the following https://git.psu.edu/zzo1/Chapter
SBU. MindCrowd’s data are proprietary.
7 These burnin samples serve two purposes. First, when a model is initialized,
JAGS enters an adaptive mode during which the sampling algorithm modifies
its behaviour for increased efficiency. These changes in the algorithm violate
the detailed balance requirement of Markov chains, so that there is no guarantee
that the so generated samples converge to the desired stationary distribution.
Second, to ensure that the samplers are exploring the posterior parameter space
sufficiently, the sampling algorithm is restarted several times with dispersed
starting values and it is checked whether all these solutions converge into
the same area (as opposed to being stuck in a local optimum, for example).
Posterior inference should be based on samples that form a Markov chain and
are converged into the same area and have “forgotten” their initial values. In
the current analysis the samplers are run independently five times (i.e. we run
five chains). The independence of these MCMC chains implies that they can
be run in parallel, which we do.

References
Adams, R. P., & MacKay, D. J. (2007). Bayesian online changepoint detection. arXiv
preprint arXiv:0710.3742.
Clarke, C. (1991). Invited commentary on R. A. Fisher. American Journal of Epi-
demiology, 134(12), 1371–1374. Retrieved from http://aje.oxfordjournals.org/content/
134/12/1371.short.
Cumming, G. (2014). The new statistics why and how. Psychological Science, 25(1), 7–29.
Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver and Boyd.
Fox, C., & Roberts, S. (2012). A tutorial on variational Bayes. Artificial Intelligence Review,
38, 85–95.
Gelman, A. (2006). The boxer, the wrestler, and the coin flip: A paradox of robust bayesian
inference and belief functions. American Statistician, 60, 146–150.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014).
Bayesian Data Analysis (3rd edn.). Boca Raton, FL.: Chapman & Hall/CRC.
Gershman, S. J., & Blei, D. M. (2012). A tutorial on bayesian nonparametric models.
Journal of Mathematical Psychology, 56, 1–12.
32 Z. Oravecz, M. Huentelman and J. Vandekerckhove

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust
misinterpretation of confidence intervals. Psychological Bulletin and Review, 21(5),
1157–1164.
Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge, UK: Cambridge
University Press.
Kruschke, J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS and Stan (2nd
edn.). London: Academic Press/Elsevier.
Lee, M. D., & Wagenmakers, E. (2013). Bayesian cognitive modeling. New York: Cambridge.
Lindley, D. (1972). Bayesian statistics: A review. Philadelphia: Society for Industrial and
Applied Mathematics.
Lindley, D. (1993). The analysis of experimental data: The appreciation of tea and wine.
Teaching Statistics, 15(1), 22–25.
Lindley, D. (2004). That wretched prior. Significance, 1(2), 85–87.
Lunn, D., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—a Bayesian
modelling framework: concepts, structure, and extensibility. Statistics and Computing, 10,
325–337.
Neiswanger, W., Wang, C., & Xing, E. A. (2014). Asymptotically exact, embarrassingly
parallel MCMC. Retrieved from http://arxiv.org/pdf/1311.4780v2.pdf, 1311.4780.
Ostwald, D., Kirilina, E., Starke, L., & Blankenburg, F. (2014). A tutorial on variational
Bayes for latent linear stochastic time-series models. Journal of Mathematical Psychology, 60,
1–19.
Pe, M. L., Vandekerckhove, J., & Kuppens, P. (2013). A diffusion model account of the
relationship between the emotional flanker task and rumination and depression. Emotion,
13(4), 739.
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using
Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical
Computing (DSC 2003) (pp. 20–22).
Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108.
Ratcliff, R., Carpenter, R. H. S., & Reddi, B. A. J. (2001). Putting noise into
neurophysiological models of simple decision making. Nature Neuroscience, 6, 336–337.
Reddi, B. A., & Carpenter, R. H. S. (2000). The influence of urgency on decision time.
Nature Neuroscience, 3, 827–830.
Rouder, J. N., & Batchelder, W. H. (1998). Multinomial models for measuring storage
and retrieval processes in paired associate learning. In C. E. Dowling, F. S. Roberts,
& P. Theuns (Eds.), Recent progress in mathematical psychology (pp. 195–226). New York:
Psychology Press.
Scott, S. L., Blocker, A. W., & Bonassi, F. V. (2013). Bayes and Big Data: The consensus
Monte Carlo algorithm. In Paper presented at the 2013 EFab@Bayes 250 Workshop.
Stan Development Team. (2013). Stan: A C++ Library for Probability and Sampling, Version
1.3. Retrieved from http://mc-stan.org/.
Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37(1),
1–2.
Turner, B. M., Forstmann, B. U., Wagenmakers, E. J., Brown, S. D., Sederberg, P. B., &
Steyvers, M. (2013). A Bayesian framework for simultaneously modeling neural and
behavioral data. NeuroImage, 72, 193–206.
Vandekerckhove, J. (2014). A cognitive latent variable model for the simultaneous analysis
of behavioral and personality data. Journal of Mathematical Psychology, 60, 58–71.
Sequential Bayesian Updating 33

Vandekerckhove, J., Matzke, D., & Wagenmakers, E.-J. (in press). Model comparison and the
principle of parsimony. Oxford: Oxford University Press.
Vandekerckhove, J., Tuerlinckx, F., & Lee, M. (2011). Hierarchical diffusion models for
two-choice response times. Psychological Methods, 16, 44–62.
van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy
on test items. Psychometrika, 72(3), 287–308.
Verdinelli, I., & Wasserman, L. (1995). Computing Bayes factors using a generalization
of the Savage-Dickey density ratio. Journal of the American Statistical Association, 90(430),
614–618.
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values.
Psychonomic Bulletin & Review, 14, 779–804.
Zhu, J., Chen, J., & Hu, W. (2014). Big learning with Bayesian methods.
http://arxiv.org/pdf/1411.6370.pdf, 1411.6370v1.
3
PREDICTING AND IMPROVING
MEMORY RETENTION
Psychological Theory Matters
in the Big Data Era

Michael C. Mozer and


Robert V. Lindsey

Abstract
Cognitive psychology has long had the aim of understanding mechanisms of human
memory, with the expectation that such an understanding will yield practical techniques
that support learning and retention. Although research insights have given rise to
qualitative advice for students and educators, we present a complementary approach that
offers quantitative, individualized guidance. Our approach synthesizes theory-driven and
data-driven methodologies. Psychological theory characterizes basic mechanisms of human
memory shared among members of a population, whereas machine-learning techniques use
observations from a population to make inferences about individuals. We argue that despite
the power of Big Data, psychological theory provides essential constraints on models. We
present models of forgetting and spaced practice that predict the dynamic time-varying
knowledge state of an individual student for specific material. We incorporate these models
into retrieval-practice software to assist students in reviewing previously mastered material.
In an ambitious year-long intervention in a middle-school foreign language course, we
demonstrate the value of systematic review on long-term educational outcomes, but more
specifically, the value of adaptive review that leverages data from a population of learners to
personalize recommendations based on an individual’s study history and past performance.

Introduction
Human memory is fragile. The initial acquisition of knowledge is slow and
effortful. And once mastery is achieved, the knowledge must be exercised
periodically to mitigate forgetting. Understanding the cognitive mechanisms of
memory has been a longstanding goal of modern experimental psychology, with
the hope that such an understanding will lead to practical techniques that support
learning and retention. Our specific aim is to go beyond the traditional qualitative
Predicting and Improving Memory Retention 35

forms of guidance provided by psychology and express our understanding in terms


of computational models that characterize the temporal dynamics of a learner’s
knowledge state. This knowledge state specifies what material the individual already
grasps well, what material can be easily learned, and what material is on the verge
of slipping away. Given a knowledge-state model, individualized teaching strategies
can be constructed that select material to maximize instructional effectiveness.
In this chapter we describe a hybrid approach to modeling knowledge
state that combines the complementary strengths of psychological theory and a
Big Data methodology. Psychological theory characterizes basic mechanisms of
human memory shared among members of a population, whereas the Big Data
methodology allows us to use observations from a population to make inferences
about individuals. We argue that despite the power of Big Data, psychological
theory provides essential constraints on models, and that despite the success of
psychological theory in providing a qualitative understanding of phenomena, Big
Data enables quantitative, individualized predictions of learning and performance.
This chapter is organized as follows. First, we discuss the notion of know-
ledge state and the challenges involved in inferring knowledge state from
behavior. Second, we turn to traditional psychological theory, describing key
human-memory phenomena and computational models that have been developed
to explain these phenomena. Third, we explain the data-mining technique known
as collaborative filtering, which involves extracting patterns from large datasets for
the purpose of making personalized recommendations. Traditionally, collaborative
filtering has been used by e-commerce merchants to recommend products to buy
and movies to watch, but in our context, we use the technique to recommend
material to study. Fourth, we illustrate how a synthesis of psychological theory and
collaborative filtering improves predictive models. And finally, we incorporate our
predictive models into software that provides personalized review to students, and
show the benefit of this type of modeling in two semester-long experiments with
middle-school students.

Knowledge State
In traditional electronic tutors (e.g. Anderson, Conrad, & Corbett, 1989;
Koedinger & Corbett, 2006; Martin & VanLehn, 1995), the modeling of a
student’s knowledge state has depended on extensive handcrafted analysis of the
teaching domain and a process of iterative evaluation and refinement. We present
a complementary approach to inferring knowledge state that is fully automatic
and independent of the content domain. We hope to apply this approach in any
domain whose mastery can be decomposed into distinct, separable components of
knowledge or items to be learned (van Lehn, Jordan, & Litman, 2007). Applicable
domains range from the concrete to the abstract, and from the perceptual to the
36 M. C. Mozer and R. V. Lindsey

cognitive, and span qualitatively different forms of knowledge from declarative to


procedural to conceptual.
What does it mean to infer a student’s knowledge state, especially in a
domain-independent manner? The knowledge state consists of latent attributes
of the mind such as the strength of a specific declarative memory or a
stimulus–response association, or the psychological representations of interrelated
concepts. Because such attributes cannot be observed directly, a theory of
knowledge state must be be validated through its ability to predict a student’s future
abilities and performance.
Inferring knowledge state is a daunting challenge for three distinct reasons.

1. Observations of human behavior provide only weak clues about the knowledge state.
Consider fact learning, the domain which will be a focus of this chapter. If
a student performs cued recall trials, as when flashcards are used for drilling,
each retrieval attempt provides one bit of information: whether it is successful
or not. From this meager signal, we hope to infer quantitative properties
of the memory trace, such as its strength, which we can then use to predict
whether the memory will be accessible in an hour, a week, or a month. Other
behavioral indicators can be diagnostic, including response latency (Lindsey,
Lewis, Pashler, & Mozer, 2010; Mettler & Kellman, 2014; Mettler, Massey,
& Kellman, 2011) and confidence (Metcalfe & Finn, 2011), but they are also
weak predictors.
2. Knowledge state is a consequence of the entire study history, i.e. when in the past the
specific item and related items were studied, the manner and duration of study,
and previous performance indicators. Study history is particularly relevant
because all forms of learning show forgetting over time, and unfamiliar and
newly acquired information is particularly vulnerable (Rohrer & Taylor, 2006;
Wixted, 2004). Further, the temporal distribution of practice has an impact
on the durability of learning for various types of material (Cepeda, Pashler,
Vul, & Wixted, 2006; Rickard, Lau, & Pashler, 2008).
3. Individual differences are ubiquitous in every form of learning. Taking an example
from fact learning (Kang, Lindsey, Mozer, & Pashler, 2014), Figure 3.1(a)
shows extreme variability in a population of 60 participants. Foreign-language
vocabulary was studied at four precisely scheduled times over a four-week
period. A cued-recall exam was administered after an eight-week retention
period. The exam scores are highly dispersed despite the uniformity in
materials and training schedules. In addition to inter-student variability,
inter-item variability is a consideration. Learning a foreign vocabulary word
may be easy if it is similar to its English equivalent, but hard if it is similar
to a different English word. Figure 3.1(b) shows the distribution of recall
accuracy for 120 Lithuanian-English vocabulary items averaged over a set of
students (Grimaldi, Pyc, & Rawson, 2010). With a single round of study, an
Predicting and Improving Memory Retention 37

(a) 10

8
Frequency

0
0 0.2 0.4 0.6 0.8 1
Test score
(b) 30

25

20
Frequency

15

10

0
0 0.2 0.4 0.6 0.8
Test score
FIGURE 3.1 (a) Histogram of proportion of items reported correctly on a cued recall
task for a population of 60 students learning 32 Japanese-English vocabulary pairs
(Kang et al., 2014); (b) Histogram of proportion of subjects correctly reporting an
item on a cued recall task for a population of 120 Lithuanian-English vocabulary pairs
being learned by roughly 80 students (Grimaldi et al., 2010)

exam administered several minutes later suggests that items show a tremendous
range in difficulty (krantas→shore was learned by only 3 percent of students;
lova→bed was learned by 76 percent of students).

Psychological Theories of Long-Term Memory Processes


The most distressing feature of memory is the inevitability of forgetting. Forgetting
occurs regardless of the skills or material being taught, and regardless of the age
or background of the learner. Even highly motivated learners are not immune:
Medical students forget roughly 25–35 percent of basic science knowledge after
38 M. C. Mozer and R. V. Lindsey

one year, more than 50 percent by the next year (Custers, 2010), and 80–85
percent after 25 years (Custers & ten Cate, 2011).
Forgetting is often assessed by teaching participants some material in a single
session and then assessing cued-recall accuracy following some lag t. The
probability of recalling the studied material decays according to a generalized
power-law as a function of t (Wixted & Carpenter, 2007),
Pr (recall) = m(1 + ht)− f ,
where m, h, and f are constants interpreted as the degree of initial learning
(0 ≤ m ≤ 1), a scaling factor on time (h > 0), and the memory decay exponent
( f > 0), respectively. Figure 3.2(a) shows recall accuracy at increasing study-test
lags from an experiment by Cepeda, Vul, Rohrer, Wixted, & Pashler (2008) in
which participants were taught a set of obscure facts. The solid line in the figure is
the best fitting power-law forgetting curve.
When material is studied over several sessions, the temporal distribution of study
influences the durability of memory. This phenomenon, known as the spacing effect,
is observed for a variety materials—skills and concepts as well as facts (Carpenter,
Cepeda, Rohrer, Kang, & Pashler, 2012)—and has been identified as showing
great promise for improving educational outcomes (Dunlosky, Rawson, Marsh,
Nathan, & Willingham, 2013).
The spacing effect is typically studied via a controlled experimental paradigm in
which participants are asked to study unfamiliar paired associates in two sessions.
The time between sessions, known as the intersession interval or ISI, is manipulated
across participants. Some time after the second study session, a cued-recall test is
administered to the participants. The lag between the second session and the test
is known as the retention interval or RI. Cepeda et al. (2008) conducted a study in
which RIs were varied from seven to 350 days and ISIs were varied from minutes
to 105 days. Their results are depicted as circles connected with dashed lines in
Figure 3.2(b). (The solid lines are model fits, which we discuss shortly.) For each
RI, Cepeda et al. (2008) find an inverted-U relationship between ISI and retention.
The left edge of the graph corresponds to massed practice, the situation in which
session two immediately follows session one. Recall accuracy rises dramatically
as the ISI increases, reaching a peak and then falling off gradually. The optimal
ISI—the peak of each curve—increases with the RI. Note that for educationally
relevant RIs on the order of weeks and months, the Cepeda et al. (2009) result
indicates that the effect of spacing can be tremendous: Optimal spacing can double
retention over massed practice. Cepeda, Pashler, Vul, Wixted & Rohrer (2006)
conducted a meta-analysis of the literature to determine the functional relationship
between RI and optimal ISI. We augmented their dataset with the more recent
results of Cepeda et al. (2008) and observed an approximately power-function
relationship between RI and optimal ISI (both in days):
O ptimal I S I = 0.097R I 0.812 .
Predicting and Improving Memory Retention 39

(a) 100

80

60
% Recall

40

20

0
1 7 14 21 35 70 105
Retention (days)

(b) 100

80 7 Day retention

60 35 Day retention
% Recall

70 Day retention
40

20 350 Day retention

0
1 7 14 21 35 70 105
Spacing (days)
FIGURE 3.2 (a) Recall accuracy as a function of lag between study and test for a set
of obscure facts; circles represent data provided by Cepeda et al. (2008) and solid line
is the best power-law fit. (b) Recall accuracy as a function of the temporal spacing
between two study sessions (on the ordinate) and the retention period between the
second study session and a final test. Circles represent data provided by Cepeda et al.
(2008), and solid lines are fits of the model MCM, as described in the text.

This relationship suggests that as material becomes more durable with practice,
ISIs should increase, supporting even longer ISIs in the future, consistent with
an expanding-spacing schedule as qualitatively embodied in the Leitner method
(Leitner, 1972) and SuperMemo (Woźniak, 1990).
40 M. C. Mozer and R. V. Lindsey

Many models have been proposed to explain the mechanisms of the spacing
effect (e.g. Benjamin & Tullis, 2010; Kording, Tenenbaum, & Shadmehr, 2007;
Mozer, Pashler, Cepeda, Lindsey, & Vul, 2009; Pavlik & Anderson, 2005a;
Raaijmakers, 2003; Staddon, Chelaru, & Higa, 2002). These models have been
validated through their ability to account for experimental results, such as those
in Figure 3.2, which represent mean performance of a population of individuals
studying a set of items. Although the models can readily be fit to an individual’s
performance for a set of items (e.g. Figure 3.1(a)) or a population’s performance
for a specific item (e.g. Figure 3.1(b)), it is a serious challenge in practice to use
these models to predict an individual’s memory retention for a specific item.
We will shortly describe an approach to making such individualized predictions.
Our approach incorporates key insights from two computational models, ACT-R
(Pavlik & Anderson, 2005a) and MCM (Mozer et al., 2009), into a Big Data
technique that leverages population data to make individualized predictions. First,
we present a brief overview of the two models.

ACT-R
ACT- R (Anderson et al., 2004) is an influential cognitive architecture whose
declarative memory module is often used to account for explicit recall following
study. ACT-R assumes that a separate trace is laid down each time an item is studied,
and the trace decays according to a power law, t −d , where t is the age of the
memory and d is the power law decay for that trace. Following n study episodes,
the activation for an item, m n , combines the trace strengths of individual study
episodes according to:
n
!
X −dk
m n = ln bk tk + β, (1)
k=1

where tk and dk refer to the age and decay associated with trace k, and β is a student
and/or item-specific parameter that influences memory strength. The variable bk
reflects the salience of the kth study session (Pavlik, 2007): Larger values of bk
correspond to cases where, for example, the participant self-tested and therefore
exerted more effort.
To explain spacing effects, Pavlik and Anderson (2005b, 2008) made an
additional assumption: The decay for the trace formed on study trial k depends
on the item’s activation at the point when study occurs:

dk (m k−1 ) = cem k−1 + α,

where c and α are constants. If study trial k occurs shortly after the previous
trial, the item’s activation, m k−1 , is large, which will cause trace k to decay
rapidly. Increasing spacing therefore benefits memory by slowing decay of trace k.
Predicting and Improving Memory Retention 41

However, this benefit is traded off against a cost incurred due to the aging of traces
1...k − 1 that causes them to decay further.
The probability of recall is monotonically related to activation, m:
τ −m
Pr (recall) = 1/(1 + e s ),

where τ and s are additional parameters. In total, the variant of the model
described here has six free parameters.
Pavlik and Anderson (2008) use ACT-R activation predictions in a heuristic
algorithm for within-session scheduling of trial order and trial type (i.e. whether an
item is merely studied, or whether it is first tested and then studied). They assume
a fixed spacing between initial study and subsequent review. Thus, their algorithm
reduces to determining how to best allocate a finite amount of time within a
session. Although they show an effect of the algorithm used for within-session
scheduling, between-session manipulation has a greater impact on long-term
retention (Cepeda, Pashler, Vul, & Wixted, 2006).

MCM
ACT- R is posited on the assumption that memory decay follows a power function.
We developed an alternative model, the Multiscale Context Model or MCM (Mozer
et al., 2009), which provides a mechanistic basis for the power function. Adopting
key ideas from previous models of the spacing effect (Kording et al., 2007;
Raaijmakers, 2003; Staddon et al., 2002) MCM proposes that each time an item is
studied, it is stored in multiple item-specific memory traces that decay at different
rates. Although each trace has an exponential decay, the sum of the traces decays
approximately as a power function of time. Specifically, trace i, denoted xi , decays
over time according to:

xi (t + 1t) = xi (t) exp(−1t/τi ),

where τi is the decay time constant, ordered such that successive traces have slower
decays, i.e. τi < τi+1 . Traces 1 − k are combined to form a net trace strength, sk ,
via a weighted average:
k k
1 X X
sk = γi xi , where Ŵk = γi
Ŵk i=1 i=1

and γi is a factor representing the contribution of trace i. In a cascade of K traces,


recall probability is simply the thresholded strength: Pr (recall) = min(1, s K ).
Spacing effects arise from the trace update rule, which is based on Staddon et
al. (2002). A trace is updated only to the degree that it and faster decaying traces
fail to encode the item at the time of study. This rule has the effect of storing
information on a timescale that is appropriate given its frequency of occurrence in
42 M. C. Mozer and R. V. Lindsey

the environment. Formally, when an item is studied, the increment to trace i is


negatively correlated with the net strength of the first i traces, i.e.

1xi = ǫ(1 − si ),

where ǫ is a step size. We adopt the retrieval-dependent update assumption of


Raaijmakers (2003): ǫ = 1 for an item that is not recalled at the time of study, and
ǫ = ǫr (ǫr > 1) for an item that is recalled.
The model has five free parameters (ǫr , and four parameters that determine
the contributions {γi } and the time constants, {τi }). MCM was designed such that
all of its parameters, with the exception of ǫr , could be fully constrained by data
that are easy to collect—the function characterizing forgetting following a single
study session—which then allows the model to make predictions for data that
are difficult to collect—the function characterizing forgetting following a study
schedule consisting of two or more study sessions. MCM has been used to obtain
parameter-free predictions for a variety of results in the spacing literature. The solid
lines in the right panel of Figure 3.2 show parameter-free predictions of MCM for
the Cepeda et al. (2008) study.

Collaborative Filtering
In the last several years, an alternative approach to predicting learners’ performance
has emerged from the machine-learning community. This approach essentially sets
psychological theory aside in favor of mining large datasets collected as students
solve problems. To give a sense of the size of these datasets, we note that the
Khan Academy had over 10 million unique users per month and delivered over
300 million lessons at the end of 2013 (Mullany, 2013). Figure 3.3(a) visualizes
a dataset in which students solve problems over time. Each cell in the tensor
corresponds to a specific student solving a particular problem at a given moment
in time. The contents of a cell indicate whether an attempt was made and if so
whether it was successful. Most of the cells in the tensor are empty. A collaborative
filtering approach involves filling in those missing cells. While the tensor may have
no data about student S solving problem P given a particular study history, the
tensor will have data about other similar students solving P, or about S solving
problems similar to P. Filling in the tensor also serves to make predictions about
future points in time.
Collaborative filtering has a long history in e-commerce recommender systems;
for example, Amazon wishes to recommend products to customers and Netflix
wishes to recommend movies to its subscribers. The problems are all formally
equivalent, simply replace “student” in Figure 3.3(a) with “customer” or
“subscriber,” and replace “problem” with “product” or “movie.” The twist
that distinguishes memory prediction from product or movie prediction is our
understanding of the temporal dynamics of human memory. These dynamics are
Predicting and Improving Memory Retention 43

(a) (b)

Ti
m
e
Problem Pt Pt+1

Problems
Knowledge Kt Kt+1 Kt+2
state

Response Rt Rt+1

Students
FIGURE 3.3 (a) A tensor representing students × problems × time. Each cell describes
a student’s attempt to solve a problem at a particular moment in time. (b) A naive
graphical model representing a teaching paradigm. The nodes represent random
variables and the arrows indicate conditional dependencies among the variables. Given
a student with knowledge state K t at time t, and a problem Pt posed to that student,
Rt denotes the response the student will produce. The evolution of the student’s
knowledge state will depend on the problem that was just posed. This framework can
be used to predict student responses or to determine an optimal sequence of problems
for a particular student given a specific learning objective.

not fully exploited in generic machine-learning approaches. We shortly describe


initial efforts in this regard that leverage computational models like ACT-R and
MCM to characterize memory dynamics.
Collaborative filtering involves inferring a relatively compact set of latent
variables that can predict or explain the observed data. In the case of product
recommendations, the latent variables may refer to features of a product (e.g.
suitable for children) or of the customer (e.g. has children). In the case of
student modeling, the latent variables describe skills required to solve a problem
or a student’s knowledge state. Using these latent variable representations of
problems and student knowledge states, Figure 3.3(b) presents an extremely general
data-driven framework that has been fruitfully instantiated to predict and guide
learning (e.g. Lan, Studer, & Baraniuk, 2014; Sohl-Dickstein, 2013).
A simple collaborative-filtering method that may be familiar to readers is
item-response theory or IRT, the classic psychometric approach to inducing latent
traits of students and items based on exam scores (DeBoek & Wilson, 2004). IRT
is used to analyze and interpret results from standardized tests such as the SAT and
GRE, which consist of multiple-choice questions and are administered to large
populations of students. Suppose that n S students take a test consisting of n I items,
and the results are coded in the binary matrix R ≡ {rsi }, where s is an index over
students, i is an index over items, and rsi is the binary (correct or incorrect) score
for student s’s response to item i. IRT aims to predict R from latent traits of the
students and the items. Each student s is assumed to have an unobserved ability,
represented by the scalar as . Each item i is assumed to have an unobserved difficulty
level, represented by the scalar di .
44 M. C. Mozer and R. V. Lindsey

IRT specifies the probabilistic relationship between the predicted response, Rsi
and as and di . The simplest instantiation of IRT, called the one-parameter logistic
(1PL) model because it has one item-associated parameter, is:
1
Pr (Rsi = 1) = . (2)
1 + exp(di − as )
A more elaborate version of IRT, called the 3PL model, includes an item-associated
parameter for guessing, but that is mostly useful for multiple-choice questions
where the probability of correctly guessing is non-negligible. Another variant,
called the 2PL model, includes parameters that allow for student ability to have
a non-uniform influence across items. (In simulations we shortly describe, we
explored the 2PL model but found that it provided no benefit over the 1PL
model.) Finally, there are more sophisticated latent-trait models that characterize
each student and item not as a scalar but as a feature vector (Koren, Bell, & Volinsky,
2009).

Integrating Psychological Theory with Big-Data


Methods: A Case Study of Forgetting
IRT is typically applied post hoc to evaluate the static skill level of students (Roussos,
Templin, & Henson, 2007). Extensions have been proposed to model a time
varying skill level (e.g. Andrade & Tavares, 2005), allowing the technique to predict
future performance. However, these extensions are fairly neutral with regard to
their treatment of time: Skill levels at various points in time are treated as unrelated
or as following a random walk. Thus, the opportunity remains to explore dynamic
variants of latent-trait models that integrate the longitudinal history of study and
properties of learning and forgetting to predict future performance of students. In
this section, we take an initial step in this direction by incorporating the latent
traits of IRT into a theory of forgetting. Instead of using IRT to directly predict
behavioral outcomes, we use latent-trait models to infer variables such as initial
memory strength and memory decay rate, and then use the theory of forgetting to
predict knowledge state and behavioral outcomes.

Candidate Models
The forgetting curve we described earlier, based on the generalized power law,
is supported by data from populations of students and/or populations of items.
The forgetting curve cannot be measured for an individual item and a particular
student—which we’ll refer to as a student-item—due to the observer effect and the
all-or-none nature of forgetting. Regardless, we will assume the functional form
of the curve for a student-item is the same, yielding:
Pr (Rsi = 1) = m(1 + htsi )− f , (3)
Predicting and Improving Memory Retention 45

where Rsi is the response of student s to item i following retention interval tsi .
This model has free parameters m, h, and f , as described earlier.
We would like to incorporate the notion that forgetting depends on latent
IRT-like traits that characterize student ability and item difficulty. Because the
critical parameter of forgetting is the memory decay exponent, f , and because
f changes as a function of skill and practice (Pavlik & Anderson, 2005a), we can
individuate forgetting for each student-item by determining the decay exponent
in Equation 3 from latent IRT-like traits:

f = eãs −d̃i . (4)

We add the tilde to ãs and d̃i to indicate that these ability and difficulty parameters
are not the same as those in Equation 2. Using the exponential function ensures
that f is non-negative.
Another alternative we consider is individuating the degree-of-learning
parameter in Equation 3 as follows:

1
m= . (5)
1 + exp(di − as )
With this definition of m, Equation 3 simplifies to 1PL IRT (Equation 2) at t = 0.
For t > 0, recall probability decays as a power-law function of time.
We explored five models that predict recall accuracy of specific student-items:
(1) IRT, the 1PL IRT model (Equation 2); (2) MEMORY, a power-law forgetting
model with population-wide parameters (Equation 3); (3) HYBRID DECAY, a
power-law forgetting model with decay rates based on latent student and item
traits (Equations 3 and 4); (4) HYBRID SCALE, a power-law forgetting model with
the degree-of-learning based on latent student and item traits (Equations 3 and 5);
and (5) HYBRID BOTH, a power-law forgetting model that individuates both the
decay rate and degree-of-learning (Equations 3, 4, and 5). The Appendix describes
a hierarchical Bayesian inference method for parameter estimation and obtaining
model predictions.

Simulation Results
We present simulations of our models using data from two previously published
psychological experiments exploring how people learn and forget facts, summa-
rized in Table 3.1. In both experiments, students were trained on a set of items
(cue–response pairs) over multiple rounds of practice. In the first round, the cue
and response were both shown. In subsequent rounds, retrieval practice was given:
Students were asked to produce the appropriate response to each cue. Whether
successful or not, the correct response was then displayed. Following this training
procedure was a retention interval tsi specific to each student and each item, after
46 M. C. Mozer and R. V. Lindsey

TABLE 3.1 Experimental data used for simulations.

Study name S1 S2
Source Kang et al. (2014) Cepeda et al. (2008)
Materials Japanese-English vocabulary Interesting but obscure facts
# Students 32 1354
# Items 60 32
Rounds of practice 3 1
Retention intervals 3 min–27 days 7 sec–53 min

which an exam was administered. The exam obtained the rsi binary value for that
student-item.
To evaluate the models, we performed 50-fold validation. In each fold, a
random 80 percent of elements of R were used for training and the remaining
20 percent were used for evaluation. Each model generates a prediction,
conditioned on the training data, of recall probability at the exam time tsi , which
can be compared against the observed recall accuracy in the held-out data.
Each model’s capability of discriminating successful from unsuccessful recall
trials was assessed with a signal-detection analysis (Green & Swets, 1966). For
each model, we compute the mean area under the receiver operating characteristic
curve (hereafter, AUC) across validation folds as a measure of the model’s predictive
ability. The measure ranges from 0.5 for random guesses to 1.0 for perfect
predictions. The greater the AUC, the better the model is at predicting a particular
student’s recall success on a specific item after a given lag.
Figure 3.4(a) and (b) summarizes the AUC values for studies S1 and S2 ,
respectively. The baseline MEMORY model performs poorly ( p < 0.01 for all
pairwise comparisons by a two-tailed t test unless otherwise noted), suggesting
that the other models have succeeded in recovering latent student and item traits
that facilitate inference about the knowledge state of a particular student-item.
The baseline IRT model, which ignores the lag between study and test, does
not perform as well as the latent-state models that incorporate forgetting. The
HYBRID BOTH model does best in S1 and ties for best in S2 , suggesting that
allowing for individual differences both in degree of learning and rate of forgetting
is appropriate. The consistency of results between the two studies is not entirely
trivial considering the vastly different retention intervals examined in the two
studies (see Table 3.1).

Generalization to New Material


The simulation we described holds out individual student-item pairs for validation.
This approach was convenient for evaluating models but does not correspond to
Predicting and Improving Memory Retention 47

(a)
0.84

0.82

0.80
AUC

0.58

0.56

0.54
Memory IRT Hybrid Hybrid Hybrid
scale decay both

(b)
0.84

0.82

0.80
AUC

0.66

0.64

0.62

0.60
Memory IRT Hybrid Hybrid Hybrid
scale decay both
FIGURE 3.4 Mean AUC values on the five models trained and evaluated on (a) Study
S1 and (b) Study S2 . The error bars indicate a 95 percent confidence interval on
the AUC value over multiple validation folds. Note that the error bars are not useful
for comparing statistical significance of the differences across models, because the
validation folds are matched across models, and the variability due to the fold must be
removed from the error bars.

the manner in which predictions might ordinarily be used. Typically, we may


have some background information about the material being learned, and we wish
to use this information to predict how well a new set of students will fare on
the material. Or we might have some background information about a group of
students, and we wish to use this information to predict how well they will fare
on new material. For example, suppose we collect data from students enrolled in
Spanish 1 in the fall semester. At the onset of the spring semester, when our former
48 M. C. Mozer and R. V. Lindsey

0.82

0.80
AUC

0.58

0.56
Memory IRT Hybrid Hybrid Hybrid
scale decay both
FIGURE 3.5 Mean AUC values when random items are held out during validation
folds, Study S1

Spanish 1 students begin Spanish 2, can we benefit from the data acquired in the
fall to predict their performance on new material?
To model this situation, we conducted a further validation test in which,
instead of holding out random student-item pairs, we held out random items
for all students. Figure 3.5 shows mean AUC values for Study S1 data for the
various models. Performance in this item-generalization task is slightly worse than
performance when the model has familiarity with both the students and the items.
Nonetheless, it appears that the models can make predictions with high accuracy
for new material based on inferences about latent student traits and about other
items.1
To summarize, in this section we demonstrated that systematic individual
(student and item) differences can be discovered and exploited to better predict
a particular student’s retention of a specific item. A model that combines a
psychological theory of forgetting with a collaborative filtering approach to
latent-trait inference yields better predictions than models based purely on
psychological theory or purely on collaborative filtering. However, the datasets
we explored are relatively small—1,920 and 43,328 exam questions. Ridgeway,
Mozer, and Bowles (2016) explore a much larger dataset consisting of 46.3 million
observations collected from 125K students learning foreign language skills with
online training software. Even in this much larger dataset, memory retention
is better predicted using a hybrid model over a purely data-driven approach.2
Furthermore, in naturalistic learning scenarios, students are exposed to material
multiple times, in various contexts, and over arbitrary temporal distributions of
study. The necessity for mining a large dataset becomes clear in such a situation,
Predicting and Improving Memory Retention 49

but so does the role of psychological theory, as we hope to convince the reader in
the next section.

Integrating Psychological Theory with Big Data


Methods: A Case Study of Personalized Review
We turn now to an ambitious project in which we embedded knowledge-state
models into software that offers personalized recommendations to students about
specific material to study. The motivation for this project is the observation that,
at all levels of the educational system, instructors and textbooks typically introduce
students to course material in blocks, often termed chapters or lessons or units.
At the end of each block, instructors often administer a quiz or problem set to
encourage students to master material in the block. Because students are rewarded
for focusing on this task, they have little incentive at that moment to rehearse
previously learned material. Although instructors appreciate the need for review,
the time demands of reviewing old material must be balanced against the need
to introduce new material, explain concepts, and encourage students toward
initial mastery. Achieving this balance requires understanding when students will
most benefit from review. Controlled classroom studies have demonstrated the
importance of spaced over massed study (Carpenter, Pashler, & Cepeda, 2009;
Seabrook, Brown, & Solity, 2005; Sobel, Cepeda, & Kapler, 2011), but these
studies have been a one-size-fits-all type of approach in which all students reviewed
all material in synchrony. We hypothesized that personalized review might yield
greater benefits, given individual differences such as those noted in the previous
section of this chapter.
We developed software that was integrated into middle-school Spanish foreign
language courses to guide students in the systematic review of course material. We
conducted two semester-long experiments with this software. In each experiment,
we compared several alternative strategies for selecting material to review. Our goal
was to evaluate a Big Data strategy for personalized review that infers the dynamic
knowledge state of each student as the course progressed, taking advantage of both
population data and psychological theory. Just as we leveraged theories of forgetting
to model retention following a single study session, we leverage theories of spaced
practice—in particular, the two models we described earlier, ACT-R and MCM—to
model retention following a complex multi-episode study schedule.

Representing Study History


Before turning to our experiments, we extend our approach to modeling
knowledge state. Previously, we were concerned with modeling forgetting after a
student was exposed to material one time. Consequently, we were able to make the
strong assumption that all student-items have an identical study history. To model
50 M. C. Mozer and R. V. Lindsey

knowledge state in a more naturalistic setting, we must relax this assumption and
allow for an arbitrary study history, defined as zero or more previous exposures at
particular points in time.
Extending our modeling approach, we posit that knowledge state is jointly
dependent on factors relating to (1) an item’s latent difficulty, (2) a student’s latent
ability, and (3) the amount, timing, and outcome of past study. We refer to the
model with the acronym DASH summarizing the three factors (difficulty, ability,
and study history).
DASH predicts the likelihood of student s making a correct response on the kth
trial for item i, conditioned on that student-item’s specific study history:

P(Rsik = 1 | as , di , t1:k , r1:k−1 , θ ) = σ (as − di + h θ (ts,i,1:k , rs,i,1:k−1 )), (6)


 −1
where σ (x) ≡ 1 + exp(−x) is the logistic function, ts,i,1:k are the times at
which trials 1 through k occurred, rs,i,1:k−1 are the binary response accuracies on
trials 1 through k − 1, h θ is a function that summarizes the effect of study history
on recall probability, and θ is a parameter vector that governs h θ . As before, as
and di denote the latent ability of student s and difficulty of item i, respectively.
This framework is an extension of additive-factors models used in educational data
mining (Cen, Koedinger, & Junker, 2006, 2008; Pavlik, Cen, & Koedinger, 2009).
DASH draws on key insights from the psychological models MCM and ACT- R
via a representation of study history that is based on log counts of practice and
success with an item over multiple expanding windows of time, formalized as:
W
X
hθ = θ2w−1 log(1 + csiw ) + θ2w log(1 + n siw ), (7)
w=1

where w ∈ {1, ..., W } is an index over time windows, csiw is the number of times
student s correctly recalled item i in window w out of n siw attempts, and θ are
window-specific weightings. Motivated by the multiple traces of MCM, we include
statistics of study history that span increasing windows of time. These windows
allow the model to modulate its predictions based on the temporal distribution
of study. Motivated by the diminishing benefit of additional study in ACT-R
(Equation 1), we include a similar log transform in Equation 7.3 Both MCM
and ACT-R modulate the effect of past study based on response outcomes, i.e.
whether the student performed correctly or not on a given trial. This property is
incorporated into Equation 7 via the separation of parameters for counts of total
and correct attempts.
Being concerned that the memory dynamics of MCM and ACT-R provided
only loose inspiration to DASH, we designed two additional variants of DASH that
more strictly adopted the dynamics of MCM and ACT-R. The variant we call
DASH [ MCM ] replaces expanding time windows with expanding time constants,
which determine the rate of exponential decay of memory traces. The model
Predicting and Improving Memory Retention 51

assumes that the counts n siw and csiw are incremented at each trial and then decay
over time at a timescale-specific exponential rate τw . Formally, we use Equation 7
with the counts redefined as:
n siw = k−1 −(tsik −tsiκ )/τw
csiw = k−1 −(tsik −tsiκ )/τw
P P
κ=1 e κ=1 r siκ e (8)
The variant we call DASH[ACT-R] does not have a fixed number of time windows,
but instead—like ACT-R—allows for the influence of past trials to continuously
decay according to a power-law. DASH[ACT-R] formalizes the effect of study
history to be identical to the memory trace strength of ACT-R (Equation 1):
k−1
!
X
−θ2
h θ = θ1 log 1 + θ3+rsiκ (tsik − tsiκ ) (9)
κ=1

Further details of the modeling and a hierarchical Bayesian scheme for inferring
model parameters are given in Lindsey (2014).

Classroom Studies of Personalized Review


We incorporated systematic, temporally distributed review into Spanish foreign
language instruction at a Denver area middle school using an electronic flashcard
tutoring system. Each week of the semester, students engaged during class in three
20–30 minute sessions with the system, called COLT. COLT presented vocabulary
words and short sentences in English and required students to type the Spanish
translation, after which corrective feedback was provided. The first two sessions
of each week began with a study-to-proficiency phase for new material that was
introduced in that week’s lesson, and then proceeded to a phase during which
previously introduced material was reviewed. In the third session, these activities
were preceded by a quiz on the current lesson, which counted toward the course
grade.
We conducted two semester-long experiments with COLT, the first of which is
described in detail in Lindsey, Shroyer, Pashler, and Mozer (2014) and the second
of which appears only in the PhD thesis of Lindsey (2014). We summarize the
two experiments here.

Experiment 1
Experiment 1 involved 179 third-semester Spanish students, split over six class
periods. The semester covered ten lessons of material. COLT incorporated three
different schedulers to select material from these lessons for review. The goal of each
scheduler was to make selections that maximize long-term knowledge preservation
given the limited time available for review. The scheduler was varied within
participant by randomly assigning one third of a lesson’s items to each scheduler,
counterbalanced across participants. During review, the schedulers alternated in
52 M. C. Mozer and R. V. Lindsey

selecting items for retrieval practice. Each scheduler selected from among the items
assigned to it, ensuring that all items had equal opportunity and that all schedulers
administered an equal number of review trials.
A massed scheduler selected material from the current lesson. It presented the
item in the current lesson that students had least recently studied. This scheduler
reflects recent educational practice: Prior to the introduction of COLT, alternative
software was used that allowed students to select the lesson they wished to study.
Not surprisingly, given a choice, students focused their effort on preparing for
the imminent end-of-lesson quiz, consistent with the preference for massed study
found by Cohen, Yan, Halamish, and Bjork (2013).
A generic-spaced scheduler selected one previous lesson to review at a spacing
deemed to be optimal for a range of students and a variety of material according
to both empirical studies (Cepeda et al., 2006, 2008) and computational models
(Khajah, Lindsey, & Mozer, 2013; Mozer et al., 2009). On the time frame
of a semester—where material must be retained for one to three months—a
one-week lag between initial study and review obtains near-peak performance for
a range of declarative materials. To achieve this lag, the generic-spaced scheduler
selected review items from the previous lesson, giving priority to the least recently
studied.
A personalized-spaced scheduler used our knowledge-state model, DASH, to
determine the specific item a particular student would most benefit from
reviewing. DASH infers the instantaneous memory strength of each item the
student has studied. Although a knowledge-state model is required to schedule
review optimally, optimal scheduling is computationally intractable because it
requires planning over all possible futures (when and how much a student studies,
including learning that takes place outside the context of COLT, and within
the context of COLT, whether or not retrieval attempts are successful, etc.).
Consequently, a heuristic policy is required for selecting review material. We
chose a threshold-based policy that prioritizes items whose recall probability is
closest to a threshold θ . This heuristic policy is justified by simulation studies
as being close to optimal under a variety of circumstances (Khajah et al.,
2013) and by Bjork’s (1994) notion of desirable difficulty, which suggests that
memory is best served by reviewing material as it is on the verge of being
forgotten.
As the semester progressed, COLT continually collected data and DASH was
retrained with the complete dataset at regular intervals. The retraining was
sufficiently quick and automatic that the model could use data from students in the
first class period of the day to improve predictions for students in the second class
period. This updating was particularly useful when new material was introduced
and DASH needed to estimate item difficulty. By the semester’s end, COLT had
amassed data from about 600,000 retrieval-practice trials.
Predicting and Improving Memory Retention 53

To assess student retention, two proctored cumulative exams were administered,


one at the semester’s end and one 28 days later, at the beginning of the following
semester. Each exam tested half of the course material, randomized for each student
and balanced across chapters and schedulers; no corrective feedback was provided.
On the first exam, the personalized spaced scheduler improved retention by 12.4
percent over the massed scheduler (t (169) = 10.1, p < 0.0001, Cohen’s d = 1.38)
and by 8.3 percent over the generic spaced scheduler (t (169) = 8.2, p < 0.0001,
d = 1.05) (Figure 3.6(a)). Over the 28-day intersemester break, the forgetting
rate was 18.1 percent, 17.1 percent, and 15.7 percent for the massed, generic,
and personalized conditions, respectively, leading to an even larger advantage for
personalized review. On the second exam, personalized review boosted retention
by 16.5 percent over massed review (t (175) = 11.1, p < 0.0001, d = 1.42) and
by 10.0 percent over generic review (t (175) = 6.59, p < 0.0001, d = 0.88). Note
that “massed” review is spaced by usual laboratory standards, being spread out over
at least seven days. This fact may explain the small benefit of generic spaced over
massed review.
In Lindsey et al. (2014), we showed that personalized review has its greatest
effect on the early lessons of the semester, which is sensible because that material
had the most opportunity for being manipulated via review. We also analyzed
parameters of DASH to show that its predictions depend roughly in equal part on
student abilities, item difficulties, and study history.
To evaluate the quality of DASH’s predictions, we compared DASH against
alternative models by dividing the retrieval-practice trials recorded over the
semester into 100 temporally contiguous disjoint sets, and the data for each set was
predicted given the preceding sets. The accumulative prediction error (Wagenmakers,
Grünwald, & Steyvers, 2006) was computed using the mean deviation between
the model’s predicted recall likelihood and the actual binary outcome, normalized
such that each student is weighted equally. Figure 3.6(b) compares DASH against
five alternatives: A baseline model that predicts a student’s future performance to be
the proportion of correct responses the student has made in the past, a Bayesian
form of IRT, Pavlik and Anderson’s (2005b) ACT-R model of spacing effects,
and the two variants of DASH we described earlier that incorporate alternative
representations of study history motivated by models of spacing effects. DASH
and its two variants perform better than the alternatives. The DASH models each
have two key components: (1) A dynamic representation of study history that can
characterize learning and forgetting, and (2) a collaborative filtering approach to
inferring latent difficulty and ability factors. Models that omit the first component
(baseline and IRT) or the second (baseline and ACT-R) do not fare as well. The
DASH variants all perform similarly. Because these variants differ only in the manner
in which the temporal distribution of study and recall outcomes is represented, this
distinction does not appear to be critical.
54 M. C. Mozer and R. V. Lindsey

(a) (b)
% Correct on cumulative exam

75 0.35

Personalized

DASH [ACT−R]
DASH [MCM]
70

spaced

Prediction error
65 0.3

DASH
Generic
spaced
Massed
60

Baseline

ACT−R
55 0.25

IRT
50

45
End of semester One month delayed

FIGURE 3.6 COLT experiment 1. (a) Mean scores on the two cumulative
end-of-semester exams, taken 28 days apart. All error bars indicate ±1 within-student
standard error (Masson & Loftus, 2003). (b) Accumulative prediction error of six
models using the data from the experiment. The models are as follows: A baseline
model that predicts performance from the proportion of correct responses made by
each student, a model based on item-response theory (IRT), a model based on Pavlik
& Anderson’s (2005a, 2005b) ACT-R model, DASH, and two variants of DASH that
adhere more strictly to the tenets of MCM and ACT-R. Error bars indicate ±1 SEM.

Experiment 2
Experiment 1 took place in the fall semester with third-semester Spanish students.
We conducted a follow-up experiment in the next (spring) semester with the
same students, then in their fourth semester of Spanish. (One student of the 179
in experiment 1 did not participate in experiment 2 because of a transfer.) The
semester was organized around eight lessons, followed by two cumulative exams
administered 28 days apart. The two cumulative exams each tested half the course
material, with a randomized split by student.
The key motivations for experiment 2 are as follows.
• In experiment 1, the personalized-review scheduler differed from the other
two schedulers both in its personalization and in its ability to select material
from early in the semester. Because personalized review and long-term review
were conflated, we wished to include a condition in experiment 2 that
involved long-term review but without personalization. We thus incorporated
a random scheduler that drew items uniformly from the set of items that
had been introduced in the course to date. Because the massed scheduler of
experiment 1 performed so poorly, we replaced it with the random scheduler.
• Because the same students participated in experiments 1 and 2, we had
the opportunity to initialize students’ models based on all the data from
experiment 1. The old data provided DASH with fairly strong evidence from
the beginning of the semester about individual student abilities and about the
relationship of study schedule to retention. Given that experiment 2 covered
only eight lessons, versus the ten in experiment 1, this bootstrapping helped
DASH to perform well out of the gate.
Predicting and Improving Memory Retention 55

55

% Correct on cumulative exam

Personalized
50

Spaced
Random
Generic
Spaced
45

40
End of semester One month delayed
FIGURE 3.7 COLT experiment 2. Mean scores on the cumulative end-of-semester
exam. All error bars indicate ±1 within-student standard error (Masson & Loftus,
2003).

• Using the data from experiment 1, DASH[ACT-R] obtains a slighly lower


accumulative prediction error than DASH (Figure 3.6(b)). Consequently, we
substituted DASH[ACT-R] as the model used to select items for review in the
personalized condition.

Figure 3.7 summarizes the experiment outcome. The bars represent scores in
the three review conditions on the initial and delayed exams. The differences
among conditions are not as stark as we observed in experiment 1, in part because
we eliminated the weak massed condition and in part due to an unanticipated issue
which we address shortly. Nonetheless, on the first exam, the personalized-spaced
scheduler improved retention by 4.8 percent over the generic-spaced scheduler
(t (167) = 3.04, p < 0.01, Cohen’s d = 0.23) and by 3.4 percent over the random
scheduler (t (167) = 2.29, p = 0.02, d = 0.18). Between the two exams, the
forgetting rate is roughly the same in all conditions: 16.7 percent, 16.5 percent, and
16.5 percent for the generic, random, and personalized conditions, respectively.
On the second exam, personalized review boosted retention by 4.6 percent over
generic review (t (166) = 2.27, p = 0.024, d = 0.18) and by 3.1 percent over
random review, although this difference was not statistically reliable (t (166) = 1.64,
p = 0.10, d = 0.13).
At about the time we obtained these results, we discovered a significant problem
with the experimental software. Students did not like to review. In fact, at the
end of experiment 1, an informal survey indicated concern among students that
56 M. C. Mozer and R. V. Lindsey

mandatory review interfered with their weekly quiz performance because they
were not able to spend all their time practicing the new lesson that was the subject
of their weekly quiz. Students wished to mass their study due to the incentive
structure of the course, and they requested a means of opting out of review. We
did not accede to their request; instead, the teacher explained the value of review to
long-term retention. Nonetheless, devious students found a way to avoid review:
Upon logging in, COLT began each session with material from the new lesson.
Students realized that if they regularly closed and reopened their browser windows,
they could avoid review. Word spread throughout the student population, and most
students took advantage of this unintended feature of COLT. The total number of
review trials performed in experiment 2 was a small fraction of the number of
review trials in experiment 1. Consequently, our failure to find large and reliable
differences among the schedulers is mostly due to the fact that students simply did
not review.
One solution might be to analyze the data from only those students who
engaged in a significant number of review trials during the semester. We opted
instead to use data from all students and to examine the relative benefit of the
different review schedulers as a function of the amount of review performed. The
amount of review is quantified as the total number of review trials performed by
a student divided by the total number of items, i.e. the mean number of review
trials. Note, however, that this statistic does not imply that each item was reviewed
the same number of times. For each student, we computed the difference of exam
scores between personalized and generic conditions, and between personalized and
random conditions. We performed a regression on these two measures given the
amount of review. Figure 3.8 shows the regression curves that represent the exam
score differences as a function of mean review trials per item. The regressions were
constrained to have an intercept at 0.0 because the conditions are identical when
no review is included. The data points plotted in Figure 3.8 are averages based
on groups of about ten students who performed similar amounts of review. These
groupings make it easier to interpret the scatterplot, but the raw data were used for
the regression analysis.
Figure 3.8 shows a positive slope for all four regression lines (all reliable
by t tests with p < 0.01), indicating that with more time devoted to review,
the personalized-review scheduler increasingly outperforms the random and
generic-review schedulers. If, for example, students had studied on COLT for an
average of one more review trial per item for each of the 13 weeks in the semester
leading up to exam 1, Figure 3.8 predicts an (absolute) improvement on exam 1
scores of 10.2 percent with personalized-spaced review over generic-spaced review
and 7.2 percent with personalized-spaced review over random review. We wish to
emphasize that we are not simply describing an advantage of review over no review.
Our result suggests that students will score a letter grade higher (7–10 points out
of 100) with time-matched personalized review over the other forms of review.
Predicting and Improving Memory Retention 57

(a) Exam 1 (b) Exam 2


8
8
6
6

Difference in exam score (%)


Difference in exam score (%)

4
4

2 2

0 0

−2 −2
−4
−4
−6
Personalized − Random Personalized − Random
−6
−8 Personalized − Generic Personalized − Generic
−8
0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Average number of review trials per item Average number of review trials per item

FIGURE 3.8 COLT experiment 2. Scatterplot for exams 1 and 2 ((a) and (b),
respectively) showing the advantage of personalized-spaced review over random and
generic-spaced review, as a function of the amount of review that a student performed.
The amount of review is summarized in terms of the total number of review trials
during the semester divided by the number of items. Long-dash regression line
indicates the benefit of personalized over random review; short-dash line indicates
the benefit of personalized over generic review.

In contrast to experiment 1, the effect of personalized review was not amplified


on exam 2 relative to exam 1. Our explanation is that in experiment 1, the 28-day
retention interval between exams 1 and 2 was the holiday break, a time during
which students were unlikely to have much contact with course material. In
experiment 2, the intervening 28-day period occurred during the semester when
students were still in class but spent their time on enrichment activities that were
not part of our experiment (e.g. class projects, a trip to the zoo). Consequently,
students had significant exposure to course content, and this exposure could only
have served to inject noise in our assessment exam.

Discussion
Whereas previous studies offer in-principle evidence that human learning can
be improved by the inclusion and timing of review, our results demonstrate in
practice that integrating personalized-review software into the classroom yields
appreciable improvements in long-term educational outcomes. Our experiment
goes beyond past efforts in its scope: It spans the time frame of a semester, covers
the content of an entire course, and introduces material in a staggered fashion
and in coordination with other course activities. We find it remarkable that the
review manipulation had as large an effect as it did, considering that the duration
of roughly 30 minutes a week was only about 10 percent of the time students were
engaged with the course. The additional, uncontrolled exposure to material from
58 M. C. Mozer and R. V. Lindsey

classroom instruction, homework, and the textbook might well have washed out
the effect of the experimental manipulation. Our experiments go beyond showing
that spaced practice is superior to massed practice: Taken together, experiments 1
and 2 provide strong evidence that personalization of review is superior to other
forms of spaced practice.
Although the outcome of experiment 2 was less impressive than the outcome
of experiment 1, the mere fact that students went out of their way to avoid a
review activity that would promote long-term retention indicates the great need
for encouraging review of previously learned material. One can hardly fault the
students for wishing to avoid an activity they intuited to be detrimental to their
grades. The solution is to better align the students’ goals with the goal of long-term
learning. One method of alignment is to administer only cumulative quizzes. In
principle, there’s no reason to distinguish the quizzes from the retrieval practice
that students perform using COLT, achieving the sort of integration of testing and
learning that educators often seek.

Conclusions
Theory-driven approaches in psychology and cognitive science excel at charac-
terizing the laws and mechanisms of human cognition. Data-driven approaches
from machine learning excel at inferring statistical regularities that describe how
individuals vary within a population. In this chapter, we have argued that in the
domain of learning and memory, a synthesis of theory and data-driven approaches
inherits the strengths of each. Theory-driven approaches characterize the temporal
dynamics of learning and forgetting based on study history and past performance.
Data-driven approaches use data from a population of students learning a collection
of items to make inferences concerning the knowledge state of individual students
for specific items.
The models described in this chapter offer more than qualitative guidance to
students about how to study. In one respect, they go beyond what even a skilled
classroom teacher can offer: They are able to keep track of student knowledge state
at a granularity that is impossible for a teacher who encounters hundreds of students
over the course of a day. A system such as COLT provides an efficient housekeeping
function to ensure that knowledge, once mastered, remains accessible and a part of
each student’s core competency. COLT allows educators to do what they do best:
to motivate and encourage; to help students to acquire facts, concepts, and skills;
and to offer creative tutoring to those who face difficulty. To achieve this sort of
complementarity between electronic tools and educators, a Big Data approach is
essential.
Predicting and Improving Memory Retention 59

Appendix: Simulation Methodology for Hybrid


Forgetting Model
Each of the five forgetting models was cast in a hierarchical Bayesian generative
framework, as specificed in Table 3.2. We employed Markov chain Monte Carlo
to draw samples from the posterior, specifically Metropolis-within-Gibbs (Patz
& Junker, 1999), an extension of Gibbs sampling wherein each draw from
the model’s full conditional distribution is performed by a single Metropolis-
Hastings step.
Inference on the two sets of latent traits in the HYBRID BOTH model—{as }
and {di } from IRT, {ãs } and {d̃i } from HYBRID DECAY—is done jointly, leading
to possibly a different outcome than the one that we would obtain by first fitting
IRT and then inferring the decay-rate determining parameters. In essence, the
HYBRID BOTH model allows the corrupting influence of time to be removed
from the IRT variables, and allows the corrupting influence of static factors to be
removed from the forgetting-related variables.
The hierarchical Bayesian models impose weak priors on the parameters. Each
model assumes that latent traits are normally distributed with mean zero and an
unknown precision parameter shared across the population of items or students.
The precision parameters are all given Gamma priors. Through Normal-Gamma
conjugacy, we can analytically marginalize them before sampling. Each latent

TABLE 3.2 Distributional assumptions of the generative Bayesian response models. The
HYBRID BOTH model shares the same distributional assumptions as the HYBRID DECAY and
HYBRID SCALE models.

IRT HYBRID DECAY HYBRID SCALE

rsi | as , di rsi | ãs , d̃i , m, h, tsi rsi | as , di , ãs , d̃i , h, tsi


∼ Bernoulli( psi ) ∼ Bernoulli(m p̃si ) ∼ Bernoulli( psi p̃si )

p̃si = (1 + htsi )− exp(ãs −d̃i )


psi = (1 + exp(di − as ))−1 ãs | τã ∼ Normal(0, τã−1 ) p̃si = (1 + htsi )− f
as | τa ∼ Normal(0, τa−1 ) d̃i | τd̃ ∼ Normal(0, τd̃−1 ) f ∼ Gamma(ψ f 1 , ψ f 2 )
di | τd ∼ Normal(0, τd−1 ) τã ∼ Gamma(ψã1 , ψã2 )
τa ∼ Gamma(ψa1 , ψa2 ) τd̃ ∼ Gamma(ψd̃1 , ψd̃2 ) All other parameters are the
τd ∼ Gamma(ψd1 , ψd2 ) h ∼ Gamma(ψh1 , ψh2 ) same as IRT and HYBRID
DECAY
m ∼ Beta(ψm1 , ψm2 )
60 M. C. Mozer and R. V. Lindsey

trait’s conditional distribution thus has the form of a likelihood (defined in


Equations 2–5) multiplied by the probability density function of a non-standardized
student’s t-distribution. For example, the ability parameter in the HYBRID SCALE
model is drawn via a Metropolis-Hastings step from the distribution
Y
p(as | a¬s , d, h, m, R) ∝ P(rsi | as , di , h, m)
i
!ψ1 + n S 2−1,
as2
× 1+ 1
P (10)
2(ψ2 + 2 j6=s aj)

where the first term is given by Equations 3 and 5. The effect of the
marginalization of the precision parameters is to tie the traits of different students
together so that they are no longer conditionally independent.
Hyperparameters ψ of the Bayesian models were set so that all the Gamma
distributions had shape parameter 1 and scale parameter 0.1. For each run of each
model, we combined predictions from across three Markov chains, each with a
random starting location. Each chain was run for a burn in of 1,000 iterations and
then 2,000 more iterations were recorded. To reduce autocorrelation among the
samples, we thinned them by keeping every tenth one.
Why did we choose to fit models with hierarchical Bayesian (HB) inference
instead of the more standard maximum likelihood (ML) estimation? The difference
between HB and ML is that HB imposes an additional bias that, in the absence
of strong evidence about a parameter value—say, a student’s ability or an item’s
difficulty—the parameter should be typical of those for other students or other
items. ML does not incorporate this prior belief, and as a result, it is more
susceptible to overfitting a training set. For this reason, we were not surprised
when we tried training models with ML and found they did not perform as well
as with HB.

Acknowledgments
The research was supported by NSF grants SBE-0542013, SMA-1041755, and
SES-1461535 and an NSF Graduate Research Fellowship to R. Lindsey. We thank
Jeff Shroyer for his support in conducting the classroom studies, and Melody
Wisehart and Harold Pashler for providing raw data from their published work
and for their generous guidance in interpreting the spacing literature.

Notes
1 Note that making predictions for new items or new students is principled
within the hierarchical Bayesian modeling framework. From training data, the
Predicting and Improving Memory Retention 61

models infer not only student or item-specific parameters, but also hyper-
parameters that characterize the population distributions. These population
distributions are used to make predictions for new items and new students.
2 In contrast to the present results, Ridgeway et al. (2016) found no improvement
with the HYBRID BOTH over the HYBRID SCALE model.
3 The counts csiw and n siw are regularized by add-one smoothing, which ensures
that the logarithm terms are finite.

References
Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004).
An integrated theory of the mind. Psychological Review, 111, 1036–1060.
Anderson, J. R., Conrad, F. G., & Corbett, A. T. (1989). Skill acquisition and the LISP
tutor. Cognitive Science, 13, 467–506.
Andrade, D. F., & Tavares, H. R. (2005). Item response theory for longitudinal data:
Population parameter estimation. Journal of Multivariate Analysis, 95, 1–22.
Benjamin, A. S., & Tullis, J. (2010). What makes distributed practice effective? Cognitive
Psychology, 61, 228–247.
Bjork, R. (1994). Memory and metamemory considerations in the training of human
beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition: Knowing about knowing
(pp. 185–205). Cambridge, MA: MIT Press.
Carpenter, S. K., Cepeda, N. J., Rohrer, D., Kang, S. H. K., & Pashler, H. (2012). Using
spacing to enhance diverse forms of learning: Review of recent research and implications
for instruction. Educational Psychology Review, 24, 369–378.
Carpenter, S. K., Pashler, H., & Cepeda, N. (2009). Using tests to enhance 8th grade
students’ retention of U. S. history facts. Applied Cognitive Psychology, 23, 760–771.
Cen, H., Koedinger, K., & Junker, B. (2006). Learning factors analysis—a general method
for cognitive model evaluation and improvement. In Proceedings of the Eighth International
Conference on Intelligent Tutoring Systems.
Cen, H., Koedinger, K., & Junker, B. (2008). Comparing two IRT models for conjunctive
skills. In Woolf, B., Aimeur, E., Njambou, R, & Lajoie, S. (Eds.), Proceedings of the Ninth
International Conference on Intelligent Tutoring Systems.
Cepeda, N. J., Coburn, N., Rohrer, D., Wixted, J. T., Mozer, M. C., & Pashler, H. (2009).
Optimizing distributed practice: Theoretical analysis and practical implications. Journal of
Experimental Psychology, 56, 236–246.
Cepeda, N. J., Pashler, H., Vul, E., & Wixted, J. T. (2006). Distributed practice in verbal
recall tasks: A review and quantitative synthesis. Psychological Bulletin and Review, 132,
364–380.
Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed
practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin,
132, 354–380.
Cepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., & Pashler, H. (2008). Spacing effects in
learning: A temporal ridgeline of optimal retention. Psychological Science, 19, 1095–1102.
Cohen, M. S., Yan, V. X., Halamish, V., & Bjork, R. A. (2013). Do students think
that difficult or valuable materials should be restudied sooner rather than later? Journal of
Experimental Psychology: Learning, Memory, and Cognition, 39(6), 1682–1696.
62 M. C. Mozer and R. V. Lindsey

Custers, E. (2010). Long-term retention of basic science knowledge: A review study.


Advances in Health Science Education: Theory & Practice, 15(1), 109–128.
Custers, E., & ten Cate, O. (2011). Very long-term retention of basic science knowledge
in doctors after graduation. Medical Education, 45(4), 422–430.
DeBoek, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized
linear and nonlinear approach. New York: Springer.
Dunlosky, J., Rawson, K., Marsh, E., Nathan, M., & Willingham, D. (2013). Improving
students’ learning with effective learning techniques: Promising directions from cognitive
and educational psychology. Psychological Science in the Public Interest, 14(1), 4–58.
Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York:
Wiley.
Grimaldi, P. J., Pyc, M. A., & Rawson, K. A. (2010). Normative multitrial recall
performance, metacognitive judgments and retrieval latencies for Lithuanian-English
paired associates. Behavioral Research Methods, 42, 634–642.
Kang, S. H. K., Lindsey, R. V., Mozer, M. C., & Pashler, H. (2014). Retrieval practice
over the long term: Expanding or equal-interval spacing? Should spacing be expanding
or equal-interval? Psychonomic Bulletin & Review, 21, 1544–1550.
Khajah, M., Lindsey, R. V., & Mozer, M. C. (2013). Maximizing students’ retention via
spaced review: Practical guidance from computational models of memory. In Proceedings
of the 35th Annual Conference of the Cognitive Science Society.
Koedinger, K. R., & Corbett, A. T. (2006). Cognitive tutors: Technology bringing learning
science to the classroom. In K. Sawyer (Ed.), The Cambridge handbook of the learning sciences
(pp. 61–78). Cambridge, UK: Cambridge University Press.
Kording, K. P., Tenenbaum, J. B., & Shadmehr, R. (2007). The dynamics of memory as a
consequence of optimal adaptation to a changing body. Nature Neuroscience, 10, 779–786.
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for
recommender systems. IEEE Computer, 42, 42–49.
Lan, A. S., Studer, C., & Baraniuk, R. G. (2014). Time-varying learning and content
analytics via sparse factor analysis. In ACM SIGKDD conf. on knowledge disc. and data
mining. Retrieved from http://arxiv.org/abs/1312.5734.
Leitner, S. (1972). So lernt man lernen. Angewandte Lernpsychologie – ein Weg zum Erfolg.
Freiburg im Breisgau: Verlag Herder.
Lindsey, R. V. (2014). Probabilistic models of student learning and forgetting (Unpublished
doctoral dissertation). Computer Science Department, University of Colorado
at Boulder, USA.
Lindsey, R. V., Lewis, O., Pashler, H., & Mozer, M. C. (2010). Predicting students’
retention of facts from feedback during training. In S. Ohlsson & R. Catrambone (Eds.),
Proceedings of the 32nd Annual Conference of the Cognitive Science Society (pp. 2332–2337).
Austin, TX: Cognitive Science Society.
Lindsey, R. V., Shroyer, J. D., Pashler, H., & Mozer, M. C. (2014). Improving students’
long-term knowledge retention through personalized review. Psychological Science, 25,
639–647.
Martin, J., & VanLehn, K. (1995). Student assessment using Bayesian nets. International
Journal of Human-Computer Studies, 42, 575–591.
Masson, M., & Loftus, G. (2003). Using confidence intervals for graphically based data
interpretation. Canadian Journal of Experimental Psychology, 57, 203–220.
Predicting and Improving Memory Retention 63

Metcalfe, J., & Finn, B. (2011). People’s hypercorrection of high confidence errors: Did
they know it all along? Journal of Experimental Psychology: Learning, Memory, and Cognition,
37, 437–448.
Mettler, E., & Kellman, P. J. (2014). Adaptive response-time-based category sequencing in
perceptual learning. Vision Research, 99, 111–123.
Mettler, E., Massey, C., & Kellman, P. J. (2011). Improving adaptive learning technology
through the use of response times. In L. Carlson, C. Holscher, & T. Shipley (Eds.),
Proceedings of the 33rd Annual Conference of the Cognitive Science Society (pp. 2532–2537).
Austin, TX: Cognitive Science Society.
Mozer, M. C., Pashler, H., Cepeda, N., Lindsey, R. V., & Vul, E. (2009). Predicting
the optimal spacing of study: A multiscale context model of memory. In Y. Bengio,
D. Schuurmans, J. Lafferty, C. Williams, & A. Culotta (Eds.), Advances in Neural
Information Processing Systems (Vol. 22, pp. 1321–1329). Boston, MA: MIT Press.
Mullany, A. (2013). A Q&A with Salman Khan. Retrieved December 23, 2014, from
http://live.fastcompany.com/Event/A_QA_With_Salman_Khan.
Patz, R. J., & Junker, B. W. (1999). A straightforward approach to Markov chain Monte
Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24,
146–178.
Pavlik, P. I. (2007). Understanding and applying the dynamics of test practice and study
practice. Instructional Science, 35, 407–441.
Pavlik, P. I., & Anderson, J. R. (2005a). Practice and forgetting effects on vocabulary
memory: An activation-based model of the spacing effect. Cognitive Science, 29(4),
559–586.
Pavlik, P. I., & Anderson, J. (2005b). Practice and forgetting effects on vocabulary memory:
An activation-based model of the spacing effect. Cognitive Science, 29, 559–586.
Pavlik, P. I., & Anderson, J. R. (2008). Using a model to compute the optimal schedule of
practice. Journal of Experimental Psychology: Applied, 14, 101–117.
Pavlik, P. I., Cen, H., & Koedinger, K. (2009). Performance factors analysis—a new
alternative to knowledge tracing. In V. Dimitrova & R. Mizoguchi (Eds.), Proceeding of
the Fourteenth International Conference on Artificial Intelligence in Education. Brighton, UK.
Raaijmakers, J. G. W. (2003). Spacing and repetition effects in human memory: Application
of the SAM model. Cognitive Science, 27, 431–452.
Rickard, T., Lau, J., & Pashler, H. (2008). Spacing and the transition from calculation to
retrieval. Psychonomic Bulletin & Review, 15, 656–661.
Ridgeway, K., Mozer, M. C., & Bowles, A. (2016). Forgetting of foreign-language skills:
A corpus-based analysis of online tutoring software. Cognitive Science Journal. (Accepted
for publication).
Rohrer, D., & Taylor, K. (2006). The effects of overlearning and distributed practice on
the retention of mathematics knowledge. Applied Cognitive Psychology, 20, 1209–1224.
Roussos, L. A., Templin, J. L., & Henson, R. A. (2007). Skills diagnosis using IRT-based
latent class models. Journal of Educational Measurement, 44, 293–311.
Seabrook, R., Brown, G., & Solity, J. (2005). Distributed and massed practice: From
laboratory to classroom. Applied Cognitive Psychology, 19, 107–122.
Sobel, H., Cepeda, N., & Kapler, I. (2011). Spacing effects in real-world classroom
vocabulary learning. Applied Cognitive Psychology, 25, 763–767.
64 M. C. Mozer and R. V. Lindsey

Sohl-Dickstein, J. (2013). Personalized learning and temporal modeling at Khan Academy.


Retrieved from http://lytics.stanford.edu/datadriveneducation/slides/sohldickstein.
pdf.
Staddon, J. E. R., Chelaru, I. M., & Higa, J. J. (2002). Habituation, memory and the brain:
The dynamics of interval timing. Behavioural Processes, 57, 71–88.
van Lehn, K., Jordan, P., & Litman, D. (2007). Developing pedagogically effective tutorial
dialogue tactics: Experiments and a testbed. In Proceedings of the SLaTE Workshop on
Speech and Language (pp. 17–20).
Wagenmakers, E.-J., Grünwald, P., & Steyvers, M. (2006). Accumulative prediction error
and the selection of time series models. Journal of Mathematical Psychology, 50, 149–166.
Wixted, J. T. (2004). The psychology and neuroscience of forgetting. Annual Review of
Psychology, 55, 235–269.
Wixted, J. T., & Carpenter, S. K. (2007). The Wickelgren power law and the Ebbinghaus
savings function. Psychological Science, 18, 133–134.
Woźniak, P. (1990). Optimization of learning (Unpublished master’s thesis). Poznan University
of Technology, Poznan, Poland.
4
TRACTABLE BAYESIAN TEACHING
Baxter S. Eaves Jr.,
April M. Schweinhart,
and Patrick Shafto

Abstract
The goal of cognitive science is to understand human cognition in the real world. However,
Bayesian theories of cognition are often unable to account for anything beyond the
schematic situations whose simplicity is typical only of experiments in psychology labs. For
example, teaching to others is commonplace, but under recent Bayesian accounts of human
social learning, teaching is, in all but the simplest of scenarios, intractable because teaching
requires considering all choices of data and how each choice of data will affect learners’
inferences about each possible hypothesis. In practice, teaching often involves computing
quantities that are either combinatorially implausible or that have no closed-form solution.
In this chapter we integrate recent advances in Markov chain Monte Carlo approximation
with recent computational work in teaching to develop a framework for tractable Bayesian
teaching of arbitrary probabilistic models. We demonstrate the framework on two complex
scenarios inspired by perceptual category learning: phonetic category models and visual
scenes categorization. In both cases, we find that the predicted teaching data exhibit
surprising behavior. In order to convey the number of categories, the data for teaching
phonetic category models exhibit hypo-articulation and increased within-category variance.
And in order to represent the range of scene categories, the optimal examples for teaching
visual scenes are distant from the category means. This work offers the potential to scale
computational models of teaching to situations that begin to approximate the richness of
people’s experience.

Pedagogy is arguably humankind’s greatest adaptation and perhaps the reason for
our success as a species (Gergely, Egyed, & Kiraly, 2007). Teachers produce data to
efficiently convey specific information to learners and learners learn with this in
mind (Shafto and Goodman, 2008; Shafto, Goodman, & Frank, 2012; Shafto,
Goodman, & Griffiths, 2014). This choice not only ensures that information
lives on after its discoverer, but also ensures that information is disseminated
66 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

quickly and effectively. Shafto and Goodman (2008) introduced a Bayesian model
of pedagogical data selection and learning, and used a simple teaching game to
demonstrate that human teachers choose data consistently with the model and that
human learners make stronger inferences from pedagogically sampled data than
from randomly sampled data (data generated according to the true distribution).
Subsequent work, using the same model, demonstrated that preschoolers learn
differently from pedagogically selected data (Bonawitz et al., 2011).
Under the model, a teacher, T , chooses data, x ∗ , to induce a specific belief
(hypothesis, θ ∗ ) in the learner, L. Mathematically, this means choosing data with
probability in proportion with the induced posterior probability of the target
hypothesis,
p L (θ ∗ | x ∗ )
pT (x ∗ | θ ∗ ) = R (1)
p (θ ∗ | x)d x
x L

p L (x ∗ | θ ∗ ) p L (θ ∗ )
p L (x ∗ )
=R (2)
p L (x | θ ∗ ) p L (θ ∗ )
x
dx
p L (x)
p L (x ∗ | θ ∗ ) p L (θ ∗ )
∝ . (3)
p L (x ∗ )
Thus Bayesian teaching includes Bayesian learning as a sub-problem—because it
requires considering all possible inferences given all possible data. At the outer
layer the teacher
R considers (integrates; marginalizes) over all possible alternative
data choices, x p L (θ ∗ | x)d x; at the inner level, the learner considers all alternative
hypotheses in the marginal likelihood, p L (x ∗ ). The teacher considers how each
possible dataset will affect learning of the target hypothesis and the learner considers
how well the data chosen by the teacher communicate each possible hypothesis.
Pedagogy works because learners and teachers have an implicit understanding of
each other’s behavior. A learner can quickly dismiss many alternatives using the
reasoning that had the teacher meant to convey one of those alternatives, she
would have chosen data differently. The teacher chooses data with this in mind.
Computationally, Bayesian teaching is a complex problem. Producing data that
lead a learner to a specific inference about the world requires the teacher to make
choices between different data. Choosing requires weighing one choice against all
others, which requires computing large, often intractable sums or integrals (Luce,
1977). The complexity of the teacher’s marginalization over alternative data can,
to some extent, be mitigated by standard approximation methods (e.g. Metropolis,
Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Geman & Geman, 1984), but for
teaching, this is not enough. For each choice of dataset, the teacher must consider
how those data will cause the learner to weigh the target hypothesis against all other
hypotheses. As we shall see, this inner marginalization is not one that we can easily
Tractable Bayesian Teaching 67

make go away. And as the hypothesis becomes more complex, the marginalization
becomes more complex; often, as is the case in categorization, the size of the
set of alternative hypotheses increases ever faster as the number of data increases.
For example, if a category learner does not know the number of categories, he
or she must assume there can be as few as one category or as many categories
as there are data. Learning complex concepts that are reminiscent of real-world
scenarios often introduces marginalizations that have no closed form solution or
are combinatorially intractable. Because of this, existing work that models teaching
has done so using necessarily simple, typically discrete, hypothesis spaces.
A Bayesian method of eliciting a specific inference in learners has applications
beyond furthering our understanding of social learning, to education, perception,
and machine learning; thus it is in our interest to make Bayesian teaching tractable.
It is our goal in this chapter to leverage approximation methods that allow us to
scale beyond the simple scenarios used in previous research. We employ recent
advances in Monte Carlo approximation to facilitate tractable Bayesian teaching.
We proceed as follows: In the first section we discuss some of the sources of
complexity that arise in Bayesian statistics, such as marginalized probabilities, and
discuss standard methods of combating complexity. In the second section we
briefly discuss new methods from the Bayesian Big Data literature, paying special
attention to one particular method, psuedo-marginal sampling, which affords the
same theoretical guarantees of standard Monte Carlo approximation methods while
mitigating the effects of model complexity through further approximation, and
outline the procedure for simulating teaching data. In the third section we apply
the teaching model to the debate within developmental psychology of whether
infant-directed speech is for teaching, which amounts to teaching category models.
Lastly, in the fourth section we apply the teaching model to a more complex
problem of teaching natural scene categories, which we model as categories of
category models. We conclude with a brief recapitulation and meditation on future
directions.

Complexity in Bayesian Statistics


Complexity is the nemesis of the Bayesian modeler. It is a problem from the outset.
Bayes’ theorem states that the posterior probability, π(θ ′ |x) of a hypothesis, θ ′ , given
some data, x is equal to the likelihood, f (x|θ ′ ), of the data given the hypothesis
multiplied by the prior probability, π(θ ′ ), of the hypothesis divided by the marginal
likelihood, m(x), of the data:

f (x | θ ′ )π(θ ′ )
π(θ ′ | x) = (4)
m(x)
f (x | θ ′ )π(θ ′ )
=P . (5)
θ ∈2 f (x | θ)π(θ)
68 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

The immediate problem is this marginal likelihood (lurking menacingly below


the likelihood and prior). Often, the sum, or integral, over 2 involves
computing a large, or infinite, number of terms or has no closed-form solution,
rendering it analytically intractable; thus much of the focus of Bayesian statistical
research involves approximating Bayesian inference by either approximating certain
intractable quantities or avoiding their calculation altogether.

Importance Sampling
Importance sampling (IS) is a Monte Carlo method used to approximate
integrals that are analytically intractable or not suitable for quadrature (numerical
integration).1 IS involves re-framing the integral of a function p with respect to θ,
as an expectation with respect to an importance function, w(·) = p(·)/q(·), under q,
such that q(·) > 0 whenever p(·) > 0. One draws a number, M, of independent
samples θ̄1 , . . . θ̄ M from q, and takes the arithmetic average of w(θ¯1 ), . . . , w(θ¯M ).
By the law of large numbers,
M Z
1 X
lim w(θ̄) = p(θ)dθ, (6)
M→∞ M θ
i=1

as M → ∞, the average approaches the true value of the target expectation, which
means that IS produces an unbiased estimate (the expected value of the estimate is
the true value).
If we wish to estimate m(x), we set w(θ) = f (x | θ)π(θ)/q(θ),
Z Z  
f (x | θ)π(θ)
m(x) = f (x | θ)π(θ)dθ = q(θ)dθ
θ θ q(θ)
  M
f (x | θ)π(θ) 1 X f (x | θ̄i )π(θ̄i )
= Eq ≈ . (7)
q(θ) M i=1 q(θ̄i )
When we approximate the integral with the sum, we no longer consider the
differential, dθ, but consider only individual realizations, θ̄, drawn from q. As
we shall see in the third section, the choice of q influences the efficiency of IS. A
straightforward, although usually inefficient, choice is to draw θ̄ from the prior,
q(θ) = π(θ), in which case,
M
1 X
m(x) ≈ f (x | θ̄i ). (8)
M i=1

The Metropolis-Hastings Algorithm


If we do not explicitly need the quantity m(x), we can avoid calculating it
altogether using the Metropolis-Hastings algorithm (MH; Metropolis, Rosenbluth,
Tractable Bayesian Teaching 69

Rosenbluth, Teller, & Teller, 1953). MH is a Markov-chain Monte Carlo


(MCMC) algorithm that is used to construct a Markov chain with p(y) as its
stationary distribution. This means that in the limit of state transitions, that y will
occur in the induced Markov chain with probability p(y). MH requires a function
g that is proportional to p, g(y) = Z −1 p(y), and a proposal function q(y → y ′ )
that proposes moves to new states, y ′ , from the current state, y. MH works by
repeatedly proposing samples from q and accepting samples (setting y := y ′ ) with
probability min[1, A], where
g(y ′ )q(y → y ′ )
A := . (9)
g(y)q(y ′ → y)
If q is symmetric, that is q(y → y ′ ) = q(y ′ → y) for all y, y ′ , then q cancels from
the equation. For example, if y ∈ R, then proposing y ′ from a normal distribution
centered at y, q(y → y ′ ) := N (y, σ ), is a symmetric proposal density.
To sample from the posterior distribution, set g = f (x | θ)π(θ) and notice that
m(x) is a constant,
π(θ ′ | x)
A := (10)
π(θ | x)
f (x | θ ′ )π(θ ′ )m(x)
= (11)
f (x | θ)π(θ)m(x)
f (x | θ ′ )π(θ ′ )
= (12)
f (x | θ)π(θ)
Thus, to draw posterior samples using MH, one only need evaluate the likelihood
and prior.

Recent Advances in Monte Carlo Approximation


Thanks to algorithms like those mentioned in the previous section, the marginal
likelihood is rarely a problem for standard Bayesian inference. It is so little a problem
that modelers rarely acknowledge it, substituting “∝” for “=” to avoid even
writing it, for they shall not bother to calculate it anyway. These days, complex
likelihoods pose a greater problem. For example, in Bayesian teaching, one is
interested in the likelihood of the learner’s inference given data, which is the
learner’s posterior. The complexity of the likelihood increases as the number of
data increases and as the complexity of the learning model increases.
Large amounts of data directly affect computation time. Assuming that data are
not reduced to a summary statistic, computation of the likelihood requires O(N )
QN
function evaluations, f (x | θ) = i=1 ℓ(xi | θ). If N is very large and ℓ is expensive
to compute, then computing f is infeasible, which renders standard Monte Carlo
inference infeasible. Methods exist for approximate (biased) MCMC using random
subsets of the data, such as adaptive subsampling (Bardenet, Doucet, & Holmes,
70 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

2014) and stochastic gradient methods (Patterson & Teh, 2013). Firefly Monte
Carlo (Maclaurin & Adams, 2014), which uses a clever proposal density to activate
(light up) certain data points, is the first exact MCMC algorithm to use subsets of
data. Other proposals employ multiprocessing strategies such as averaging results
from independent Monte Carlo simulations run on subsets of data (Scott et al.,
2013) and dividing computations and computing parts of the MH acceptance ratio
on multiple processors (Banterle, Grazian, & Robert, 2014).

Pseudo-Marginal Markov Chain Monte Carlo


This chapter primarily deals with complex likelihood, which is frequently
exacerbated by larger numbers of data. We focus on an exceptionally simple
technique referred to as pseudo-marginal MCMC (PM-MCMC; Andrieu &
Roberts, 2009; Andrieu & Vihola, 2012), which allows exact MH to be performed
using approximated functions. Assume that p in Equation 9 is difficult to evaluate
but that we can compute an estimate, p̂(y) = W y p(y), for which W y ∼ ψ(y) ≥ 0
and the expected value of W y (which, in keeping with the existing literature
we shall refer to as weights) is equal to some constant, E[W y ] = k. The target
distribution is then a joint distribution over W and y, but we implicitly integrate
over W leaving only p(y). Hence, we can simply substitute p with p̂ in the
original acceptance ratio,

w′ p(y ′ )q(y ′ → y) p̂(y ′ )q(y ′ → y)


A := ′
= . (13)
w p(y)q(y → y ) p̂(y)q(y → y ′ )

And to simulate from the posterior of a density with an intractable likelihood:

fˆ(x | θ ′ )π(θ ′ )q(θ ′ → θ)


A := , (14)
fˆ(x | θ)π(θ)q(θ → θ ′ )

where fˆ(x | θ) is a Monte Carlo estimate of the likelihood, f (x | θ). Thus,


PM-MCMC is an exact MCMC algorithm equivalent to standard MH. The
stability and convergence properties of this algorithm have been rigorously
characterized (Andrieu & Vihola, 2012; Sherlock, Thiery, Roberts, & Rosenthal,
2013). In practice, the user only need ensure that p̂(y) has a constant bias and that
each p̂(y) is never recomputed for any y.

Teaching Using PM-MCMC


The purpose of teaching, from the teacher’s perspective, is to choose one specific
dataset from the collection of all possible datasets, to convey one specific hypothesis
Tractable Bayesian Teaching 71

to a learner who considers all hypotheses,

p L (θ ∗ | x ∗ ) p L (x ∗ | θ ∗ )
pT (x ∗ | θ ∗ ) = ∝ .
m(θ ∗ ) p L (x ∗ )
R
The teacher marginalizes over datasets, m(θ ∗ ) = x p L (θ ∗ | x)d x, andR for each
dataset marginalizes over all possible learning inferences, p L (x ∗ ) = θ p L (x ∗ |
θ) p(θ)dθ . To generate teaching data, we must simulate data according to this
probability distribution while navigating these marginalizations.
In order to simulate teaching data, we use PM-MCMC by embedding
importance sampling within the Metropolis-HastingsR algorithm. We use MH to
avoid calculating the integral over alternative data, x p L (θ ∗ | x)d x, leaving the
MH acceptance ratio,
p L (x ′ | θ) p L (x ∗ )
A= , (15)
p L (x ∗ | θ) p L (x ′ )
where x ′ is the proposed (continuous) data and it is assumed that the proposal
density, q, is a symmetric, Gaussian perturbation of the data. Equation 15 indicates
that we must calculate the marginal likelihoods of data in order to use MH for
teaching. This marginalization is inescapable, so we replace it with an importance
sampling approximation, p̂ L (x).
Teaching necessarily depends on the content to be taught, and different
problems require different formalizations of learning. In the following two sections
we apply the teaching model to generate data to teach in two distinct perceptual
learning problems involving categorization models: phonetics and visual scenes.
Categorization is well studied psychologically (see Anderson, 1991; Feldman,
1997; Markman & Ross, 2003) and computationally (see Jain, Murty, & Flynn,
1999; Neal, 2000; Rasmussen, 2000), and presents a particularly challenging
marginalization problem, and is thus an ideal testbed.

Example: Infant-Directed Speech (Infinite Mixtures Models)


Infant-directed speech (IDS; motherese) has distinct properties such as a slower
speed, higher pitch, and singsong prosody. Kuhl et al. (1997) discovered that IDS
has unique phonetic properties that might indicate that IDS is suitable for teaching.
Phonemes are defined by their formants, which are peaks in the spectral
envelope. The first formant, F1 , is the lowest frequency peak; the second formant,
F2 , is the second lowest frequency peak; and so on. The first two formants
are usually sufficient to distinguish phonemes. When examples of phonemes are
plotted in F1 × F2 formant space they form bivariate Gaussian clusters. Kuhl et
al. (1997) observed that the clusters of infant-directed speech (IDS) corner vowel
examples, (/A/, as in pot; /i/, as in beet; /u/, as in boot) are hyper-articulated
(farther apart), resulting in an increased vowel space. The argument is that
72 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

hyper-articulation is good for teaching because clusters that are farther apart should
be easier to discriminate.
To date, the IDS research lacks a formal account of teaching vowel phonemes
to infants; rather, arguments are built around intuitions, which conceivably are the
source of much of the contention regarding this topic.2 This sort of question has
been unapproachable previously because languages contain many many phonemes,
and the set of possible categorizations of even a small number of examples rapidly
results in an intractable quantity to sum over. Here we show how the teaching
model can be applied to such a problem. We first describe a model of learning
Gaussian category models and then describe the relevant teaching framework. We
then generate teaching data and explore their qualitative properties.

Learning Phonetic Category Models


Infants must learn how many phonemes there are in their native language as well
as the shapes, sizes, and location of each phoneme in formant space, all while
inferring to which phoneme each datum belongs. For this task, we employ a
Bayesian non-parametric learning framework for learning category models with
an unknown number of categories (Rasmussen, 2000).3
Following other work, we formalize phonetic category models as mixtures
of Gaussians (Vallabha, McClelland, Pons, Werker, & Amano, 2007; Feldman,
Griffiths, Goldwater, & Morgan, 2013). Each phoneme, φ1 , . . . , φ K ∈ 8, is a
Gaussian with mean µk and covariance matrix, 6k ,

8 = {φ1 , . . . , φ K } = {{µ1 , 61 }, . . . , {µ K , 6 K }}. (16)

The likelihood of some data under a finite Gaussian mixture model is


N X
Y K
f (x|8, ) = ωk N (xi | µk , 6k ), (17)
i=1 k=1

where N (x | µ, 6) is the multivariate Normal likelihoodP of x given µ and 6, and


K
each ωk in  is a non-negative real number such that k=1 ωk = 1. The above
model assumes that the learner knows the number of categories. We are interested
in the case where in addition to learning the means and covariance matrices of each
phoneme, the learner learns the assignment of examples to an unknown number of
phonemes. The assignment is represented as an N -length vector z = [z 1 , . . . , z N ].
Each entry z i ∈ 1, . . . , K . In this case the likelihood is,
N X
Y K
f (x|8, z) = N (xi | µk , 6k )δzi ,k , (18)
i=1 k=1

where δi, j is the Kronecker delta function which assumes value 1 if i = j and value
0 otherwise. δzi ,k equals 1 if, and only if the i th datum is a member of phoneme k.
Tractable Bayesian Teaching 73

We employ the Dirichlet process Gaussian mixture model (DPGMM)


framework. Learners must infer 8 and z. We assume the following generative
model:

G ∼ DP(α H ), (19)
φk ∼ H, (20)
xk ∼ N (xk | φk ), (21)

where DP(α H ) is a Dirichlet process with concentration parameter α that emits H ,


where H is the prior distribution on φ. Here H is the Normal Inverse-Wishart
(NIW) prior (Murphy, 2007),

µk , 6k ∼ NIW(µ0 , 30 , κ0 , ν0 ), (22)

which implies

6k ∼ Inverse-Wishartν0 (3−1
0 ), (23)
µk |6k ∼ N (µ0 , 6k /κ0 ), (24)

where 30 is the prior scale matrix, µ0 is the prior mean, ν0 ≥ d is the prior degrees
of freedom, and κ0 is the number of prior observations.
To formalize inference over z, we introduce a prior, π(z | α), via the Chinese
Restaurant Process (Teh, Jordan, Beal, & Blei, 2006), denoted CRP(α), where the
parameter α affects the probability of new components. Higher α creates a higher
bias toward new components. Data points are assigned to components as follows:
 nk
 if k ∈ 1 . . . K N
, α) = N −α1 + α
X
(−i)
P(z i = k|z , nk = δzi ,k , (25)
 if k = K + 1 i=1
N −1+α

where z (−i) = z \ z i .

Teaching DPGMMs
Recall that the probability of the teacher choosing data is proportional to the
induced posterior. The posterior for the DPGMM is,

f (x | 8, z)π(8 | µ0 , 30 , κ0 , ν0 )π(z | α)
π(8, z|x) = . (26)
m(x | µ0 , 30 , κ0 , ν0 , α)
74 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

Our choice of prior allows us to calculate the marginal likelihood exactly,


Kz
X Y
m(x | µ0 , 30 , κ0 , ν0 , α) = π(z | α)
z∈Z k=1
ZZ
× f (xk | φk , z)π(φk | µ0 , 30 , κ0 , ν0 )dφk (27)
φk
Kz
X Y
= π(z | α) f (xk | µ0 , 30 , κ0 , ν0 ), (28)
z∈Z k=1

where QKz
Ŵ(α) k=1 Ŵ(n k ) K z
π(z | α) = α , (29)
Ŵ(N + α)
Z is the set of all possible partitions of N data points into 1 to N categories, K z is
the number of categories in assignment vector z, and f (xk | µ0 , 30 , κ0 , ν0 ) is the
marginal likelihood of the data assigned to category k under NIW (which can be
calculated analytically). The size of Z has its own named combinatorial quantity:
the Bell number, or Bn . If we have sufficiently little data or ample patience, we can
calculate the quantity in Equation 28 by enumerating Z. However, Bell numbers
grow quickly, B1 = 1, B5 = 25, B12 = 4, 213, 597, and so on. We can produce an
importance sampling approximation by setting q(z) := π(z|α),
M Kz
1 X Yi
m̂(x | µ0 , 30 , κ0 , ν0 , α) = f (xk | µ0 , 30 , κ0 , ν0 ). (30)
M i=1 k=1
The approach of drawing from the prior by setting q(θ) := π(θ) is usually
inefficient. Areas of high posterior density contribute most to the marginal
likelihood, thus the optimal q is close to the posterior. Several approaches have
been proposed for estimating the marginal likelihood in finite mixture models
(Chib, 1995; Marin & Robert, 2008; Rufo, Martin, & Pérez, 2010; Fiorentini,
Planas, & Rossi, 2012), here we propose a Gibbs initialization importance
sampling scheme suited to the infinite case.4 Each sample, z̄ 1 , . . . , z̄ M , is drawn
by sequentially assigning the data to categories based on the standard collapsed
Gibbs sampling scheme (Algorithm 1),

N
Y
q(z) = p(z i |{z 1 , . . . , z i−1 }, {x1 , . . . , xi−1 }, 30 , µ0 , κ0 , ν0 , α), (31)
i=2

p(z i |{z 1 , . . . , z i−1 }, {x1 , . . . , xi−1 }, 30 , µ0 , κ0 , ν0 , α)


(
n k f (xi | xk , µ0 , 30 , κ0 , ν0 ) if k ∈ 1 . . . K
∝ . (32)
α f (xi | µ0 , 30 , κ0 , ν0 ) if k = K + 1
Tractable Bayesian Teaching 75

Algorithm 1 Partial Gibbs importance sampling proposal


1: function PGIBBS(x, µ0 , 30 , κ0 , ν0 , α)
2: q ←0
3: Z ← [1]
4: K ←1
5: n ← [1]
6: for i ∈ 2, . . . , |x| do
7: P ← empty array of length K + 1
8: for k ∈ 1, . . . , K do
9: y ← {x j ∈ x1 , . . . , xi−1 : Z j = k}
10: P[k] ← n[k] × f (xi | y, µ0 , 30 , κ0 , ν0 )
11: end for
12: P[K + 1]P ← α × f (xi | µ0 , 30 , κ0 , ν0 )
13: P ← P/ p∈P p
14: z ∼ Discrete(P)
15: Z .append(z)
16: q ← q + P[z]
17: if z ≤ K then
18: n[z] ← n[z] + 1
19: else
20: n.append(1)
21: K ← K +1
22: end if
23: end for
24: return q
25: end function

Because each sample is independent, we do not have to worry about the


label-switching problem, which produces unpredictable estimator biases (see Chib,
1995 and Neal, 1999). Thus, we simulate teaching data for the target model,
(8, z), according to pseudo-marginal MH acceptance ratio,
f (x ′ | 8, z)m̂(x | µ0 , 30 , κ0 , ν0 , α)
 = . (33)
f (x | 8, z)m̂(x ′ | µ0 , 30 , κ0 , ν0 , α)
To ensure that the importance sampling estimate of the marginal likelihood
(Equation 30) converges to the exact quantity, we generated 2,500 random
datasets for N = 6, 8, 10 from N ([0, 0], I2 ) and calculated the importance sampling
estimate for up to 10,000 samples. For the calculation, the NIW parameters were
set to µ0 = [0, 0], λ0 = I2 , κ0 = 1, and ν0 = 2; and the CRP parameter was
α = 1. Figure 4.1(a) shows the average relative error as a function of the number of
samples. The results demonstrate that the relative error of the IS estimate decreases
76 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

(a) 100 (b) 102


N=6 Enumeration
Relative estimator error (log)

101

Mean runtime (sec; log)


N=8 Prior IS
N=10 100 Gibbs IS
10–1
10–1

10–2

10–3
10–2
10–4

10–5
100 101 102 103 104 3 4 5 6 7 8 9 10 11 12 13
Estimator samples (log) N

(c) 0.30
Prior IS
0.25 Gibbs IS
Mean runtime (sec)

0.20

0.15

0.10

0.05
3 4 5 6 7 8 9 10 11 12 13
N

FIGURE 4.1 Performance comparison between prior and partial Gibbs importance
sampling. (a) Mean relative error, over 2,500 random datasets (y-axis), of the prior
importance sampling approximation (dark) and the partial Gibbs importance sampling
approximation (light; Equation 31) by number of samples (x-axis) for six (solid), eight
(dashed), and ten data points (dotted). (b) Runtime performance (seconds; y-axis)
of algorithms for calculating/approximating m(x | µ0 , 30 , κ0 , ν0 , α) by number of
data points (N; x-axis): exact calculation via enumeration (black), 1,000 samples of
prior importance sampling (dark gray), and 1,000 samples of partial Gibbs importance
sampling (light gray). (c) Separate view of runtime of the importance samplers.

as the number of samples increases, that there is generally more error for higher
N , and that the partial Gibbs importance sampling scheme produces a third of the
error of the importance sampling scheme. We compared the runtime performance
of C++ implementations of exact calculation via enumeration and importance
sampling (using M = 1, 000 samples) for n = 1, . . . , 13. The results can be seen in
Figure 4.1(b) and (c). Enumeration is faster than IS until N = 10 after which the
Tractable Bayesian Teaching 77

intractability of enumeration becomes apparent. For N = 13 importance sampling


with M = 1, 000 is 400 times faster than enumeration.

Experiments
We first conduct small-scale experiments to demonstrate that x̂ simulated using Â
(the pseudo-marginal acceptance ratio) is equivalent to x simulated using A (the
exact acceptance ratio) while demonstrating the basic behavior of the model. We
then scale up and conduct experiments to determine what type of behavior (e.g.
hyper or hypo-articulation variance increase) can be expected in data designed to
teach complex Gaussian category models to naive learners.
To ensure the exact MH samples and pseudo-marginal MH samples are
identically distributed we used a three-category model, which was to be taught
with two data points assigned to each category. We collected 1,000 samples across
five independent Markov chains, ignoring the first 200 samples from each chain
and thereafter collecting every twentieth sample.5 The prior parameters were set as
in the previous section. Figure 4.2(a) and (b) show the result. Both datasets exhibit
similar behavior including hyper-articulation, denoted by the increased distance
between the category means of the teaching data, and within-category variance
increase. A two-sample, permutation-based, Gaussian Kernel test (Gretton,
Fukumizu, Harchaoui, & Sriperumbudur, 2009; Gretton, Borgwardt, Rasch,
Scholkopf, & Smola, 2012) using 10,000 permutations indicates that the exact
and pseudo-marginal data are identically distributed ( p = 0.9990).

(a) (b) (c)


4 4 10
Teaching
3 Original 3
5
2 2

1 1 0

0 0
–5
–1 –1

–2 –2 –10
–4 –2 0 2 4 –4 –2 0 2 4 –10 –5 0 5 10

FIGURE 4.2 Behavior of the teaching model with exact and pseudo-marginal samples.
(a) Three-category Gaussian mixture model. The gray points are drawn directly from
the target model and the black points are drawn from the teaching model using the
exact acceptance ratio with N = 3. (b) Three-category Gaussian mixture model. The
gray points are drawn directly from the target model and the black points are drawn
from the teaching model using the pseudo-marginal acceptance ratio with N = 3.
(c) Pseudo-marginal samples for a two-category model where both categories have
the same mean.
78 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

From a pedagogical standpoint, hyper-articulation is intuitively sensible.


A learner cannot accurately learn the shapes and locations of categories if he or she
has not learned how many categories there are, a task which is made considerably
easier by accentuating the differences between phonemes. Thus, much of the effort
of teaching category models should be allocated toward teaching the number of
categories, perhaps at the expense of accurately conveying their other attributes.
To further demonstrate this behavior in the teaching model, we designed a
two-category model where both categories had identical means ([0, 0]), but had
opposite x and y variances (3.5, 1) and (1, 3.5). Each category was taught with
two data points. We repeated the same sampling procedure used in the previous
experiment and used the same NIW parameters. We see in Figure 4.2(c) that the
samples from the target model (gray) appear to form a single cluster but that the
teaching model elongates the categories perpendicularly to form a cross, which
makes the number of categories clear.
Thus far we have observed hyper-articulation and within-category variance
increase in the teaching data for simple models. We have not particularly pushed
the limits of the pseudo-marginal approach. Whereas dialects of the English
language have around 20 vowel phonemes, we have so far only generated a
three-category model. Here we increase the complexity of the target model to
better represent the complexity of speech and to determine whether more complex
target models necessitate more nuanced teaching manipulations. For example,
hypo-articulation may result when two near clusters move apart; if one pair of
clusters moves apart, the resulting movement may force other pairs of clusters closer
together.
We randomly generated a 20-category target model for which we simulated
teaching data. After disregarding the first 1,000 sample, we collected every fortieth
sample until 500 samples had been collected from each of eight independent
chains of PM-MCMC. We aggregated the samples across chains and calculated
the category means of the teaching data using the aggregated data. The teaching
samples and their category means are plotted along with random samples from
the target model in Figure 4.3 (top). The change in distance (hypo- and
hyper-articulation) between all 120 pairs can be seen in Figure 4.3 (bottom). The
resulting teaching data exhibit a general tendency to hyper-articulate, although
a number of category pairs seem to hypo-articulate. This hypo-articulation does
not occur only in pairs that are well separated in the target model. Among the
hypo-articulated category pairs are adjacent pairs 0–8, 1–6, 4–14, and 11–14.
Qualitatively, it appears that hypo-articulation is used by the teaching model,
in conjunction with variance increases, to disambiguate cluster boundaries. For
example, clusters 0, 6, 8, and 12 are close together in the original data; in
the teaching data, clusters 6 and 8 move away from each other but move
closer to clusters 1 and 0. This movement has the effect of creating a clearer
separation between clusters 1 and 6, and clusters 0, 8, and 12. Variance
Tractable Bayesian Teaching 79

Teaching
10 Original

–10
x2

–20

–30

–40
–15 –10 –5 0 5 10 15 20 25
x1
5
∆Teaching - ∆Random

4
3
2
1
0
–1

Cluster pair

FIGURE 4.3 Scale experiment. (Top) Scatter plot of random samples from the target
model (gray) and the teaching data (black). The numbered circles represent the means
of each of the 20 categories. (Bottom) Change in distance between category pairs from
random to teaching samples. Negative values indicate hypo-articulation and positive
values indicate hyper-articulation.

increases then help to disambiguate the hypo-articulated clusters as in Figure 4.2.


These results demonstrate that hypo-articulation is indeed consistent with
teaching.

Discussion
In this section we sought to teach Gaussian category models using a
non-parametric categorization framework, inspired by a debate from the
infant-directed speech literature. We demonstrated how standard MH sampling in
the teaching model becomes intractable at a small number of datapoints/categories
(Figure 4.1(b)) and showed how PM-MCMC using a novel importance sampling
scheme (Algorithm 1) allows for tractable teaching in complex models. We
then conducted experiments demonstrating that PM-MCMC produces results
80 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

indistinguishable from standard MH, while demonstrating that, like IDS, the
teaching model produces hyper-articulation and within-category variance increase
(Figure 4.2). We then scaled up and created a random target model with
roughly the same complexity as an English phonetic category model, finding
the hypo-articulation, hyper-articulation, and variance increase are all features
consistent with teaching.
The results suggest that these features are consistent with teaching in general,
but do not indicate that they are consistent specifically with teaching phonetic
category models. To that end, one would need to apply the teaching model to
category models derived from empirical phonetics data. We have demonstrated
that, using PM-MCMC, the teaching model is capable of contributing to this and
other theoretical debates in teaching complex categories such as those in natural
language.

Example: Natural Scene Categories: Infinite Mixtures


of Infinite Mixtures
The visual environment has regular properties including a predictable anisotropic
distribution of oriented contours. While these properties are never explicitly
taught, there is evidence to indicate that the visual system learns and takes
advantage of these properties by experience in the visual world (Hansen &
Essock, 2004; Schweinhart, & Essock, 2013; Wainwright, 1999), and the ability to
automatically teach people’s perception would be useful.
The distribution of orientations in the visual environment is bimodal, peaking
at the cardinal orientations (horizontal and vertical: Coppola, Purves, McCoy,
& Purves, 1998; Hansen & Essock, 2004; Switkes, Mayer, & Sloan, 1978) In
carpentered (man-made) environments, this makes sense as buildings and walls
tend to have both horizontal and vertical contours. However, even in the natural
environment (i.e. an outdoor rural scene), there tends to be more structural
information at the cardinal orientations due to the horizon, foreshortening,
and phototropic/gravitropic growth. The average scene contains most of its
structural content around horizontal, second most around vertical and least near
the 45-degree obliques. The human visual system’s processing of oriented structure
is biased in the opposite way, thus neutralizing this anisotropy in natural scenes
by suppressing the perceptual magnitude of content the most at near-horizontal
orientations the least at oblique orientations, with intermediate suppression at
vertical orientations—termed the horizontal effect (Essock, DeFord, Hansen, &
Sinai, 2003; Essock, Haun, & Kim, 2009).

Sensory Learning of Orientation Distributions


While the general pattern of anisotropy present in natural scenes has been
found to be a good match to perceptual biases (Essock, Haun, & Kim, 2009),
Tractable Bayesian Teaching 81

there are substantial differences between the distributions for carpentered and
non-carpentered environments (Girshick, Landy, & Simoncelli, 2011). The
distribution of oriented contours in an office environment has substantially greater
peaks at the cardinal orientations than the distribution in a national park, for
instance. Here we generalize the teaching model described in the previous section
to determine optimal examples for “teaching” the visual system the distribution of
natural perceptual scenes from different categories (e.g. nature versus “carpentered”
environments).
Given data in the form of the amplitudes of various, discrete orientations, scene
categories can themselves be multi-modal. For example, the oriented content
in forest scenes is different from the oriented content in desert scenes, but both
desert and forest scenes fall into the category of natural scenes. In order to begin
quantifying different types of scene categories, we employ a nested categorization
model in which outer categories are composed of inner categories. (For a similar
but more restrictive model see Yerebakan, Rajwa, & Dundar, 2014). More
specifically, we implement a Dirichlet process mixture model where the outer
Dirichlet process emits a Dirichlet process that emits Gaussians according to NIW.
This is a generalization of the DPGMM model outlined in the previous section.
The generative process of this Dirichlet process mixture model of Dirichlet
process Gaussian mixture models (DP-DPGMM) is outlined in Algorithm 2. A
CRP parameter for the outer categories, γ , is drawn from H ; and the assignment
of data to outer categories, z, is drawn from C R PN (γ ). For each outer category,
k = 1, . . . , K , an inner CRP parameter, αk , is drawn from 3; a set of NIW
parameters, G k , is drawn from G; and an assignment of data in outer category
k to inner categories, vk , is drawn from C R Pnk (αk ). For each inner category,
j = 1, . . . , Jk , a mean and covariance, µk j and 6k j , are drawn from G k ; and data
points are drawn from those µk j and 6k j . The full joint density is,
K
Y
p(γ | H ) p(z | γ ) ( p(αk | 3) p(vk | αk ) p(G k | G)
k=1
 
Jk
Y Y
×  p(µk j , 6k j | G k ) p(x | µk j , 6k j ). (34)
j=1 x∈xk j

Teaching DP-DPGMMs
Given data x = x1 , . . . , x N we wish to teach the assignment of data to outer
categories, z, the assignment of data to inner categories, v, and the means
and covariance matrices that make up the inner categories. The DP-DPGMM
framework assumes that G, H , and 3, the base distributions on G k , and the outer
and inner CRP parameters are known and that all other quantities are unknown.
82 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

Algorithm 2 Generative process of the DP-DPGMM


procedure DP-DPGMM(G, H , 3, the number of data N )
γ∼H
z ∼ C R PN (γ )
for k ∈ 1, . . . , K z do
αk ∼ 3
Gk ∼ G
vk ∼ C R Pnk (αk )
for j ∈ 1, . . . , Jk do
µk j , 6k j ∼ G k
end for
for i ∈ 1, . . . , n k do
xki ∼ N (µk,vi , 6k,vi )
end for
end for
end procedure

To compute the marginal likelihood m(x | G, H, 3), we must integrate and sum
over all unknowns. The resulting quantity is far more complex than the DPGMM
marginal likelihood (Equation 28). We approximate m(x | G, H, 3) via importance
sampling by drawing parameters from the generative process and calculating the
likelihood of the data,
M K z̄ Jv̄ki
1 X Yi Y
m(x | G, H, 3) ≈ f (xk, j | Ḡ k ), (35)
M i=1 k=1 j=1

where K z̄i is the number of outer categories in the ith outer category assignment,
z̄ i , and Jv̄ki is the number of inner categories in the kth outer category according
to the ith inner category assignment, v̄i . The MH acceptance ratio is then,
Q K z∗ Q Jvk∗ ′
m̂(x | G, H, 3) k=1 j=1 N (x k j | µ∗k j , 6k∗j )
A= Q K z∗ Q Jvk∗ . (36)
m̂(x ′ | G, H, 3) k=1 j=1 N (x k j | µ∗k j , 6k∗j )
Notice that all factors of the full joint distribution that do not rely on the data
cancel from A, leaving only the likelihood of the data under the inner-category
parameters (µ∗k j , 6k∗j ) and the marginal likelihood.

Experiments
Our model can be used to choose the images that would be most efficient data for
teaching the true distribution within and across scene categories. In this vein, we
Tractable Bayesian Teaching 83

shall use the teaching model to choose images, from among some set of empirical
data, that are best for teaching scene categories given their orientation distribution.
Different types of visual experience were collected by wearing a head-mounted
camera, which sent an outgoing video feed to a laptop that was stored in
a backpack. The videos were recorded during typical human environmental
interaction as observers walked around different types of environments (a nature
preserve, inside a house, downtown in a city, around a university, etc.).
Subsequently, every thousandth frame of the videos was taken as a representative
sample and sample images were sorted into two outer categories: purely natural (no
man-made structure) or outdoor, but containing carpentered content. Then, the
structural information was extracted using a previously developed image rotation
method (see Schweinhart, & Essock, 2013). Briefly, each frame was fast Fourier
transformed, rotated to the orientation of interest and the amplitude of the cardinal
orientations (horizontal and vertical) was extracted and stored. Repeating this
process every 15 degrees allowed each video frame to be condensed into a series of
12 data points representing the amount of oriented structure in the image. In this
work, we focus on amplitudes at 0, 45, 90, and 135 degrees and on natural and
carpentered scenes.
To derive a target distribution (means and covariance matrices of inner cate-
gories), we applied expectation-maximization (EM; Dempster, Laird, & Rubin,
1977) to the orientation data from each setting.6 To facilitate cross-referencing
existing images, rather than generating a distribution over datasets, we searched for
the single best dataset, xopt , for teaching the scene categories by searching for the
dataset that maximized the quantity in Equation 3, i.e.

 
xopt = argmaxx pT (x | θ ∗ ) . (37)

We find the approximate argmax via simulated annealing (Metropolis et al.,


1953). Simulated annealing applies a temperature, T , to the MH acceptance ratio,
A1/T . A higher temperature has the effect of softening the distribution, which
prevents MH from becoming stuck in a local maxima, allowing it to more easily
find the global maximum. The temperature is reduced as the MH run progresses
and ends with T=1. We adopt a annealing schedule such that on transition t of
tmax total transitions, T −1 = t/tmax . We ran 16 independent Markov chains for
3,000 iterations and chose the dataset that produced the maximum score under the
teaching model. The DP-DPGMM hyper-parameters were set as follows,

µk , λk , κk , νk ∼ G,
µk ∼ N (x̄, cov(x)),
84 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

λk ∼ Inverse-Wishartd+1 (cov(x)),
µk ∼ N (x̄, cov(x), ),
κk ∼ Gamma(2, 2),
νk ∼ Gammad (2, 2),

with the intention of being sufficiently vague, where Gammad (·, ·) denotes the
gamma distribution with lower bound d, x̄ is the mean of x, and cov(x) is the
covariance of x.7 All α and γ were drawn from Inverse-Gamma(1, 1).
We note that PM-MCMC does not offer the same theoretical guarantees for
optimization problems that it does for simulation because PM-MCMC relies on
approximate scores; thus the maximum score may be inflated to some degree by
estimator error. Pilot simulations revealed that at 1,000 IS samples, the variance
of m̂ for this problem is acceptable for current purposes. If estimator error is a
concern, one may verify the top few optimal datasets post hoc by re-evaluating their
scores a number of times.
The optimal teaching data are plotted along with the data from the original
model, with the target model means superimposed in Figure 4.4. The images
closest to the teaching data and the empirical means in Euclidean space are
displayed in Figure 4.5. The results demonstrate that the images closest to
the mean in terms of their orientation content are not the best examples to
teach the inner categories; the algorithm instead chose images that contrast the
category distributions. This is especially true for the natural images and when the
distribution of the inner category has higher variance (Figure 4.5, bottom row;
gray data).
Although the teaching model was only given information about the amplitude
of oriented structure in the global image, there are qualitative visual implications
of the choice of images used for teaching. Whereas images near the mean for
both “natural” categories have predominant horizon lines and ground planes, the
teaching model makes a clearer distinction between the two categories by choosing
images with and without a strong horizontal gradient. The teaching model also
more readily distinguishes urban (inner A) from rural-type (inner B) environments
for the carpentered scenes as indicated by the inclusion of cars and buildings in
inner category A (see Figure 4.5). Overall, the teaching model included a wider
variety of vantage points (including looking at the ground) for teaching images of
all categories, better capturing the variability of the image set. This is opposed to
the relatively equal height in the visual field of the centers of the mean images.

Discussion
In this section we sought to select optimal images for teaching categories of
natural scenes. We employed a nested categories model to generalize the DPGMM
Tractable Bayesian Teaching 85

Carpentered Natural

70
60
50
Amplitude

40
30
20
10
0

15°
30°
45°
60°
75°
90°
105°
120°
135°
150°
165°
180°


15°
30°
45°
60°
75°
90°
105°
120°
135°
150°
165°
180°
Orientation Orientation
FIGURE 4.4 Image data associated with the mean empirical data and optimal teaching
data. (Top) The two closest images, in Euclidean space, to the optimal teaching datum
for each inner category for natural (left) and carpentered (right) scenes. (Bottom) The
two closest images, in Euclidean space, to the empirical means for each inner category
for natural (left) and carpentered (right) scenes.

model used in IDS categorization. Unlike the DPGMM, the DP-DPGMM had
no closed-form posterior (due to use of non-conjugate models) and therefore
computing the MH ratio required approximation. The results of the simulation
indicate that the best examples for teaching the inner categories of purely natural
and carpentered scenes are not the means of the respective categories.
The images that are best for teaching different visual categories under the model
exhibit surprising features; the teaching model emphasizes data away from the
mean in order to contrast the categories and represent the variation. Although we
have not directly tested the effectiveness of the teaching images in visual category
learning, the results of this model have potential implications for fields in which
visual training and image categorization are important.

Conclusion
The goal of cognitive science is to understand human cognition in common
scenarios; however, a valid complaint against Bayesian theoretical accounts of
100.0 Model samples
Best teaching samples
Carpentered 45°

72.5 Category mean

135°

135°

135°
90°

90°
45.0

17.0

–10.0
100.0

70.0
Natural 45°

135°

135°

135°
90°

90°
40.0

10.0

–20.0
–20.0

120.0

–20.0

120.0

–20.0

120.0

–10.0

–10.0

–20.0

100.0
15.0
50.0
85.0

15.0
50.0
85.0

15.0
50.0
85.0

15.0
40.0
65.0
90.0

15.0
40.0
65.0
90.0

10.0
40.0
70.0
0° 0° 0° 45° 45° 90°
FIGURE 4.5 Scene category teaching results. Orientation–orientation scatter plots of random samples from the target model. Different
marker colors denote different inner categories. Circles represent the target model category means and triangles represent the optimal
teaching data. The top row shows data from carpentered scenes; the bottom shows data from natural scenes.
Tractable Bayesian Teaching 87

cognition is that they are often unable to account for anything more than schematic
scenarios. Although we have focused on the problem of teaching categories,
we have demonstrated how recent advances in the so-called Bayesian Big Data
literature allow Bayesian cognitive modelers, in general, to build more compelling
models that are applicable to real-world problems of interest to experimentalists.
We began the chapter by briefly discussing the complexity concerns of the
Bayesian cognitive modeler, especially in the domain of teaching, and outlined
some standard methods of dealing with it. We then discussed pseudo-marginal
sampling and applied it to the problem of teaching complex concepts. We applied
the PM-MCMC-augmented teaching model to teaching phonetic category
models, demonstrating how the framework could be used to contribute to an
active debate in linguistics: whether infant-directed speech is for teaching. The
results suggested that some of the unintuitive properties of IDS are consistent with
teaching although further work is needed to be directly applicable to IDS. We then
applied the teaching model to the far more complex problem of teaching nested
category models. Specifically, we outlined a framework for learning and teaching
scene categories from orientation spectrum data extracted from images. We found
that the optimal data for teaching these categories captured a more descriptive
picture of the nested category than the mean data. Specifically, the teaching data
seek to convey the ranges of the categories.
This work represents a first step toward a general framework for teaching
arbitrary concepts. In the future, we hope to extend the model to teach in richer
domains and under non-probabilistic learning frameworks by creating a symbiosis
between Bayesian and non-Bayesian methods such as artificial neural networks and
convex optimization.

Acknowledgments
This work was supported in part by NSF award DRL-1149116 to P.S.

Notes
1 In general, quadrature is a more precise, computationally efficient solution than
Monte Carlo integration in the situations in which it can be applied.
2 We refer those interested in reading more about this debate to Burnham,
Kitamura, & Vollmer-Conna (2002), de Boer & Kuhl (2003), Uther, Knoll,
& Burnham (2007), McMurray, Kovack-Lesh, Goodwin, & McEchron (2013),
and Cristia & Seidl (2013).
3 The term non-parametric is used to indicate that the number of parameters is
unknown (that we must infer the number of parameters), not that there are no
parameters.
88 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

4 For an overview of methods for controlling Monte Carlo variance, see Robert
and Casella (2013, Chapter 4).
5 When the target distribution is multi-modal, Markov chain samplers often
become stuck in a single mode. To mitigate this, it is common practice to
sample from multiple independent Markov chains.
6 We used the implementation of EM in the scikit-learn (Pedregosa, Varoquaux,
Gramfort, Michel, Thirion, Grisel, & Duchesnay, 2011) python module’s
DPGMM class.
7 The degrees of freedom of NIW cannot be less than the number of dimensions,
thus the lower bound on νk must be d.

References
Anderson, J. (1991). The adaptive nature of human categorization. Psychological Review,
98(3), 409.
Andrieu, C., & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte
Carlo computations. Annals of Statistics, 37(2), 697–725. arXiv: 0903.5480.
Andrieu, C., & Vihola, M. (2012). Convergence properties of pseudo-marginal Markov
chain Monte Carlo algorithms. 25(2), 43. arXiv: 1210.1484.
Banterle, M., Grazian, C., & Robert, C. P. (2014). Accelerating Metropolis-Hastings
algorithms: Delayed acceptance with prefetching, 20. arXiv: 1406.2660.
Bardenet, R., Doucet, A., & Holmes, C. (2014). Towards scaling up Markov chain Monte
Carlo: An adaptive subsampling approach. Proceedings of the 31st International Conference on
Machine Learning, 4, 405–413.
Bonawitz, E., Shafto, P., Gweon, H., Goodman, N. D., Spelke, E., & Schulz, L. (2011).
The double-edged sword of pedagogy: Instruction limits spontaneous exploration and
discovery. Cognition, 120(3), 322–330.
Burnham, D., Kitamura, C., & Vollmer-Conna, U. (2002). What’s new, pussycat? On
talking to babies and animals. Science, 296(5572), 1435.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical
Association, 90(432), 1313–1321.
Coppola, D. M., Purves, H. R., McCoy, A. N., & Purves, D. (1998). The distribution of
oriented contours in the real world. Proceedings of the National Academy of Sciences of the
United States of America, 95(7), 4002–4006.
Cristia, A., & Seidl, A. (2013). The hyperarticulation hypothesis of infant-directed speech.
Journal of Child Language, 41, 1–22.
de Boer, B., & Kuhl, P. K. (2003). Investigating the role of infant-directed speech with a
computer model. Acoustics Research Letters Online, 4(4), 129.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B
(Methodological), 39(1), 1–38.
Essock, E. A., DeFord, J. K., Hansen, B. C., & Sinai, M. J. (2003). Oblique stimuli are seen
best (not worst!) in naturalistic broad-band stimuli: A horizontal effect. Vision Research,
43(12), 1329–1335.
Tractable Bayesian Teaching 89

Essock, E. A., Haun, A. M., & Kim, Y. J. (2009). An anisotropy of orientation-tuned


suppression that matches the anisotropy of typical natural scenes. Journal of Vision, 9(1),
35.1–35.15.
Feldman, J. (1997). The structure of perceptual categories. Journal of Mathematical Psychology,
41(2), 145–170.
Feldman, N. H., Griffiths, T. L., Goldwater, S., & Morgan, J. L. (2013). A role for
the developing lexicon in phonetic category acquisition. Psychological Review, 120(4),
751–778.
Fiorentini, G., Planas, C., & Rossi, A. (2012). The marginal likelihood of dynamic mixture
models. Computational Statistics & Data Analysis, 56(9), 2650–2662.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6,
721–741.
Gergely, G., Egyed, K., & Kiraly, I. (2007). On pedagogy. Developmental Science, 10(1),
139–146.
Girshick, A. R., Landy, M. S., & Simoncelli, E. P. (2011). Cardinal rules: Visual orientation
perception reflects knowledge of environmental statistics. Nature Neuroscience, 14 (7),
926–932.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B., & Smola, A. (2012). A kernel
two-sample test. Journal of Machine Learning Research, 13, 723–773.
Gretton, A., Fukumizu, K., Harchaoui, Z., & Sriperumbudur, B. K. (2009). A fast,
consistent kernel two-sample test. Advances in Neural Information Processing Systems, 22,
673–681.
Hansen, B. C., & Essock, E. A. (2004). A horizontal bias in human visual processing of
orientation and its correspondence to the structural components of natural scenes. Journal
of Vision, 4(12), 1044–1060.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing
Surveys (CSUR), 31(3), 264–323.
Kuhl, P. K., Andruski, J. E., Christovich, I. A., Christovich, L. A., Kozhevinkova, E. V.,
Ryskina, V. L., . . . Lacerda, F. (1997). Cross-language analysis of phonetic units in
language addressed to infants. Science, 277(5326), 684–686.
Luce, R. (1977). The choice axiom after twenty years. Journal of Mathematical Psychology,
15(3), 215–233.
Maclaurin, D., & Adams, R. P. (2014). Firefly Monte Carlo: Exact MCMC with subsets of
data. arXiv: 1403.5693 (2000), 1–13. arXiv: 1403.5693l.
Marin, J.-M., & Robert, C. P. (2008). Approximating the marginal likelihood using copula.
arXiv preprint. arXiv:0804.2414. arXiv: 0810.5474.
Markman, A. B., & Ross, B. H. (2003). Category use and category learning. Psychological
Bulletin, 129(4), 592–613.
McMurray, B., Kovack-Lesh, K., Goodwin, D., & McEchron, W. (2013). Infant directed
speech and the development of speech perception: Enhancing development or an
unintended consequence? Cognition, 129(2), 362–378.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953).
Equation of state calculations by fast computing machines. The Journal of Chemical Physics,
21(6), 1087–1092. arXiv: 5744249209.
Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. University of
British Columbia.
90 B. S. Eaves Jr., A. M. Schweinhart, and P. Shafto

Neal, R. M. (1999). Erroneous results in “marginal likelihood from the Gibbs output”. University
of Toronto.
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models.
Journal of Computational and Graphical Statistics, 9(2), 249–265.
Patterson, S., & Teh, Y. W. (2013). Stochastic gradient Riemannian Langevin dynamics on
the probability simplex. Nips, 1–10.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
. . . Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine
Learning Research, 12, 2825–2830.
Rasmussen, C. (2000). The infinite Gaussian mixture model. Advances in Neural Information
Processing, 11, 554–560.
Robert, C. P., & Casella, G. (2013). Monte Carlo statistical methods. New York: Springer
Science & Business Media.
Rufo, M., Martin, J., & Pérez, C. (2010). New approaches to compute Bayes factor in finite
mixture models. Computational Statistics & Data Analysis, 54(12), 3324–3335.
Schweinhart, A. M., & Essock, E. A. (2013). Structural content in paintings: Artists
overregularize oriented content of paintings relative to the typical natural scene bias.
Perception, 42(12), 1311–1332.
Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., & Mcculloch,
R. E. (2013). Bayes and Big Data: The Consensus Monte Carlo Algorithm. International
Journal of Management Science and Engineering Management, 11(2), 78–88.
Shafto, P., & Goodman, N. D. (2008). Teaching games: Statistical sampling assumptions
for learning in pedagogical situations. In Proceedings of the 13th Annual Conference of the
Cognitive Science Society.
Shafto, P., Goodman, N. D., & Frank, M. C. (2012). Learning from others: The
consequences of psychological reasoning for human learning. Perspectives on Psychological
Science, 7(4), 341–351.
Shafto, P., Goodman, N. D., & Griffiths, T. L. (2014). A rational account of pedagogical
reasoning: Teaching by, and learning from, examples. Cognitive Psychology, 71C, 55–89.
Sherlock, C., Thiery, A., Roberts, G., & Rosenthal, J. (2013). On the efficiency
of pseudo-marginal random walk Metropolis algorithms. arXiv preprint, arXiv, 43(1),
238–275. arXiv:1309.7209v1.
Switkes, E., Mayer, M. J., & Sloan, J. A. (1978). Spatial frequency analysis of the visual
environment: Anisotropy and the carpentered environment hypothesis. Vision Research,
18(10), 1393–1399.
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical dirichlet processes.
Journal of the American Statistical Association, 101(476), 1566–1581.
Uther, M., Knoll, M., & Burnham, D. (2007). Do you speak E-NG-L-I-SH? A comparison
of foreigner- and infant-directed speech. Speech Communication, 49(1), 2–7.
Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., & Amano, S. (2007). Unsupervised
learning of vowel categories from infant-directed speech. Proceedings of the National
Academy of Sciences of the United States of America, 104(33), 13273–13278.
Wainwright, M. J. (1999). Visual adaptation as optimal information transmission. Vision
Research, 39(23), 3960–3974.
Yerebakan, H. Z., Rajwa, B., & Dundar, M. (2014). The infinite mixture of infinite
Gaussian mixtures. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q.
Weinberger (Eds) Advances in neural information processing systems 27 (pp. 28–36). Curran
Associates, Inc.
5
SOCIAL STRUCTURE RELATES TO
LINGUISTIC INFORMATION DENSITY
David W. Vinson and
Rick Dale

Abstract
Some recent theories of language see it as a complex and highly adaptive system, adjusting
to factors at various time scales. For example, at a longer time scale, language may
adapt to certain social or demographic variables of a linguistic community. At a shorter
time scale, patterns of language use may be adjusted by social structures in real time.
Until recently, datasets large enough to test how socio-cultural properties—spanning vast
amounts of time and space—influence language change have been difficult to obtain. The
emergence of digital computing and storage have brought about an unprecedented ability
to collect and classify massive amounts of data. By harnessing the power of Big Data we
can explore what socio-cultural properties influence language use. This chapter explores
how social-network structures, in general, contribute to differences in language use. We
analyzed over one million online business reviews using network analyses and information
theory to quantify social connectivity and language use. Results indicate that perhaps a
surprising proportion of variance in individual language use can be accounted for by subtle
differences in social-network structures, even after fairly aggressive covariates have been
added to regression models. The benefits of utilizing Big Data as a tool for testing classic
theories in cognitive science and as a method toward guiding future research are discussed.

Introduction
Language is a complex behavioral repertoire in a cognitively advanced species. The
sounds, words, and syntactic patterns of language vary quite widely across human
groups, who have developed different linguistic patterns over a long stretch of time
and physical separation (Sapir, 1921). Explanations for this variation derive from
two very different traditions. In the first, many language scientists have sought
to abstract away from this observed variability to discern core characteristics of
92 D. W. Vinson and R. Dale

language, which are universal and perhaps genetically fixed across people (from
Chomsky, 1957 to Hauser, Chomsky, & Fitch, 2001). The second tradition
sees variability as the mark of an intrinsically adaptive system. For example,
Beckner et al. (2009) argue that language should be treated as being responsive to
socio-cultural change in real time. Instead of abstracting away apparently superficial
variability in languages, this variability may be an echo of pervasive adaptation,
from subtle modulation of real-time language use, to substantial linguistic change
over longer stretches of time. This second tradition places language in the broader
sphere of human behavior and cultural products in a time when environmental
constraints have well-known effects on many aspects of human behavior (see
Triandis, 1994 for review).1
Given these explanatory tendencies, theorists of language can have starkly
divergent ideas of it. An important next step in theoretical mitigation will be new
tools and broad data samples so that, perhaps at last, analyses can match theory in
extent and significance. Before the arrival of modern information technologies,
a sufficient linguistic corpus would have taken years, if not an entire lifetime,
to acquire. Indeed, some projects on the topic of linguistic diversity have this
property of impressive timescale and rigor. Some examples include the Philadelphia
Neighborhood Corpus, compiled by William Labov in the early 1970s, the
Ethnologue, first compiled by Richard Pittman dating back to the early 1950s,
and the World Atlas of Language Structures (WALS), a collection of data and
research from 55 authors on language structures available online, produced in 2008.
Digitally stored language, and to a great extent accessible for analysis, has begun to
exceed several exabytes, generated every day online (Kudyba & Kwatinetz, 2014).2
One way this profound new capability can be harnessed is by recasting current
theoretical foundations, generalized from earlier small-scale laboratory studies, into
a Big Data framework.
If language is pervasively adaptive, and is thus shaped by socio-cultural
constraints, then this influence must be acting somehow in the present day, in
real-time language use. Broader linguistic diversity and its socio-cultural factors
reflect a culmination of many smaller, local changes in the incremental choices of
language users. These local changes would likely be quite small, and not easily
discerned by simple observation, and certainly not without massive amounts of
data. In this chapter, we use a large source of language data, Yelp, Inc. business
reviews, to test whether social-network structures relate in systematic ways to the
language used in these reviews. We frame social-network variables in terms of
well-known network measures, such as centrality and transitivity (Bullmore &
Sporns, 2009), and relate these measures to language measures derived from
information theory, such as information density and uniformity (Aylett, 1999;
Jaeger, 2010; Levy & Jaeger, 2007). In general, we find subtle but detectable
relationships between these two groups of variables. In what follows, we first
motivate the broad theoretical framing of our Big Data question: What shapes
Social Structure and Information Density 93

linguistic diversity and language change in the broad historical context? Following
this we describe information theory and its use in quantifying language use. Then,
we explain how social structure may influence language structure. We consider this
a first step in understanding how theories in cognitive and computational social
science can be used to harness the power of Big Data in important and meaningful
ways (see Griffiths, 2015).

What Shapes Language?


As described above, language can be cast as a highly adaptive behavioral property.
If so, we would probably look to social, cultural, or even ecological aspects of the
environment to understand how it changes (Nettle, 1998; Nichols, 1992; Trudgill,
1989, 2011). Many studies, most over the past decade, suggest languages are
dynamically constrained by a diverse range of environmental factors. Differences
in the spread and density of language use (Lupyan & Dale, 2010), the ability of
its users (Bentz et al., submitted; Christiansen & Chater, 2008; Dale & Lupyan,
2012; Ramscar, 2013; Wray & Grace, 2007) and its physical environment (Everett,
2013; Nettle, 1998) impact how a language is shaped online (Labov, 1972a, 1972b)
and over time (Nowak, Komarova & Niyogi, 2002). These factors determine
whether certain aspects of a language will persist or die (Abrams & Strogatz, 2003),
simplify, or remain complex (Lieberman, Michel, Jackson, Tang, & Nowak, 2007).
Language change is also rapid, accelerating at a rate closer to that of the spread of
agriculture (Gray & Atkinson, 2003; cf. Chater, Reali, & Christiansen, 2009) than
genetics. Using data recently made available from WALS and a recent version of
the Ethnologue (Gordon, 2005), Lupyan and Dale (2010) found larger populations
of speakers, spread over a wider geographical space, use less inflection and more
lexical devices. This difference may be due to differences in communicating within
smaller, “esoteric” niches and larger, “exoteric” niches (also see Wray & Grace,
2007), such as the ability of a language’s speakers (Bentz & Winter, 2012; Dale &
Lupyan, 2012; Lupyan & Dale, 2010) or one’s exposure to a growing vocabulary
(Reali, Chater & Christiansen, 2014).
Further evidence of socio-cultural effects may be present in real-time language
usage. This is a goal of the current chapter: Can we detect these population-level
effects in a large database of language use? Before describing our study, we
describe two key motivations of our proposed analyses: The useful application
of (1) information theory in quantifying language use and (2) network theory in
quantifying social structure.

Information and Adaptation


Information theory (Shannon, 1948) defines the second-order information of a word
as the negative log probability of a word occurring after some other word:
I (wi ) = − log2 p(wi |wi−1 )
94 D. W. Vinson and R. Dale

The theory of uniform information density (UID; Levy & Jaeger 2007; Jaeger,
2010) states that speakers will aim to present the highest amount of information
across a message at a uniform rate, so as to efficiently communicate the most
content without violating a comprehender’s channel capacity. Support for this
theory comes from Aylett (1999), in an early expression of this account, who
found that speech is slower when a message is informationally dense and Jaeger
(2010), who found information-dense messages are more susceptible to optional
word injections, diluting its overall density over time. Indeed, even word length
may be adapted for its informational qualities (Piantadosi, Tily, & Gibson, 2011).
In a recent paper, we investigated how a simple contextual influence, the
intended valence of a message, influences information density and uniformity.
While it is obvious that positive and negative emotions influence what words
individuals use (see Vinson & Dale, 2014a for review), it is less obvious that the
probability structure of language use is also influenced by one’s intentions. Using a
corpus of over 200,000 online customer business reviews from Yelp, Inc., findings
suggest that the information density of a message increases as the valence of that
message becomes more extreme (positive or negative). It also becomes less uniform
(more variable) as message valence becomes more positive (Vinson & Dale, 2014b).
The results are commensurate with theories that suggest language use adapts to a
variety of socio-cultural factors in real time.
In this chapter, we look to information-theoretic measures of these kinds to
quantify aspects of language use, with the expectation that they will also relate in
interesting ways to social structure.

Social-Network Structure
Another key motivation of our proposed analyses involves the use of network
theory to quantify the intricate structural properties that connect a community
of speakers (Christakis & Fowler, 2009; Lazer et al., 2009). Understanding
how specific socio-cultural properties influence language can provide insight
into the behavior of the language user herself (Baronchelli, Ferrer-i-Cancho,
Pastor-Satorras, Chater, & Christiansen, 2013). For instance, Kramer, Guilliory,
and Hancock (2014) having analyzed over 600,000 Facebook users, reported that
when a user’s newsfeed was manipulated to show only those posts that were either
positive or negative, a reader’s own posts aligned with the emotional valence of
their friends’ messages. Understanding what a language looks like when under
certain socio-cultural pressures can provide valuable insight into what societal
pressures that help shape a language. Indeed, global changes to one’s socio-cultural
context, such as changes in the classification of severity of crime and punishment
over time, are marked by linguistic change (Klingenstien, Hitchcock, & DeDeo,
2014) while differences in the distance between socio-cultural niches are marked
by differences in language use (Vilhena et al., 2014).
Social Structure and Information Density 95

Current Study
In the current study, we utilize the Yelp database as an arena to test how
population-level differences might relate to language use. While previous work
suggests online business reviews may provide insight into the psychological states
of its individual reviewers (Jurafsky, Chahuneau, Routledge, & Smith, 2014), we
expect that structural differences in one’s social community as a whole, where
language is crucial to conveying ideas, will affect language use. We focus on
how a language user’s social niche influences the amount and rate of information
transferred across a message. Agent-based simulations (Chater et al., 2006; Dale &
Luypan, 2012; Reali et al., 2014) and recent studies on the influences of interaction
in social networks (Bond et al., 2012; Choi, Blumen, Congleton, & Rajaram,
2014) indicate that the structure of language use may be influenced by structural
aspects of a language user’s social interactions. From an exploratory standpoint, we
aim to determine if one’s social-network structure predicts the probability structure
of language use.

Method
Corpus
We used the Yelp Challenge Dataset (www.yelp.com/dataset.challenge), which,
at the time of this analysis, contained reviews from businesses in Phoenix, Las
Vegas, Madison, and Edinburgh. This includes 1,125,458 reviews from 252,898
users who reviewed businesses in these cities. The field entries for reviews include
almost all the information that is supplied on the Yelp website itself, including the
content of the review, whether the review was useful or funny, the star rating that
was conferred upon the business, and so on. It omits a user’s public username, but
includes an array of other useful information, in particular a list of user ID codes
that point to friends of a given user. Yelp users are free to ask any other Yelp user to
be their friend. Friendship connections are driven by users’ mutual agreement to
become friends. These user ID codes allow us to iteratively build social networks
by randomly choosing a user, and expanding the network by connecting friends
and friends of friends, which we further detail below.

Linguistic Measures
The first and simplest measure we explore in our analysis is the number of words in a
given review, its word length. This surface measure is used as a basic but important
covariate for regression analyses. Word length will define the bin count for entropy
and other information analyses, and so directly impacts these measures.
The second measure we use is the reviewer-internal entropy (RI-Ent) of a
reviewer’s word use. This marks the discrete Shannon entropy of a reviewer’s overall
96 D. W. Vinson and R. Dale

word distribution. If reviewers use many different words, entropy would be high. If
a reviewer reuses a smaller subset of words, the entropy of word distribution would
be low, as this would represent a less uniform distribution over word types.
A third measure is the average information encoded in the reviewer’s word
use, which we’ll call average unigram information (AUI). Information, as described
above, is a measure of the number of bits a word encodes given its frequency in
the overall corpus. Reviews with higher information use less frequent words, thus
offering more specific and less common verbiage in describing a business.
A fourth measure is one more commonly used in studies of informational
structure of language, which we’ll call the average conditional information (ACI). This
is a bit-based measure of a word based on its probability conditioned on the prior
word in the text. In other words, it is a measure of the bits encoded in a given
bigram of the text. We compute the average bits across bigrams of a review, which
reflect the uniqueness in word combinations.3
Finally, we extract two crude measures of information variability by calculating
the standard deviation over AUI and ACI, which we call unigram informational
variability (UIV) and conditional informational variability (CIV), respectively. Both
measures are a reflection of how stable the distribution is over a reviewer’s
average unigram and bigram bit values. These measures relate directly to uniform
information density (see Jaeger, 2010; Levy & Jaeger, 2007). A very uniform
distribution of information is represented by a stable mean and lower UIV/CIV;

TABLE 5.1 Summary of information-theoretic measures quantifying language


in reviews.

Measure Description Definition


PN
RI-Ent Reviewer-internal R I − E N Tj = - i=1 log2 p(wi |R j )
entropy
PN
AUI Average unigram AU I j = - N1 i=1 log2 p(wi )
information
PN
ACI Average AC I j = - N 1−1 i=1 log2 p(wi |wi−1 )
conditional
information
UIV Unigram U I V j = σ (UI j )
informational
variability
CIV Conditional C I V j = σ (CI j )
informational
variability

N = number of words in a review; p(w) = probability of word w; wi = i th word of a


review; UI j = set of unigram information scores for each word of a given review; CI j =
set of conditional information scores for each word of a given review.
Social Structure and Information Density 97

a review with unigram or bigram combinations that span a wide range of


informativeness induces a wider range of bit values, and thus a higher UIV/CIV
(less uniform density). A summary of these measures appears in Table 5.1.
Punctuation, stop words, and spacing were removed using the tm package in R
before information-theoretic measures were obtained.4

Social Networks
One benefit of the Big Data approach we take in this chapter is that we can pose
our questions about language and social structure using the targeted unit of analysis
of social networks themselves. In other words, we can sample networks from the Yelp
dataset directly, with each network exhibiting network scores that reflect a certain
aspect of local social structure. We can then explore relationships between the
information-theoretic measures and these social-network scores.
We sampled 962 unique social networks from the Yelp dataset, which amounted
to approximately 38,000 unique users and 450,000 unique reviews. Users represent
nodes in social networks and were selected using a selection and connection
algorithm also shown in Figure 5.1. We start by choosing a random user who has

2 2

2 2

1 1 1

0 0 0 0

1 1 1

2 2

2 2

Select Select Connect


FIGURE 5.1 “0”-degree was chosen at random among those with between 11 and
20 friends. We then connected these individuals. Then, from ten randomly chosen
friends of the seed node, we chose up to ten friends of friends and connected
them. Following this, we interconnected the whole set. Note that this is a simplified
visualization of the process, as friends and friends of friends were chosen up to a count
of ten, which produces much larger networks (visualized in examples below).
98 D. W. Vinson and R. Dale

between 11 and 20 friends in the dataset (we chose this range to obtain networks
which were not too small or too large as to be computationally cumbersome). After
we chose that user, we connected all his or her friends and then expanded the social
network by one degree; randomly selecting ten of his or her friends and connecting
up to ten of his or her friend’s friends to the network. We then interconnected all
users in this set (shown as the first-degree nodes and connections in Figure 5.1).
We conducted this same process of finding friends of these first-degree nodes, and
then interconnected those new nodes of the second degree. In order to make sure
networks did not become too large, we randomly sampled up to ten friends of
each node only. Fifty percent of all networks fell between 89 and 108 reviewers
in size, and the resulting nets reveal a relatively normal distribution of network
metrics described in the next section.

Network Measures
A variety of different network measures were used to quantify the structure of each
network. We consider two different categories of network structures: simple and
complex. A summary of all seven (two simple, five complex) network measures
appears in Table 5.2. We used two simple network measures: The number of
reviewers in a network, or nodes, and the number of friendship connections
between reviewers, or edges.
We considered five complex network measures. The first measure, network
degree, is the ratio of edges to nodes. This provides a measure of connectivity across
a network. High-degree networks have a higher edge-to-node ratio than lower
degree networks.
The second measure, network transitivity, determines the probability that
two adjacent nodes are themselves connected (sometimes termed the “clustering
coefficient”). Groups of three nodes, or triples, can either be closed (e.g. fully
connected) or open (e.g. two of the three nodes are not connected). The
ratio of closed triples to total triples provides a measure of the probability that
adjacent nodes are themselves connected. High transitivity occurs when the ratio
of closed-to-open triples is close to one.
A third measure, network betweenness, determines the average number of
shortest paths in a network that pass through some other node. The shortest path
of two nodes is the one that connects both nodes with the fewest edges. A pair of
nodes can have many shortest paths. A node’s betweenness value is the sum of the
ratio of a pair of node’s shortest paths that pass through the node, over the total
number of shortest paths in the network. We determined network betweenness by
taking the average node betweenness for all nodes in the network. Effectively, this
provides a measure of network efficiency. The higher a network’s betweenness, the
faster some bit of new information can travel throughout the network.
TABLE 5.2 Summary of the measures quantifying a network’s structure.

Measure Definition Description


Nodes Nodes Number of individuals in
the network
Edges Edges Number of node to node
connections; vertices in a
network
Edges
Degrees The ratio of connections to
N odes
nodes in a network
N closed T ri ples
Transitivity/Clustering The average number of
N tri ples
Coefficient completely connected
triples given the total
number of triples in a
( ) network.
P P S Pst (V )
s6=t6=ν
S Psp
Betweenness S Pst is the number of total
N
shortest paths from node
s to node t. S Pst (V ) is
the number of shortest
paths from s to t that pass
through node V . The
sum for all shortest paths
for all nodes determines
the betweenness of node
V . We take the average
betweenness of each
node for all nodes N in a
network.
PN
i=1 |C x (n∗) − C x (n i )|
Centrality Cz = PN C x is the graph level
Max i=1 |C x (n∗) − C z (n i ) centrality defined as the
sum of the absolute
difference between the
observed maximum
central node C x (n ∗ ) and
all other node centrality
measures C x (n i ) over the
theoretical maximum
centrality of a network
with the same number of
nodes. As this is a
measure of the maximum
possible centrality and
actual centrality, graph
level centrality will
always fall between 0
(low centrality) and 1
(high centrality).
Continued
100 D. W. Vinson and R. Dale

TABLE 5.2 (cont).

Measure Definition Description


Scale Free f (x) = x −σ α is the exponent
characterizing the power
law fit predicted by the
degree distribution x. α
is always greater than 1
and typically falls within
the range of 2 < α < 3,
but not always.

A fourth measure stems from node centrality, which determines the number of
connections a single node has with other nodes. Centrality can also be determined
for the whole network, known as graph centrality. Graph centrality is the ratio of
the sum of the absolute value of the centrality of each node, over the maximum
possible centrality of each node (Freeman, 1979). Node centrality is greatest when
a single node is connected to all other nodes, whereas graph centrality is greatest
when all nodes are connected to all other nodes. Information is thought to travel
faster in high-centrality networks. Here we use graph centrality only. From this
point on we will refer to graph centrality simply as centrality. Network betweenness
and network centrality share common theoretical assumptions, but quantify different
structural properties of a network.
Our fifth and final measure determines whether the connections between nodes
in a network share connectivity at both local and global scales. Scale-free networks
display connectivity at all scales, local and global, simultaneously (Dodds, Watts, &
Sabel, 2003). A network is scale free when its degree distribution (i.e., the number
of edge connections per node) fits a power law distribution. Networks that are
less scale free are typically dominated by either a local (a tightly connected set
of nodes) or global connectivity (randomly connected nodes). Networks that
exemplify differences in complex structures are presented in Figure 5.2.

Additional Measures
Individual reviews were not quantified independently. Instead, all reviews from
a single individual were concatenated into one document. This allowed for
information-theoretic measures to be performed over a single user’s total set of
reviews. The average information of a network was then computed by taking
the average across all individuals (nodes) in the network. Such an analysis affords
testing how the structure of an individual’s social network impacts that individual’s
Social Structure and Information Density 101

Low degree High degree Low transitivity High transitivity

Low betweenness High betweenness Low centrality High centrality

Low scale free High scale free

FIGURE 5.2 Example Yelp networks with high/low structural properties.

overall language use. However, due to the nature of how our information-theoretic
measures were determined, individuals who wrote well over one hundred reviews
were treated the same as those who wrote merely one. This introduces a
possible bias since information measures are typically underestimated when using
non-infinite sample sizes (as in the case of our information measures). While we
control for certain measures such as the average reviewer’s total review length and
network size, additional biases may occur due to the nature of how each measure
was determined (e.g. averaging across reviewers with unequal length reviews).
To address these concerns two additional measures, (1) a gini coefficient and (2) a
random review baseline—to assess the reliability of our analyses—were used. They
are described below.
Gini Coefficient. The Gini coefficient (range = [0,1]) was originally developed
to assess the distribution of wealth across a nation. As the coefficient approaches
zero, wealth is thought to approach greater equality. As the coefficient approaches
one, more of the nation’s wealth is thought to be shared among only a handful
of its residents. We use the Gini coefficient to assess the distribution of reviews
across a network. Since each node’s reviews were concatenated, given only one
value for each information-theoretic measure, certain reviewer’s measures will be
more representative of the linguistic distributions. The Gini coefficient provides
102 D. W. Vinson and R. Dale

an additional test as to whether a network’s average information is influenced by the


network’s distribution of reviews.
Random Review Baseline. A random network baseline provides a baseline value
to compare coefficient values from true networks. Baseline information measures
were computed by randomly sampling (without replacement) the same number of
reviews written by each reviewer. For example, if a reviewer writes five reviews,
then five random reviews are selected to take their place. These five reviews
are then deleted from the pool of total reviews used throughout all networks.
This ensures that the exact same reviews are used in baseline reviews. We did
not go so far as to scramble tokens within reviews. While this would provide
a sufficient baseline, obtaining the true information-theoretic measures of each
review, without token substitution, provides a more conservative measure. The
random review baseline was used to compare all significant true network effects as
an additional measure of reliability.

Broad Expectations and Some Predictions


This chapter presents a broad thesis, and we test it using the Yelp data:
Social-network structure will relate in interesting ways to patterns of language
use. Although this is a broad expectation, it is not a specific prediction, and
we do not wish to take a strong stance on specific hypotheses here, giving the
reader the impression that we had conceived, in advance, the diverse pattern of
results that we present below. Instead, Big Data is providing rich territory for
new exploration. The benefit of a Big Data approach is to identify interesting
and promising new patterns or relationships and, we hope, encourage further
exploration. As we show below, even after controlling for a variety of collinearities
among these measures, the broad thesis holds. Regression models strongly relate
social-network and information-theoretic measures. Some of the models show
proportion variance accounted for at above 50 percent. Despite this broad
exploratory strategy, a number of potential predictions naturally pop out of existing
discussion of information transmission in networks. We consider three of these here
before moving onto the results.
One possibility is that more scale-free networks enhance information trans-
mission, and thus would concomitantly increase channel capacity of nodes in the
network. One might suppose that in the Yelp context, a more local scale-free
structure might have a similar effect: An efficient spread of information may be
indicated by a wide diversity of information measures, as expressed in the UIV and
CIV measures.
A second prediction—not mutually exclusive from the first—drawing from the
work on conceptual entrainment and imitation in psycholinguistics, is that densely
connected nets may induce the least AUI. If nodes in a tightly connected network
infect each other with common vocabulary, then this would reduce the local
Social Structure and Information Density 103

information content of these words, rendering them less unique and thus show the
lowest entropy, AUI, and so on. One may expect something similar for transitivity,
which would reflect the intensity of local mutual interconnections (closed triples).
However, the reverse is also possible. If language users are more densely
connected it may be more likely that they have established a richer common
ground overall. If so, language use may contain more information-dense words
specific to a shared context (Doyle & Frank, submitted). A fourth prediction is
that more network connectivity over a smaller group (higher-network degree) may
afford more complex language use, and so lead to higher AUI and ACI.
A final prediction comes from the use of information theory to measure the rate
of information transmission. When a speaker’s message is more information-dense,
it is more likely that it will also be more uniform. Previous findings show speakers
increase their speech rate when presenting low information-dense messages, but
slow their speech rate for information-dense messages (Pellegrino, Coupé, &
Marisco, 2011). It may be that any social structure that leads to increases in
information density simultaneously decreases information variability.

Results
Simple Measures
The confidence intervals (99.9 percent CI) of five multiple regression models,
where nodes, edges, and the Gini coefficient were used to predict each
information-theoretic measure, are presented in Table 5.3. We set a conservative
criterion for significance ( p < 0.001) for all analyses. Only those analyses that were
significant are presented. Crucially, all significant effects of independent variables
were individually compared to their effects on the random review baseline. To
do this, we treated the random review baseline and true network reviews as two
levels of the same variable: “true_baseline.” Using linear regression we added an
interaction term between independent network variables and the true_baseline
variable. A significant interaction is demarcated by “†” in Tables 5.3 and 5.4.
The effects of these network variables on information-theoretic measures are
significantly different in true networks compared to baseline networks. This helps
to ensure that our findings are not simply an artifact of our methodology.
All variables were standardized (scaled and shifted to have M = 0 and SD
= 1). Additionally, the number of words (length) was log transformed due to a
heavy-tailed distribution. All other variables were normally distributed. Because
length correlates with all information-theoretic measures and UIV and CIV
correlate with the mean of AUI and ACI, respectively (due to the presence of
a true zero), the mean of each information measure was first predicted by length
while UIV and CIV were also predicted by AUI and ACI. The residual variability
of these linear regression models was then predicted by nodes, edges, and the
Gini coefficient. The purpose of residualization is to further ensure that observed
104 D. W. Vinson and R. Dale

TABLE 5.3 Lexical measures as predicted by nodes, edges, and Gini coefficient.

Nodes Edges Gini coef F-statistic


Length n.s. (0.09, 0.27) F(3, 958) = 125
(0.30, 0.43)
R 2 = 0.28, Rad
2
j = 0.28
† †
R I − Entr esidual n.s. (0.10, 0.15) (−0.16, −0.12) F(3, 958) = 446.6
R 2 = 0.58, Rad
2
j = 0.58
† † †
AU Ir esidual (−0.05, −0.01) (0.10, 0.12) (−0.12, −0.09) F(3, 958) = 391
R 2 = 0.55, Rad
2
j = 0.55

AC Ir esidual n.s. (0.07, 0.12) (−0.04, −0.01) F(3, 958) = 154.1
R 2 = 0.33, Rad
2
j = 0.32

U I Vr esidual (−0.01, −0.001) (0.001, 0.01) (004, 0.01) F(3, 958) = 39.58
R 2 = 0.11, Rad
2
j = 0.11
C I Vr esidual n.s. (−0.003, 0) (0.002, 0.004) F(3, 958) = 27.99
R 2 = 0.08, Rad
2
j = 0.08

Only the mean and 99.9 percent confidence intervals for each IV with p < 0.001 are presented. The “†”
symbol denotes all network effects that were significantly different from baseline network effects ( p < .001).
Multiple linear regressions were performed in R: lm(DV∼Nodes+Edges+Gini).

interactions are not due to trivial collinearity between simpler variables (length,
nodes) and ones that may be subtler and more interesting (CIV, centrality, etc.).5
The number of nodes provides a crude measure of network size, edges, network
density, and the Gini coefficient (the distribution of reviews across the network).
Importantly, no correlation exists between the Gini coefficient with either edges
or nodes. And, although a strong correlation exists between nodes and edges (r =
0.67, t (960) = 28.23, p < 0.0001), in only two instances, AUI and UIV, did
nodes account for some portion of variance. As nodes increased, average unigram
information, along with average unigram variability, decreased. However, only
the relationship between nodes and average unigram information was significantly
different between the true network and the random review baseline. In all cases,
save conditional information variability (CIV), a significant proportion of variance
in information measures was accounted for by edges, and in all but length and CIV,
the relationship between information measures and edges was significantly different
between the true network and the random review baseline (Figure 5.3(a) presents
an example interaction plot between ACI, edges and true_baseline measures).
Finally, the Gini coefficient accounted for a significant portion of variance for
all information measures, but only for RI-Ent and AUI did it have a significantly
different relationship between the true network and the random review baseline.
One explanation may be that more unique language use may naturally occur
when more individuals contribute more evenly to the conversation. Another
possibility is that networks with less even review distributions are more likely to
TABLE 5.4 Information-theoretic measures predicted by complex network measures.

Degree Transitivity Betweenness Centrality Scale Free: α F-statistic


Length n.s. (0.20, 0.42) (-0.28, -0.08) (0.01, 0.16) n.s. F(5, 956) = 33.3
R 2 = 0.15, Rad
2
j = 0.14
R I -Entr esidual (0.04, 0.13)† (-0.11, -0.03) n.s. n.s. n.s. F(5, 956) = 21.47
R 2 = 0.10, Rad
2
j = 0.10
AU Ir esidual (0.02, 0.09)† n.s. n.s. (0.004, 0.05)† n.s. F(5, 956) = 10.36
R 2 = 0.05, Rad
2
j = 0.05
AC Ir esidual (0.01, 0.07)† (-0.07, -0.01)† n.s. n.s. n.s. F(5, 956) = 11.38
R 2 = 0.06, Rad
2
j = 0.05
UIV r esidual n.s. n.s. n.s. n.s. n.s. n.s.
CIV r esidual n.s. n.s. n.s. n.s. n.s. n.s.

Only the mean and 99.9 percent confidence intervals for each IV with p < 0.001 are presented. All reported values were significant ( p < 0.001). The “†” symbol denotes all
network effects significantly different from baseline network effects.
106 D. W. Vinson and R. Dale

True Network vs. Baseline


(a) (b)

0.5
Residual ACI

0.0

−0.5
Baseline
True network
−1.0
0.0 2.5 0 1
Edges Residual degree
(c)

−2 0 2
Residual transitivity

FIGURE 5.3 Network measures for true and baseline networks by across conditional
information-density. All four plots show significant interactions for variables in True
networks compared to baseline networks. Linear regression models with interaction
terms were used in R:lm(DV∼IV+true_baseline+IV*true_baseline).

have more reviews, suggesting that a larger number of reviewers’ language use is
more representative of the overall linguistic distribution of reviews. A simple linear
regression analysis reveals the Gini coefficient accounts for a significant portion of
variance in the total number of reviews in each network (R2 ad j = 0.21, F[1,960]
= 249.6, p < 0.001, 99.9 percent CI [0.25, 0.38]), increasing as the number of
reviews increases.
We interpret these results cautiously considering it is a first step toward
understanding what aspects of a network relate to language use. The results
suggest changes in population size and connectivity occur alongside changes in
the structure of language use. Speculatively, the individual language user may be
influenced by the size and connectivity of her network. When the size of his or
Social Structure and Information Density 107

her network increases, the words he or she uses may be more frequent. However,
when connectivity increases, the words he or she uses may be of low frequency,
and therefore more information dense. This supports current work that shows how
a shared common ground may lead to an increase in information-dense word use
(Doyle & Frank, submitted). This is further explored in the discussion.
Although we find significant effects, how network size and connectivity
influence information density and channel capacity, and how different ways of
interpreting information (as we have done here) interact with simple network
measures is unclear. Generally, these results suggest that word choice may relate
to social-network parameters.

Complex Measures
The complex network measures, centrality, degree and scale free, were log transformed
for all analyses due to heavy-tailed distributions. Given the larger number of
network variables and their use of similar network properties such as number
of nodes or edges, it is possible that some complex network measures will be
correlated. To avoid any variance inflation that may occur while using multiple
predictors, we determined what variables were collinear using a variance inflation
factor (vif) function in R. We first used nodes and edges to predict the variance
of each complex network measure. We factor this out by taking the residual of
each model and then used the vif function from the R library car to determine
what complex network measures exhibited collinearity. Using a conservative VIF
threshold of five or less (see Craney & Surls, 2002; Stines, 1995 for review) we
determined that no complex network measures used in our model was at risk of
any collinearity that would have seriously inflated the variance.6 All VIF scores
were under the conservative threshold for all complex network variables and are
therefore not reported. Residuals of complex network measures, having factored
out any variance accounted for by nodes and edges, were used to predict each
information-theoretic measure presented in Table 5.4.
One or more complex measures accounted for a significant proportion of
variance in each information density measure. Certain trends are readily observed
across these models. Specifically, word length increased as network transitivity and
centrality increased and decreased as network betweenness increased; however, no
network measure effects were significantly different from random review baseline
effects (significance marked by the “†” symbol in Table 5.4). Additionally, RI-Ent
and AUI and ACI increased as network degree increased, accounting for ∼5–10
percent of the variance in each measure. The relationship between network
degree and corresponding information measures in true networks was significantly
different from baseline. This was also the case for network centrality for both
AUI and network transitivity for ACI. Figure 5.3 presents interaction plots for
residual ACI by degree (b) and residual transitivity (c) between true and random
108 D. W. Vinson and R. Dale

review baseline networks. Complex network measures did not share a significant
relationship with UIV or CIV.
It is clear that certain network structures predict differences in information
density measures even after stringent controls were applied to both information
and network measures. Specifically, support for higher information-dense messages
may be the result of networks that exhibit high global connectivity driven by
increases in specific network properties, namely, network degree and centrality.
This further supports previous work showing that a shared common ground may
bring about higher information-dense language use. Specifically, networks that
exhibit a more centralized nucleus and are more densely connected (higher degree)
may be more likely to share a similar common ground among many members of
the group. If so, a shared common ground may result in more unique language
use. Yet, networks that exhibit close, niche-like groupings exemplified by high
network transitivity may infect its members with the same vocabulary, decreasing
the overall variability in language use. Further analysis is necessary to unpack the
relationship that different social-network structures have with language use.

Discussion
We find pervasive relationships between language patterns, as expressed in
information-theoretic measures of review content, and social-network variables,
even after taking care to control for collinearity. The main findings are listed here:

1. Reviewers used more information-dense words (RI-Ent, AUI) and bigrams


(ACI) in networks with more friendship connections.
2. Reviewers used more information-dense words (RI-Ent, AUI) in networks
that have a lower Gini coefficient; networks where reviews were more evenly
distributed.
3. Reviewers use more information-dense words (RI-Ent, AUI) and bigrams
(ACI) as network degree (ratio of friendships connections to number of
individuals in the network) increased and as individuals in the network were
grouped more around a single center (AUI only).
4. Reviewers used fewer information-dense bigrams as the number of local
friendship connections increased (e.g. network transitivity).
5. Unigram information variability (UIV) was higher with higher connectivity;
channel capacity was less uniform in networks with more friendship
connections.

The predictions laid out at the end of the Methods section are somewhat
borne out. Scale-free networks do not appear to have a strong relationship among
information-theoretic scores, but networks that exhibit higher transitivity do
lead to lower information-dense bigrams (though not significant for any other
Social Structure and Information Density 109

information measure) and, while more connections lead to higher information


density, they do not lead to lower information variability. Indeed, when
considering the last finding, the opposite was true: Networks with higher
connectivity used more information-dense words at a more varied rate. Although
this was not what we predicted, it is in line with previous work supporting the
notion that certain contextual influences afford greater resilience to a varied rate
of information transmission (Vinson & Dale, 2014b). In this case, more friendship
connections may allow for richer information-dense messages to be successfully
communicated less uniformly.
We found support for two predictions: (1) high transitivity networks lead to
less information-dense bigram language use and (2) high degree networks tend
to exhibit higher information density. In addition, more centralized networks
also lead to higher information-dense unigram language use. The first prediction
suggests that networks where more local mutual interconnections exist may be
more likely to infect other members with similar vocabulary. That is, more local
connectivity may lead to more linguistic imitation or entrainment. Here we merely
find that the structure of reviewers’ language use is similar to one another. It is
possible that similarities in linguistic structure reveal similarities in semantic content
across connected language users, but future research is needed to support this claim.
Support for the second prediction suggests that users adapt their
information-dense messages when they are more highly connected. This effect can
be explained if we assume that certain social network structures afford groups the
ability to establish an overall richer common ground. Previous work shows that
increased shared knowledge leads to more information dense messages (Doyle &
Frank, submitted; Qian & Jaeger, 2012). It may be that increases in network degree
and centrality enhance network members’ abilities to establish a richer common
ground, leading to more information-dense messages. One possibility may be
that certain networks tend to review similar types of restaurants. Again, further
exploration into how the number of friendship connections, network degree,
and centrality impact information density and variability is needed to determine
the importance of specific network properties in language use. Figure 5.4(a–c)
provide example networks that exhibit low, middle, and high network ACI given
the specific network structures that predict ACI above (e.g. increases in network
degree and decreases in transitivity).

General Discussion
In this chapter we explored how language use might relate to social structure. We
built 962 social networks from over 200,000 individuals who collectively wrote
over one million online customer business reviews. This massive, structured dataset
allowed testing how language use might adapt to structural differences in social
networks. Utilizing Big Data in this way affords assessing how differences in one’s
110 D. W. Vinson and R. Dale

(a) High ACI (b) Middle ACI (c) Low ACI

FIGURE 5.4 Yelp networks occurring at the tails of certain complex network measure
distributions (as specified in text) presenting ideal conditions for language use
exhibiting high (a) middle (b) and low (c) average condition information (ACI).

local social environment might relate to language use and communication more
generally. Our findings suggest that as the connectivity of a population increases,
speakers use words that are less common. Complex network variables such as the
edge-to-node ratio, network centrality, and local connectivity (e.g. transitivity) also
predict changes in the properties of words used. The variability of word use was
also affected by simple network structures in interesting ways. As a first exploration
our findings suggest local social interactions may contribute in interesting ways to
language use. A key strength of using a Big Data approach is in uncovering new
ways to test theoretical questions about cognitive science, and science in general.
Below we discuss how our results fit into the broader theoretical framework of
understanding what shapes language.
When controlling for nodes, edges, and review length, many R2 values in
our regression models were lowered. However, finding that some variability in
language use is accounted for by population connectivity suggests language use may
be partly a function of the interactions among individuals. Both network degree,
centrality, and transitivity varied in predictable ways with information measures.
Mainly, as the number of connections between nodes increased and as the network
became more centralized the use of less frequent unigrams (AUI) increased.
Interestingly, networks that exhibit high connectivity and greater centrality may
have more long-range connections. A growing number of long-range connections
may lead to the inclusion of individuals that would normally be farther away from
the center of the network. Individuals in a network with these structural properties
may be communicating more collectively, having more readily established a richer
common ground. If so, higher information density is more probable, as the
communication that is taking place can become less generic and more complex.
Additionally, networks with higher local connectivity, or high transitivity, tend
to use more common language, specifically bigrams. This again may be seen
as supporting a theory of common ground, that individuals with more local
connectivity are more likely to communicate using similar terminology, in this
Social Structure and Information Density 111

case, bigrams. Using a Big Data approach it is possible to further explore other
structural aspects of one’s network that might influence language use.
While we merely speculate about potential conclusions, it is possible to obtain
rough measures of the likelihood of including more individuals at longer ranges.
Specifically, a network’s diameter—the longest stretch of space between two
individual nodes in any network—may serve as a measure of the distance that a
network occupies in socio-cultural space. This may be taken as a measure of how
many strangers are in a network, with longer diameters being commensurate with
the inclusion of more strangers.
It may be fruitful to explore the impact of a single individual on a network’s
language use. We do not yet explore processes at the individual level, opting
instead to sample networks and explore their aggregate linguistic tendencies.
Understanding the specifics of individual interaction may be crucial toward
understanding how and why languages adapt. We took an exploratory approach
and found general support for the idea that network structure influences certain
aspects of language use, but we did not look for phonological or syntactic patterns;
in fact our analysis could be regarded as a relatively preliminary initial lexical
distribution analysis. However, information finds fruitful application in quantifying
massive text-based datasets and has been touted as foundational in an emerging
understanding of language as an adaptive and efficient communicative system
(Jaeger, 2010; Moscoso Del Prado Martín, Kostíc, & Baayen, 2004). In addition,
previous work investigating the role of individual differences in structuring one’s
network are important to consider. For instance, differences in personality, such as
being extroverted or introverted, are related to specific network-level differences
(Kalish & Robbins, 2006). It is open to further exploration as to how information
flows take place in networks, such as through hubs and other social processes.
Perhaps tracing the origin of the network by determining the oldest reviews of
the network and comparing these to the network’s average age may provide insight
into the importance of how certain individuals or personalities contribute to the
network’s current language use.
We see the current results as suggestive of an approach toward language as an
adaptive and complex system (Beckner et al., 2009; Lupyan & Dale, in press). Our
findings stand alongside previous research that reveals some aspect of the structure
of language adapts to changes in one’s socio-cultural context (Klingenstein et al.,
2014; Kramer et al., 2014; Lupyan & Dale, 2010; Vilhena et al., 2014). Since
evolution can be thought of as the aggregation of smaller adaptive changes taking
place from one generation to the next, finding differences in language within
social networks suggests languages are adaptive, more in line with shifts in social
and cultural structure than genetic change (Gray & Atkinson, 2003; cf. Chater
et al., 2008). The results of this study suggest that general language adaptation may
occur over shorter time scales, in specific social contexts, that could be detected
in accessible Big Data repositories (see, e.g. recently, Stoll, Zakharko, Moran,
112 D. W. Vinson and R. Dale

Schikowski, & Bickel, 2015). The space of communicable ideas may be more
dynamic, adapting to both local and global constraints at multiple scales of time.
A deeper understanding of why language use changes may help elucidate what
ideas can be communicated when and why. The application of sampling local
social networks provides one method toward understanding what properties of a
population of speakers may relate to language change over time—at the very least,
as shown here, in terms of general usage patterns.
Testing how real network structures influence language use is not possible
without large amounts of data. The network sampling technique used here allows
smaller networks to be sampled within a much larger social-network structure. The
use of Big Data in this way provides an opportunity to measure subtle and intricate
features whose impacts may go unnoticed in smaller-scale experimental datasets.
Still, we would of course recommend interpreting initial results cautiously. The
use of Big Data can provide further insight into the cognitive factors contributing
to behavior, but can only rarely be used to test for causation. To this point, one
major role the use of Big Data plays in cognitive science, and one we emphasize
here, is its ability to provide a sense of direction and a series of new hypotheses.

Acknowledgments
We would like to thank reviewers for their helpful and insightful commentary. This
work was supported in part by NSF grant INSPIRE-1344279 and an IBM PhD
fellowship awarded to David W. Vinson for the 2015-16 academic year.

Notes
1 This description involves some convenient simplification. Some abstract and
genetic notions of language also embrace ideas of adaptation (Pinker &
Bloom, 1990), and other sources of theoretical subtlety render our description
of the two traditions an admittedly expository approximation. However,
the distinction between these traditions is stark enough to warrant the
approximation: The adaptive approach sees all levels of language as adaptive
across multiple time scales, whereas more fixed, abstract notions of language see
it as only adaptive in a restricted range of linguistic characteristics.
2 Massive online sites capable of collecting terabytes of metadata per day have only
emerged in the last 10 years: Google started in 1998; Myspace 2003; Facebook
2004; Yelp 2004; Google+ 2011. Volume, velocity, and variety of incoming
data are thought to be the biggest challenges toward understanding Big Data
today (McAfee, Brynjolfsson, Davenport, Patil, & Barton, 2012).
3 Previous research calls this Information Density and uses this as a measure
of Uniform Information Density. We use the name Average Conditional
Information given the breadth of information-theoretic measures used in this
study.
Social Structure and Information Density 113

4 Note: AUI and ACI were calculated by taking only the unique n-grams.
5 Our approach toward controlling for collinearity by residualizing variables
follows that of previous work (Jaeger, 2010). However, it is important to note
the process of residualizing to control for collinearity is currently in debate (see
Wurm & Fisicaro, 2014 for review). It is our understanding that the current
stringent use of this method is warranted provided it stands as a first pass toward
understanding how language use is influenced by network structures.
6 The variance inflation acceptable for a given model is thought to be somewhere
between five and ten (Craney & Surles, 2002). After the variance predicted by
nodes and edges was removed from our analyses, no complex network measure
reached the variance inflation threshold of five.

References
Abrams, D. M., & Strogatz, S. H. (2003). Linguistics: Modelling the dynamics of language
death. Nature, 424(6951), 900.
Aylett, M. P. (1999). Stochastic suprasegmentals: Relationships between redundancy,
prosodic structure and syllabic duration. Proceedings of ICPhS–99, San Francisco.
Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., & Christiansen,
M. H. (2013). Networks in cognitive science. Trends in Cognitive Sciences, 17(7), 348–360.
Beckner, C., Blythe, R., Bybee, J., Christiansen, M. H., Croft, W., Ellis, N. C.,
. . . Schoenemann, T. (2009). Language is a complex adaptive system: Position paper.
Language learning, 59(s1), 1–26.
Bentz, C., Vererk, A., Douwe, K., Hill, F., & Buttery, P. (2015). Adaptive communication:
Languages with more non-native speakers tend to have fewer word forms. PLoS One,
10(6), e0128254.
Bentz, C., & Winter, B. (2013). Languages with more second language speakers tend to
lose nominal case. Language Dynamics and Change, 3, 1–27.
Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D., Marlow, C., Settle, J. E., &
Fowler, J. H. (2012). A 61-million-person experiment in social influence and political
mobilization. Nature, 489(7415), 295–298.
Bullmore, E., & Sporns, O. (2009). Complex brain networks: Graph theoretical analysis of
structural and functional systems. Nature Reviews Neuroscience, 10(3), 186–198.
Chater, N., Reali, F., & Christiansen, M. H. (2009). Restrictions on biological adaptation
in language evolution. Proceedings of the National Academy of Sciences, 106(4), 1015–1020.
Choi, H. Y., Blumen, H. M., Congleton, A. R., & Rajaram, S. (2014). The role of
group configuration in the social transmission of memory: Evidence from identical and
reconfigured groups. Journal of Cognitive Psychology, 26(1), 65–80.
Chomsky, N. A. (1957) Syntactic Structures. New York: Mouton.
Christiansen, M. H., & Chater, N. (2008). Language as shaped by the brain. Behavioral and
Brain Sciences, 31(5), 489–509.
Christakis, N. A., & Fowler, J. H. (2009). Connected: The surprising power of our social networks
and how they shape our lives. New York, NY: Little, Brown.
Craney, T. A., & Surles, J. G. (2002). Model-dependent variance inflation factor cutoff
values. Quality Engineering, 14(3), 391–403.
114 D. W. Vinson and R. Dale

Dale, R., & Lupyan, G. (2012). Understanding the origins of morphological diversity: The
linguistic niche hypothesis. Advances in Complex Systems, 15, 1150017/1–1150017/16.
Dale, R., & Vinson, D. W. (2013). The observer’s observer’s paradox. Journal of
Experimental & Theoretical Artificial Intelligence, 25(3), 303–322.
Dodds, P. S., Watts, D. J., & Sabel, C. F. (2003). Information exchange and the robustness
of organizational networks. Proceedings of the National Academy of Sciences, 100(21),
12516–12521.
Doyle, G., & Frank, M. C. (2015). Shared common ground influences information density
in microblog texts. In Proceedings of NAACL-HLT (pp. 1587–1596).
Ember, C. R., & Ember, M. (2007). Climate, econiche, and sexuality: Influences on
sonority in language. American Anthropologist, 109(1), 180–185.
Everett, C. (2013). Evidence for direct geographic influences on linguistic sounds: The case
of ejectives. PLoS One, 8(6), e65275.
Freeman, L. C. (1979). Centrality in social networks: Conceptual clarification. Social
Networks, 1(3), 215–239.
Gordon, R. G. (2005). Ethnologue: Languages of the World, 15th Edition. Dallas, TX: SIL
International.
Gray, R. D., & Atkinson, Q. D. (2003). Language-tree divergence times support the
Anatolian theory of Indo-European origin. Nature, 426(6965), 435–439.
Griffiths, T. L. (2015). Manifesto for a new (computational) cognitive revolution. Cognition,
135, 21–23.
Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: What is it,
who has it, and how did it evolve? Science, 298(5598), 1569–1579.
Jaeger, F. T. (2010). Redundancy and reduction: Speakers manage syntactic information
density. Cognitive Psychology, 61(1), 23–62.
Jurafsky, D., Chahuneau, V., Routledge, B. R., & Smith, N. A. (2014). Narrative framing
of consumer sentiment in online restaurant reviews. First Monday, 19(4).
Kalish, Y., & Robins, G. (2006). Psychological predispositions and network structure: The
relationship between individual predispositions, structural holes and network closure.
Social Networks, 28(1), 56–84.
Klingenstein, S., Hitchcock, T., & DeDeo, S. (2014). The civilizing process in London’s
Old Bailey. Proceedings of the National Academy of Sciences, 111(26), 9419–9424.
Kramer, A. D., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of
massive-scale emotional contagion through social networks. Proceedings of the National
Academy of Sciences, 111(14), 8788–8790.
Kudyba, S., & Kwatinetz, M. (2014). Introduction to the Big Data era. In S. Kudyba (Ed.),
Big Data, Mining, and Analytics: Components of Strategic Decision Making (pp. 1–15). Boca
Ratan, FL: CRC Press.
Labov, W. (1972a). Language in the inner city: Studies in the Black English vernacular
(Vol. 3). Philadelphia, PA: University of Pennsylvania Press.
Labov, W. (1972b). Sociolinguistic patterns (No. 4). Philadelphia, PA: University of
Pennsylvania Press.
Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., & Van Alstyne,
M. (2009). Life in the network: The coming age of computational social science. Science,
323(5915), 721.
Social Structure and Information Density 115

Levy, R., & Jaeger, T. F. (2007). Speakers optimize information density through syntactic
reduction. In B. Schlökopf, J. Platt, and T. Hoffman (Eds.), Advances in neural information
processing systems (NIPS) 19, pp. 849–856. Cambridge, MA: MIT Press.
Lieberman, E., Michel, J. B., Jackson, J., Tang, T., & Nowak, M. A. (2007). Quantifying
the evolutionary dynamics of language. Nature, 449(7163), 713–716.
Lupyan, G., & Dale, R. (2010). Language structure is partly determined by social structure.
PLoS One, 5(1), e8559.
Lupyan, G., & Dale, R. (2015). The role of adaptation in understanding linguistic
diversity. In R. LaPolla, & R. de Busser (Eds.), The shaping of language: The relationship
between the structures of languages and their social, cultural, historical, and natural environments
(pp. 289–316). Amsterdam, The Netherlands: John Benjamins Publishing Company.
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big Data:
The management revolution. Harvard Business Review, 90(10), 61–67.
Moscoso del Prado Martín, F., Kostić, A., & Baayen, R. H. (2004). Putting the bits together:
An information theoretical perspective on morphological processing. Cognition, 94(1),
1–18.
Nettle, D. (1998). Explaining global patterns of language diversity. Journal of Anthropological
Archaeology, 17(4), 354–374.
Nichols, J. (1992). Linguistic diversity in space and time. Chicago, IL: University of Chicago
Press.
Nowak, M. A., Komarova, N. L., & Niyogi, P. (2002). Computational and evolutionary
aspects of language. Nature, 417(6889), 611–617.
Pellegrino, F., Coupé, C., & Marsico, E. (2011). Across-language perspective on speech
information rate. Language, 87(3), 539–558.
Piantadosi, S. T., Tily, H., & Gibson, E. (2011). Word lengths are optimized for efficient
communication. Proceedings of the National Academy of Sciences, 108(9), 3526–3529.
Pinker, S., & Bloom, P. (1990). Natural language and natural selection. Behavioral and Brain
Sciences, 13(4), 707–727.
Qian, T., & Jaeger, T. F. (2012). Cue effectiveness in communicatively efficient discourse
production. Cognitive Science, 36(7), 1312–1336.
Ramscar, M. (2013). Suffixing, prefixing, and the functional order of regularities in
meaningful strings. Psihologija, 46(4), 377–396.
Reali, F., Chater, N., & Christiansen, M. H. (2014). The paradox of linguistic complexity
and community size. In E. A. Cartmill, S. Roberts, H. Lyn & H. Cornish (Eds.),
The evolution of language. Proceedings of the 10th International Conference (pp. 270–277).
Singapore: World Scientific.
Sapir, E., 1921. Language: An Introduction to the Study of Speech. New York: Harcourt, Brace
and company.
Shannon C. A. (1948) A mathematical theory of communications. Bell Systems Technical
Journal, 27(4), 623–656.
Stine, R. A. (1995). Graphical interpretation of variance inflation factors. The American
Statistician, 49(1), 53–56.
Stoll, S., Zakharko, T., Moran, S., Schikowski, R., & Bickel, B. (2015). Syntactic
mixing across generations in an environment of community-wide bilingualism. Frontiers
in Psychology, 6, 82.
Triandis, H. C. (1994). Culture and social behavior. New York, NY: McGraw-Hill Book
Company.
116 D. W. Vinson and R. Dale

Trudgill, P. (1989). Contact and isolation in linguistic change. In L. Breivik & E. Jahr (Eds.),
Language change: Contribution to the study of its causes (pp. 227–237). Berlin: Mouton de
Gruyter.
Trudgill, P. (2011). Sociolinguistic typology: Social determinants of linguistic complexity. Oxford,
UK: Oxford University Press.
Vilhena, D. A., Foster, J. G., Rosvall, M., West, J. D., Evans, J., & Bergstrom, C. T.
(2014). Finding cultural holes: How structure and culture diverge in networks of scholarly
communication. Sociological Science, 1, 221–238.
Vinson, D. W., & Dale, R. (2014a). An exploration of semantic tendencies in word of
mouth business reviews. In Proceedings of the Science and Information Conference (SAI), 2014
(pp. 803–809). IEEE.
Vinson, D. W., & Dale, R. (2014b). Valence weakly constrains the information density of
messages. In Proceedings of the 36th Annual Conference of the Cognitive Science Society (pp.
1682–1687). Austin, TX: Cognitive Science Society.
Wray, A., & Grace, G. W. (2007). The consequences of talking to strangers: Evolutionary
corollaries of socio-cultural influences on linguistic form. Lingua, 117(3), 543–578.
Wurm, L. H., & Fisicaro, S. A. (2014). What residualizing predictors in regression analyses
does (and what it does not do). Journal of Memory and Language, 72, 37–48.
6
MUSIC TAGGING AND LISTENING
Testing the Memory Cue Hypothesis
in a Collaborative Tagging System

Jared Lorince and


Peter M. Todd

Abstract
As an example of exploring human memory cue use in an ecologically valid context, we
present ongoing work to examine the “memory cue hypothesis” in collaborative tagging. In
collaborative tagging systems, which allow users to assign freeform textual labels to digital
resources, it is generally assumed that tags function as memory cues that facilitate future
retrieval of the resources to which they are assigned. There is, however, little empirical
evidence demonstrating that this is in fact the case. Employing large-scale music listening
and tagging data from the social music website Last.fm as a case study, we present a set of
time series and information theoretic analytic methods we are using to explore how patterns
of content tagging and interaction support or refute the hypothesis that tags function as
retreival cues. Early results are, on average, consistent with the hypothesis. There is an
immediate practical application of this work to those working with collaborative tagging
systems (are user motivations what we think they are?), but our work also comprises
contributions of interest to the cognitive science community: First, we are expanding our
understanding of how people generate and use memory cues “in the wild.” Second, we
are enriching the “toolbox” available to cognitive scientists for studying cognition using
large-scale, ecologically valid data that is latent in the logged activity of web users.

Introduction
Humans possess a unique capacity to manipulate the environment in the pursuit
of goals. These goals can be physical (building shelter, creating tools, etc.), but
also informational, such as when we create markers to point the way along a path
or leave a note to ourselves as a reminder to pick up eggs from the market. In
the informational case, the creation of reminders or pointers in the environment
functions as a kind of cognitive offloading, enriching our modes of interaction with
the environment while requiring reduced internal management of information.
118 J. Lorince and P. M. Todd

The proliferation of web-based technologies has massively increased the number


of opportunities we have for such offloading, the variety of ways we can go
about it, and the need to do so (if we are to keep up with the ever expanding
mass of information available online). This is particularly true with respect to the
various “Web 2.0” technologies that have recently gained popularity. As jargony
and imprecise a term it may be, “Web 2.0” entails a variety of technologies of
interest to cognitive scientists, including the sort of informational environment
manipulations that interest us here. More than anything else, the “upgrade” from
Web 1.0 that has occurred over the past 10–15 years has seen the evolution of
the average web user from passive information consumer to active information
producer, using web tools as a means of interacting with digital content and other
individuals. The active web user generates a variety of data of interest to our field,
facilitating the study of cognitive processes like memory and categorization, as well
as a wealth of applied problems that methods and theory from the cognitive sciences
can help address. The systematic recording of user data by Web systems means there
is a wealth of “Big Data” capturing such behavior available to cognitive scientists.
Collaborative tagging is one of the core technologies of Web 2.0, and entails
the assignment of freeform textual labels (tags) to online resources (photos, music,
documents, etc.) by users. These tag assignments are then aggregated into a socially
generated semantic structure known as a “folksonomy.” The commonly assumed
purpose of tagging is for personal information management: Users tag resources
to facilitate their own retrieval of tagged items at a later time. In effect, then,
such tags serve as memory cues, signals offloaded to the (virtual) environment
that allow users to find resources in the future. If this assumption holds, tagging
behavior can serve as a useful window on the psychological processes described
above. However, while the “tags as memory cues” hypothesis is assumed across
a majority of tagging research, there is little in the way of empirical evidence
supporting this interpretation of tagging behavior. Our current research thus
serves to test this hypothesis, examining Big Data from social tagging systems to
determine whether users are in fact using tags as memory cues. Using a unique
dataset from the social music website Last.fm that includes records of both what
music users have tagged and how they have interacted with that music over time
(in the form of music listening histories), we examine if and how patterns of
content interaction support or contradict the memory cue interpretation. There is
an immediate practical application of this work to those working with collaborative
tagging systems (are user motivations what we think they are?), but our work also
comprises contributions of interest to the cognitive science community: First, we
are expanding our understanding of how people generate and use memory cues “in
the wild.” Second, we are enriching the “toolbox” available to cognitive scientists
for studying cognition using large-scale, ecologically valid data that is latent in the
logged activity of web users.
Music Tagging and Listening 119

We begin the chapter by providing background on precisely what collaborative


tagging entails, describing the existing theories of tagging motivation, and briefly
summarizing the relevant work in psychology and cognitive science on memory
cue generation and usage, relating it to the case of online tagging. We then
formalize our research objectives, outlining the difficulties in making claims about
why people are tagging based on histories of what they have tagged, presenting
the details of our dataset and how it offers a partial solution to those difficulties,
and delineating our concrete hypotheses. Finally, we present the novel analysis
methodologies we are employing and some of the results they have generated.

Background
What is Collaborative Tagging?
In collaborative tagging, many individuals assign freeform metadata in the form
of arbitrary strings (tags) to resources in a shared information space. These
resources can, in principle, be any digital object, and web services across a wide
variety of domains implement tagging features. Examples include web bookmarks
(Delicious.com, Pinboard.in), music (Last.fm), photos (Flickr.com, 500px.com),
academic papers (academia.edu, mendeley.com), books (LibraryThing.com), and
many others. When many users engage in tagging of a shared corpus of content,
the emergent semantic structure is known as a folksonomy, a term defined by
Thomas Vander Wal as a “user-created bottom-up categorical structure . . . with
an emergent thesaurus” (Vander Wal, 2007). Under his terminology, a folksonomy
can either be broad, meaning many users tag the same, shared resources, or narrow,
in which any given resource tends to be tagged by only one user (usually the
content creator or uploader). Last.fm, on which we are performing our analyses,
is a canonical example of the former, and Flickr, where users upload and tag their
own photos, is a good example of the latter.
Folksonomies have been lauded as a radical new approach to content
classification (Heckner, Mühlbacher, & Wolff, 2008; Shirky, 2005; Sterling,
2005; Weinberger, 2008). In principle, they leverage the “wisdom of the
crowds” to generate metadata both more flexibly (multiple classification of content
is built in to the system) and at lower economic cost (individual users are,
generally, self-motivated and uncompensated) than in traditional, expert, or
computer-generated taxonomies, as one might find in a library. The approach
is not uncontroversial, however, with critics from library science in particular
(Macgregor & McCulloch, 2006) pointing out the difficulties that the wholly
uncontrolled vocabularies of folkonomies can introduce (especially poor handling
of homonyms and hierarchical relationships between tags). In broad folksonomies,
the existence of social imitation effects (Floeck, Putzke, Steinfels, Fischbach, &
Schoder, 2011; Lorince & Todd, 2013) can also cast doubt on whether agreement
as to how an item ought to be tagged reflects true consensus, or instead bandwagon
120 J. Lorince and P. M. Todd

effects that do not “correctly” categorize the item. Given our current focus
on individuals’ tagging motivations, the level of efficacy of tagging systems for
collective classification is not something we address here.
Hotho, Jäschke, Schmitz, & Stumme (2006a) formally define a folksonomy
as a tuple F := (U, T, R, Y )1 where U , T , and R are finite sets representing,
respectively, the set of all unique users, tags, and resources in the tagging system.
Y is a ternary relation between them (Y ⊆ U × T × R), representing the set of tag
assignments (or, equivalently, annotations) in the folksonomy (i.e. instances of a
particular user assigning a particular tag to a particular resource). They also define
the personomy of a particular user, P := (Tu , Ru , Yu ), which is simply the subset of
F corresponding to the tagging activity of a single user.
Collaborative tagging systems began to be developed in the early 2000s, with
the launch of the social bookmarking tool Delicious in 2003 marking the first
to gain widespread popularity. Three years later, Golder and Huberman’s (2006)
seminal paper on the stabilization of tag distributions on Delicious sparked interest
in tagging as an object of academic interest. In the years since, a substantial
literature on the dynamics of tagging behavior has developed. Research has
covered topics as diverse as the relationship between social ties and tagging
habits (Schifanella, Barrat, Cattuto, Markines, & Menczer, 2010), vocabulary
evolution (Cattuto, Baldassarri, Servedio, & Loreto, 2007), mathematical and
multi-agent modeling of tagging behaviors (Cattuto, Loreto, & Pietronero, 2007;
Lorince & Todd, 2013), identification of expert taggers (Noll, Au Yeung, Gibbins,
Meinel, & Shadbolt, 2009; Yeung, Noll, Gibbins, Meinel, & Shadbolt, 2011),
emergence of consensus among taggers (Halpin, Robu, & Shepherd, 2007; Robu,
Halpin, & Shepherd, 2009), and tag recommendation (Jäschke, Marinho, Hotho,
Schmidt-Thieme, & Stumme, 2007; Seitlinger, Ley, & Albert, 2013), among
others.
This small sample of representative work is indicative of the fact that, at least
at the aggregate level, researchers have a fairly good idea of how people tag. What
is comparatively poorly understood (and relevant to our purposes here) is exactly
why users tag.

Why People Tag


The prevailing assumption about tagging behavior is that tags serve as retrieval or
organizational aids. Take the original definition of “folksonomy” as a canonical
example: “Folksonomy is the result of personal free tagging of information and
objects (anything with a URL) for one’s own retrieval” (Vander Wal, 2007, emphasis
added). Couched in psychological terms, this is to say that tags function as memory
cues of some form, facilitating future retrieval of the items to which they are
assigned. There are various manifestations of this perspective (see, among many
examples, Glushko, Maglio, Matlock, & Barsalou, 2008, Halpin et al., 2007,
Music Tagging and Listening 121

and Golder & Huberman, 2006), and it is one generally in line with the design
goals of tagging systems. Although tagged content can be used in various ways
beyond retrieval, such as resource discovery and sharing, the immediate motivation
for a user to tag a given item is most often assumed (not illogically) as being
to achieve an information organization and retrieval goal. This is not to imply
that other tagging objectives, such as social sharing, are necessarily illogical, only
that they are less often considered primary motivators of tagging choices. Such
retrieval goals are implemented in tagging systems by allowing users to use tags
as search keywords (returning items labeled with a particular tag from among a
user’s own tagged content, or the global folksonomy) and by allowing them to
directly browse the tags they or others have generated. On Last.fm, for example,
a user can click on the tag “rock” on the tag listing accessible from his or her
profile page, and view all the music to which he or she has assigned the tag
“rock.”
While our current goal is to test whether this assumption holds when
considering users’ histories of item tagging and interaction, it is important to
recognize that alternative motivations for tagging can exist. Gupta, Li, Yin, & Han
(2010), for instance, posit no less than nine possible reasons, beyond future retrieval,
for which a user might tag: Contribution and sharing, attracting attention to one’s
own resources, play and competition, self-presentation, opinion expression, task
organization, social signaling, earning money, and “technological ease” (i.e. when
software greatly reduces the effort required to tag content). We will not analyze
each of these motivational factors in depth, but present the list in its entirety
to make clear that tagging motivation can extend well beyond a pure retrieval
function. We do, however, briefly review the most well-developed theories of tag
motivation in the literature.
What is likely to be the most critical distinction in a user’s decision to tag
a resource is the intended audience of the tag, namely whether it is self- or
other-directed. This distinction maps onto what Heckner et al. (2008) refer to
as PIM (personal information management) and resource sharing. The sort of
self-generated retrieval cues that interest us here fall under the umbrella of PIM,
while tags generated for the purposes of resource sharing are intended to help
other people find tagged content. For example, a user may apply tags to his or her
Flickr photos that serve no personal organizational purpose, but are intended to
make it easier for others to discover his or her photos. Social motivations can be
more varied, however. Zollers (2007), for instance, argues that opinion expression,
performance, and activism are all possible motivations for tagging. Some systems
also implement game-like features to encourage tagging (Weng & Menczer, 2010;
Weng, Schifanella, & Menczer, 2011; Von Ahn & Dabbish, 2004) that can invoke
socially directed motivations.
Ames and Naaman (2007) present a two-dimensional taxonomy of tagging
motivation, dividing motivation not only along dimensions of sociality (like
122 J. Lorince and P. M. Todd

Heckner et al. 2008), but also a second, functional dimension. Under their termi-
nology, tags can be either organizational or communicative. When self-directed,
organizational tags are those used for future retrieval, while communicative tags
provide contextual information about a tagged resource, but are not intended to
aid in retrieval. Analogously, social tags can either be intended to help other users
find a resource (organizational) or communicate information about a resource once
it is found (communicative).
While all of these theories of tagging motivation appear reasonable (to varying
degrees), there is little in the way of empirically rigorous work demonstrating that
user tagging patterns actually align with them. The most common methods for
arriving at such taxonomies are examining the interface and features of tagging
systems to infer how and why users might tag (e.g. in a system where a user can
only see his or her own tags, social factors are likely not to be at play, see Marlow,
Naaman, Boyd, & Davis, 2006), semantic analysis and categorization of tags (e.g.
“to read” is likely to be a self-directed organizational tag, while tagging a photo
with one’s own username is likely to be a socially directed tag suggesting a variety
of self-advertisement, see Sen et al., 2006; Zollers, 2007), and qualitative studies in
which researchers explicitly ask users why they tag (e.g. Ames & Naaman, 2007;
Nov, Naaman, & Ye, 2008). All of these methods certainly provide useful insights
into why people tag, but none directly measure quantitative signals of any proposed
motivational factor. One notable exception to this trend is the work of Körner
and colleagues (Körner, Benz, Hotho, Strohmaier, & Stumme, 2010; Körner,
Kern, Grahsl, & Strohmaier, 2010; Zubiaga, Körner, & Strohmaier, 2011), who
propose that taggers can be classified as either categorizers (who use constrained tag
vocabularies to facilitate later browsing of resources) or describers (who use broad,
varied vocabularies to facilitate later keyword-based search over resources). They
then develop and test quantitative measures that, they hypothesize, should indicate
that a user is either a categorizer or describer. Although Körner and colleagues
are able to classify users along the dimensions they define, they cannot know
if describers actually use their tags for search, or that categorizers use them for
browsing. This is a problem pervasive in work on tagging motivation (for lack of
the necessary data, as we will discuss below); there is typically no way to verify that
users actually use the tags they have applied in a manner consistent with a given
motivational framework.

Connections to Psychological Research on Memory Cues


We now turn to work from the psychological literature on how humans generate
and employ the kinds of externalized memory cues that tags may represent. There
is little work directly addressing the function of tags in web-based tagging systems
as memory cues, but some literature has explored self-generated, external memory
cues. This research finds its roots more broadly in work on mnemonics and other
Music Tagging and Listening 123

memory aids that gained popularity in the 1970s (Higbee, 1979). Although most
work has focused on internal memory aids (e.g. rhyming, rehearsal strategies, and
other mnemonics), some researchers have explored the use of external aids, which
are typically defined as “physical, tangible memory prompts external to the person,
such as writing lists, writing on one’s hand, and putting notes on a calendar” (Block
& Morwitz, 1999, p. 346). We of course take the position that digital objects, too,
can serve as memory cues, and some early work (Harris, 1980; Hunter, 1979;
Intons-Peterson & Fournier, 1986) was sensitive to this possibility long before
tagging and related technologies were developed.
The work summarized above, although relevant, provides little in the way
of testable hypotheses with respect to how people use tags. Classic research on
human memory—specifically on so-called cued recall—can offer such concrete
hypotheses. If the conceptualization of tags as memory cues is a valid one,
we would expect users’ interaction with them to conform, to at least some
degree, with established findings on cued retrieval of memories. The literature
on cued recall is too expansive and varied to succinctly summarize here (see
Kausler & Kausler, 1974 for a review of classic work), but broadly speaking
describes scenarios in which an individual is presented with target items (most
typically words presented on a screen) and associated cues (also words, generally
speaking), and is later tested on his or her ability to remember the target items
when presented with the previously learned cues. The analog to tagging is that
tags themselves function as cues, and are associated with particular resources that
the user wishes to retrieve (recall) at a later time. The scenarios, of course, are
not perfectly isomorphic. While in a cued-recall context, a subject is presented
with the cue, and must retrieve from memory the associated item(s), in a tagging
context the user may often do the opposite, recalling the cue, which triggers
automatic retrieval (by the tagging system) of the associated items “for free” with
no memory cost to the user. Furthermore, it is likely to be true in many cases
that a user may not remember the specific items they have tagged with a given
term at all. Instead, a tag might capture some relevant aspect of the item it is
assigned to, such that it can serve to retrieve a set of items sharing that attribute
(with no particular resource being sought). As an example, a user might tag upbeat,
high-energy songs with the word “happy,” and then later use that tag to listen to
upbeat, happy songs. In such a case, the user may have no particular song in mind
when using the tag for retrieval, as would be expected in a typical cued-recall
scenario.
These observations reveal that, even when assuming tags serve a retrieval
function, how exactly that function plays out in user behavior can take various
forms. Nonetheless, we take the position that an effective tag—if and when that
tag serves as retrieval cue—should share attributes of memory cues shown to be
effective in the cued recall literature. In particular, we echo Earhard’s (1967) claim
that “the efficiency of a cue for retrieval is dependent upon the number of items
124 J. Lorince and P. M. Todd

for which it must act, and that an efficient strategy for remembering must be some
compromise between the number of cues used and the number of items assigned
to each cue” (p. 257). We base this on the assumption that tags, whether used for
search, browsing, or any other retrieval-centric purpose, still serve as cue-resource
associates in much the same way as in cued recall research; useful tags should
connect a user with desired resources in a way that is efficient and does not impose
unreasonable cognitive load.
In cases of tagging for future retrieval, this should manifest as a balance between
the number of unique tags (cues) a user employs, and the number of items which
are labeled with each of those tags. Some classic research on cued recall would
argue against such a balancing act, with various studies suggesting that recall
performance reliably increases as a function of cue distinctiveness (Moscovitch &
Craik, 1976). This phenomenon is sometimes explained by the cue-overload effect
(Rutherford, 2004; Watkins & Watkins, 1975), under which increasing numbers
of targets associated with a cue will “overload” the cue such that its effectiveness
for recalling those items declines. In other words, the more distinctive a cue is (in
terms of being associated with fewer items), the better. But when researchers have
considered not only the number of items associated with a cue, but also the total
number of cues a subject must remember, results have demonstrated that at both
extremes—too many distinct cues or too many items per cue—recall performance
suffers. Various studies support this perspective (e.g. Hunt & Seta, 1984; Weist,
1970), with two particularly notable cued recall studies being those by Earhard
(1967), who found recall performance to be an increasing function of the number
of items per cue, but a decreasing function of the total number of cues, and Tulving
& Pearlstone (1966), who found that subjects were able to remember a larger
proportion of a set of cues, but fewer targets per cue, as the number of targets
associated with each cue increased.
Two aspects of tagging for future retrieval that are not well captured by existing
work are (a) the fact that, in tagging, cues are self-generated and (b) differences in
scale (the number of items to be remembered and tags used far exceed, in many
cases by orders of magnitude, the number of cues and items utilized in cued recall
studies). Tullis & Benjamin (2014) have recently begun to explore the question of
self-generated cues in experiments where subjects are explicitly asked to generate
cues for later recall of associated items, and their findings are generally consistent
with the account of cued recall described here. Results suggest that people are
sensitive to the set of items to be remembered in their choice of cues, and that
their choices generally support the view that cue distinctiveness aids in recall. The
issue of scale remains unaddressed, however.
In sum, the case of online tagging has important distinctions from the paradigms
used in cued recall research, but we nonetheless find the cued recall framework to
be a useful one for generating the specific hypotheses we explore below.
Music Tagging and Listening 125

Problem Formalization and Approach


Stated formally, our overarching research question is this: By jointly examining
when and how people tag resources, along with their patterns of interaction over
time with those same resources, can we find quantitative evidence supporting or
refuting the prevailing hypothesis that tags tend to serve as memory cues? In this
section we address the challenges associated with answering this question, describe
our dataset and how it provides an opportunity for insight into this topic, and
outline specific hypotheses.

The Challenge
As discussed above, there is no shortage of ideas as to why people tag, but actually
finding empirical evidence supporting the prevalent memory cue hypothesis—or
any other possible tagging motivation, for that matter—is difficult. The simple fact
of the matter is that there are plenty of data logging what, when, and with which
terms people tag content in social tagging systems, but to our knowledge there are
no publicly available datasets that reveal how those tags are subsequently used for
item retrieval (or for any other reason). Of the various ways a user might interact
with or be exposed to a tag after he or she has assigned it to an item (either by
using it as a search term, clicking it in a list, simply seeing it onscreen, etc.), none
are open to direct study. This is not impossible in principle, as a web service could
log such information, but such data are not present in publicly available datasets or
possible to scrape from any existing tagging systems.
Thus, we face the problem of making inferences about why a user tagged an
item based only on the history of what, how, and when that user has tagged,
without any ability to test if future use of the tag matches our inferences. It may
seem, then, that survey approaches that directly ask users why they tag might
necessarily be our best option, but we find this especially problematic. Not only
are such self-reported motivations not wholly reliable, we are more interested in
whether tags actually function as memory cues than whether users intend to use
them as such. With all this in mind, we now turn to describing the dataset with
which we are currently working, and why we believe it provides a partial resolution
to these challenges.

Dataset
Our current work revolves around data crawled over the course of 2013 and 2014
from the social music website Last.fm. The core functionality of the site (a free
service) is tracking listening habits in a process known as “scrobbling,” wherein
each timestamped, logged instance of listening to a song is a “scrobble.” Listening
126 J. Lorince and P. M. Todd

data are used to generate music recommendations for users, as well as to connect
them with other users with similar listening habits on the site’s social network.
Listening statistics are also summarized on a user’s public profile page (showing the
user’s recently listened tracks, most listened artists, and so on). Although users can
listen to music on the site itself using its radio feature, they can also track their
listening in external media software and devices (e.g. iTunes, Windows Media
Player, etc.), in which case listening is tracked with a software plugin, as well as
on other online streaming sites (such as Spotify and Grooveshark). Because the
site tracks listening across various sources, we can be confident that we have a
representative—if not complete—record of users’ listening habits.
Last.fm also incorporates tagging features, and users can tag any artist, album,
or song with arbitrary strings. Being a broad folksonomy, multiple users can tag
the same item (with as many distinct tags as they desire), and users can view the
distribution of tags assigned to any given item. In addition to seeing all the tags
that have been assigned to a given item, users are also able to search through their
own tags (e.g. to see all the songs that one has tagged “favorites”) or view the items
tagged with a particular term by the community at large. From there, they can also
listen to collections of music tagged with that term (e.g. on the page for the tag
“progressive metal” there is a link to the “play progressive metal tag”).
The current version of our dataset consists of complete listening and tagging
histories for over 90,000 Last.fm users for the time period of July 2005 through
December 2012, amounting to over 1.6 billion individual scrobbles and nearly
27 million individual annotations (tuples representing a user’s assignment of a
particular tag to a particular item at a particular time). See Table 6.1 for a high-level
summary. All data were collected either via the Last.fm API or direct scraping of
publicly available user profile pages. We originally collected a larger sample of
tagging data from users (approximately 1.9 million), and the data described here
represent the subsample of those for which we have so far collected listening data.
See our previous work using the larger tagging dataset (Lorince & Todd, 2013;
Lorince, Joseph, & Todd, 2015; Lorince, Zorowitz, Murdock, & Todd, 2014) for
technical details of the crawling process.
The value of these data is that they provide not only a large sample of user
tagging decisions, as in many other such datasets, but also patterns of interaction
over time with the items users have tagged. Thus, for any given artist or song2 a
user has listened to, we can determine if the user tagged that same item and when,
permitting a variety of analyses that explore the interplay between interaction with
an object (in our case, by listening to it) and tagging it. This places us in a unique
position to test if tagging a resource affects subsequent interaction with it in a way
consistent with the memory cue hypothesis.
We of course face limitations. While these data present a new window on our
questions of interest, they cannot establish a causal relationship between tagging
and any future listening, and there may be peculiarities of music listening that limit
Music Tagging and Listening 127

TABLE 6.1 Dataset summary. Per-user medians in


parentheses.

Measure Count (per-user median)


Total users 90,603
Total scrobbles 1,666,954,788 (7,163)
Unique artists scrobbled 3,922,349 (486)
Total annotations 26,937,952 (37)
Total unique tags 551,898 (16)
Unique artists tagged 620,534 (16)

the applicability of any findings to other tagging domains (e.g. web bookmarks,
photos, etc.). Nonetheless, we find ourselves in a unique position to examine the
complex interplay between music tagging and listening that can provide insight
into whether or not people tag for future retrieval, and tagging motivation more
generally.

Hypotheses
As we clearly cannot measure motivation directly, we seek to establish a set of
anticipated relationships between music tagging and listening that should hold if
the memory cue hypothesis is correct, or at least in a subset of cases in which
it applies. The overarching prediction of the memory cue hypothesis is that tags
facilitate re-finding music in the future, which should manifest here as increased
levels of listening to tagged music than we would find in the absence of tagging.
Here we outline two concrete hypotheses:

Hypothesis 1. If a user tags an item, this should increase the probability that a user
listens to it in the future. Specifically, assignment of tags to a particular artist/song
should correlate with greater rates of listening to that artist/song later.

If tagging does serve as a retrieval aid, it should increase the chance that a user
interacts with the tagged resource in the future. We would expect that increases in
tagging an artist, on average, should correlate with and precede increased probability
of listening to that artist. This would suggest that increased tagging is predictive of
future listening, which is consistent with the application of tags facilitating later
retrieval of a resource.

Hypothesis 2. Those tags that are most associated with increased future listening
(i.e. those that most likely function as memory cues) should occupy a “sweet spot”
of specificity that makes them useful as retrieval aids.
128 J. Lorince and P. M. Todd

Even if the memory cue hypothesis holds, it is presumably the case that not all
tags serve as memory cues. Those that do, as evidenced by a predictive relationship
with future listening, should demonstrate moderate levels of information content
(in the information theoretic sense, Shannon, 1948). A tag that is overly specific
(for example, one that uniquely identifies a particular song) is likely to be of little
use in most cases,3 as the user may as well recall the item directly, while one that is
overly broad (one that applies to many different items) is also of little value, for it
picks out too broad a set of items to effectively aid retrieval. Thus we hypothesize
that the specificity of tags (as measured by Shannon entropy) should be more likely
on average to fall in a “sweet spot” between these extremes in those cases where
tagging facilitates future listening.

Analytic Approaches
In this section we describe some of the analytic approaches we are employing to test
the memory cue hypothesis, and a selection of early findings. We discuss, in turn,
time series analysis methods including visualization and clustering, information
theoretic analyses of tags, and other approaches to be explored in future work
including modeling the causal influence (or lack thereof) of tagging on subsequent
listening.
Central to the analyses presented below are user-artist listening time series
and user-artist tagging time series. The former consist of the monthly scrobble
frequencies for each user-artist pair in our data (i.e. for every user, there exists
one time series of monthly playcounts for each unique artist he or she has listened
to) in the July 2005 through December 2012 range. We similarly define tagging
time series, which reflect the number of times a particular user tagged a particular
artist each month. Although listening data are available at a higher time resolution
than we use for analysis, users’ historical tagging data are only available at monthly
time resolution. Thus we down-sample all listening data to monthly playcounts to
facilitate comparative analysis with tagging.
While it is possible in principle to define these time series at the level of
particular songs as opposed to artists, the analysis we present here is limited to
the artist level. For this first phase of research we have taken this approach because
(a) the number of unique songs is much larger than the number of unique artists,
greatly increasing the computational demands of analysis, and (b) the listening and
tagging data (especially the latter) for any particular song in our dataset are typically
very sparse. Thus, for the purposes of the work presented here, we associate with
a given artist all annotations assigned directly to that artist, or to any of the artist’s
albums or songs.
Listening time series are normalized to account for variation in baseline levels
of listening across users. We accomplish this by dividing a users’s playcount for
a given artist in a given month by that user’s total playcount (across all artists)
Music Tagging and Listening 129

for that month. This effectively converts raw listening counts to the proportion
of a user’s listening in a given time period allocated to any given artist. After all
pre-processing, our data consists of 78,271,211 untagged listening time series (i.e.
user-artist pairings in which the user never tagged the corresponding artist), and
5,336,702 tagged time series (user-artist pairings in which the user tagged the artist
at least once in the data collection period).

Time Series Analysis


With our time series thus defined, a number of analyses become possible to address
our first hypothesis defined above. In most cases, such time series analysis at the
level of the individual is very difficult, as listening and tagging data (especially
the latter) tend to be sparse for any single user. But by aggregating many time
series together, we can determine if user behavior, on average, is consistent with
our hypotheses. Tagging data are not sparse for all users, however, and some
users are in fact prolific taggers with thousands of annotations. As is clear from
Figure 6.1, tagging levels show a long-tailed distribution in which most users tag
very little, and a small number tag a great deal. Although we average across users
for the analyses presented here, these discrepancies between typical taggers and
“supertaggers”—the implications of which we directly examine in other work
(Lorince et al., 2014, 2015)—suggest that future work may benefit from analyzing
different groups of taggers separately.
A first, high-level perspective is to compare the overall average listening of
tagged versus untagged listening time series (that is, comparing listening patterns
on average for user-artist pairs in which the user has tagged that artist, and those
in which he or she has not), to see if they match the intuitions set forth in
Hypothesis 1. As is apparent in Figure 6.2, they do. Here, after temporally aligning
all time series to the first month in which a user listened to a given artist, we
plot the mean normalized playcount (i.e. proportion of a user’s listening in a given
month) among all untagged (solid line) and tagged (dashed line) time series. As
predicted, tagging is correlated with increased listening to an artist after the tag is
applied (and also within the month the tag is applied), as evidenced by the higher
peak and slower decay of listening for tagged time series. Note that the tagged
time series analyzed here are limited to those tagged in the first month a user
listens to a given artist. We ignore cases where a user only tagged an artist in the
preceding or subsequent months, as there is no principled way to align the tagged
and untagged time series for comparison under these circumstances. However,
tagging is by far the most common in the first month a user listens to an artist
(more than 52 percent of tagged time series have an annotation the month of the
first listen), so this analysis still captures a majority of the data. While these results
are correlational (we cannot know if increased listening levels are caused by tagging,
130 J. Lorince and P. M. Todd

100
Proportion of users with N total annotations

10–1

10–2

10–3

10–4

10–5

10–6
100 101 102 103 104 105 106
Number of annotations (N )
FIGURE 6.1 For a given total annotation count N , the proportion of users in our
tagging dataset with a total of N annotations, on a log-log scale.

or if users are simply more likely to tag the artists they are more likely listen to),
aggregate listening patterns are at least consistent with Hypothesis 1.
In concurrent work,4 we are exploring canonical forms of music listening
patterns by applying standard vector clustering methods from computer science
to identify groups of similarly shaped listening time series. The precise
methodological details are not relevant here, but involve representing each time
series as a simple numeric vector, and feeding many such time series into an
algorithm (k-means) that arbitrarily defines k distinct cluster centroids. Vectors
are each assigned to the cluster to whose centroid they are most similar (as
measured by Euclidean distance), and a new centroid is defined for each cluster
as the mean of all its constituent vectors. This process repeats iteratively until
the distribution of vectors over clusters stabilizes.5 In Figure 6.3 we show results
of one of various clustering analyses, showing cluster centroids and associated
probability distributions of tagging for k = 9 clusters. Plotted are the mean
probability distributions of listening in each cluster, as well as the temporally
aligned probability distribution of tagging for all user-artist pairs in the cluster.
Consideration of the clustered results is useful for two reasons. First, it demonstrates
that tagging is, on average, most likely in the first month a user listens to an artist
even when the user’s listening peaks in a later month, which is impossible to see in
Figure 6.2. Second, it provides further evidence that increases in tagging correlate
Music Tagging and Listening 131

0.016
Untagged time series
Tagged time series
0.014

0.012
Mean normalized listening

0.010

0.008

0.006

0.004

0.002

0.000
0 20 40 60 80
Months since first listen

FIGURE 6.2 Mean normalized playcount each month (aligned to the month of first
listen) for all listening time series in which the user never tagged the associated artist
(solid line) and listening time series in which the user tagged the artist in the first
month he or she listened to the artist (dashed line).

with and precede increases in listening. This is demonstrated by the qualitatively


similar shapes of of the tagging and listening distributions, but more importantly
by the fact that the tagging distributions are shifted leftward (that is, earlier in time)
compared to the listening distributions.
We have established that, on average, the relative behavior of listening and
tagging time series are in line with our expectations, but an additional useful
analysis is to explore if the probability of listening to an artist increases with the
number of times that artist is tagged. Tagged time series should demonstrate more
listening, as we have shown, but presumably the more times a user has tagged
an artist, the more pronounced this effect should be. Figure 6.4 confirms the
hypothesis, plotting the mean probability of listening to an artist as a function
of the number of months since a user first listened to that artist, separated into
the number of times the user has tagged the artist (or associated songs/albums).
Formally, given that a user has listened to an artist for the first time at T0 , what
is the probability that he or she listened to the artist one or more times in month
T1 , T2 , . . . , Tn ? Tagged time series show greater listening as compared to untagged
series, with listening probabilities increasing with the total number of times they
are tagged.
132 J. Lorince and P. M. Todd

Cluster 1 (N = 62716) Cluster 2 (N = 140215) Cluster 3 (N = 78418)


0.10 0.25 0.12
Listening
Tagging 0.10
0.08 0.20

0.08
0.06 0.15
0.06
0.04 0.10
0.04

0.02 0.05
0.02

0.00 0.00 0.00


0 15 30 45 60 75 0 15 30 45 60 75 0 15 30 45 60 75
Cluster 4 (N = 168848) Cluster 5 (N = 289971) Cluster 6 (N = 62355)
0.12 0.25 0.16

0.14
0.10 0.20
0.12
Probability density

0.08 0.10
0.15
0.08
0.06
0.10 0.06
0.04
0.04
0.05
0.02 0.02

0.00 0.00 0.00


0 15 30 45 60 75 0 15 30 45 60 75 0 15 30 45 60 75
Cluster 7 (N = 126026) Cluster 8 (N = 27219) Cluster 9 (N = 44232)
0.14 0.09 0.07
0.08
0.12 0.06
0.07
0.10 0.05
0.06
0.08 0.05 0.04

0.06 0.04 0.03


0.03
0.04 0.02
0.02
0.02 0.01
0.01
0.00 0.00 0.00
0 15 30 45 60 75 0 15 30 45 60 75 0 15 30 45 60 75
Months since first listen

FIGURE 6.3 Clustering results for k = 9. Shown are mean normalized playcount (solid
line) and mean number of annotations (dashed line), averaged over all the time series
within each cluster. Time series are converted to probability densities, and aligned to
the first month in which a user listened to a given artist. Clusters are labeled with
the number of listening time series (out of 1 million) assigned to each cluster. Cluster
numbering is arbitrary.

Taken together, these preliminary comparisons of tagging and listening behavior


demonstrate that tagging behavior is associated with increased probability of
interaction with the tagged item, consistent with but not confirming Hypothesis 1.
In the next section we describe some of the information theoretic methods used
to explore Hypothesis 2.

Information Theoretic Analyses


We have discussed the hypothesized importance of tag specificity in whether or
not it serves as an effective retrieval aid, and here describe some analyses testing
Music Tagging and Listening 133

1.0
0 annotations
1 annotations
2 annotations
3 annotations
0.8 4 annotations
5 annotations
6 annotations
7 annotations
Probability of listening

8 or more annotations
0.6

0.4

0.2

0.0
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48
Months since first listen

FIGURE 6.4Mean normalized playcount for user-artist listening time series tagged a
given number of times.

the hypothesis that the tags used as retrieval cues6 should have moderate levels of
specificity. A useful mathematical formalization of “specificity” for our purposes
here is the information theoretic notion of entropy, as defined by Shannon (1948).
Entropy (H ) is effectively a measure of uncertainty in the possible outcomes of a
random variable. It is defined as
X
H (X ) = − P(xi ) logb P(xi ), (1)
i

where P(xi ) is the probability of random variable X having outcome xi , and b is


the base of the logarithm. We follow the convention of using b = 2 such that values
of H are measured in bits. The greater the value of H , the greater the uncertainty
in the outcome of X . We can thus define the entropy of a tag by thinking of it as a
random variable, whose possible “outcomes” are the different artists to which it is
assigned. The more artists a tag is assigned to, and the more evenly it is distributed
over those artists, the higher its entropy. H thus provides just the sort of specificity
measure we need. High values of H correspond to low specificity, and low values
of H indicate high specificity (H = 0 for a tag assigned to only one artist, as there
is zero uncertainty as to which artist the tag is associated with).
We can define tag entropy at the level of an individual user’s vocabulary, where
H for a given tag is calculated over the artists to which that user has assigned it, and
did so for each of every user’s tags. We then binned all tags by their entropy (with a
134 J. Lorince and P. M. Todd

1.0

Mean total normalized 16


listening by entropy Tag use vs. entropy
0.06 0.14
Probability density

Probability density
0.05 0.12 14
0.8 0.10
0.04
0.08
0.03
0.06
0.02 12
0.04
0.01 0.02
Probability of listening

0.00 0.00
0.6

Tag entropy (bits)


0.0 5.0 10.0 0.0 5.0 10.0 10
H (bits) H (bits)

0.4
6

4
0.2

2
(0,0.5) bits
(0.5.1.0) bits
0.0 0
–28 –24 –20 –16 –12 –8 –4 0 4 8 12 16 20 24 28
Months since annotation

FIGURE 6.5 Mean probability of listening each month (relative to the month in which
a tag is applied) for user-artist time series associated with tags of a given binned entropy
(bin width of 0.5 bits). Each line represents the mean listening for a particular entropy
bin, with line color indicating the entropy range for the bin (darker shades show lower
entropy). Highlighted are the listening probabilities associated with 0.0–0.5 bit entropy
tags (bold dashed line) and 0.5 to 1.0 bit entropy tags (bold solid line). The inset plots
show the total mean listening (i.e. sum over all values in each line from the main plot)
for each entropy bin (left), and the probability distribution of tags by entropy (right).

bin width of 0.5 bits), and for each bin retrieved all listening time series associated
with tags in that bin. We then determined the mean probability of listening to
those artists each month relative to the month when the tag was applied.
The results are shown in Figure 6.5. Each line shows the average probability of
a user listening to an artist at a time X months before or after tagging it, given
that the user annotated that artist with a tag in a given entropy range. Entropies
are binned in 0.5 bit increments, and entropy values are indicated by the color of
each line. Two obvious large-scale trends should be noted. First, consistent with
the earlier finding that tagging overwhelmingly occurs in the first month a user
listens to an artist, the probability of listening to an artist peaks in the month it is
tagged, and is greater in the months following the annotation than preceding it.
Second, there is a general trend of overall lower listening probabilities with higher
entropy, consistent with findings suggesting that greater tag specificity ought to
facilitate retrieval. But, in support of our “sweet spot” hypothesis, this trend is not
Music Tagging and Listening 135

wholly monotonic. Tags with the lowest entropy (between 0.0 and 0.5 bits, dashed
bold line) are not associated with the highest listening probabilities; tags with low,
but not too low, entropy (between 0.5 and 1.0 bits, solid bold line) have the highest
rates of listening.
The left-hand inset plot is the probability distribution of total listening by
binned entropy (i.e. the mean sum total of normalized listening within each bin).
This is, effectively, a measure of the total amount of listening, on average, associated
with artists labeled with a tag in a given entropy bin, and makes clear the peak
for tags in the 0.5 to 1.0 bit range. Also of note is the relative stability of total
listening (excepting the aforementioned peak) up to around 7 bits of entropy, after
which total listening drops off rapidly. The right-hand inset plot is the probability
distribution of listening time series across tag entropy bins—or in other words, the
distribution of rates of tag use versus tag entropy. Very low entropy tags (0 to 0.5
bits) are clearly the most common, indicating the existence of many “singleton”
and low-use tags—that is, tags a user applies to only one, or very few, unique
artists. Ignoring these tags, however, we observe a unimodal, relatively symmetric
distribution peaked on the 5.0–5.5 bit entropy bin (marked with a vertical dashed
line) that corresponds more or less directly to the stable region of total listening in
the left-hand inset plot. Precisely what drives the preponderance of “singleton” tags
is not totally clear, but excluding them, these data do suggest that users demonstrate
a preference for moderate-entropy tags associated with relatively high listening
probabilities.
These results do not strongly suggest the existence of a single “sweet spot”
in entropy (the peak in the 0.5–1.0 bit bin may be partly due to noise, given
the relatively low frequency of tags in that entropy range), but do demonstrate
that there is not a simple, monotonic relationship between increased listening and
lower entropy values. Instead, we observed a range of entropy values (from 0.0 to
approximately 7.0 bits) that are associated with higher listening rates. We must be
cautious in drawing strong conclusions from these results, however. Because we
are collapsing tagging and listening activity by artist, we cannot know the number
of particular songs a user might retrieve with a given tag. Thus there may exist
dependencies between tag entropy and the number of associated songs that drive
mean listening rates higher or lower in a misleading manner. For example, a tag
that tends to only be associated with a small number of songs may show low mean
listening rates not because it is an ineffective retrieval cue, but because a small set
of songs may generate low listening rates compared with a larger set.
This is just one of various difficulties in interpreting large-scale data such
as these. When considering the average behavior of many, heterogenous users,
normalization and other transformations (such as our normalization of playcounts
to account for variation in users’ overall listening levels) are necessary, but can
interact with derived measures (such as our entropy calculations) in complex,
sometimes unexpected ways. As we continue this research program, we will need
136 J. Lorince and P. M. Todd

to further evaluate and refine the normalization methods we employ. Nonetheless,


these early results are suggestive of systematic, meaningful relationships between
listening habits and tag specificity.

Next Steps: Causal Analyses


The major shortcoming of the results we have presented thus far is that they
cannot provide a causal argument in support of the memory cue hypothesis.
Tagging is certainly correlated with listening, and early results suggest that observed
tagging/listening relationships are, on average, in line with our hypotheses, but
this is insufficient to make a strong causal argument. There is no simple method to
address the critical question here: Does tagging an artist result in a user’s listening
to that artist being measurably different than it would have been had the user not tagged
the artist?
Without addressing the philosophical problems surrounding claims about “true”
causality, we are still tasked with testing if any sort of predictive causality exists
between tagging and subsequent changes in listening behavior. Several relevant
statistical methods exist, such as Granger causality (Granger, 1969), which tests
for a causal relationship between two time series, as well as new methods like
Bayesian structural time-series models (Brodersen, Gallusser, Koehler, Remy, &
Scott, 2014), which estimate the causal impact of an intervention on time series
data as compared to control data without an intervention. Although these and
related methods are powerful, their applicability to our case appears limited for
two reasons: First, tagging data are very sparse for any particular user-artist pair
(typically consisting of one, or only a few, annotations), making methods that
measure the impact of one time series on another, like Granger causality, untenable.
Second, and more importantly, it is currently difficult to determine—even if
tagging shows a predictive relationship with future listening—whether tagging
actually facilitates retrieval of resources, thereby increasing listening, or if it is
simply the case that users are more likely to tag those artists which they are
independently more likely to listen to. Methods like Granger causality are useful
when only two variables are involved, but cannot eliminate the possibility of a
third variable driving both processes (in our case, intrinsic interest in an artist on
the part of a user might increase both listening and the probability of tagging that
artist).
We are currently exploring methods to sidestep this problem, but it is without
doubt a challenging one. One possible approach may employ clustering methods
similar to those described above to identify similar partial listening time series. If
there exists a sufficient number of time series that show similar forms during the
first N months a user listens to an artist, and if enough of those time series are
tagged in month N , we can compare if and how tagged time series tend to diverge
from untagged time series once a tag is applied. This poses some computational
Music Tagging and Listening 137

hurdles, and it is unclear if the sparsity of tagging data will permit such an analysis,
but we hope the approach will prove fruitful. We also aim to expand our analysis
to employ standard machine learning algorithms (such as support vector machines
and logistic regression models) to develop a classifier for categorizing tagged and
untagged time series. If a high-performing classifier based on listening behavior
can be developed, it would indicate that there are systematic differences in listening
behavior for tagged time series. This would suggest that tagging is not simply more
likely for those artists a user is likely to listen to anyway, but instead is associated
with distinctive patterns of listening.
One approach that has borne fruit in ongoing work, building upon the time
series analysis methods described above, is the use of a regression model that
predicts future listening rates as a function of past listening rates and whether or
not a user-artist listening time series has been tagged (Lorince, Joseph, & Todd,
2015). Using a generalized additive model (GAM, Hastie & Tibshirani, 1990), our
dependent variable in the regression is the logarithm of the sum of all listens in the
six months after a tag has been applied, to capture the possible effect of tagging
over a wide temporal window (the results are qualitatively the same when testing
listening for each individual month, however), while our independent variables
are a binary indicator of whether or not the time series has been tagged, as
well as seven continuous-valued predictors, one each for the logarithm of total
listens in the month of peak listening7 in the time series and in each of the six
previous months. The regression equation is as follows, where m corresponds to
the month of peak listening, L is the number of listens in any given month, T is
the binary tagged/untagged indicator, and f represents the exponential-family
functions calculated in the GAM (there is a unique function f for each pre-peak
month, see Hastie & Tibshirani, 1990 for details):
6
X 6
X
log L m+i = b0 + b1 T + f (log L m−i ) (2)
i=1 i=0

We refer the reader to the full paper for further details, but the model (after
imposing various constraints that permit temporal alignment of tagged and
untagged time series data), allows us to measure the effect of tagging an artist on
future listening, while controlling for users’ past listening rates. Our early results
suggest that tagging has a measurable, but quite small, effect on future listening. As
we cannot visualize the regression results for all model variables at once, Figure 6.6
instead displays the predicted difference in listening corresponding to tagging as
a function of the number of peak listens, calculated with a similar model, which
considers only the effect of listening in the peak month on post-peak listening.
This plot suggests and the full model confirms that, controlling for all previous
listening behavior, a tag increases the logarithm of post-peak listens by 0.147
(95 percent confidence interval = [0.144, 0.150]). In other words, the effect of
a tag is associated with around 1.15 more listens over six months, on average, than
138 J. Lorince and P. M. Todd

1000
Listens in following 6 months

Tagged
100 No
Yes

10
10 100 1000
Listens in peak month
FIGURE 6.6 Regression model results, showing predicted sum total of listening in
the 6 months after a tag is assigned as a function of the number of listens in the
month of peak listening in a time series. Results shown on a log-log scale, and shaded
regions indicated a bootstrapped 95 percent confidence interval. Figure replicated
from Lorince et al. (2015).

if it were not to have been applied. These results thus suggest that tagging does
lead to increases in listening, but only very small ones. Further analysis comparing
the predictiveness of different tags for future listening (again, see the full paper
for details) furthermore indicates that only a small subset of tags analyzed have
any significant effect on future listening. Taken together, these tentative results
provide evidence that tags certainly do not always function as memory cues, and
that facilitating later retrieval may actually be an uncommon tagging motivation.

Summary and Conclusions


In this chapter, we have made the following concrete contributions:
• A description of collaborative tagging systems, and how they offer valuable
data on people’s use of external memory cues in their day-to-day lives;
• a description of the “memory cue hypothesis,” and the value of empirically
testing it both for researchers specifically interested in tagging systems and
cognitive scientists interested in human memory cue use;
• a review of the challenges associated with testing the “memory cue hypothesis”
and a description of a new dataset that can help address them;
• two concrete hypotheses with respect to tagging and listening behavior that
should hold if tags do in fact serve as memory cues; and
Music Tagging and Listening 139

• a set of analytic methods for exploring those hypotheses.


Studying human cognition “in the wild” simultaneously presents great promise
and difficult challenges. Big Data like that described here permit correlational
analysis on a large scale, with often compelling results, but can leave causal
relationships difficult to discern. The time series and information theoretic analysis
methods we have introduced do provide evidence that, on average, music tagging
and listening behavior interact in a way consistent with the memory cue hypothesis
insofar as tagging is associated with greater levels of listening and that moderate
entropy tags are most strongly correlated with high listening probabilities. But as we
have discussed, much work remains to be done to determine whether a compelling
causal case can be made: Does tagging actually cause increases in listening that
would not have occurred otherwise, specifically by facilitating retrieval? Our early
results using a regression model suggest otherwise.
A second issue, particularly relevant to our data, but problematic in any study of
“choice” in web environments, is the pervasiveness of recommendation systems. In
comparing listening and tagging patterns, we have made the tacit assumption that
users are making (more or less) intentional decisions about their music listening.
In reality, however, an unknown proportion of users’ listening is driven not by the
active choice to listen to a particular artist (whether or not it is mediated by usage
of a tag), but instead by the algorithms of a recommendation engine.8
These are challenges faced in any “Big Data” scenario, but a secondary issue
is particularly relevant for psychologists and other researchers interested in making
claims about individual cognitive processes. By analyzing and averaging data from
many thousands of users, we are essentially describing the activity of an “average
user,” but must be hesitant to claim that any particular user behaves in the manner
our results suggest. Even if aggregate data suggest that tags do (or do not) function
as memory cues, we must remain sensitive to the limits on the conclusions we can
draw from such findings. Large-scale data analysis is a valuable tool for psycholog-
ical researchers, but must be interpreted with care. This is particularly important
given the non-normal distribution of tagging behavior observed in our data.
Although our results are tentative, we have presented an informative case
study of human memory cue use in a real-world environment (digital though
it may be), and a suite of tools for analyzing it. Our hope is that this work
has provided evidence of the usefulness of collaborative tagging data for studying
human memory and categorization, an introduction to some of the methods we
can employ for research in this domain, and more generally an example of the
power of Big Data as a resource for cognitive scientists.

Notes
1 The original definition contains a fourth element, such that F := (U, T, R,
Y, ≺). The last term, ≺, represents a user-specific subtag/supertag relation that
140 J. Lorince and P. M. Todd

folksonomy researchers (including the authors who define it) do not typically
examine, and we do not discuss it here.
2 When crawling a user’s listening history, we are able to determine song names
and the associated artist names, but not the corresponding album names.
3 This is not to say that such tags are never useful. We can imagine the generation
of highly specific cues (such as “favorite song of 1973”) that are associated with
one or only a few targets, but are still useful for retrieval. As we will see below,
however, such high specificity tags are not strongly associated with increased
listening on average.
4 This work is not yet published, but see the following URL for some method-
ological details: https://jlorince.github.io/archive/pres/Chasm2014.pdf.
5 For these analyses, we also applied a Gaussian smoothing kernel to all time
series, and performed clustering on a random subset of 1 million time series,
owing to computational constraints. Qualitative results hold over various
random samples, however.
6 This is not to say that all tags are used as retrieval cues, only that those are the
ones that this hypothesis applies to. How to determine which tags are used as
retrieval cues and which are not is a separate question we do not tackle here;
for the purposes of these analyses we assume that such tags exist in sufficient
numbers for us to see the proposed pattern in the data when considering all
tags.
7 Our methods align all time series to month of peak listening, and consider only
tagged time series where the tag was applied in that peak month.
8 Because the Last.fm software can track listening from various sources, a given
scrobble can represent a direct choice to listen to a particular song/artist, a
recommendation generated by Last.fm, or a recommendation from another
source, such as Pandora or Grooveshark.

References
Ames, M., & Naaman, M. (2007). Why we tag: Motivations for annotation in mobile
and online media. In Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems (pp. 971–980). ACM.
Block, L. G., & Morwitz, V. G. (1999). Shopping lists as an external memory aid for
grocery shopping: Influences on list writing and list fulfillment. Journal of Consumer
Psychology, 8(4), 343–375.
Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. (2014). Inferring
causal impact using Bayesian structural time-series models. Annals of Applied Statistics, 9,
247–274.
Cattuto, C., Baldassarri, A., Servedio, V. D., & Loreto, V. (2007). Vocabulary growth
in collaborative tagging systems. arXiv preprint. Retrieved from https://arxiv.org/abs/
0704.3316.
Music Tagging and Listening 141

Cattuto, C., Loreto, V., & Pietronero, L. (2007). Semiotic dynamics and collaborative
tagging. Proceedings of the National Academy of Sciences, 104(5), 1461–1464.
Earhard, M. (1967). Cued recall and free recall as a function of the number of items per
cue. Journal of Verbal Learning and Verbal Behavior, 6(2), 257–263.
Floeck, F., Putzke, J., Steinfels, S., Fischbach, K., & Schoder, D. (2011). Imitation
and quality of tags in social bookmarking systems—collective intelligence leading to
folksonomies. In T. J. Bastiaens, U. Baumöl, & B. J. Krämer (Eds.), On collective intelligence
(pp. 75–91). Berlin: Springer International Publishing.
Glushko, R. J., Maglio, P. P., Matlock, T., & Barsalou, L. W. (2008). Categorization in the
wild. Trends in Cognitive Sciences, 12(4), 129–135.
Golder, S. A., & Huberman, B. A. (2006). Usage patterns of collaborative tagging systems.
Journal of Information Science, 32(2), 198–208.
Granger, C. W. J. (1969). Investigating causal relations by econometric models and
cross-spectral methods. Econometrica, 37(3), 424.
Gupta, M., Li, R., Yin, Z., & Han, J. (2010). Survey on social tagging techniques. ACM
SIGKDD Explorations Newsletter, 12(1), 58–72.
Halpin, H., Robu, V., & Shepherd, H. (2007). The complex dynamics of collaborative
tagging. In Proceedings of the 16th International Conference on World Wide Web (pp. 211–220).
ACM.
Harris, J. E. (1980). Memory aids people use: Two interview studies. Memory & Cognition,
8(1), 31–38.
Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models (Vol. 43). London:
CRC Press.
Heckner, M., Mühlbacher, S., & Wolff, C. (2008). Tagging tagging: Analysing user
keywords in scientific bibliography management systems. Journal of Digital Information
(JODI), 9(2), 1–19.
Higbee, K. L. (1979). Recent research on visual mnemonics: Historical roots and
educational fruits. Review of Educational Research, 49(4), 611–629.
Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006). Information retrieval in
folksonomies: Search and ranking. In Proceedings of 3rd European Semantic Web Conference
(ESWC) (pp. 411–426). Springer International Publishing.
Hunt, R. R., & Seta, C. E. (1984). Category size effects in recall: The roles of relational
and individual item information. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 10(3), 454.
Hunter, I. M. L. (1979). Memory in everyday life. In M. M. Gruneberg & P. E. Morris
(Eds.), Applied problems in memory (pp. 1−11). London: Academic Press.
Intons-Peterson, M. J., & Fournier, J. (1986). External and internal memory aids: When
and how often do we use them? Journal of Experimental Psychology: General, 115(3), 267.
Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., & Stumme, G. (2007). Tag
recommendations in folksonomies. In Knowledge discovery in databases: PKDD 2007
(pp. 506–514). Berlin: Springer International Publishing.
Kausler, D. H., & Kausler, D. H. (1974). Psychology of verbal learning and memory. New
York: Academic Press.
Körner, C., Benz, D., Hotho, A., Strohmaier, M., & Stumme, G. (2010). Stop thinking,
start tagging: Tag semantics emerge from collaborative verbosity. In Proceedings of the 19th
International Conference on World Wide Web (pp. 521–530). ACM.
142 J. Lorince and P. M. Todd

Körner, C., Kern, R., Grahsl, H.-P., & Strohmaier, M. (2010). Of categorizers and
describers: An evaluation of quantitative measures for tagging motivation. In Proceedings
of the 21st ACM Conference on Hypertext and Hypermedia (pp. 157–166). ACM.
Lorince, J., & Todd, P. M. (2013). Can simple social copying heuristics explain tag
popularity in a collaborative tagging system? In Proceedings of the 5th Annual ACM Web
Science Conference (pp. 215–224). ACM.
Lorince, J., Joseph, K., & Todd, P. M. (2015). Analysis of music tagging and listening
patterns: Do tags really function as retrieval aids? In Proceedings of the 8th Annual Social
Computing, Behavioral-Cultural Modeling and Prediction Conference (SBP 2015). Washington,
D.C.: Springer International Publishing.
Lorince, J., Zorowitz, S., Murdock, J., & Todd, P. M. (2014). “Supertagger” behavior
in building folksonomies. In Proceedings of the 6th Annual ACM Web Science Conference
(pp. 129–138). ACM.
Lorince, J., Zorowitz, S., Murdock, J., & Todd, P. M. (2015). The wisdom of the few?
“supertaggers” in collaborative tagging systems. The Journal of Web Science, 1(1), 16–32.
Macgregor, G., & McCulloch, E. (2006). Collaborative tagging as a knowledge organisation
and resource discovery tool. Library Review, 55(5), 291–300.
Marlow, C., Naaman, M., Boyd, D., & Davis, M. (2006). HT06, tagging paper, taxonomy,
Flickr, academic article, to read. In Proceedings of the 17th Conference on Hypertext and
Hypermedia (pp. 31–40). ACM.
Moscovitch, M., & Craik, F. I. (1976). Depth of processing, retrieval cues, and uniqueness
of encoding as factors in recall. Journal of Verbal Learning and Verbal Behavior, 15(4),
447–458.
Noll, M. G., Au Yeung, C.-M., Gibbins, N., Meinel, C., & Shadbolt, N. (2009). Telling
experts from spammers: Expertise ranking in folksonomies. In Proceedings of the 32nd
International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp. 612–619). ACM.
Nov, O., Naaman, M., & Ye, C. (2008). What drives content tagging: The case of photos
on Flickr. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
(pp. 1097–1100). ACM.
Robu, V., Halpin, H., & Shepherd, H. (2009). Emergence of consensus and shared
vocabularies in collaborative tagging systems. ACM Transactions on the Web (TWEB), 3(4),
1–34.
Rutherford, A. (2004). Environmental context-dependent recognition memory effects:
An examination of ICE model and cue-overload hypotheses. The Quarterly Journal of
Experimental Psychology Section A, 57(1), 107–127.
Schifanella, R., Barrat, A., Cattuto, C., Markines, B., & Menczer, F. (2010). Folks in
folksonomies: Social link prediction from shared metadata. In Proceedings of the 3rd ACM
International Conference on Web Search and Data Mining (pp. 271–280). ACM.
Seitlinger, P., Ley, T., & Albert, D. (2013). An implicit-semantic tag recommendation
mechanism for socio-semantic learning systems. In T. Ley, M. Ruohonen, M. Laanpere,
& A. Tatnall (Eds.), Open and Social Technologies for Networked Learning (pp. 41–46). Berlin:
Springer International Publishing.
Sen, S., Lam, S. K., Rashid, A. M., Cosley, D., Frankowski, D., Osterhouse, J., . . . Riedl,
J. (2006). Tagging, communities, vocabulary, evolution. In Proceedings of the 2006 20th
Anniversary Conference on Computer Supported Cooperative Work (pp. 181–190). ACM.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical
Journal, 27, 379–423.
Music Tagging and Listening 143

Shirky, C. (2005). Ontology is overrated: Categories, links, and tags. Retrieved from
www.shirky.com/writings/ontology_overrated.html.
Sterling, B. (2005). Order out of chaos. Wired Magazine, 13(4).
Tullis, J. G., & Benjamin, A. S. (2014). Cueing others’ memories. Memory & Cognition,
43(4), 634–646.
Tulving, E., & Pearlstone, Z. (1966). Availability versus accessibility of information in
memory for words. Journal of Verbal Learning and Verbal Behavior, 5(4), 381–391.
Vander Wal, T. (2007). Folksonomy coinage and definition. Retrieved July 29, 2014, from
www.vanderwal.net/folksonomy.html.
Von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems (pp. 319–326). ACM.
Watkins, O. C., & Watkins, M. J. (1975). Buildup of proactive inhibition as a cue-overload
effect. Journal of Experimental Psychology: Human Learning and Memory, 1(4), 442.
Weinberger, D. (2008). Everything is miscellaneous: The power of the new digital disorder (1st
edn.). New York: Holt Paperbacks.
Weist, R. M. (1970). Optimal versus nonoptimal conditions for retrieval. Journal of Verbal
Learning and Verbal Behavior, 9(3), 311–316.
Weng, L., & Menczer, F. (2010). GiveALink tagging game: An incentive for social
annotation. In Proceedings of the ACM SIGKDD Workshop on Human Computation (pp.
26–29). ACM.
Weng, L., Schifanella, R., & Menczer, F. (2011). The chain model for social tagging game
design. In Proceedings of the 6th International Conference on Foundations of Digital Games (pp.
295–297). ACM.
Yeung, C.-M. A., Noll, M. G., Gibbins, N., Meinel, C., & Shadbolt, N. (2011). SPEAR:
Spamming-Resistant Expertise Analysis and Ranking in collaborative tagging systems.
Computational Intelligence, 27(3), 458–488.
Zollers, A. (2007). Emerging motivations for tagging: Expression, performance, and
activism. In Workshop on Tagging and Metadata for Social Information Organization,
held at the 16th International World Wide Web Conference.
Zubiaga, A., Körner, C., & Strohmaier, M. (2011). Tags vs shelves: From social tagging to
social classification. In Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia
(pp. 93–102). ACM.
7
FLICKR® DISTRIBUTIONAL TAGSPACE
Evaluating the Semantic Spaces Emerging from
Flickr® Tag Distributions

Marianna Bolognesi

Abstract
Flickr users tag their personal pictures with a variety of keywords. Such annotations could
provide genuine insights on salient aspects emerging from the personal experiences that
have been captured in the picture, which range beyond the purely visual features, or the
language-based associations. Mining the emergent semantic patterns of these complex open-
ended large-scale bodies of uncoordinated annotations provided by humans is the goal of this
chapter. This is achieved by means of distributional semantics, i.e. by relying on the idea
that concepts that appear in similar contexts have similar meanings (e.g. Latent Semantic
Analysis, LSA, Landauer & Dumais 1997). This chapter presents the Flickr Distributional
Tagspace (FDT), a distributional semantic space built on Flickr tag co-occurrences, and
evaluates it as follows: (1) through a comparison between the semantic representations that
it produces, and those that are obtained from speaker-generated features norms collected
in an experimental setting, as well as with WordNet-based metrics of semantic similarity
between words; (2) through a categorization task and a consequent cluster analysis.

The results of the two studies suggest that FDT can deliver semantic representations
that correlate with those that emerge from aggregations of features norms, and can
cluster fairly homogeneous categories and subcategories of related concepts.

Introduction
The large-scale collections of user-generated semantic labels that can be easily
found online has recently prompted the interest of research communities that focus
on the automatic extraction of meaning from large-scale unstructured data, and the
creation of bottom-up methods of semantic knowledge representation. Fragments
of natural language such as tags are today exploited because they provide contextual
Flickr® Distributional Tagspace 145

cues that can help solve problems in computer vision research; for example, the
queries in the Google image searching browser, where one can drag an image
and obtain in return other images that are visually similar, can be refined by
providing linguistic cues. For this reason, there is a growing interest in analyzing
(or “mining”) these linguistic labels for identifying latent recurrent patterns and
extracting new conceptual information, without referring to a predefined model,
such as an ontology or a taxonomy. Mining these large sets of unstructured data
retrieved from social networks (also called Big Data), seems more and more crucial
for uncovering aspects of the human cognitive system, to track trends that are
latently encoded in the usage of specific tags, and to fuel business intelligence and
decision-making in the industry sector (sentiment analysis and opinion mining).
The groupings of semantic labels attributed to digital documents by users, and
the semantic structures that emerge from such uncoordinated actions, known as
folksonomies (folk-taxonomies), are today widely studied in multimedia research
to assess the content of different digital resources (see Peters & Weller, 2008 for an
overview), relying on the “wisdom of the crowd”: If many people agree that
a web page is about cooking, then with high probability it is about cooking
even if its content does not include the exact word “cooking.” Although several
shortcomings of folksonomies have been pointed out (e.g. Peters & Stock, 2007),
this bottom-up approach of collaborative content structuring is seen as the next
transition toward the Web 3.0, or Semantic Web.
Whereas the multimedia researchers that aim to implement new tools for tag
recommendations, machine-tagging, and information retrieval in the semantic
web are already approaching and trying to solve the new challenges set by these
resources, in cognitive science the task-oriented data, collected in experimental
settings, seem to be still the preferred type of empirical data, because we know
little about how and to what extent Big Data can be modeled to reflect the human
behavior in the performance of typically human cognitive operations.

Related Work: Monomodal and Multimodal


Distributional Semantics
In the past 20 years several scholars managed to harvest and model word
meaning representations by retrieving semantic information from large amounts
of unstructured data, relying on the distributional hypothesis (Harris, 1954; Firth,
1957). The distributional hypothesis suggests that words that appear in similar
contexts tend to have similar meanings. Distributional models allow the retrieval
of paradigmatic relations between words that do not themselves co-occur, but that
co-occur with the same other terms: Book and manual are distributionally similar
because the two words are used in similar sentences, not because they are often
used together in the same sentence. Such models have been classically built from
the observation of word co-occurrences in corpora of texts (Baroni & Lenci, 2010;
146 M. Bolognesi

Burgess & Lund, 1997; Landauer & Dumais, 1997; Sahlgren, 2006; Turney &
Pantel, 2010; Rapp, 2004), and for this reason they have been often “accused”
of yielding language-based semantic representations, rather than experience-based
semantic representations.
In order to overcome the limitations of the language-based distributional
models, there have been recent attempts to create hybrid models, in which
the semantic information retrieved from word co-occurrences is combined with
perceptual information, retrieved in different ways, such as from human-generated
semantic features (Andrews, Vigliocco, & Vinson, 2009; Johns & Jones, 2012;
Steyvers, 2010) or from annotated images, under the assumption that images
provide a valid proxy for perceptual information (see, for example, Bruni, Tran,
& Baroni, 2014). Image-based information has been proven to be non-redundant
and complimentary to the text-based information, and the multimodal models
in which the two streams of information are combined perform better than
those based on solely linguistic information (Andrews et al., 2009; Baroni &
Lenci, 2008; Riordan & Jones, 2011). In particular, it has been shown that
while language-based distributional models capture encyclopedic, functional, and
discourse-related properties of words, hybrid models can also harvest perceptual
information, retrieved from images.
Such hybrid models constitute a great leap forward in the endeavor of modeling
human-like semantic knowledge by relying on the distributional hypothesis and on
large amounts of unstructured, human-generated data. Yet, I believe, they present
some questionable aspects, which I hereby summarize.
Combining text-derived with image-derived information by means of sophis-
ticated techniques appears to be an operation that is easily subject to error (how
much information shall be used from each stream and why? Does the merging
technique make sense from a cognitive perspective?). Moreover, this operation
seems to lean too much toward a strictly binary distinction between visual
versus linguistic features (respectively retrieved from two separate streams), leaving
aside other possible sources of information (e.g. emotional responses, cognitive
operations, other sensory reactions that are not captured by purely visual or purely
linguistic corpora).
Furthermore, the way in which visual information is retrieved from images
might present some drawbacks. For example, image-based information included
in hybrid models is often collected through real-time “games with a purpose,”
created ad hoc for stimulating descriptions of given stimuli from individuals, or
coordinated responses between two or more users (for a comprehensive overview,
see Thaler, Simperl, Siorpaes, & Hofer, 2011). In the popular ESP game (Ahn &
Dabbish, 2004, licensed by Google in 2006), for example, two remote participants
that do not know each other have to associate words with a shared image, trying to
coordinate their choices and produce the same associations as fast as possible, thus
forcing each participant to guess how the other participant would “tag” the image.
Flickr® Distributional Tagspace 147

Although the entertaining nature of these games is crucial to keep the participants
motivated during the task, and has little or no expenses, the specific instructions
provided to the contestants can constrain the range of associations that a user might
attribute to a given stimulus, and trigger ad hoc responses that provide only partial
insights on the content of semantic representations. As Weber, Robertson, and
Vojnovic show (2008), ESP gamers tend to match their annotations to colors, or
to produce generic labels to meet the other gamer quickly, rather than focusing
on the actual details and peculiarities of the image. The authors also show that
a “robot” can predict fairly appropriate tags without even seeing the image. In
addition, ESP as well as other databases of annotated images harvest annotations
provided by people who are not familiar with the images: images are provided
by the system. Arguably, such annotations reflect semantic knowledge about the
concepts represented, which are processed as categories (concept types), rather
than individual experiential instances (concept tokens). Thus, such images cannot
be fully acknowledged to be a good proxy of sensorimotor information, because
there has not been any sensorimotor experience: The annotator has not experienced
the exact situation captured by the image.
Finally, in hybrid models the texts and the images used as sources of information
have been produced /processed by different populations, and thus they may not be
comparable.
Motivated by these concerns, my research question is the following: Can we
build a hybrid distributional space that (1) is based on a unique but intrinsically
variegated source of semantic information, so as to avoid the artificial and arbitrary
merging of linguistic and visual streams; (2) contains spontaneous and therefore
richer data, which are not induced by specific instructions or time constraints such
as in the online games; (3) contains perceptual information that is derived from
direct experience; (4) contains different types of semantic information (perceptual,
conceptual, emotional, etc.) provided by the same individuals in relation to specific
stimuli; (5) is based on dynamic, noisy, and constantly updated (Big) Data?
As explained below, the answer can be found in Flickr Distributional Tagspace
(FDT), a distributional semantic space based on Flickr tags. Big Data meets
cognitive science.

Flickr Distributional Tagspace


FDT is a distributional semantic space based on Flickr tags, i.e. linguistic labels
associated with the images uploaded on Flickr. As a distributional semantic space,
FDT delivers tables of proximities among words, built from the observation of tags’
covariance across large amounts of Flickr images: Two tags that appear in similar
pictures have similar meanings, even though the two tags do not appear together in
the same pictures.
148 M. Bolognesi

The Flickr Environment


Flickr is a video/picture hosting service powered by Yahoo!. All the visual
contents hosted on Flickr are user-contributed (they are personal pictures and
videos provided by registered users), and spontaneously tagged by users themselves.
Tagging rights are restricted to self-tagging (and at best permission-based tagging,
although in practice self-tagging is most prevalent, see Marlow, Naaman, Boyd, &
Davis, 2006 for further documentation). Moreover, the Flickr interface mostly
affords for blind-tagging instead of suggested-tagging, i.e. tags are not based on a
dictionary, but are freely chosen from an uncontrolled vocabulary, and thus might
contain spelling mistakes, invented words, etc. Users can attribute a maximum of
75 tags to each picture, and this group of tags constitutes the image’s tagset.
To the best of my knowledge there has been only one attempt to systematically
categorize the tags attributed to pictures in Flickr. Such classification, performed
by Beaudoin (2007), encompasses 18 post hoc created categories, which include
syntactic property types (e.g. adjectives, verbs), semantic classes (human partici-
pants, living things other than humans, non-living things), places, events/activities
(e.g. wedding, Christmas, holidays), ad hoc created categories (such as photographic
vocabulary, e.g. macro, Nikon), emotions, formal classifications such as terms
written in any language other than English, and compound terms written as one
word (e.g. mydog). Of all the 18 types of tags identified, Beaudoin reports that
the most frequent are: (1) Geographical locations, (2) compounds, (3) inanimate
things, (4) participants, and (5) events.
The motivations that stimulate the tagging process in Flickr, as well as in other
digital environments, has been classified in different ways, the most popular being a
macro-distinction between categorizers (users who employ shared high-level features
for later browsing) and describers (users who accurately and precisely describe
resources for later searching) (Körner, Benz, Hotho, Strohmaier, & Stumme,
2010). While Flickr users are homogeneously distributed across these two types,
ESP users for example are almost all describers (Strohmaier, Körner, & Kern,
2012). Other models suggest different categories of tagging motivations: Marlow et
al. (2006) suggest a main distinction between organizational and social motivations;
Ames and Naaman (2007) suggest a double distinction, between self versus
social tagging, and organization versus communication-driven tags; Heckner,
Heilemann, and Wolff (2009) suggest a distinction between personal information
management versus resource sharing; Nov, Naaman, and Ye (2009) propose
a wider range of categories for tagging motivation, which include enjoyment,
commitment, self-development, and reputation. In general, all classifications
suggest that Flickr users tend to attribute to their pictures a variety of tags that
ranges beyond the purely linguistic associations of the purely visual features,
suggesting that Flickr tags indeed include a wide variety of semantic information,
Flickr® Distributional Tagspace 149

which makes this environment an interesting corpus of dynamic, noisy, accessible,


and spontaneous Big Data.
Because all Flickr contents are user-contributed, they represent personal
experiences lived by the user and then reported on the social network through
photographs. Thus, each image can be considered as a visual proxy for the
actual experience lived by the photographer and captured in the picture. In fact,
operations such as post-processing, image manipulation, and editing seem to be
used by Flickr users to improve the appearance of the pictures, rather than to
create artificial scenes such as, for example, the conceptual images created ad hoc by
advertisers and creative artists, where entities are artificially merged together and
words are superimposed. However, at this stage this is a qualitative consideration,
and would require further (quantitative) investigation.
Although (as described above) the motivations for tagging personal pictures
on Flickr may differ across the variety of users, each tag can be defined as a
salient feature stimulated by the image, which captures an experience lived by the
photographer. These features (these tags), are not simply concrete descriptors of
the visual stimulus, but they often denote cognitive operations, associated entities,
and emotions experienced in that situation or triggered later on by the picture
itself, which are encoded in the tags. Being an a posteriori process, in fact, the
tagging also includes cognitive operations which range beyond the purely visual
features, but that are still triggered by the image.

The Distributional Tagspace


This work builds upon an exploratory study proposed in Bolognesi (2014), where
the idea of exploiting the user-generated tags from Flickr for creating semantic
representations that encompass perceptual information was first introduced. The
claim was investigated through a study based on an inherently perceptual domain:
the words that denote primary and secondary colors. The covariance of the tags red,
orange, yellow, green, blue, and purple across Flickr images was analyzed, and as a result
the pairwise proximities between all the six tags were plotted in a dendrogram.
The same thing was done by retrieving the semantic information about the six
color terms through two distributional models based on corpora of texts (LSA,
Landauer & Dumais 1997; DM, Baroni & Lenci 2010). The cluster analysis based
on Flickr tags showed a distribution of the colors that resembled the Newton
color wheel (or the rainbow), which is also the distribution of the wavelengths
perceived by the three types of cones that characterize the human eye, thanks to
which we are sensitive to three different chromatic spectra. On the other hand, the
two “blind” distributional models based on corpora of texts, and therefore on the
solely linguistic information, could not reproduce the same order: In the “blind”
distributional models the three primary colors were closer to one another, and the
150 M. Bolognesi

tag green was in both cases the farthest one, probably due to the fact that the word
green is highly polysemic.
That first investigation, aimed at analyzing the distribution of color terms across
Flickr images’ tags, showed that it is possible to actually capture rich semantic
knowledge from the Flickr environment, and that this information is missed by
(two) distributional models based on solely linguistic contexts.

Implementing FDT
The procedure for creating a FDT semantic space relies on the following steps, as
was first illustrated in Bolognesi (2014). All the operations can be easily performed
in the R environment for statistical analyses (for these studies the R version 2.14.2
was used), while the raw data (tagsets) can be downloaded from Flickr, through
the freely available Flickr API services.1

1 Set up the pool of chosen concepts to be analyzed and represented in the


distributional semantic space.
2 Download from Flickr® a corpus of tagsets that include each of the target
concepts as a tag. The metadata must be downloaded through the API
flickr.photos.search, whose documentation can be found on the Flickr website:
(www.flickr.com/services/api/explore/flickr.photos.search.html). In order to
implement FDT the arguments api_key, tags, and extras need to be used. In
api_key one should provide the Flickr-generated password to sign each call;
in tags one should provide the concepts to be mined; in extras one should
indicate owner_name and tags. The reason for including the field owner_name
is explained in point 4, while tags is needed to obtain the tagsets. There are
several other optional arguments in flickr.photos.search, and they can be used
to further filter the results, such as for example the date of upload. The
number of pictures to be downloaded for each target concept depends on
their availability. As a rule of the thumb, it is preferable to download roughly
100,000 pictures for each concept and then concatenate the obtained samples
into one corpus (uploaded on R as a dataframe). An informal investigation has
shown that smaller amounts of pictures for each tag produce variable semantic
representations of the target concept, while for more than 100,000 tagsets per
concept the resulting semantic representation remains stable. Thus, in order to
keep the computations fast, 100,000 tagsets per concept is the optimal value.
The tagsets download can be performed with the open source command-line
utility implemented by Buratti (2011) for unsupervised downloads of metadata
from flickr.com. This powerful tool is hosted on code.google.com and can be
freely downloaded.
3 After concatenating the tagsets into one dataframe, they should be cut at the
fifteenth tag so that the obtained corpus consists of tagsets of 15 tags each.2
Flickr® Distributional Tagspace 151

This operation is done in order to keep only the most salient features that
users attribute to a picture, which are arguably tagged first.
4 Subset (i.e. filter) the concatenated corpus, in order to drop the redundant
tagsets that belong to the same user, and thus keep only the unique tagsets for
each user (each owner name). This operation should be done to avoid biased
frequencies among the tags’ co-occurrences, due to the fact that users often
tag batches of pictures with the same tagset (copied and pasted). For example,
on a sunny Sunday morning a user might take 100 pictures of a flower, upload
all of them, and tag them with the same tags “sunny,” “Sunday,” “morning,”
“flower.” In FDT only one of these 100 pictures taken by the same user is
kept.
5 Another filtering of the corpus should be done, by dropping those tagsets
where the concept to be analyzed appears after the first three tags.3 This
allows one to keep only those tagsets that describe pictures for which a target
concept is very salient (and therefore is mentioned among the first three tags).
Pictures described by tagsets where the target concept appears late are not
considered to be representative for the given concept.
6 Build the matrix of co-occurrences, which displays the frequencies with
which each target concept appears in the same picture with each related tag.
This table will display the target concepts on the rows and all of the other
tags, that co-appear with each of the target concepts across the downloaded
tagsets, on the columns. The raw frequencies of co-occurrence reported in the
cells should then be turned into measures of association. The measure used
for this distributional semantic space is an adaptation of the pointwise mutual
information (Bouma, 2009), in which the joint co-occurrence of each tag
pair is squared, before dividing it by the product of the individual occurrences
of the two tags. Then, the obtained value is normalized by multiplying the
squared joint frequency for the sample size (N ). This double operation (not
very different from that performed in Baroni & Lenci, 2010) is done in order
to limit the general tendency of the mutual information, to give weight to
highly specific semantic collocates, despite their low overall frequency. This
measure of association is formalized as follows:
2
f a,b
S P M I = log N
fa × fb
where a and b are two tags, f stands for frequency of occurrence (joint
occurrence of a with b in the numerator and individual occurrences of a
and b in the denominator), and N is the corpus size. The obtained value
approximates the likelihood of finding a target concept and each other tag
appearing together in a tagset, taking into account their overall frequency in
the corpus, the frequency of their co-appearance within the same tagsets, and
the sample size. Negative values, as commonly done, are raised to zero.
152 M. Bolognesi

TABLE 7.1 The three main differences between LSA and FDT, pertaining to
context type (of the co-occurrence matrix), measure of association between an
element and a context, and dimensionality reduction applied before the
computation of the cosine.

LSA FDT
Context Documents of text Tagsets
(the matrix of (the matrix of
co-occurrences is co-occurrences is
word by document) word by word)
Measure of association typically tf-idf (term SPMI
frequency–inverse
document frequency)
Dimensionality reduction SVD (singular value None, the matrix is
decomposition), used dense.
because the matrix is
sparse.

7 Turn the dataframe into a matrix, so that each row constitutes a concept’s
vector, and calculate the pairwise cosines between rows. The cosine, a
commonly used metric in distributional semantics, expresses the geometrical
proximity between two vectors, which has to be interpreted as the
semantic similarity between two concepts. The obtained table represents the
multidimensional semantic space FDT.

All the steps illustrated in the procedure can be easily done with the basic R
functions, besides step 7 for which the package LSA is required. In fact, FDT is
substantially similar to LSA; yet, there are some crucial differences between FDT
and LSA, summarized in Table 7.1.
A cluster analysis can finally provide a deeper look into the data. In the studies
described below, the data were analyzed in R through an agglomerative hierarchical
clustering algorithm, the Ward’s method (El-Hamdouchi & Willett, 1986; Ward,
1963), also called minimum variance clustering (see explanation of this choice in
Study Two). The Ward method works on Euclidean distances (thus the cosines
were transformed into Euclidean distances): It is a variance-minimizing approach,
which minimizes the sum of squared differences within all clusters and does not
demand the experimenter to set the amount of clusters in advance. In hierarchical
clustering each instance is initially considered a cluster by itself and the instances are
gradually grouped together according to the optimal value of an objective function,
which in Ward’s method is the error sum of squares. Conversely, the commonly
used k-means algorithms demand the experimenter to set the number of clusters
Flickr® Distributional Tagspace 153

in which he or she wants the data to be grouped. However, for observing the
spontaneous emergence of consistent semantic classes from wild data, it seems
preferable to avoid setting a fixed number of clusters in advance. In R it is possible
to use agglomerative hierarchical clustering methods through the function hclust.
An evaluation of the clustering solution, illustrated in the studies below, was
obtained with pvclust R package (Suzuki & Shimodaira, 2006), which allows
the assessment of the uncertainty in hierarchical cluster analysis. For each cluster
in hierarchical clustering, quantities called p-values are calculated via multiscale
bootstrap resampling. The P-value of a cluster is a value between 0 and 1, which
indicates how strong the cluster is supported by data.4

Study One
The purpose of this study was to evaluate FDT semantic representations against
those obtained from speaker-generated feature norms, and those obtained from
linguistic analyses conducted on WordNet. The research questions approached by
this task can be summarized as follows:
To what extent do the semantic representations created by FDT correlate with
the semantic representations based on human-generated features, and with those
emerging from the computation of semantic relatedness in WordNet, using three
different metrics?
In order to achieve this, the semantic representations of a pool of concepts,
analyzed with FDT, were compared through a correlation study to those obtained
from a database of human-generated semantic features, as well as to the similarities
obtained by computing the pairwise proximities between words in WordNet (three
different metrics).

Semantic Spaces and Concept Similarities in FDT


and in McRae’s Features Norms
Given the encouraging outcomes of the exploratory study conducted on color
terms, from which it emerged that FDT can capture perceptual information that is
missed by other distributional models based on corpora of texts, a new investigation
was conducted, aimed at comparing the distributional representations obtained
from FDT with those derived from the database of McRae’s features norms, a
standard that has often been used for evaluating how well distributional models
perform (e.g. Baroni, Evert, & Lenci, 2008; Baroni & Lenci, 2008; Shaoul &
Westbury, 2008).
McRae’s features norms is a database that covers 541 concrete, living, and
non-living basic-level concepts, which have been described by 725 subjects in
a property generation task: Given a verbal stimulus such as dolphin, participants had
to list the features that they considered salient for defining that animal. The features
154 M. Bolognesi

produced in McRae’s database were then standardized and classified by property


types, according to two different sets of categories: The taxonomy proposed by
Cree and McRae (2003), and a modified version of the feature type taxonomy
proposed in Wu and Barsalou (2009). Both taxonomies are reported in McRae,
Cree, Seidenberg, and McNorgan (2005).
Moreover, McRae and colleagues released a distributional semantic space where
the proximities between each concept and the other 540 are measured through
the cosines between each two concept vectors, whose coordinates are the raw
frequencies of co-occurrence between a concept and each produced feature. The
resulting table is a square and symmetric matrix displaying all the proximities
between each pair of concepts, like in a distance chart of cities. Each row (or
column) of the matrix describes the distances (or better, the proximity) of a given
concept against all the other concepts. In this study, a similar matrix was built
with FDT, analyzing the concepts co-occurrences across Flickr tags, and then the
lists of similarities characterizing the concepts in McRae’s features norms were
compared to the lists of similarities characterizing the concepts in FDT through
the computation of the Pearson correlation coefficient.
However, since in Flickr not all the concepts listed in McRae’s features norms
are well represented, only a subset of 168 concepts were selected because of their
high frequency among Flickr® tags (>100,000 photographs retrieved; e.g. airplane
was considered, while accordion was dropped because in Flickr the amount of tagsets
containing accordion among the first three tags was <<100,000). Concepts that
were ambiguous or highly polysemic were omitted (e.g. apple, which refers to both
the fruit and the brand) because in McRae’s features norms the distinct meanings of
polysemic words were disambiguated by the experimenter, while FDT, at this stage,
does not distinguish the different senses (e.g. the top contextual tags retrieved for
apple include juicy and picking, as well as computer and Ipod). Finally, those concepts
that in Flickr trigger a meaning that is not covered in McRae’s features norms were
left aside (e.g. bomb in Flickr activates many contextual tags that refer to the graffiti
scenario, and the action of “bombing,” i.e. signing walls with personal logos). The
selected concepts are listed in Table 7.2.

Types of Features in Flickr and McRae’s Features Norms


The Wu & Barsalou (2009) taxonomy, used to label the concepts in McRae’s
features norms, distinguishes four classes of features that can be associated with
a given stimulus: ENTITY, SITUATION, TAXONOMIC, and INTROSPEC-
TIVE. Each of these classes includes several sub-types. For example, the class
ENTITY includes abstract entities that were associated with given stimuli (e.g.
given the stimulus church, the produced feature <associated with religion>),
behavior (airplane, <flies>), external and internal components (apple, <has skin>,
<has seeds>), and the list goes on. The class SITUATION includes feature
Flickr® Distributional Tagspace 155

types that express contextual properties, such as locations (apple, <found on


trees>), participants (toy, <used by kids>), associated entities (spoon, <used with
bowls>), and function (knife, <used for cutting>). TAXONOMIC relations
include coordinates, subordinates and individuals, superordinates, and synonyms.
Finally, the INTROSPECTIVE class includes emotions, evaluations, and cognitive
operations that require the activation of internal states, triggered by the verbal
stimulus.
According to this taxonomy, of the 27 different sub-types of features gathered
in McRae’s features norms, the highest ranked types (i.e. features with highest
production frequency) are:

1 function;
2 external surface property;
3 external component;
4 superordinate;
5 entity behavior.

Such display suggests that during the property generation task that allowed
the experimenters to collect the features, the participants activated mental
simulations that allowed them to “see” and consequently list visual properties
of the given words, while imagining the correspondent referents in action.5
However, little information is produced with regard to the contextual properties
of such simulations: The information about locations, participants, and associated
entities that would populate the imagined situations are not ranked as high as the
properties attributed directly to the referent that was mentally simulated after the
correspondent verbal stimulus was provided. This does not necessarily mean that
in the mental simulations the concepts are imagined to be floating in a vacuum,
without any surrounding context, but rather that the contextual entities are not as
salient as other types of properties directly attributed to the referent.
On the other hand, according to Beaudoin’s classification of Flickr tags
(Beaudoin, 2007), the most frequent features tagged in Flickr are:

1 geographical locations;
2 compounds;
3 inanimate things;
4 participants;
5 events.

Even though Beaudoin’s taxonomy differs from Wu and Barsalou’s one, it appears
clear that Flickr tags tend to favor contextual entities. As a matter of fact, as a
qualitative exploration, the Wu and Barsalou taxonomy was applied in FDT, to the
168 concepts taken from McRae’s features norms (listed in Table 7.2). In particular,
after building the contingency matrix that displays the 168 concepts on the rows
156 M. Bolognesi

and all of the tags that co-occur with them on the columns (step 6 of the FDT
procedure), the top 20 tags that co-occur with each of the 168 concepts (i.e. the
20 tags with higher SPMI value in each concept vector) were manually annotated
with the semantic roles of the Wu and Barsalou taxonomy. This annotation process
presented intrinsic difficulties, due to the fact that tags are one-word labels and
their relationship with the target concept can be ambiguous even when their part
of speech is not ambiguous. For example, given the concept table, its related tag
coffee might contribute to specify a subordinate category of tables (coffee tables), an
object that often appears on a table (cup of coffee), or an activity that can be done
at a table (having a coffee). In this regard, a specific case is represented by abstract
entities. Consider, for instance, how nature is associated with waterfall. According
to the coding scheme proposed in Wu & Barsalou (2009), this association could
be plausibly labeled as location (waterfalls are found in nature), associated_entity or
associated_abstract_entity (nature is simply associated with waterfalls as a related entity
that does not fulfill any strictly taxonomic relation), origin (waterfalls come from
nature), or superordinate (nature constitutes a taxonomic hypernym of waterfalls).
Although the difficulties in specifying the relation between concepts in FDT
might be considered a drawback of this distributional semantic space, they might
actually become a strength when dealing with abstract concepts. As suggested
in Recchia and Jones (2012), predicates are not necessarily the basic units for
semantic analysis: A concept such as law is clearly associated and described by
courthouse, crime, and justice, and these concepts may play a role in the semantic
representation of law, even though it is difficult to express through a short sentence
the nature of the relationship between law and each of the above mentioned
related concepts. In this respect, FDT is less constrained than other databases
of semantic features collected in experimental settings, which often address only
concrete concepts. The naturalness and genuineness of Flickr tags has been clearly
stated by one of Flickr’s co-founders in the following terms: “free typing loose
associations is just a lot easier than [for example] making a decision about the
degree of match to a pre-defined category (especially hierarchical ones). It’s like
90% of the value of a ‘proper’ taxonomy but 10 times simpler” (Butterfield,
2004).
The outcomes of the manual annotation display a very different scenario, when
compared to the features that most frequently are produced by the speakers in
McRae’s features norms. The top-ranked tags in FDT express features that are
associated with the given concepts by relationships of:

1 locations;
2 associated entities;
3 superordinates;
4 functions;
5 external surface properties.
Flickr® Distributional Tagspace 157

Such display is motivated by the fact that the conceptual representations in FDT
are indeed highly contextualized, appearing as they do in photographs that capture
real experiences, and thus unavoidably involve other entities.

Correlation Coefficients Between the Semantic Representations


in FDT and Norms
Following the procedure explained in Implementing FDT the table of similarities
was created by calculating the pairwise cosines between the 168 concept-vectors.
The obtained square matrix contained the similarities between each pair of
concepts, interpreted geometrically as their proximity, expressed by the cosine
measure. Each row (or each column, since the table is squared and symmetric)
displays a set of coordinates, which defines the position of a given concept against
each other concept, in a multidimensional space.
For the purpose of comparing the two semantic spaces, obtained from McRae
features norms and FDT, a table of 168 by 168 cosines, computed on McRae’s
features norms, was created. At this point, each of the 168 concept was defined
by the same coordinates across two spaces: McRae’s features norms and FDT. For
example, bike was defined in FDT as a concept whose proximity to airplane is 0.14,
to apartment is 0.20, to avocado is 0.08, etc. Similarly, in the multidimensional space
based on McRae’s features norms bike was defined as the concept whose proximity
to airplane is 0.10, to apartment is 0, to avocado is 0, etc.
The correlation study reported in Table 7.2 shows the Pearson’s coefficients
between the semantic representations of each of the 168 concepts considered across
the two distributions.
All the representations correlate positively across the two spaces, showing
medium/high values (M= 0.69), comparable to other state-of-art distributional
semantic spaces. However, such degrees of correlation are not extremely high,
suggesting that the aspects of meaning captured by McRae’s features norms
somehow differ from those captured by FDT. This can be explained by the fact
that the type of features that behave as contexts in the two spaces, and therefore
contribute to shaping the semantic representations of the concepts, are different,
as illustrated in the previous section. Curiously, the semantic representations of
the concepts denoting animals, and in particular birds, show particularly poor
correlation coefficients across the two spaces, compared to other categories (e.g.
owl: 0.53; peacock: 0.50; raven: 0.42; seagull: 0.43). The semantic representations
of birds (there are 15 instances of the category birds, reported in Table 7.2)
have average correlation 0.52, across FDT and McRae’s features norms, while
the semantic representations of any other concept (non-birds) shows an average
correlation of 0.70. The difference between these coefficients is statistically
significant (t=8.015, p < 0.001). A closer look shows that the features that are most
commonly associated with birds are: <is_bird> (323 counts); <has_feathers>
158 M. Bolognesi

TABLE 7.2 Pearson’s correlation coefficients between the semantic representations of


the concepts in FDT and in McRae’s features norms.

Concept Correlation Concept Correlation Concept Correlation


Norms/FDT Norms/FDT Norms/FDT
airplane 0.732 falcon 0.492 pillow 0.756
apartment 0.696 fence 0.624 pineapple 0.753
avocado 0.727 flamingo 0.590 pistol 0.824
bag 0.722 football 0.823 plum 0.758
ball 0.673 fox 0.590 pony 0.645
banana 0.687 fridge 0.794 potato 0.838
bcar 0.513 garage 0.749 pumpkin 0.665
bed 0.680 garlic 0.698 pyramid 0.674
bike 0.616 gate 0.630 rabbit 0.596
bison 0.685 giraffe 0.846 radio 0.850
blackbird 0.532 goat 0.669 raspberry 0.797
boat 0.703 goose 0.514 raven 0.424
book 0.710 gorilla 0.645 revolver 0.818
bottle 0.689 grape 0.774 rice 0.583
bread 0.669 guitar 0.794 rifle 0.821
brick 0.583 gun 0.719 robin 0.571
bridge 0.643 hamster 0.648 rooster 0.592
building 0.689 helicopter 0.847 salmon 0.709
bull 0.697 horse 0.557 sandals 0.857
bus 0.704 house 0.651 scarf 0.658
butterfly 0.593 jacket 0.850 scooter 0.637
cabin 0.721 jeans 0.795 seagull 0.433
cage 0.715 jeep 0.755 sheep 0.623
cake 0.781 jet 0.765 shell 0.829
candle 0.736 lamp 0.633 ship 0.728
car 0.550 lemon 0.661 shirt 0.726
carrot 0.771 lettuce 0.799 shoes 0.714
cat 0.484 limousine 0.745 skateboard 0.615
cathedral 0.697 lion 0.689 skyscraper 0.776
chapel 0.805 magazine 0.780 socks 0.793
cheese 0.793 marble 0.597 spoon 0.659
cherry 0.604 menu 0.781 squirrel 0.615
chicken 0.300 mirror 0.665 stone 0.537
church 0.717 missile 0.833 strawberry 0.701
cigarette 0.610 mittens 0.856 swan 0.548
clock 0.679 moose 0.651 sword 0.736
corn 0.706 mug 0.751 table 0.603
cottage 0.693 mushroom 0.726 taxi 0.623
cow 0.576 necklace 0.710 telephone 0.742
crocodile 0.746 nectarine 0.790 tiger 0.766
cup 0.741 oak 0.731 toy 0.697
dandelion 0.678 olive 0.637 train 0.633
deer 0.655 owl 0.531 turtle 0.699
desk 0.661 pajamas 0.920 umbrella 0.601
dog 0.501 pan 0.749 vine 0.581
dolphin 0.797 parsley 0.744 violin 0.850
door 0.573 peach 0.761 wall 0.678
dress 0.695 peacock 0.495 whale 0.628
drum 0.794 pear 0.771 wheel 0.582
duck 0.611 pearl 0.600 whistle 0.770
duncbuggy 0.722 pen 0.696 willow 0.617
eagle 0.543 pencil 0.724 woodpecker 0.606
eggplant 0.865 pepper 0.507 worm 0.794
elephant 0.623 piano 0.721 yacht 0.690
emerald 0.629 pic 0.787 zebra 0.679
envelope 0.824 pig 0.636 zucchini 0.755
Flickr® Distributional Tagspace 159

(226); <flies> (213); <has_beak> (164); <lays_eggs> (154); <has_wings> (139);


<an_animal> (101); <eats> (64). These are all external components (visible to
the eye) and entity behaviors. On the other hand, the top tags for each instance
of birds, gathered in FDT, are locations (pond, farm, nest), associated entities (hen,
worm, car) and hypernyms (animal, nature).
An additional explanation of the poor correlation of this category across the
two distributions could be that birds, more than other animals or other concrete
entities, might fulfill symbolic functions and thus appear to be often depicted and
represented in contexts where real birds would not naturally appear (e.g. on clothes,
accessories such as earrings, logos, etc.). This type of information is captured by
FDT, which weights locations and associated entities, and not by McRae’s features
norms, thus lowering the correlation between the semantic representations of the
instances of the bird category across the two distributional spaces.

A Comparison with WordNet-Based Similarity Metrics and Discussion


The similarities among semantic representations (vectors) of the 168 concepts
were compared to those expressed by three other metrics of similarity, based
on WordNet. For example, the similarity between bike and car in FDT and
that in McRae’s features norms were compared to the similarity between bike
and car obtained through three WordNet-based metrics, computed with the Perl
module WordNet::Similarity6 (Pedersen, Patwardhan, & Michelizzi, 2004). The
chosen measures are: PATH7 and WUP,8 which are both based on the idea
that the similarity between two concepts is a function of the length of the path
that links them in the WordNet taxonomy; and JCN,9 which is an information
content-based measure, i.e. based on the idea that the more common information
two concepts share, the more similar they are. The average correlations are reported
in Table 7.3.

TABLE 7.3 The average Pearson’s correlation coefficients between semantic representations in
FDT, McRae’s features norms, and three metrics of similarity/relatedness based on WordNet
(JCN, WUP, and PATH). All coefficients are significant with p < 0.001, besides JCN/WUP,
significant with p < 0.005.

FDT McRae f.n. JCN WUP PATH


FDT 1
McRae f.n. 0.69 (sd=0.10) 1
JCN 0.6210 (sd=0.30) 0.57 (sd=0.30) 1
WUP 0.46 (sd=0.12) 0.47 (sd=0.15) 0.22 (sd=0.11) 1
PATH 0.79 (sd=0.08) 0.72 (sd=0.11) 0.65 (sd=0.31) 0.65 (sd=0.08) 1
160 M. Bolognesi

As the table shows, FDT generates semantic representations that correlate positively
with state of the art metrics of semantic similarity/relatedness. Interestingly, FDT
semantic representations show high correlations with those computed through
PATH, which counts the number of nodes along the shortest path between the
senses in the “is-a” WordNet hierarchies.
In conclusion, Study One showed that a distributional analysis performed
with FDT on Flickr Big Data (image tags) delivers semantic representations
that correlate with those obtained from speaker-generated features elicited in an
experimental setting, as well as those obtained from the WordNet environment.

Study Two
The purpose of this study was to evaluate FDT’s ability to categorize given
concepts into semantically meaningful clusters, based on their co-occurrence across
the Flickr tagsets, and compare such categorization to that obtained from semantic
feature norms.

Categorization Task in FDT and in McRae’s Feature Norms


Six semantic categories were established in advance for this task: (i) Animals (e.g.
moose, eagle, cow), (ii) food (e.g. garlic, raspberry, rice), (iii) clothing (e.g. jeans, shirt,
shoes), (iv) weapons (e.g. gun, revolver, sword), (v) musical instruments (e.g. violin,
drum, guitar), (vi) vehicles (taxi, jet, boat). Ninety-four concepts that were best
represented in Flickr were selected from the list of concepts analyzed in McRae’s
features norms, for their belonging to one of the six chosen categories. Some of the
chosen categories were particularly broad and included more concepts than others
(e.g. animals), in order to allow FDT to perform spontaneously additional divisions
into subcategories (e.g. mammals versus birds, farm animals versus wild animals,
etc.). Other categories were more restricted and less populated (e.g. weapons),
so that the accuracy of FDT could be tested also with regard to more specific
categories. The research questions approached with this task can be summarized as
follows:
• To what extent can FDT predict these six categories, based only on tags,
co-occurrences across Flickr photographs?
• To what extent can FDT spontaneously perform intra-category distinctions,
and what types of subcategories does it highlight?
• How does the categorization performed by FDT resemble the categorization
of the same concepts emerging from McRae’s features norms?
The cluster analysis was performed through hierarchical clustering, using the
Ward method, which identifies clusters that have minimum internal variance. As
explained in the section Implementing FDT, this approach was preferred to the
Flickr® Distributional Tagspace 161

(a) 120 (b)


200
Within groups sum of squares

Within groups sum of squares


100

150
80

100
60

2 4 6 8 10 12 14 2 4 6 8 10 12 14
Number of clusters Number of clusters

FIGURE 7.1 (a) Data partitioning, analysis of the within group sum of squares by
number of clusters in FDT data. (b) Data partitioning, analysis of the within group
sum of squares by number of clusters on McRae’s features forms.

classic k-means methods, where experimenters set up in advance the number of


clusters in which they want the data to be organized. With the Ward method,
the number of clusters is not specified, so that categories and subcategories can
emerge spontaneously. The first goal was set to check whether FDT would support
a cluster analysis into six predetermined categories.
Following the procedure outlined in the section Implementing FDT, the
distributional semantic space that encompassed the 94 vectors was implemented.
Then the table of similarities (cosines) was analyzed, looking for cluster solutions.
As a first step, an attempt at partitioning the data and determining the optimal
number of clusters was performed in a plot of the “within groups sum of squares”
by “number of clusters” (Figure 7.1(a)). The obtained graph does not show a clear
bend (or “elbow”), which would indicate a best solution, even though there seems
to be a light bend around the partition into six clusters. In other words, it seems that
FDT in this case cannot clearly predict a neat division into the six predetermined
categories. Also because of this lack of a clear cut into six classes, a hierarchical
clustering algorithm, where the number of centroids are not established a priori
by the experimenter, was preferred. In comparison, the analysis conducted on the
same concepts, retrieved from the table of similarities (cosines) in McRae’s features
norms, shows clearly that here the semantic representations can be clustered into
six categories (Figure 7.1(b)). Yet, this does not imply that the concepts clustered
in each of the categories are those established in advance.
As a second step, the data were analyzed through the function hclust,11 which
produced a cluster solution that was then plotted into a dendrogram, on which
the function cutree12 displayed the partition into the six more stable clusters
Figure 7.2(a). The root of the tree is the unique cluster that gathers all the
samples, the leaves being the clusters with only one sample. A qualitative look
Cluster dendrogram FDT
(a)

FIGURE 7.2 (a) Cluster analysis performed in R with hclust on FDT data. The function cutree shows the solution for a six-way partitioning.
(b) Cluster analysis performed in R with hclust on McRae’s features norms data. The function cutree shows the solution for a six-way
partitioning.
Cluster dendrogram McRae’s features norms
(b)

FIGURE 7.2 Continued


164 M. Bolognesi

at the dendrogram shows that the six categories are not the six predetermined
categories, and they are not internally coherent. The six main categories that seem
to emerge are: General foods, a sample of vegetables, a sample of fruits, musical
instruments + weapons + vehicles + some animals, clothing, and animals. The
same procedure was performed on McRae’s features norms data, and it is plotted
in Figure 7.2(b). Here, the categories that seem to emerge are also non-coherent
from a semantic perspective: Fruit, other foods, weapons, musical instruments +
clothing + vehicles, birds, other animals.
A quantitative analysis of the data was performed with the package pvclust13
(Suzuki & Shimodaira 2006), which in addition to hclust assesses the uncertainty
in hierarchical clustering via multiscale bootstrap analysis, and highlights the results
that are highly supported by the data (e.g. p > 0.95). Figure 7.3(a) and (b) shows
the clusters on FDT and McRae’s features norms data that are supported by the
data with alpha = 0.95.

Revised Categorization in FDT


Since the Ward clustering method minimizes the sum of squared differences within
all clusters (inertia criterion), the deviance between clusters gradually diminishes
while the deviance inside each cluster increases, during the recursive procedure
performed by this method. Thus, the smaller each cluster, the smaller is the
within-cluster variance, and the better is the solution. With the FDT matrix of
cosines, in fact, partitioning the data into a larger amount of clusters contributes
to highlighting different subcategories. For example, an 11-way partitioning
performed with cutree shows the following semantic categories and subcategories:
Some foods, vegetables, fruits, musical instruments, weapons, water and ground
vehicles, air vehicles (plus falcon), animals, clothing, exotic animals, and farm
animals (Figure 7.4).

Cluster Validation and Discussion


Several measures for validating the six-way cluster solutions are hereby reported.
These measures refer to both internal and external evaluation measures performed
with FDT data and McRae features norms. The meaning and documentation
about each measure can be found in the manuals of the relative methods used to
compute it (see links).
In general, it can be stated that FDT can predict semantic categories on the
basis of tags, co-occurrences, but these seem to be ad hoc categories (Barsalou,
1983), rather than classic taxonomic ones. The macro-distinction into the six
predetermined categories is not as clear-cut as expected, but on the other hand,
even the speaker-generated semantic features collected in an experimental setting
produce stable and semantically homogeneous six-way cluster analyses. On the
(a) FDT highly supported clusters

FIGURE 7.3 (a) Cluster analysis performed in R with pvclust on FDT data. (b) Cluster analysis performed in R with pvclust on McRae’s
features norms data.
(b) McRae’s features norms highly supported clusters

FIGURE 7.3 Continued


FIGURE 7.4 Eleven-way cluster analysis performed in R with pvclust on FDT data.
168 M. Bolognesi

TABLE 7.4 Different measures of cluster analysis validation, related to the six-way
cluster solutions obtained with FDT data and McRae feature norms, and set top and
bottom rules.

Measure Score FDT Score McRae f.n. Method


Connectivity (0 to infinity, 28.740 26.537 R(clValid)c14
should be minimized)
Dunn index (0 to infinity, 0.427 0.195 R(clValid)
should be maximized)
Silhouette (-1 to 1, should 0.329 0.439 R(clValid)
be maximized)
APN (should be 0.014 0.009 R(clValid)
minimized)
AD (should be minimized ) 0.519 0.475 R(clValid)
ADM (should be 0.017 0.016 R(clValid)
minimized)
FOM (should be 0.092 0.101 R(clValid)
minimized)
Cophenetic correl. coeff. 0.563 0.563 Multidendrograms15
(max. 1, should be max.)
Normal. mean sq. error 351.147 813.508 Multidendrograms
(should be min.)
Normal. mean absolute err. 17.248 54.203 Multidendrograms
(should be min.)

other hand, allowing for a higher number of clusters, FDT shows accurate
intra-category distinctions between different types of vehicles (air, ground, and
water transportation), as well as different types of animals (farm animals and wild
animals).
In conclusion, Study Two showed that mining Flickr tags through the FDT
procedure, it is possible to create a cognitively plausible categorization of given
concepts, which are automatically divided into semantically coherent clusters.

Conclusions
Big Data such as the metadata that characterize the Web 2.0 (the web of
social networks) are increasingly attracting the interest of a variety of scientific
communities for they present at the same time a challenge and an opportunity to
bridge toward the next web, or Web 3.0, the Semantic Web. The challenge is that
of creating adequate tools for mining and structuring the semantic information that
such datasets contain; the opportunity is that of the availability of huge amounts of
data that are not subject to the same biases and constraints of laboratory-collected
Flickr® Distributional Tagspace 169

data. Yet, the majority of applications that involve Big Data mining seem to be
specifically market-oriented (rather than being aimed at reproducing the human
cognitive system) and task-oriented (designed to serve a specific query). On the
other hand, in cognitive science the distributional hypothesis that lays under the
implementation of several word space models is gaining more attention thanks
to the recent implementation of hybrid and multimodal word spaces that would
account for the grounded nature of human semantic knowledge, by integrating in
the word vectors some perceptually derived information.
The study hereby presented relates to both the above-mentioned fields of
research (Big Data mining and cognitive science) for it proposes a distributional
semantic space that harvests semantic representations by looking at concepts’
co-occurrences across Flickr tags, and it compares them to those that are obtained
from semantic data collected in experimental settings (speaker-generated features).
The analyses here presented showed that FDT is a general distributional semantic
space for meaning representation that is based on a unique but intrinsically
variegated source of semantic information, and thus it avoids the artificial and
arbitrary merging of linguistic and visual streams of semantic information, as
it is performed by hybrid distributional models. FDT’s fair ability to model
semantic representations and predict human-like similarity judgments suggests that
this would be a promising way to observe from a distributional perspective the
semantic representations of abstract concepts, a timely topic in cognitive science
(e.g. Pecher, Boot, & Van Dantzig, 2011). The fact that abstract concepts can
hardly be described through predicates suggests that FDT, which is based on
contextualized conceptual associations, could highlight some hidden peculiarities
of the inner structure of abstract concepts. The scientific literature about abstract
concepts’ representations claims that these concepts differ from concrete ones in
that their mental simulation appears to involve a wider amount of participating
entities, arguably because they are used in a wider variety of contexts (Barsalou &
Wiemer-Hastings, 2005). Moreover, the processing of abstract concepts seems to
require a special focus on introspections and emotions, which seems to be less
crucial for concrete concepts (Kousta, Vigliocco, Vinson, Andrews, & Del Campo,
2011). These claims will need to be investigated through extensive distributional
analyses of concrete versus abstract concepts in the FDT environment.

Notes
1 On demand, the author can release the raw data and the materials used for the
studies described in the following sections.
2 Ranking the number of tags attributed to each picture in an informal analysis
conducted over a sample of 5 million pictures (i.e. how many pictures have one,
two, three tags, etc), it appeared that most pictured contain one to 15 tags. After
170 M. Bolognesi

15, the graph’s curve that indicates the number of pictures containing 15+ tags
drops dramatically.
3 This number is chosen without a specific quantitative investigation: Out of 15
tags considered, the first three tags are considered to be the most salient, but
a deeper psycholinguistic investigation could test whether the tagging speed
actually decreases after the first three tags, suggesting a decrease in salience.
4 Other validation methods such as the popular purity and entropy measures
obtained for example with the software CluTo demand the experimenter to set
the number of clusters, an operation which here was avoided on purpose.
5 In the instruction provided to the participants, McRae and colleagues request
them to list properties of the concept to which a given word refers. For the
exact wording used in the instructions, cf. McRae et al. (2005).
6 I am extremely grateful to Ted Pedersen for personally providing these data.
7 Straightforward nodes count in the noun and verb WordNet “is-a” hierarchies.
8 Wu and Palmer (1994).
9 Jiang and Conrath (1997).
10 The JCN measure requires that a word be observed in the sense-tagged corpus
used to create information content values, and the coverage of this corpus
is somewhat limited. As a result, quite a few pairs showed zero relatedness,
but in fact this has to be read as zero information on this pair, rather than
zero relatedness. The concepts whose vector was a list of zeroes are: blackbird,
carrot, crocodile, dolphin, dunebuggy, flamingo, garlic, giraffe, gorilla, hamster,
helicopter, limousine, mittens, moose, mushroom, nectarine, pajamas, parsley,
pineapple, pumpkin, pyramid, raspberry, raven, robin, scooter, skateboard,
turtle, zebra, zucchini. Not considering these concepts in the correlation study,
the average correlation between JCN and FDT is 0.78 (SD=0.12).
11 Documentation at http://127.0.0.1:30955/library/stats/html/hclust.html.
Arguments: method=“ward.” Other parameters were left on the default
settings.
12 Documentation at http://127.0.0.1:30955/library/stats/html/cutree.html.
Arguments: k= 6.
13 Documentation at: http://127.0.0.1:30955/library/pvclust/html/pvclust.html.
Arguments: method.hclust=“ward,” method.dist=“euclidean”.
14 Documentation can be found at: http://127.0.0.1:13703/library/clValid/html/
clValid.html.
15 Documentation can be found at: http://deim.urv.cat/∼sergio.gomez/multidendro
grams.php.

References
Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. Proceedings of the
SIGCHI Conference 2004 (pp. 319–326). Wien, Austria.
Flickr® Distributional Tagspace 171

Ames, M., & Naaman, M. (2007). Why we tag: Motivations for annotation in mobile and
online media. In Proceedings of the SIGCHI Conference 2007 (pp. 971–980). New York,
NY, USA.
Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and
distributional data to learn semantic representations. Psychological Review, 1163, 463–498.
Baroni, M., & Lenci, A. (2008). Concepts and properties in word spaces. In A. Lenci (Ed.),
From context to meaning: Distributional models of the lexicon in linguistics and cognitive science.
Special issue of the Italian Journal of Linguistics, 201, 55–88.
Baroni, M., & Lenci, A. (2010). Distributional Memory: A general framework for corpus-
based semantics. Computational Linguistics, 36(4), 673–721.
Baroni, M., Evert, S., & Lenci, A. (Eds.) (2008). Lexical Semantics: Bridging the
gap between semantic theory and computational simulation. Proceedings of the ESSLLI
Workshop on Distributional Lexical Semantics 2008.
Barsalou, L. (1983). Ad hoc categories. Memory & Cognition, 11, 211–227.
Barsalou, L., & Wiemer-Hastings, K. (2005). Situating abstract concepts. In D. Pecher and
R. Zwaan (Eds.), Grounding cognition: The role of perception and action in memory, language,
and thought (pp. 129–163). New York: Cambridge University Press.
Beaudoin, J. (2007). Flickr® image tagging: Patterns made visible. Bulletin of the American
Society for Information Science and Technology, 341, 26–29.
Bolognesi, M. (2014). Distributional Semantics meets Embodied Cognition: Flickr® as a
database of semantic features. Selected Papers from the 4th UK Cognitive Linguistics Conference
(pp. 18–35). London, UK.
Bouma, G. (2009). Normalized (Pointwise) mutual information in collocation extraction.
In C. Chiarcos, R. E. de Castilho, & M. Stede (Eds.), From form to meaning: Processing texts
automatically. Proceedings of the Biennial GSCL Conference (pp. 31–40). Potsdam, Germany:
Narr Verlag.
Bruni, E., Tran, N., & Baroni, M. (2014). Multimodal Distributional Semantics. Journal of
Artificial Intelligence Research, 49, 1–47.
Buratti, A. (2011). FlickrSearch 1.0. Retrieved August 2014 from https://code.google.com/
p/irrational-numbers/wiki/Downloads.
Burgess, C., & Lund, K. (1997). Modelling parsing constraints with high-dimensional
context space. Language and Cognitive Processes, 12, 1–34.
Butterfield, S. (2004, August 4) Sylloge. Retrieved March 20, 2014 from http://
www.sylloge.com/personal/2004/08/folksonomy-social-classification-great.html.
Cree, G., & McRae, K. (2003). Analyzing the factors underlying the structure and
computation of the meaning of chipmunk, cherry, chisel, cheese, and cello and many
other such concrete nouns. Journal of Experimental Psychology, 132, 163–201.
El-Hamdouchi, A., & Willett, P. (1986). Hierarchic document clustering using Ward’s
method. Proceedings of the Ninth International Conference on Research and Development in
Information Retrieval (pp. 149–156). Washington: ACM.
Firth, J. (1957). Papers in Linguistics. London: Oxford University Press.
Harris, Z. (1954). Distributional structure. Word, 10(23), 146–162.
Heckner, M., Heilemann, M. & Wolff, C. (2009). Personal information management vs.
resource sharing: Towards a model of information behaviour in social tagging systems.
International AAAI Conference on Weblogs and Social Media (ICWSM), May, San Jose, CA,
USA.
172 M. Bolognesi

Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical
taxonomy. The International Conference on Research in Computational Linguistics, Taiwan.
Johns, B. T., & Jones, M. N. (2012). Perceptual inference from global lexical similarity.
Topics in Cognitive Science, 4(1), 103–120.
Körner, C., Benz, D., Hotho, A., Strohmaier, M., & Stumme, G. (2010) Stop thinking,
start tagging: Tag semantics arise from collaborative verbosity. Proceedings of the 19th
International Conference on World Wide Web (WWW 2010), April, Raleigh, NC, USA:
ACM.
Kousta, S., Vigliocco, G., Vinson, D., Andrews, M., & Del Campo, E. (2011). The
representation of abstract words: Why emotion matters. Journal of Experimental Psychology,
140, 14–34.
Landauer, T., & Dumais, S. (1997). A solution to Plato’s problem: The Latent
Semantic Analysis theory of the acquisition, induction and representation of knowledge.
Psychological Review, 104(2), 211–240.
McRae, K., Cree, G., Seidenberg, M., & McNorgan, C. (2005). Semantic feature
production norms for a large set of living and nonliving things. Behavioral Research
Methods, Instruments, and Computers, 37, 547–559.
Marlow, C., Naaman, M., Boyd, D., & Davis, M. (2006). HT06, tagging paper, taxonomy,
Flickr, academic article, to read. In Proceedings of the 7th Conference on Hypertext and
Hypermedia (pp. 31–40).
Nov, O., Naaman, M., & Ye, C. (2009). Motivational, structural and tenure factors that
impact online community photo sharing. Proceedings of AAAI International Conference on
Weblogs and Social Media (ICWSM 2009).
Pecher, D., Boot, I., & Van Dantzig, S. (2011). Abstract concepts: Sensory-motor
grounding, metaphors, and beyond. In B. Ross (Ed.), The psychology of learning and
motivation, Vol. 54, pp. 217–248. Burlington: Academic Press.
Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity—Measuring the
relatedness of concepts. Proceedings of Fifth Annual Meeting of the North American Chapter of
the Association for Computational Linguistics (NAACL-04) (pp. 38–41). May, Boston, MA.
Peters, I., & Stock, W. (2007). Folksonomy and information retrieval. Proceedings of the 70th
Annual Meeting of the American Society for Information Science and Technology, 70, 1510–1542.
Peters, I., & Weller, K. (2008). Tag gardening for folksonomy enrichment and maintenance.
Webology, 5(3), article 58.
Rapp, R. (2004). A freely available automatically generated thesaurus of related words.
Proceedings of the 4th International Conference on Language Resources and Evaluation LREC
2004 (pp. 395–398).
Recchia, G., & Jones, M. (2012). The semantic richness of abstract concepts. Frontiers in
Human Neuroscience, 6, article 315.
Riordan, B., & Jones, M. (2010). Redundancy in linguistic and perceptual experience:
Comparing distributional and feature-based models of semantic representation. Topics in
Cognitive Science, 3(2), 303–345.
Sahlgren, M. (2006). The Word-Space Model: Using distributional analysis to represent syntagmatic
and paradigmatic relations between words in high-dimensional vector spaces. Doctoral thesis,
Department of Linguistics, Stockholm University.
Shaoul, C., & Westbury, C. (2008). Performance of HAL-like word space models on
semantic categorization tasks. Proceedings of the Workshop on Lexical Semantics ESSLLI 2008
(pp. 42–46).
Flickr® Distributional Tagspace 173

Steyver, M. (2010). Combining feature norms and text data with topic model. Acta
Psychologica, 133, 234–243.
Strohmaier, M., Koerner, C., & Kern, R. (2012). Understanding why users tag: A survey
of tagging motivation literature and results from an empirical study. Web Semantics, 17,
1–11.
Suzuki R., & Shimodaira, H. (2006). Pvclust: An R package for assessing the uncertainty
in hierarchical clustering. Bioinformatics 22(12), 1540–1542.
Thaler, S., Simperl, E., Siorpaes, K., & Hofer, C. (2011). A survey on games for knowledge
acquisition. STI technical report, May 2011, 19.
Turney, P., & Pantel, P. (2010). From frequency to meaning: Vector space models of
semantics. Journal of Artificial Intelligence Research, 371, 141–188.
Ward, J. (1963). Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58, 236–244.
Weber, I., Robertson, S. & Vojnovic, M. (2008) Rethinking the ESP game. Report for Microsoft
Research, Microsoft Corporation.
Wu, L., & Barsalou, L. (2009). Perceptual simulation in conceptual combination: Evidence
from property generation. Acta Psychologica, 132, 173–189.
Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. Proceedings of the 2nd
Annual Meeting of the Association for Computational Linguistics (pp. 133–138).
8
LARGE-SCALE NETWORK
REPRESENTATIONS OF SEMANTICS
IN THE MENTAL LEXICON
Simon De Deyne, Yoed N. Kenett, David Anaki,
Miriam Faust, and Daniel Navarro

Abstract
The mental lexicon contains the knowledge about words acquired over a lifetime. A
central question is how this knowledge is structured and changes over time. Here we
propose to represent this lexicon as a network consisting of nodes that correspond
to words and links reflecting associative relations between two nodes, based on free
association data. A network view of the mental lexicon is inherent to many cognitive
theories, but the predictions of a working model strongly depend on a realistic scale,
covering most words used in daily communication. Combining a large network
with recent methods from network science allows us to answer questions about its
organization at different scales simultaneously, such as: How efficient and robust is
lexical knowledge represented considering the global network architecture? What
are the organization principles of words in the mental lexicon (i.e. thematic versus
taxonomic)? How does the local connectivity with neighboring words explain why
certain words are processed more efficiently than others? Networks built from word
associations are specifically suited to address prominent psychological phenomena
such as developmental shifts, individual differences in creativity, or clinical states like
schizophrenia. While these phenomena can be studied using these networks, various
future challenges and ways in which this proposal complements other perspectives
are also discussed.

Introduction
How do people learn and store the meaning of words? A typical American university
student knows the meaning of about 40,000 words by adulthood, but even young
Large-scale Semantic Networks 175

children know a remarkable number of words (around 3,000 in their spoken


vocabulary by the age of 5; Aitchison, 2012). To accomplish this feat, people must
extract regularities from the linguistic input they receive and store it in some fashion.
In this chapter, our focus is on how this knowledge is organized.
One way to think about word meanings is with the idea of a mental lexicon. The
mental lexicon can be thought of as a dictionary-like structure, in the sense that it
organizes words according to various different properties. This includes semantic
properties (meaning) and syntactic properties (e.g. part-of-speech), but might also
include perceptual characteristics (e.g. pronunciation) and pragmatic ones (e.g.
appropriate usage). However, in other respects the mental lexicon is very different
to a typical dictionary. For instance, rather than provide explicit definitions for
words, the mental lexicon represents meanings in terms of patterns of word use and
the connections between words and sensory experiences (Elman, 2009). Similarly,
the dictionary metaphor provides a poor guide to thinking about how people
retrieve information from the lexicon. Understanding the structure of the mental
lexicon helps us explain a variety of phenomena including tip-of-the-tongue states
(Brown, & McNeill, 1966), learning new words in a second language (de Groot,
1995), and various forms of anomia and aphasia (Aitchison, 2012).
If the mental lexicon is not exactly a dictionary, what kind of organization does
it possess? Our goal in this chapter is to discuss how ideas from network science can
be used to provide these insights. In particular, we focus on the importance of using
large-scale networks derived from free association norms. The structure of the chapter
is as follows. In the remainder of this section we discuss our approach in some detail
and compare it to other perspectives on the problem. In the second section we
illustrate how the study of large networks leads to new predictions that could not
be detected in smaller scale studies, and allows us to investigate the structure of the
mental lexicon at a global (macroscopic) level, an intermediate (mesoscopic) level,
and at the fine-grained (microscopic) level. Finally, in the third section we discuss
how the basic approach can be extended to capture differences between different
populations and even among different individuals.

Studying the Mental Lexicon


A cursory review of the literature in psychology, computer science, and linguistics
reveals that there is a variety of different ways in which the mental lexicon could
be studied. It is not our goal to provide an exhaustive survey, but a brief overview
is useful for highlighting the manner in which different approaches are useful for
addressing different questions.
In psychology there is a long tradition of studying word meaning on the small
scale. For example, “feature listing” tasks can be used to empirically measure how
meaning is represented in small parts of the lexicon (e.g. De Deyne, Verheyen,
Ameel, Vanpaemel, Dry, Voorspoels, & Storms, 2008; McRae, Cree, Seidenberg,
176 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

& McNorgan, 2005), and predict measures such as the relatedness between word
pairs (Dry & Storms, 2009) and typicality as a member of a category (Ameel &
Storms, 2006). On the theoretical side, we can study the semantic properties and
relations among a small set of words using connectionist networks (McClelland, &
Rogers, 2003) or Bayesian models (Kemp, Tenenbaum, Griffiths, Yamada, & Ueda,
2006). One difficulty with these approaches is the very fact that they are small in
scale, and it is not clear which results will generalize when the entire lexicon is
considered. Indeed, most psychological studies rely on small sets of concrete nouns
(Medin, Lynch, & Solomon, 2000) even though work on abstract and relational
words is available (e.g. Recchia & Jones, 2012; Wiemer-Hastings & Xu, 2005).
Moreover, selection biases also extend towards certain types of relations between
these nouns (mostly perceptual properties and categorical relations) and types of task
(highlighting relations at the same hierarchical category level), which might render
some of the conclusions regarding how the lexicon is structured and represents
word meaning premature.
A different approach that emerges from linguistics might be called the “thesaurus”
model, and is best exemplified by WordNet (Fellbaum, 1998). WordNet is a
linguistic network consisting of over 150,000 words. The basic unit within WordNet
is a synset, a set of words deemed to be synonymous. For instance, the word platypus
belongs to a synset that contains duckbill, duck-billed platypus, and Ornithorhynchus
anatinus. Synsets are connected to one another via is-a-kind-of relationships, so the
platypus synset is linked to a synset that consists of monotreme and egg-laying mammal.1
Unlike the traditional psychological approach, WordNet does not suffer from the
problem of small scale. Quite the contrary, in fact: The synsets and their connections
form an extensive network from which one can predict the similarity of two
words, query the different senses that a word may have, and so on. Unfortunately,
when viewed as a tool for studying the mental lexicon, WordNet has fairly severe
limitations of its own. The fundamental difficulty is that WordNet is not derived
from empirical data, and as a consequence it misses properties that would be
considered critical when studying human semantics. A simple example would be
the fact that it treats the elements of a synset as equivalent. This is highly implausible:
While any Australian would have a detailed mental representation for platypus, only
a small group of experts would have any representation of the term Ornithorhynchus
anatinus. Moreover, the meaning of platypus is culture specific and will be
much more central within (some) Australian cultures than in American culture
(cf. Szalay & Deese, 1978). Even ignoring culture-specific knowledge that
differentiates among the members of the synset, the WordNet representation
misses important lexical knowledge about the word platypus that would be shared
among almost all English speakers. To most English speakers, platypus is a rare
word: The frequency of the word platypus (less than one per million words) is
just a fraction of that of duck (about 25 times per million words), which has a
tremendous influence on how quickly people can decide whether it is a real word
Large-scale Semantic Networks 177

(839 ms for platypus versus 546 ms for duck), or name the animal in question (830 ms
versus 572 ms).2 As this illustrates, the WordNet approach—useful as it may be
for its original purpose—is not well suited to the empirical study of the mental
lexicon.
A third tradition, inspired by information retrieval research in computer science,
is to study semantic knowledge by analyzing the structure of large text corpora. One
of the best known approaches is latent semantic analysis (LSA; Landauer & Dumais,
1997). Using a vocabulary of nearly a hundred thousand words derived from a
variety of documents, LSA is able to capture the meaning of words by comparing
how similar the contexts are in which two words occur. For example, it infers that
opossum, marsupials, mammal, duck-billed, warm bloodedness, and anteater are related
to platypus because these words occur in similar contexts to platypus.3 In recent
years a large number of corpus-based methods have been developed (Recchia,
Sahlgren, Kanerva, Jones, & Jones, 2015). These methods differ in terms of how
they define a word’s context (e.g. the paragraph, the document, etc.), the extent to
which they use grammatical information (e.g. word order), and how the meaning is
represented (e.g. latent spaces, mixture models, etc.). Not surprisingly, the choice
of text corpus also has a very strong influence on how these models behave, and
can even become the determining factor of how well they capture human semantic
processing (Recchia & Jones, 2009).
One of the main selling points to the corpus approach is that it serves as an
existence proof for how meaning can be acquired from the world. That is, if the
text corpus is viewed as a summary of the statistics of the linguistic environment,
then LSA and the like can be construed as methods for extracting meaning from the
world (Firth, 1968; Landauer & Dumais, 1997). Without wishing to make the point
too strongly, there are some reasons to be cautious about the claim that this is how
humans do so. Even supposing that the text corpus is a good proxy for the statistics
of the linguistic input available to the human learner, it is not at all clear that the
linguistic input presents a good summary of the statistics of the world that children
are exposed to. For example, when people are asked to generate associates to banana,
the word yellow is one of the top responses, correctly capturing a relationship that
any child acquires from perceptual data. Yet, due to pragmatic constraints on human
discourse, we rarely talk about yellow bananas. Some studies have looked at this
explicitly, by comparing participants who generate ten sentences to a number of
verbal stimuli, whereas others generated a closely similar word association task.
The results showed that the type of responses, after carefully preprocessing the
sentences, correlated only moderately (Szalay & Deese, 1978, r = 0.48). Similarly,
word co-occurrence extracted from a large text-corpus show only weak correlations
with response frequencies from word associations (De Deyne, Verheyen, & Storms,
2015). Second, non-linguistic processes contribute to word meaning in the lexicon
which are picked up by word associations but not necessarily by text (for the reasons
we just mentioned). Further evidence comes from a study on the incidental learning
178 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

of word associations during sentence processing (Prior & Bentin, 2003). Finally, in
contrast to natural discourse where fully grammatical utterances need to be formed,
there is less monitoring of the responses in word association tasks. Presumably, the
transformation from idea to language is quicker and easier (Szalay & Deese, 1978).
Limitations notwithstanding, all three methods serve a useful purpose, but each
captures a different aspect of the problem. In this chapter we discuss a different
approach, one that comes with its own perspective on the problem. Drawing
from traditional experimental psychology, we seek to base our study on empirical
measures. Much like the corpus approach, we aim to describe the mental lexicon
on a large scale in a fashion that is psychologically plausible. Finally, like WordNet,
the form of this lexical knowledge can be described using the language of networks.

Using Association Networks to Represent Lexical Knowledge


The approach we take is to construct large-scale semantic networks from word
association data. As such there are two key elements to this approach: The reliance
on association data, and the use of network representation. In a typical free word
association task, a person is asked to write down the first word(s) that spontaneously
come to mind after reading a cue word. The important aspect of this task is that
it is “free”: Unlike tasks such as semantic feature generation, no restrictions are
imposed on the answers that are produced.4 What makes free associations unique
and useful is the fact that they are simply the expression of thought, freed from the
demands of syntax and morphology (Szalay & Deese, 1978). If people frequently
generate the word weird when given platypus as a cue, we assume this reflects an
association between these words (e.g. people think that platypuses are weird). From
this perspective, the word association task is a measurement tool: It is an empirical
method for getting access to the semantic representations that people possess. Of
course, no empirical measure is perfect (e.g. the relationship between response
frequency and associative strength is by no means simple); our view is that the word
association task is less problematic than the alternatives.
The second element to our approach is the network representation. Words are
represented by nodes, and a directed edge connects platypus to weird because people
generate weird in response to platypus. This is not the only way in which the empirical
data could be described (e.g. we could adopt an LSA-like approach and construct
a latent geometric space), but it is one that has a strong justification. At a bare
minimum, the network representation can be motivated as a transparent reflection of
the word association task: The essence of the task is to generate connections among
words, and as such the empirical data are quite naturally described in this fashion. At
a more theoretical level, our choice is motivated by the seminal work of Collins and
Loftus (1975), who argued that a network representation provides a psychologically
plausible way of describing the mental lexicon. Put somewhat crudely, our approach
is to use the empirical word association network as an approximation to a latent
Large-scale Semantic Networks 179

semantic network. In the next two sections we explain how these two elements are
implemented in an explicit network, where the connections between words are
derived from human responses in a word association task.

Representation of Semantic Similarity


One important property of the Collins and Loftus (1975) proposal is that the
semantic network expresses both semantic similarity and lexical co-occurrence
among the pattern of connections among words. The idea, which traces back to
the work of Deese (1965), is that two words have a semantic relationship if they
are connected to the same words, which in this context is sometimes referred to as
having shared “semantic features.”
In order for this idea to be reflected in a word association network, it must be the
case that free associations are not merely sensitive to simple linguistic collocations
(e.g. ugly—duck), they must also capture a variety of semantic relationships. This is
generally thought to be true (Mollin, 2009). Because the generation of associates
is “free,” it includes a wide range of relations that might also indicate thematic, or
affective content. To illustrate this, Figure 8.1 shows part of the word association
network that is centered on the word platypus. The most common associates of
platypus include animal, duck, mammal, water, bill, Australia, eggs, funny, cute, beak, etc.
The fact that these connections reflect a variety of taxonomic, thematic, and
affective properties illustrates how word association networks possess the kind of
expressiveness required by a semantic network.
Given the importance of having a network that can capture a broad range of
possible relationships, it is worth highlighting how this expressivity is related to
the experimental method used to collect word association data. Historically, most
studies have used a procedure in which only a single response per cue word is
collected (Kiss, 1968; Nelson, McEvoy, & Schreiber, 2004). However, collecting
more than a single response is crucial in capturing the distributional properties
of the mental lexicon (Aitchison, 2012; De Deyne, Navarro, & Storms, 2013;
Hahn, 2008; Kenett et al., 2011). This “continued word association paradigm,” in
which people provide multiple associates to each cue, offers two advantages over the
traditional approach. First, weaker associations can be collected, which is especially
important for cues that have very dominant associations (e.g. blood and red). Second,
the resulting network representations are denser (i.e. contain more links between
words) and therefore are more suited to capturing the distributional properties of
meaning compared to the more homogeneous responses in single word associations
(typically including just a handful of different responses for a specific cue). This
allows us to model human relatedness judgments and test predictions about ways the
lexicon is organized, but also allows us to compare groups of speakers, which might
represent or process meaning in the lexicon differently (Szalay & Deese, 1978).
180 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

FIGURE 8.1 Portion of the associative network around the word platypus showing direct
and indirect neighboring nodes.

Spreading Activation
The second key feature of the Collins and Loftus (1975) proposal is the notion of
spreading activation. Once a word in the network is activated, activation spreads to
other connected words, and quickly dissipates with time and distance (Collins and
Loftus, 1975; Den Heyer & Briand, 1986). This principle has been influential in
many psychological theories and models such as the adaptive control of thought
theory (Anderson, 1983, 1991), and various connectionist models of semantic
cognition (Lerner, Bentin, & Shriki, 2012; McClelland, & Rogers, 2003). Through
spreading activation, the meaning of a word in the network is represented by how
it is linked to other words and how these are interlinked themselves. In that sense,
spreading activation provides a mechanism by which distributed meaning can be
extracted from a network.
Formally, spreading activation can be implemented as a stochastic random walk
defined over the network. Starting at a particular node, a random walker selects an
out-bound edge with a probability proportional to the edge weight and moves across
Large-scale Semantic Networks 181

it. Gradually, it will explore more paths around the start node. For many of these
walkers the probability of any walker being in on a specific node reaches a state that
remains stable after many iterations. In this random walk, the relatedness in meaning
between nodes reflects the number and length of the directed paths through the
network that connect two nodes. Many short paths between a source and target
node allow a random walk to quickly reach the target, which reflects the fact that
both nodes are considerably similar in meaning. In the simplest version, a single
parameter determines this walk. This parameter governs the decay of activation: It
determines the weight of paths of a specific length in such a way that longer paths get
less weight than shorter ones, which might be useful depending on the type of task.
In recent years, various empirical studies have demonstrated how memory search
is governed by a fairly simple random walk over semantic memory (Abbott et al.,
2015; Bourgin, Abbott, Griffiths, Smith, & Vul, 2014; Smith, Huber, & Vul, 2013).
To sum up, we propose to construct a mental lexicon as a network derived from
word association. This network is a localist representation with nodes corresponding
to words.5 The semantic representations derived from it are functionally distributed,
in the sense that the meaning of a word is represented by activation distributed
over all edges connected with that word. The scale of the network is crucial: If the
network is too small or too poorly connected the spreading activation mechanism
becomes biased and lower frequency words like platypus might become unreachable
(i.e. they will have no incoming links).

The Structure of Semantic Networks


The discussion up to this point has focused on the core ideas behind the network
approach, outlining the theoretical basis for the approach and some methodological
considerations. We now turn to a discussion of the “payoff ” that a network approach
brings. That is, what does this perspective tell us about the organization of the
mental lexicon? To address this, we can examine the structure of large networks
simultaneously at three different levels: macroscopic, mesoscopic, and microscopic
(Borge-Holthoefer & Arenas, 2010b). Depending on the complexity and level of
analysis of the network, different functional patterns emerge, which are captured by
the phrase more is different (Anderson, 1972). In other words, network science offers
a framework that allows the examination of a network at different resolutions or
levels, without ignoring qualitative differences between these levels. Each of these
levels provides different insights and we discuss some of the findings related to each
of these levels in turn.

Insights at the Macroscopic Level


The macroscopic or network level reflects the combined role of all the connections
between the nodes of the network. It refers to structural properties of the entire
182 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

network, rather than any particular part. For example, the network of the mental
lexicon exhibits a small-world structure. In comparison to random networks,
small-world networks are characterized by a high degree of clustering, while
maintaining short paths between nodes (Borge-Holthoefer & Arenas, 2010a; De
Deyne & Storms, 2008; Solé, Corominas-Murtra, Valverde, & Steels, 2010; Steyvers
& Tenenbaum, 2005). Similarly, the mental lexicon networks contain a small
number of highly connected nodes or hubs, to a much greater extent than would be
expected of a random graph. In network terms, these hubs exhibit a degree (i.e. num-
ber of connected nodes) that is much higher than other nodes. More generally, the
connectivity of the network has a characteristic distribution, in which the degree of
all nodes in the network follows a truncated power-distribution (Morais et al., 2013).
These properties are not arbitrary: There is evidence that macroscopic-level
properties such as small-world organization produce networks that are robust against
damage and allow efficient information distribution (Borge-Holthoefer, Moreno, &
Arenas, 2012). Moreover, small world organization often emerges when a network
forms via growth process, as is the case for other dynamic networks such as networks
of scientific collaboration, neural networks, and the World Wide Web (Watts &
Strogatz, 1998). From this perspective, the observed structure of the semantic
network provides insights into how the network grows over time (Steyvers &
Tenenbaum, 2005).

Using Network Structure to Investigate Language Development


In cognitive science, the macroscopic properties of the lexicon have been extensively
studied to investigate how new words are gradually incorporated into the mental
lexicon. Various growth principles have been proposed to explain the characteristic
degree distributions of small world networks. For example, the mechanism of
preferential attachment assumes new nodes become connected to the network in
proportion to the number of existing connections that they have with neighboring
nodes (Steyvers & Tenenbaum, 2005). Alternatively, growth could also reflect the
mechanism of preferential acquisition where new nodes might become preferentially
attached to other nodes depending on the structure of the learning environment
(Hills , Maouene, Maouene, Sheya, & Smith, 2009). Models of network growth are
interesting in their own right, but they also provide constraints to various models
of the lexicon and predict a number of interesting phenomena. For instance, the
network growth model by Steyvers and Tenenbaum explains how the age of word
acquisition and its frequency in language independently contribute to the ease with
which a word is processed.
Recent studies have also correlated macroscopic network properties with the
typical and atypical development of the mental lexicon (for example, Hills et al.,
2009; Kenett et al., 2013; Steyvers & Tenenbaum, 2005; Zortea, Menegola,
Villavicencio, & Salles, 2014). One of these studies compared the development
Large-scale Semantic Networks 183

of individual networks in children and found that small world connectivity is


indicative of later vocabulary development, whereas children with more cohesive
and structured networks are more proficient language learners (Beckage, Smith, &
Hills, 2010).

The Relationship Between Creativity and Network Structure


According to the classical associative theory of creativity (Mednick, 1962;
Runco & Jaeger, 2012), creative individuals have a richer and more flexible
associative network than less creative individuals. Thus, creative individuals may have
more associative links in their network and can connect associative relations faster
than less creative individuals, thereby facilitating more efficient search processes
(Rossmann & Fink, 2010). Others have suggested that insight problem solving is
a result of a successful search throughout the semantic memory network, enabled
by either finding “shortcuts” or by the creation of new links between previously
unconnected nodes in the network (Schilling, 2005).
A macroscopic analysis to examine creative ability and problem solving might
shed new light on these phenomena. Recently, such an analysis revealed that the
semantic network of low creativity persons is more rigid than that of high creativity
persons (Kenett, Anaki, & Faust, 2014). This higher rigidity was expressed by
the degree of structure in the network in terms of tight clusters (as expressed
by the network modularity); longer distances connecting words (average shortest
path length); and lower small-world-ness (as expressed by a ratio of clustering
and distance, see Kenett, Anaki, & Faust, 2014). This macroscopic analysis not
only directly verified (modularity), but also extended Mednick’s classical theory
(Mednick, 1962) in terms of network distance and connectivity.

Structural Differences in Clinical Populations


Structure at the macroscopic scale can not only explain the abrupt emergence of
new cognitive functions during development, but also the degradation of these
functions with aging or neurodegenerative illness (Baronchelli, Ferrer-i-Cancho,
Pastor-Satorras, Chater, & Christiansen, 2013). While network science is widely
used to explore the neural aspects of clinical populations (Stam, 2014), a similar
methodology aimed at the cognitive level of clinical populations might also be
productive in this field. Such network tools can be used to examine the mental
lexicon organization of clinical populations suffering from speech, language and
thought disorders and provide novel insights into the nature of their deficiencies.
Currently, such studies mainly focus on analyzing small-scale representations of
lexical category organization (Beckage, Smith, & Hills, 2010; Kenett et al., 2013;
Lerner, Ogrocki, & Thomas, 2009; Voorspoels et al., 2014), or on the analysis of
speech acts (Cabana et al., 2011; Holshausen et al., 2014; Mota et al., 2012). A final
example is a recent study on the semantic networks of individuals with Asperger
184 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

syndrome. In that study, the semantic network of people with Asperger syndrome
was characterized by higher modularity than the network derived from controls,
which is argued to be related to rigidity in thought (Kenett, Gold & Faust, 2015).
Another example of this is the case of schizophrenia. Here semantic networks could
be used to test whether thought disorders can be attributed to a lack of inhibition or
an increase in activation spreading through the lexicon. leading to phenomena such
as hyper-priming, where semantically related words show larger priming effects
compared to normal controls (Gouzoulis-Mayfrank et al., 2003; Pomarol-Clotet,
Oh, Laws, & McKenna, 2008; Spitzer, 1997).

Insights at the Mesoscopic Level


The mesoscopic or group level involves the properties of a considerable subset of
nodes in the network. The structure at the mesoscopic level in the mental lexicon is
informative about the meaning of words. We can investigate structure at this level by
computing the distance between a set of words through a set of direct and indirect
paths connecting them. These distances allow us to identify closely knit regions
or clusters in the network, which are referred to as communities (or modules) in
network science.
Apart from clustering, the mesoscopic level also allows us to evaluate whether
people infer additional information by using indirect paths when comparing the
meaning of two words or mechanisms or retrieving words from memory. This
process can be modeled as the stochastic random walk, which we introduced earlier
in this chapter, resulting in a measure of relatedness that reflects both the number
and the length of paths connecting two nodes in the network. Altogether, the
mesoscopic level informs us about (a) the actual content that is activated or accessed
in retrieving words and their meaning, which allows us to differentiate between
different types of words, and (b) processes and parts of the network that are involved
in retrieving word meaning.

Thematic Organization of the Mental Lexicon


Inspecting network structure at the mesoscopic level allows us to better grasp
abstract properties at the macroscopic level. Whereas the macroscopic properties
of the network summarize the network in terms of its efficiency, growth, and
global structure, it does not provide any knowledge about the content, qualitative
properties, or the similarity relationships between words in the lexicon. For example,
while we might learn that the lexicon is characterized by a small number of hubs
(words like water, money, food, and car), understanding how these hubs arise requires
investigating how they are embedded in the network. Similarly, a measure of
modularity might give us an idea about the degree of clustering for the entire graph,
but it does not provide us with any information about the nature of individual
clusters.
Large-scale Semantic Networks 185

Looking at the qualitative aspects of specific clusters of words in the lexicon


also provides us with a direct way of grasping the principles governing lexical
organization at the semantic level. Such principles can be taxonomically or
thematically based, an issue which is still debated (Hutchison, 2003; Lucas,
2000). Furthermore, at a higher hierarchical level, larger clusters might also reveal
something about neuroanatomical constraints for how words are represented, given
that various studies indicate systematic differences between animals or artifacts
(Goldstone, 1996; Verheyen, Stukken, De Deyne, Dry, & Storms, 2011) or abstract
and concrete words (Crutch & Warrington, 2005; Hampton, 1981).
In general these questions can be addressed by identifying clusters or groups of
nodes with a higher level of interconnection among themselves than with the rest
of the network. There are many ways through which such clusters can be derived
(see Fortunato, 2010, for an overview), and some of the more statistical approaches
are especially promising. An example is the clustering of a large-scale network of
the mental lexicon for over 12,000 nodes derived from Du word association data in
Dutch (De Deyne, Verheyen, & Storms, 2015). In the weighted directed graph to
represent the lexicon, the quality of the clusters that are derived are compared with
a suitable null model network to see if these groups occur beyond chance levels
(Lancichinetti et al., 2011). Inspecting the clusters confirmed that it consistently
shows a widespread thematic structure (De Deyne, Verheyen & Storms, 2015),
which could also be described as a free categories organization principle (Kenett et al.,
2011).
An example of how the network could be organized at the mesoscopic level
is presented in Figure 8.2. In line with the structure presented in Figure 8.2, a
mesoscopic community detection analysis for an extensive lexicon argues against an
exclusively taxonomic view of the mental lexicon (Rosch, Mervis, Grey, Johnson,
& Boyes-Braem, 1976) but instead shows thematic structure across the hierarchies
that were derived from the data grouping. For example, for a typical taxonomic
category like birds, it would group together various birds but also words like
beak, nest, whistle, or egg. This converges with recent evidence that highlights the
role of thematic representations even in domains such as animals (Lin & Murphy,
2001).
It is quite likely that a thematic organization is an inherent property of language,
where most words are taxonomically related to only a small number of other words,
but might occur in a variety of thematic settings. In other words, this illustrates
what removing a selection bias towards concrete words does, as the implementation
of large-scale networks represents all types of words in language including adjectives,
verbs, and nouns.

Predicting Human Judgments of Relatedness


An important test of the structure at the mesoscopic level and the type of processes
it supports is the extent to which two nodes are related depending on the direct and
186 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

FIGURE 8.2 Hierarchical tree visualization of clusters in the lexicon with five most
central members of the mental lexicon adapted from De Deyne, Verheyen, & Storms
(2015). While the deepest level of the hierarchy shows coherent content, higher levels
also convey the relations between smaller clusters and highlight other organizational
principles of the lexicon. For example, at the highest level a word’s connotation or
valence tends to capture network structure.

indirect links that exist between them. The idea that the closeness between a pair of
nodes in terms of the paths connecting them predicts the time to verify sentences
like a bird can fly motivated the early propositional network model by Collins and
Quillian (1969).
Human relatedness judgments for word pairs like rabbit—hare, or rabbit—carrot
provides a direct way to test various topological network properties at the mesoscopic
level and test hypotheses for different kinds of words (e.g. abstract or concrete) and
semantic relations. Distinct topological properties of the network affect these
predictions. A first one is the role of weak links, where the introduction of
weak links through continued responses represents a systematic improvement over
networks derived from single-response procedures (De Deyne, Navarro, & Storms,
2013; Hahn, 2008; Kenett et al., 2011). A second factor is the role of indirect
links that could contribute to the relatedness of word pairs. Several studies show
that incorporating indirect mesoscopic structure using random walks improves
predictions of human similarity judgments (Borge-Holthoefer & Arenas, 2010a; De
Deyne, Verheyen, & Storms, 2015; Van Dongen, 2000) and can be used to extract
Large-scale Semantic Networks 187

categorical relations between words (Borge-Holthoefer & Arenas, 2010a). A final


factor is the directionality of the network. When undirected networks are derived,
the density of the network increases as the presence of an undirected link is based
on either an in or out-going link. Whereas additional density through continued
responses improves prediction, ignoring the directionality actually hampers the
prediction of human similarity judgments (De Deyne, Navarro, & Storms, 2013).
Altogether, incorporating weak links and considering indirect and directed paths
contribute to explaining human semantic cognition.

Capturing Semantic Priming Effects


A quintessential example of the role of processing requirements, directionality,
associative strength, and direct and indirect paths is priming. In priming tasks, the
processing of a target word is enhanced when it is preceded by a related cue word.
In the case of associative priming this involves the presentation of a cue such as dog,
which facilitates processing of the target bone. In network terms, such facilitation
might be explained by the presence of an associative link between these words.
Even more so, this priming not only reflects the presence of an associative link, but
also the strength of the links between nodes (Cañas, 1990).
Closely related is mediated priming, whereby a cue primes a target through a
mediated link, as in the example of stripes—tiger—lion (Chwilla & Kolk, 2002).
This type of priming is of particular theoretical importance, as it allows testing
the assumption of spreading activation throughout the network (Hutchison, 2003).
Mediated priming also extends to more complex scenarios such as three-step
priming, where two intervening non-presented concepts exist between prime
and target. An example would be where the prime mane activates a target stripes
through the mediators lion, tiger, which are never presented (McNamara, 1992).
Using large-scale free association networks, this type of mediated priming can be
investigated empirically by considering the paths connecting word pairs (Kenett,
Anaki, & Faust, 2015).
A final type of priming that is often considered distinct from the two previous
ones is semantic priming. Here, an ensemble of shared features or links rather than a
single connection determines whether priming occurs. In contrast with associative
priming, semantic priming is considered to be symmetrical, which allows disen-
tangling both types of priming (Thompson-Schill, Kurtz, & Gabrieli, 1998). From
a large-scale network perspective, spreading activation over a semantic network may
account for various types of priming. First of all, the spreading activation account
is often used to explain associative priming, through finding the shortest path
between the prime and target (Thompson-Schill, Kurtz, & Gabrieli, 1998). When
activation spreads through every possible path connecting two words, it captures
different components of meaning where the summed activation reflects semantic
similarity. The model spans a continuum going from a singular direct path to an
188 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

ensemble of paths with arbitrary lengths. At the end of this spectrum, the summed
activation over many nodes will start to resemble the activation of semantic features
in distributed models (e.g. Plaut, McClelland, Seidenberg, & Patterson, 1996).
Thus, a network account provides a flexible, yet well-defined way to understand
many of the documented priming effects but also questions certain theoretical
predictions, for instance in the case of the symmetric nature of semantic priming.

Memory Retrieval and Search in the Lexicon


Given the large amount of knowledge stored in the lexicon, retrieving information
from the lexicon is a formidable challenge. In contrast to directed similarity
judgments (which highlights meaning), priming (which allows us to grasp fast,
automatic, or even unconscious processing), memory and information search studies
highlight both aspects of cue salience and meaning during encoding and retrieval.
Two experimental paradigms have been developed to assess the underlying structure
that would support information retrieval from the lexicon.
A first line of research involves episodic memory effects such as false memories
(McEvoy, Nelson, & Komatsu, 1999; Roediger, Balota, & Watson, 2001),
recognition, and cued recall (Nelson et al., 1998; Nelson, McEvoy, & Bajo, 1988).
An interesting property of these tasks is how both direct and indirect paths between
words affect the intrusion of non-presented items in false memories and efficiency
in cued recall and recognition tasks. In cued recall, for example, participants are
required to retrieve words presented earlier when presented with a cue word in a
subsequent test phase. Because the cue word is not presented in the experiment,
the degree to which it helps retrieval of a studied word depends on the pre-existing
connectivity these words have in the lexicon. If the cue word has both many and
strong indirect paths connecting with the target word, this significantly increases
the probability of correctly recalling it from memory.
A second line of research might provide even more missing pieces of the puzzle as
it involves much broader sampling of structure at the mesoscopic level by requiring
the integration of many paths between words. Two tasks that leverage on how
humans navigate the network are the remote association task (Mednick, 1962) and
the more recent remote triad task (De Deyne, Navarro, Perfors, & Storms, 2016).
In the first task, participants are shown three cue words (e.g. falling—actor—dust)
and have to guess which word relates them.6 In the second task, three words
are randomly chosen from a large pool of words (e.g. Sunday—vitamin—idiot)
and participants have to choose which pair is most related. Both tasks entail
the integration of paths connecting words but do this for different ranges (short
in the case of the RAT, long in the case of the RTT). Similar to the judged
relatedness studies, accessing the deep mesoscopic structure of the lexicon through
random walks seems to be key for explaining human performance in the RAT
and RTT (Abbott et al., 2015; Capitán et al., 2012; De Deyne, Verheyen &
Large-scale Semantic Networks 189

Storms, 2015; Gupta, Jang, Mednick, & Huber, 2012; Thompson & Kello,
2014).

Insights at the Microscopic Level


The microscopic or node level of analysis of the network focuses on how a single
node is connected with the rest of the network. One example of a network measure
of this level is node centrality, expressed as the number of different connections
(set size) of a word. This type of centrality has been studied quite extensively in
psycholinguistics and explains why certain words are processed more efficiently than
others (Chumbley, 1986; Hutchison, 2003; Nelson & McEvoy, 2000). However, set
size provides a highly impoverished view of how words can be central in a network.
Instead, the network view provides a richer hypothesis space by distinguishing
weighted and directed relations connecting nodes. The centrality of a node can
alternatively be characterized by the number of in and out-going edges (the in
and out-degree), the strength of these edges (in and out-strength), or a reciprocal
measure that combines in and out-edges. Moreover, centrality measures may reflect
some degree of mesoscopic structure as in the cases of eigen-centrality measures
like PageRank (Page, Brin, Motwani, & Winograd, 1998). These measures take
into account the centrality of the neighboring nodes as well and proved valuable
in information retrieval. PageRank, for example, indexes the importance of web
pages based on how important the pages that link to it are and analytically closely
resembles an implementation of spreading activation based on random walks.

Simple Network Centrality Measures to Explain Word


Processing Advantages
At the microscopic level of the mental lexicon, the interconnectivity of a particular
node with its neighboring nodes affects how a word is retrieved or processed. The
basic concept of interconnectivity is expressed throughout cognitive science. In
psycholinguistics, a host of environmental variables have been proposed as to why
some words are processed more efficiently, based on word frequency, contextual
diversity, age of acquisition, and so on. In the memory literature, network-inspired
explanations include the fan effect, where the more things that are learned about
a word, the longer it takes to retrieve any one of those facts (Radvansky, 1999).
Similarly, various studies have found that in semantic tasks, words with many features
are processed more efficiently than words with just a few features (Pexman, Holyk,
& Monfils, 2003; Recchia & Jones, 2012).
Mechanistic explanations of how these environmental variables affect structure
in the lexicon are often based on the idea that the number of connections a word
has in the network influences processing time. In some cases network accounts have
been explicitly tested, for instance for explaining the effects of age of acquisition
190 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

(Steyvers & Tenenbaum, 2005), contextual diversity, and word frequency (Monaco,
Abbott, & Kahana, 2007). In the memory literature, the clustering coefficient of a
node has been proposed to explain a host of other memory and word processing
phenomena including recognition (Nelson, Zhang, & McKinney, 2001) and cued
recall (Nelson et al., 1998). However, more often than not a network is only invoked
as an explanatory device rather than a full-fledged computational model.
Why should large-scale network representations be used to examine the microscopic
organization level of the mental lexicon? The main reason is that large-scale
network implementations offer a more complete framework that allows us to
explicitly test various ways in which nodes can have processing advantages. If
the network is sufficiently large, it is possible to cover both the number of in
and out-going links as well as the number of links that might exist between
the neighbors of a node, leading to richer explanations of node centrality than
previous proposals. A good example is the concreteness effect, a finding where
highly imageable words such as chicken will be processed faster and more accurately
than abstract ones like intuition in word processing tasks like lexical decision
(Kroll & Merves, 1986). According to one hypothesis about the representation
of these words in memory, concrete words have smaller associate sets than abstract ones
(Galbraith & Underwood, 1973; Schwanenflugel & Shoben, 1983; and see de Groot,
1989), but such an explanation ignores both the weights and directionality of the
links. This goes against evidence suggesting that centrality measures derived from
undirected networks do not correspond as much with external centrality measures
such as imageability (Galbraith & Underwood, 1973) and age of acquisition (De Deyne
& Storms, 2008) or decision latencies in lexical decisions (De Deyne, Navarro, &
Storms, 2013) as compared with directed centrality measures. In particular, estimates
of in-degree and in-strength rely on how representative the set of cues is to build the
network. For example, if, for some reason, the word water, which frequently occurs
as a response, was never presented as a cue, the out-degree or out-strength for many
words will be biased as these responses are not encoded in the network.

Reverberatory and Other Complex Network Centrality Measures


Besides incorporating edge weights and directionality, recent studies indicate that
centrality might reflect even richer structure. Taking into account how central the
neighbors of a node are as well, tends to result in better predictions than measures
that are based on the centrality (in terms of degree for instance) of a particular node.
For example, in the phonological fluency task, in which participants generate as
many words starting with a specific letter as possible, the PageRank measure was
able to account for more of the variance than word frequency (Griffiths, Steyvers,
& Firl, 2007). In other words, a network perspective provides a way to test how
reverberatory or feedback effects could contribute to how efficiently words are
retrieved.
Large-scale Semantic Networks 191

Similarly, other studies have also found that centrality measures that capture some
of the mesoscopic or macroscopic properties might explain additional variance in
word processing. One example of such a measure is the word centrality measure
(Kenett et al., 2011). This measure examines the effect of each node on the general
structure of the network. This is achieved by removing a node and examining
the effect of the node removal on the average shortest path length (ASPL) of
the network without that node. In a study analyzing the Hebrew mental lexicon,
Kenett et al. (2011) found that some nodes greatly increase the ASPL of the network
once they are removed, thus indicating that these facilitative nodes enhance the
spread of activation within the network. The authors also found that some nodes
greatly decrease the ASPL of the network once they are removed, thus indicating the
presence of inhibitive nodes that hamper the spread of activation within the network.
Altogether, more complex centrality measures based on the reverberatory spread
of activation of a node or ASPL highlight how interpreting the complexity at each
of the three levels provides a richer explanation for word processing effects.

Discussion
In this chapter we have shown how a large-scale network representation of the
mental lexicon and the processes operating on it can account for a large diversity of
cognitive phenomena. At the macroscopic level these include language development,
creativity, and communication and thought disorders in clinical populations. At
the mesoscopic level they include the analysis of lexicon organization principles,
semantic relatedness, semantic priming, and word retrieval processes. At the
microscopic level they include explanations for word processing advantages for
environmental variables such as concreteness, age of acquisition, or word frequency,
but also an overarching framework for memory-based explanations including the
fan effect and potentially other measures of semantic richness of words. These
studies gradually depict a larger, broader picture of the role of lexicon structure
in a wide variety of cognitive phenomena. It can be expected that applications of
network theory to studies of the lexicon will continue to grow in the years to come
(Baronchelli, Ferrer-i-Cancho, Pastor-Satorras, Chater, & Christiansen, 2013; Faust
& Kenett, 2014).
From a modeling perspective, looking at different scales of the network
provides us with a rich way of evaluating and contrasting different proposals. In
particular, any model of semantic processing can now be evaluated in terms of the
type of macroscopic structure it exhibits, which can be achieved by looking at
degree-distributions or global modularity indices. At this level, models are expected
to be robust against damage and promote efficient diffusion of information. At the
mesoscopic level, models should be able to account for the relatedness in meaning
between words both in offline and online tasks. Contrasting different tasks such as
overt relatedness judgments and semantic priming allows us to investigate issues such
192 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

as the time course of information retrieval from the lexicon, potential asymmetric
semantic relations, and the interaction between the centrality of a node and how
its meaning is accessed. Finally, at a microscopic level, the network-based account
indicates that word processing advantages can be the result of distinct connectivity
patterns in directed networks, which are equally likely to affect the predictions of
language-based tasks (e.g. production or retrieval; online or offline) in different
ways. At each of these three levels, various studies have provided valid accounts, but
only a few studies have integrated evidence to capture the multilevel structure of the
lexicon (Griffiths, Steyvers, & Tenenbaum (2007) is a notable exception). Adapting
this approach limits the possible models but also forces models to be comprehensive
from the start. Ultimately, both factors should provide us with a more accurate
appraisal of how knowledge is represented throughout the lexicon.
The application of a multilevel network view might even go further than
providing a general framework upon which several empirical hypotheses and
predictions can be examined simultaneously via quantitative means. It might also
inspire new theoretic views. One example of how large-scale mental networks
are starting to shape theories in cognitive science is a novel proposal which relates
lexicon structure to typical and atypical semantic processing (Faust & Kenett, 2014).
This theory proposes a cognitive continuum of lexicon structure. On one extreme
of this continuum lies rigid, structured lexicon networks, such as those exhibited in
individuals with Asperger syndrome (Kenett, Gold & Faust, 2015). On the other
end of this continuum lies chaotic, unstructured lexicon networks, such as those
exhibited in individuals with schizophrenia (Zeev-Wolf, Goldstein, Levkovitz, &
Faust, 2014). According to this theory, efficient semantic processing is achieved
via a balance between rigid and chaotic lexicon structure (Faust & Kenett, 2014).
For example, in regard to individual differences in creative ability, as the mental
lexicon structure is more rigid, it is less creative, even to the point of a clinical state.
On the contrary, as the mental lexicon structure is more chaotic, it is more creative,
again producing a clinical state in the extreme case (see also Bilder & Knudsen,
2014). This theory demonstrates how network analysis of the mental lexicon can
provide a general account for a wide variety of cognitive phenomena. In this regard,
large-scale representations of the mental lexicon are crucial to advancing such
network analysis in cognitive science. As stated above, uncovering larger portions
of the mental lexicon, via large-scale representations, will advance examination of
cognitive phenomena at all network levels.

Extending the Models to Specific Groups and Individuals


In the introduction to this chapter we have argued that a mental network derived
from word associations captures the mental properties of meaning that so far have
not been available in linguistically inspired expert models or text corpus-based
models. One of the assets of this framework is that it easily extends to homogeneous
subpopulations of language users or even individuals.
Large-scale Semantic Networks 193

One obvious way to extend this work is by looking at developmental


patterns in order to obtain a more integrated view of why abrupt developmental
changes in the lexicon occur. In children, this would primarily include the
syntagmatic-to-paradigmatic shift (Ervin, 1961) around the age of six, or the
thematic-to-taxonomic shift around the age of nine (Waxman & Gelman, 1986).
In older adults, a similar explanation might account for a reverse shift towards
syntagmatic responses in Alzheimer patients (Baker & Seifert, 2001), which
potentially generalizes to the geriatric lexicon in general. Again, the issue of
network size should be considered as some of these shifts might indicate a transition
simply caused by increased size or connectivity.
A second possible extension of the large-scale network representation model is
towards examining individual semantic networks. While most lexicon networks
are collected from many individuals, some studies have tried to look at network
structure within specific individuals. To date, only a few studies have investigated the
macroscopic properties of the semantic networks of different individuals. One such
study was conducted by Morais et al. (2013). Similar to aggregated networks, the
individual networks showed short average distances and a high degree of clustering
between nodes. Moreover, the degree-distribution also followed a truncated power
law just like the aggregated counterparts. One of the interesting observations in this
study was the variability in the network sizes for different individuals, with some
networks consisting of just over 5,000 links and others over 27,000 links. Potentially,
this relates to other individual differences, such as executive functions, intelligence,
attention, etc. On the other hand, these differences might also reflect task-effects
such as sensitivity to semantic satiation or the effect that stimuli temporarily lose
their meaning, which potentially occurs in prolonged tasks (Cramer, 1968; Szalay
& Deese, 1978). More research is needed to establish ways to minimize task effects
specific to individuals, groups of people, and assessments over prolonged periods.
Altogether, investigating network structure at the individual level might be feasible
and many of the implications of individual differences are yet to be explored.
A final extension of large-scale network representations is towards the study
of clinical populations. While investigations into the cognitive aspects of clinical
populations from a network scientific perspective are just beginning to take off,
a few studies so far prove the feasibility (Cabana et al., 2011; Holshausen et al.,
2014; Kenett et al., 2015, 2013; Mota et al., 2012). Conducting large-scale network
representation studies in clinical populations could greatly contribute by providing
quantitative measures related to clinical deficiencies. Furthermore, such research
can be used in clinical diagnostics, by examining lexicon structure during clinical
treatment.

Challenges
Recently, Griffiths (2015) presented a manifesto for a computational cognitive
revolution in the era of Big Data. In line with his view, we advocate in this chapter
194 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

the significance of investigating the mental lexicon from a multi-level, Big Data,
association-based approach. So far we have mainly focused on the representation
of word meanings, without saying much about other factors that affect semantic
processing and memory retrieval. Various studies provide indirect evidence that
executive functions, working memory, attention, mood and personality traits all
contribute to how we process and retrieve meaning in the lexicon (Bar, 2009; Beaty,
Silvia, Nusbaum, Jauk, & Benedek, 2014; Benedek, Franz, Heene, & Neubauer,
2012; Benedek, Jauk, Sommer, Arendasy, & Neubauer, 2014; Heyman, Van
Rensbergen, Storms, Hutchison, & DeDeyne, 2015). To advance the application of
large-scale representation of the mental lexicon, such application must account for
the effect of these variables.
A further challenge is elucidating the relation between the phonological network
(Arbesman, Strogatz, & Vitevitch, 2010), which serves as the gateway into the
mental lexicon, and the semantic network. This relation can be studied from a
network of networks perspective (Kenett et al., 2014), that provides a way to analyze
networks which are related to each other and the interaction between them. Such
an approach will enable quantitative analysis of broader linguistic issues.
Besides the challenges in further integrating psychological factors and other
aspects of word representations, the use of network analysis also raises some
methodological challenges. The picture drawn so far still only covers a small portion
of how network science for large graphs can contribute to many psychologically
interesting phenomena. At the moment, many methods and ideas in network
science are just recent developments and continue to improve. Only a few years
ago, using binary undirected networks was conditional on the lack of methods for
weighted directed graphs. Similarly, the statistical underpinnings for identifying
clusters in these types of networks (Lancichinetti et al., 2011) or comparing different
networks have only very recently become available. Currently, developing statistical
models that allow us to test hypotheses for comparing networks remains a major
challenge for applying network science in empirical research. This is mainly due to
difficulties in estimating or collecting a large sample of empirical networks and there
are only a few statistical methods to compare between networks (Moreno & Neville,
2013). In these cases, bootstrapping methods over comparable networks will be a
solution (Baxter, Dorogovtsev, Goltsev, & Mendes, 2010). A similar approach is
used in more advanced applications of community detection for large-scale directed
weighted networks, where the cluster membership is determined by evaluating
the likelihood of this event in a comparable random network (Lancichinetti et al.,
2011).
Perhaps an even bigger challenge is the need to implement dynamic properties
in the networks, which might be needed to address the dynamic nature of semantic
processing over the mental lexicon and growth and evolution over the lifespan. Such
a dynamic time-course process of semantic retrieval within an individual might
involve the availability of different types of semantic information. In these cases it
Large-scale Semantic Networks 195

will be important to distinguish qualitatively different types of links between nodes,


which would lead to a multiplex network, where the contribution of different
types of information (be they thematic, taxonomic, language, or imagery-based) is
time-dependent.
While studying network dynamics together with labeled semantic relations
could help us better grasp developmental changes throughout our lives, it doesn’t
explain how episodic experiences eventually becomes encoded as part of the
semantic knowledge represented in the lexicon. This brings us to a last issue of
a more theoretical nature. While the properties of the linguistic environment are
an important key to the puzzle of where semantic representations come from, and
text-corpus models do indicate that this linguistic input is richly structured, it
remains unclear how to arrive at mental models that operate on this input. Similarly,
it is unclear to what extent input from the linguistic environment in itself suffices
to capture the richness in meaning of the mental lexicon. In previous research
(De Deyne et al., 2015; De Deyne, Verheyen, & Storms, 2015), we found only
limited agreement between textual-corpora based and word association based
networks, which adds empirical support to the idea that the association task does
not rely on the same properties as common language production, but should rather
be seen as tapping into the semantic information of the mental lexicon (McRae,
Khalkhali, & Hare, 2011; Mollin, 2009). Of course, humans do encode structure
from the languages they are exposed to and in many cases this structure can even
mimic properties that are considered to be non-linguistic in the first place (Louwerse
& Connell, 2011). This suggests that text corpus data can provide us with at least
a partial answer of how language shapes the mental lexicon but large-scale word
association networks might give us a more privileged view into the structure,
dynamics, and processes of the mental lexicon.
The above makes clear that different approaches to study word meaning are
complementary to each other. Whether derived from text corpora or from more
direct word associations, they highlight the valuable role of Big Data to understand
how words are acquired and represented typically and in atypical cases such as
psychiatric or neurological disorders. In doing so, there is not a single preferred level
of analysis, and just like the qualitative properties of the data, considering different
levels of complexity in these data will be important to constrain future theories and
understand a large diversity of empirical findings in this field.

Acknowledgments
Work on this chapter was supported to the first author from ARC grant
DE140101749. This work was also supported to authors YNK and MF by the
Binational Science Fund (BSF) grant (number 2013106) and by the I-CORE
Program of the Planning and Budgeting Committee and the Israel Science
196 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

Foundation (grant 51/11). DJN received salary support from ARC grant
FT110100431. Comments may be sent to simon.dedeyne@adelaide.edu.au.

Notes
1 Information retrieved from http://wordnet.princeton.edu/wordnet/man/wnstats.
7WN.html.
2 Results were retrieved from the English Lexicon project website, see http://
elexicon.wustl.edu.
3 For similar examples, try deriving neighbors using the LSA website at
http://lsa.colorado.edu.
4 Some researchers do ask participants to give “meaningful responses” (Nelson,
McEvoy, & Dennis, 2000); in all studies of our own, we have stuck to a true
free task.
5 The idea that each word maps onto exactly one node is most likely an unrealistic
assumption about how words are actually represented in the brain. However,
this simplification offers us both the flexibility needed to integrate key findings
in word processing and the ability to understand explicitly how information is
retrieved as the state of the network is interpretable by looking at which nodes
are activated.
6 In this easy example, the related word is star.

References
Abbott, J. T., Austerweil, J. L., & Griffiths, T. L. (2015). Random walks on semantic
networks can resemble optimal foraging. Psychological Review, 122(3), 558–569.
Aitchison, J. (2012). Words in the mind: An introduction to the mental lexicon. Oxford:
Wiley-Blackwell.
Ameel, E., & Storms, G. (2006). From prototypes to caricatures: Geometrical models for
concept typicality. Journal of Memory and Language, 55, 402–421.
Anderson, J. R. (1983). A spreading activation theory of memory. Journal of Verbal Learning
and Verbal Behavior, 22(3), 261–295.
Anderson, J. R. (1991). The adaptive nature of human categorization. Psychological Review,
98(3), 409.
Anderson, P. W. (1972). More is different. Science, 177(4047), 393–396.
Arbesman, S., Strogatz, S. H., & Vitevitch, M. S. (2010). The structure of phonological
networks across multiple languages. International Journal of Bifurcation and Chaos, 20(3),
679–685.
Baker, M. K., & Seifert, L. S. (2001). Syntagmatic-paradigmatic reversal in Alzheimer-type
dementia. Clinical Grontologist, 23(1–2), 65–79.
Bar, M. (2009). A cognitive neuroscience hypothesis of mood and depression. Trends in
Cognitive Sciences, 13(11), 456–463.
Baronchelli, A., Ferrer-i-Cancho, R., Pastor-Satorras, R., Chater, N., & Christiansen, M. H.
(2013). Networks in cognitive science. Trends in Cognitive Sciences, 17(7), 348–360.
Baxter, G. J., Dorogovtsev, S. N., Goltsev, A. V., & Mendes, J. F. (2010). Bootstrap percolation
on complex networks. Physical Review E, 82(1), 011103.
Large-scale Semantic Networks 197

Beaty, R. E., Silvia, P. J., Nusbaum, E. C., Jauk, E., & Benedek, M. (2014). The roles
of associative and executive processes in creative cognition. Memory & Cognition, 42(7),
1186–1197.
Beckage, N. M., Smith, L. B., & Hills, T. (2010). Semantic network connectivity is related
to vocabulary growth rate in children. In Proceedings of the 32nd Annual Conference of the
Cognitive Science Society, Portland, OR, USA (pp. 2769–2774).
Benedek, M., Franz, F., Heene, M., & Neubauer, A. C. (2012). Differential effects of
cognitive inhibition and intelligence on creativity. Personality and Individual Differences,
53(4), 480–485.
Benedek, M., Jauk, E., Sommer, M., Arendasy, M., & Neubauer, A. C. (2014). Intelligence,
creativity, and cognitive control: The common and differential involvement of executive
functions in intelligence and creativity. Intelligence, 46, 73–83.
Bilder, R. M., & Knudsen, K. S. (2014). Creative cognition and systems biology on the edge
of chaos. Frontiers in Psychology, 1–4.
Borge-Holthoefer, J., & Arenas, A. (2010a). Categorizing words through semantic memory
navigation. The European Physical Journal B-Condensed Matter and Complex Systems, 74(2),
265–270.
Borge-Holthoefer, J., & Arenas, A. (2010b). Semantic networks: Structure and dynamics.
Entropy, 12(5), 1264–1302.
Borge-Holthoefer, J., Moreno, Y., & Arenas, A. (2012). Topological versus dynamical
robustness in a lexical network. International Journal of Bifurcation and Chaos in Applied
Sciences and Engineering, 22(7), 1250157.
Bourgin, D. D., Abbott, J. T., Griffiths, T. L., Smith, K. A., & Vul, E. (2014). Empirical
evidence for markov chain Monte Carlo in memory search. In Proceedings of the 36th
Annual Meeting of the Cognitive Science Society.
Brown, R., & McNeill, D. (1966). The tip of the tongue phenomenon. Journal of Verbal
Learning and Verbal Behavior, 5(4), 325–337.
Cabana, A., Valle-Lisboa, J. C., Elvevåg, B., & Mizraji, E. (2011). Detecting order-disorder
transitions in discourse: Implications for schizophrenia. Schizophrenia Research, 131(1),
157–164.
Cañas, J. J. (1990). Associative strength effects in the lexical decision task. The Quarterly
Journal of Experimental Psychology, 42(1), 121–145.
Capitán, J. A., Borge-Holthoefer, J., Gómez, S., Martinez-Romo, J., Araujo, L., Cuesta,
J. A., & Arenas, A. (2012). Local-based semantic navigation on a networked representation
of information. PLoS One, 7(8), e43694.
Chumbley, J. I. (1986). The roles of typicality, instance dominance, and category dominance
in verifying category membership. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 12, 257–267.
Chwilla, D. J., & Kolk, H. H. (2002). Three-step priming in lexical decision. Memory &
Cognition, 30, 217–225.
Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing.
Psychological Review, 82, 407–428.
Collins, A. M., & Quillian, M. R. (1969). Retrieval time from semantic memory. Journal of
Verbal Learning and Verbal Behavior, 9, 240–247.
Cramer, P. (1968). Word association. New York, NY: Academic Press.
Crutch, S. J., & Warrington, E. K. (2005). Abstract and concrete concepts have structurally
different representational frameworks. Brain, 128, 615–627.
198 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

De Deyne, S., & Storms, G. (2008). Word associations: Network and semantic properties.
Behavior Research Methods, 40, 213–231.
De Deyne, S., Navarro, D. J., Perfors, A., Storms, G. (2016). Structure at every scale:
A semantic network account of the similarities between unrelated concepts. Journal of
Experimental Psychology: General, 145(9), 1228–1254.
De Deyne, S., Navarro, D. J., & Storms, G. (2013). Better explanations of lexical and semantic
cognition using networks derived from continued rather than single word associations.
Behavior Research Methods, 45, 480–498.
De Deyne, S., Verheyen, S., Ameel, E., Vanpaemel, W., Dry, M., Voorspoels, W., & Storms,
G. (2008). Exemplar by feature applicability matrices and other Dutch normative data for
semantic concepts. Behavior Research Methods, 40, 1030–1048.
De Deyne, S., Verheyen, S., & Storms, G. (2015). Structure and organization of the mental
lexicon: A network approach derived from syntactic dependency relations and word
associations. In A. Mehler, A. Lucking, S. Banisch, P. Blanchard, & B. Job (Eds.), Towards
a theoretical framework for analyzing complex linguistic networks (pp. 47–82). Berlin/New York:
Springer.
De Deyne, S., Verheyen, S., & Storms, G. (2015). The role of corpus size and syntax in
deriving lexico-semantic representations for a wide range of concepts. Quarterly Journal of
Experimental Psychology, 68(8), 1643–1644.
Deese, J. (1965). The structure of associations in language and thought. Baltimore, MD: Johns
Hopkins Press.
de Groot, A. M. (1995). Determinants of bilingual lexicosemantic organisation. Computer
Assisted Language Learning, 8(2–3), 151–180.
Den Heyer, K., & Briand, K. (1986). Priming single digit numbers: Automatic spreading
activation dissipates as a function of semantic distance. The American Journal of Psychology,
99(3), 315–340.
Dry, M., & Storms, G. (2009). Similar but not the same: A comparison of the utility
of directly rated and feature-based similarity measures for generating spatial models of
conceptual data. Behavior Research Methods, 41, 889–900.
Elman, J. L. (2009). On the meaning of words and dinosaur bones: Lexical knowledge
without a lexicon. Cognitive Science, 33(4), 547–582.
Ervin, S. M. (1961). Changes with age in the verbal determinants of word-association. The
American Journal of Psychology, 74(3), 361–372.
Faust, M., & Kenett, Y. N. (2014). Rigidity, chaos and integration: Hemispheric interaction
and individual differences in metaphor comprehension. Frontiers in Human Neuroscience,
8(511), 1–10. doi: 10.3389/fnhum.2014.00511.
Fellbaum, C. (1998). WordNet: An electronic lexical database. Retrieved from
www.cogsci.princeton.edu/wn. Cambridge, MA: MIT Press.
Firth, J. R. (1968). Selected papers of J. R. Firth, 1952–59./edited by F. R. Palmer (Longmans’
Linguistics Library). Harlow: Longmans.
Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3), 75–174.
Galbraith, R. C., & Underwood, B. J. (1973). Perceived frequency of concrete and abstract
words. Memory & Cognition, 1, 56–60.
Goldstone, R. L. (1996). Isolated and interrelated concepts. Memory & Cognition, 24,
608–628.
Gouzoulis-Mayfrank, E., Voss, T., M’orth, D., Thelen, B., Spitzer, M., & Meincke, U.
(2003). Semantic hyperpriming in thought-disordered patients with schizophrenia: State
or trait?—a longitudinal investigation. Schizophrenia Research, 65(2–3), 65–73.
Large-scale Semantic Networks 199

Griffiths, T. L. (2015). Manifesto for a new (computational) cognitive revolution. Cognition,


135, 21–23.
Griffiths, T. L., Steyvers, M., & Firl, A. (2007). Google and the mind. Psychological Science,
18, 10691076.
Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation.
Psychological Review, 114, 211–244.
Gupta, N., Jang, Y., Mednick, S. C., & Huber, D. E. (2012). The road not taken:
Creative solutions require avoidance of high-frequency responses. Psychological Science,
(23), 288–294.
Hahn, L. W. (2008). Overcoming the limitations of single response free associations. Electronic
Journal of Integrative Biosciences, 5, 25–36.
Hampton, J. A. (1981). An investigation of the nature of abstract concepts. Memory &
Cognition, 9, 149–156.
Heyman, T., VanRensbergen, B., Storms, G., Hutchison, K. A., & De Deyne, S. (2015). The
influence of working memory load on semantic priming. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 41(3), 911–920.
Hills, T. T., Maouene, M., Maouene, J., Sheya, A., & Smith, L. (2009). Longitudinal analysis
of early semantic networks: Preferential attachment or preferential acquisition? Psychological
Science, 20(6), 729–739.
Holshausen, K., Harvey, P. D., Elvevag, B., Foltz, P. W., & Bowie, C. R. (2014). Latent
semantic variables are associated with formal thought disorder and adaptive behavior in
older inpatients with schizophrenia. Cortex, 55, 88–96.
Hutchison, K. A. (2003). Is semantic priming due to association strength or feature overlap?
Psychonomic Bulletin and Review, 10, 785–813.
Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., & Ueda, N. (2006). Learning
systems of concepts with an infinite relational model. In Proceedings of the 21st National
Conference on Artificial Intelligence.
Kenett, D. Y., Jianxi, G., Huang, X., Shao, S., Vodenska, I., Buldyrev, S. V., . . . Havlin,
S. (2014). Network of interdependent networks: Overview of theory and applications.
In Gregorio D’Agostino, Antonio Scala (Eds), Networks of networks: The last frontier of
complexity (pp. 3–36). Cham: Springer.
Kenett, Y. N., Anaki, D., & Faust, M. (2014). Investigating the structure of semantic networks
in low and high creative persons. Frontiers in Human Neuroscience, 8(407), 1–16.
Kenett, Y. N., Anaki, D., & Faust, M. (2015). The semantic distance task: Quantifying semantic
distance with semantic network path length. Manuscript under review.
Kenett, Y. N., Gold, R., & Faust, M. (2016). The hyper-modular associative mind:
A computational analysis of associative responses of persons with Asperger syndrome.
Language and Speech, 59(3), 297–317. DOI: 10.1177/0023830915589397.
Kenett, Y. N., Kenett, D. Y., Ben-Jacob, E., & Faust, M. (2011). Global and local features of
semantic networks: Evidence from the Hebrew mental lexicon. PLoS One, 6(8), e23912.
Kenett, Y. N., Wechsler-Kashi, D., Kenett, D. Y., Schwartz, R. G., Ben-Jacob, E., & Faust,
M. (2013). Semantic organization in children with cochlear implants: Computational
analysis of verbal fluency. Frontiers in Psychology, 4 (543), 1–11.
Kiss, G. R. (1968). Words, associations, and networks. Journal of Verbal Learning and Verbal
Behavior, 7, 707–713.
Kroll, J. F., & Merves, J. S. (1986). Lexical access for concrete and abstract words. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 12(1), 92.
200 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

Lancichinetti, A., Radicchi, F., Ramasco, J. J., & Fortunato, S. (2011). Finding statistically
significant communities in networks. PLoS One, 6(4), e18961.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s Problem: The latent semantic
analysis theory of acquisition, induction and representation of knowledge. Psychological
Review, 104, 211–240.
Lerner, A. J., Ogrocki, P. K., & Thomas, P. J. (2009). Network graph analysis of category
fluency testing. Cognitive and Behavioral Neurology, 22(1), 45–52.
Lerner, I., Bentin, S., & Shriki, O. (2012). Spreading activation in an attractor network
with latching dynamics: Automatic semantic priming revisited. Cognitive Science, 36(8),
1339–1382.
Lin, E. L., & Murphy, G. L. (2001). Thematic relations in adults’ concepts. Journal of
Experimental Psychology: General, 1, 3–28.
Louwerse, M., & Connell, L. (2011). A taste of words: Linguistic context and perceptual
simulation predict the modality of words. Cognitive Science, 35, 381–398.
Lucas, M. (2000). Semantic priming without association: A meta-analytic review. Psychonomic
Bulletin & Review, 7(4), 618–630.
McClelland, J. L., & Rogers, T. T. (2003). The Parallel Distributed Processing approach to
semantic cognition. Nature Reviews Neuroscience, 4, 310–322.
McEvoy, C. L., Nelson, D. L., & Komatsu, T. (1999). What is the connection between
true and false memories? The differential roles of interitem associations in recall and
recognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25(5),
1177.
McNamara, T. P. (1992). Theories of priming: i. associative distance and lag. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 18(6), 1173.
McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature
production norms for a large set of living and nonliving things. Behavior Research Methods,
37, 547–559.
McRae, K., Khalkhali, S., & Hare, M. (2011). Semantic and associative relations: Examining
a tenuous dichotomy. In V. F. Reyna, S. Chapman, M. Dougherty, & J. Confrey (Eds.),
The adolescent brain: Learning, reasoning, and decision making (pp. 39–66). Washington, DC,
US: American Psychological Association.
Medin, D. L., Lynch, E. B., & Solomon, K. O. (2000). Are there kinds of concepts? Annual
Review of Psychology, 51(1), 121–147.
Mednick, S. (1962). The associative basis of the creative process. Psychological Review,
69(3), 220.
Mollin, S. (2009). Combining corpus linguistics and psychological data on word
co-occurrence: Corpus collocates versus word associations. Corpus Linguistics and Linguistic
Theory, 5, 175–200.
Monaco, J. D., Abbott, L. F., & Kahana, M. J. (2007). Lexico-semantic structure and the
word-frequency effect in recognition memory. Learning & Memory, 14, 204–213.
Morais, A. S., Olsson, H., & Schooler, L. J. (2013). Mapping the structure of semantic
memory. Cognitive Science, 37(1), 125–145.
Moreno, S., & Neville, J. (2013). Network hypothesis testing using mixed Kronecker product
graph models. In The IEEE International Conference on Data Mining series (ICDM) (pp.
1163–1168). IEEE.
Mota, N. B., Vasconcelos, N. A., Lemos, N., Pieretti, A. C., Kinouchi, O., Cecchi, G. A.,
. . . Ribeiro, S. (2012). Speech graphs provide a quantitative measure of thought disorder
in psychosis. PLoS One, 7(4), e34928.
Large-scale Semantic Networks 201

Nelson, D. L., & McEvoy, C. L. (2000). What is this thing called frequency? Memory &
Cognition, 28, 509–522.
Nelson, D. L., McEvoy, C. L., & Bajo, M. T. (1988). Lexical and semantic search in cued
recall, fragment completion, perceptual identification, and recognition. American Journal of
Psychology, 101(4), 465–480.
Nelson, D. L., McEvoy, C. L., & Dennis, S. (2000). What is free association and what does it
measure? Memory & Cognition, 28, 887–899.
Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida
free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments,
and Computers, 36, 402–407.
Nelson, D. L., McKinney, V. M., Gee, N. R., & Janczura, G. A. (1998). Interpreting the
influence of implicitly activated memories on recall and recognition. Psychological Review,
105, 299–324.
Nelson, D. L., Zhang, N., & McKinney, V. M. (2001). The ties that bind what is known to
the recognition of what is new. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 27, 1147–1159.
Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank citation ranking:
Bringing order to the web. Stanford InfoLab, Stanford, USA.
Pexman, P. M., Holyk, G. G., & Monfils, M. (2003). Number-of-features effects in semantic
processing. Memory & Cognition, 31, 842–855.
Plaut, D. C., McClelland, J. L., Seidenberg, M. S., & Patterson, K. (1996). Understanding
normal and impaired word reading: Computational principles in quasi-regular domains.
Psychological Review, 103, 56–115.
Pomarol-Clotet, E., Oh, T. M. S. S., Laws, K. R., & McKenna, P. J. (2008). Semantic
priming in schizophrenia: Systematic review and meta-analysis. The British Journal of
Psychiatry: The Journal of Mental Science, 192(2), 92–97.
Prior, A., & Bentin, S. (2003). Incidental formation of episodic associations: The importance
of sentential context. Memory & Cognition, 31, 306–316.
Radvansky, G. A. (1999). The fan effect: A tale of two theories. Journal of Experimental
Psychology: General, 128(2), 198–206.
Recchia, G., & Jones, M. N. (2009). More data trumps smarter algorithms: Comparing
pointwise mutual information with latent semantic analysis. Behavior Research Methods, 41,
647–656.
Recchia, G., & Jones, M. N. (2012). The semantic richness of abstract concepts. Frontiers in
Human Neuroscience, 41(3), 647–656.
Recchia, G., Sahlgren, M., Kanerva, P., Jones, M. N., & Jones, M. (2015). Encoding
sequential information in semantic space models: Comparing holographic reduced
representation and random permutation. Computational Intelligence and Neuroscience, 2015,
1–18.
Roediger, H. L., Balota, D. A., & Watson, J. M. (2001). Spreading activation and arousal of
false memories. In H. L. Roediger III, J. S. Nairne, I. Neath, and A. M. Surprenant (Eds)
The nature of remembering: Essays in honor of Robert G. Crowder (pp. 95–115). Washington,
DC: American Psychological Association.
Rosch, E., Mervis, C., Grey, W., Johnson, D., & Boyes-Braem, P. (1976). Basic objects in
natural categories. Cognitive Psychology, 8, 382–439.
Rossmann, E., & Fink, A. (2010). Do creative people use shorter association pathways?
Personality and Individual Differences, 49, 891–895.
202 S. De Deyne, Y. N. Kenett, D. Anaki, M. Faust, and D. Navarro

Runco, M. A., & Jaeger, G. J. (2012). The standard definition of creativity. Creativity Research
Journal, 24(1), 92–96.
Schilling, M. A. (2005). A “small-world” network model of cognitive insight. Creativity
Research Journal, 17(2–3), 131–154.
Schwanenflugel, P. J., & Shoben, E. J. (1983). Differential context effects in the
comprehension of abstract and concrete verbal materials. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 9, 82–102.
Smith, K. A., Huber, D. E., & Vul, E. (2013). Multiply-constrained semantic search in the
remote associates test. Cognition, 128(1), 64–75.
Solé, R. V., Corominas-Murtra, B., Valverde, S., & Steels, L. (2010). Language networks:
Their structure, function, and evolution. Complexity, 15(6), 20–26.
Spitzer, M. (1997). A cognitive neuroscience view of schizophrenic thought disorder.
Schizophrenia Bulletin, 23(1), 29–50.
Stam, C. J. (2014). Modern network science of neurological disorders. Nature Reviews
Neuroscience, 15(10), 683–695.
Steyvers, M., & Tenenbaum, J. B. (2005). The large-scale structure of semantic networks:
Statistical analyses and a model of semantic growth. Cognitive Science, 29, 41–78.
Szalay, L. B., & Deese, J. (1978). Subjective meaning and culture: An assessment through word
associations. Hillsdale, NJ: Lawrence Erlbaum.
Thompson, G. W., & Kello, C. T. (2014). Walking across wikipedia: A scale-free network
model of semantic memory retrieval. Frontiers in Psychology, 5, 1–9.
Thompson-Schill, S. L., Kurtz, K. J., & Gabrieli, J. D. E. (1998). Effects of semantic
and associative relatedness on automatic priming. Journal of Memory and Language, 38(4),
440–458.
Van Dongen, S. (2000). Graph clustering by flow simulation. Doctoral dissertation, University
of Utrecht.
Verheyen, S., Stukken, L., De Deyne, S., Dry, M. J., & Storms, G. (2011). The generalized
polymorphous concept account of graded structure in abstract categories. Memory &
Cognition, 39, 1117–1132.
Voorspoels, W., Storms, G., Longenecker, J., Verheyen, S., Weinberger, D. R., & Elvevåg,
B. (2014). Deriving semantic structure from category fluency: Clustering techniques and
their pitfalls. Cortex, 55, 130–147.
Watts, D. J. & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature,
393, 440–442. doi:10.1038/30918.
Waxman, S., & Gelman, R. (1986). Preschoolers’ use of superordinate relations in
classification and language. Cognitive Development, 1(2), 139–156.
Wiemer-Hastings, K., & Xu, X. (2005). Content differences for abstract and concrete
concepts. Cognitive Science, 29, 719–736.
Zeev-Wolf, M., Goldstein, A., Levkovitz, Y., & Faust, M. (2014). Fine-coarse
semantic processing in schizophrenia: A reversed pattern of hemispheric dominance.
Neuropsychologia, 56, 119–128.
Zortea, M., Menegola, B., Villavicencio, A., & Salles, J. F. d. (2014). Graph analysis of
semantic word association among children, adults, and the elderly. Psicologia: Reflexaõe
Crítica, 27(1), 90–99.
9
INDIVIDUAL DIFFERENCES IN
SEMANTIC PRIMING PERFORMANCE
Insights from the Semantic Priming Project

Melvin J. Yap,
Keith A. Hutchison,
and Luuan Chin Tan

Abstract
The semantic/associative priming effect refers to the finding of faster recognition times for words
preceded by related targets (e.g. cat—DOG), compared to words preceded by unrelated
targets (e.g. hat—DOG). Over the past three decades, a voluminous literature has explored
the influence of semantic primes on word recognition, and this work has been critical in
shaping our understanding of lexical processing, semantic representations, and automatic
versus attentional influences. That said, the bulk of the empirical work in the semantic
priming literature has focused on group-level performance that averages across participants,
despite compelling evidence that individual differences in reading skill and attentional
control can moderate semantic priming performance in systematic and interesting ways.
The present study takes advantage of the power of the semantic priming project (SPP;
Hutchison et al., 2013) to answer two broad, related questions. First, how stable are semantic
priming effects, as reflected by within-session reliability (assessed by split-half correlations)
and between-session reliability (assessed by test–retest correlations)? Second, assuming that
priming effects are reliable, how do they interact with theoretically important constructs such
as reading ability and attentional control? Our analyses replicate and extend earlier work by
Stolz, Besner, and Carr (2005) by demonstrating that the reliability of semantic priming
effects strongly depends on prime-target association strength, and reveal that individuals
with more attentional control and reading ability are associated with stronger priming.

Words preceded by a semantically related word (e.g. cat—DOG) are recognized


faster than when preceded by a semantically unrelated word (e.g. hat—DOG)
(Meyer & Schvaneveldt, 1971). This deceptively simple phenomenon, which has
been termed the semantic priming effect, is one of the most important observations
in cognitive science, and has profoundly shaped our understanding of word
recognition processes, the nature of the semantic system, and the distinction
between automatic and controlled processes (for excellent reviews, see McNamara,
2005; Neely, 1991). Priming phenomena are too complex to be explained
204 M. J. Yap, K. A. Hutchison and L. C. Tan

by a single process, and various theoretical mechanisms, varying from automatic


(i.e. conscious awareness not required) to strategic (i.e. controlled and adaptively
modulated by task context), have been invoked to explain priming (Neely, 1991).
Priming mechanisms can also operate prospectively (i.e. before a target is presented)
or retrospectively (i.e. after a target is presented).
The most well-known prospective mechanism is automatic spreading activation
(Posner & Snyder, 1975). That is, a prime (e.g. cat) automatically pre-activates
related nodes (e.g. DOG) via associative/semantic pathways, facilitating identi-
fication of these words when they are subsequently presented. Priming effects
also seem to reflect controlled processes such as expectancy and semantic matching.
Expectancy operates prospectively and refers to the intentional generation of
candidates for the to-be-presented target (Becker, 1980), whereas semantic
matching reflects a retrospective process that searches for a relation from the target
to the prime (Neely, Keefe, & Ross, 1989). The presence of a prime-target
relationship is diagnostic of the target’s lexical status, because non-words cannot
be related to their primes.
In addition to spreading activation, expectancy, and semantic matching,
researchers have also proposed hybrid mechanisms that combine automatic and
strategic aspects. For example, Bodner & Masson’s (1997) memory-recruitment
account of priming, which is itself based on Whittlesea and Jacoby’s (1990) retrieval
account of priming, is predicated on the idea that the processing of the prime
establishes an episodic memory trace that can then be used to facilitate processing
of the upcoming target. Importantly, the extent to which the system relies on the
episodic trace of the prime is dependent on the prime’s task relevance (Anderson &
Milson, 1989). Interestingly, memory recruitment processes are postulated to
operate even when primes are presented too briefly to be consciously processed
(but see Kinoshita, Forster, & Mozer, 2008; Kinoshita, Mozer, & Forster, 2011,
for an alternative account of Bodner & Masson’s, 1997, findings). Of course, we
should clarify that the foregoing mechanisms are not mutually exclusive and very
likely operate collectively to produce priming.
For the most part, the semantic priming literature has focused on group-level
data, which are collapsed across participants. Likewise, most models of semantic
priming have not taken into account the systematic individual differences that exist
among skilled readers (but see Plaut & Booth, 2000). However, this pervasive
emphasis on the characterization of a “prototypical” reader is difficult to reconcile
with mounting evidence that readers vary on dimensions which moderate word
recognition performance (see Andrews, 2012, for a review). Are individual
differences in the magnitude of a participant’s semantic priming effect related
to individual differences such as reading comprehension ability, vocabulary size,
and attentional control? Closely connected to this question is whether semantic
priming effects are reliable. More specifically, to what extent is the magnitude
of an individual’s semantic priming effect stable within and across experimental
Individual Differences in Semantic Priming 205

sessions? Should it turn out that semantic priming is inherently unreliable (e.g.
Stolz, Besner, & Carr, 2005), this will severely limit the degree to which
the magnitude of an individual’s semantic priming effect can be expected to
correlate with other measures of interest (Lowe & Rabbitt, 1998). Related to
this, an unreliable dependent measure makes it harder for researchers to detect
between-group differences on this measure (Waechter, Stolz, & Besner, 2010).
The present study capitalizes on the power of the SPP (Hutchison et al., 2013)
to address the above-mentioned questions. The SPP is a freely accessible online
repository (http://spp.montana.edu) containing lexical and associative/semantic
characteristics for 1,661 words, along with lexical decision (i.e. classify letter strings
as words or non-words, e.g. flirp) and speeded pronunciation (i.e. read words aloud)
behavioral data of 768 participants from four testing universities (512 in lexical
decision and 256 in speeded pronunciation). Data were collected over two sessions,
separated by no more than one week. Importantly, Hutchison et al. (2013) also
assessed participants on their attentional control (Hutchison, 2007), vocabulary
knowledge, and reading comprehension. It is noteworthy that the SPP contains
data for over 800,000 lexical decision trials and over 400,000 pronunciation trials,
collected from a large and diverse sample of participants, making this a uniquely
valuable resource for studying individual differences in primed lexical processing.
The SPP exemplifies the megastudy approach to studying lexical processing,
in which researchers address a variety of questions using databases that contain
behavioral data and lexical characteristics for very large sets of words (see Balota,
Yap, Hutchison, & Cortese, 2012, for a review). Megastudies allow the language
to define the stimuli, rather than compelling experimenters to select stimuli
based on a limited set of criteria. They now serve as an important complement
to traditional factorial designs, which are associated with selection artifacts, list
context effects, and limited generalizability (Balota et al., 2012; Hutchison
et al., 2013). For example, in the domain of semantic priming, it is often
methodologically challenging to examine differences in priming as a function of
some other categorical variable (e.g. target word frequency), due to the difficulty
of matching experimental conditions on the many dimensions known to influence
word recognition. Using megastudy data, regression analyses can be conducted
to examine the effects of item characteristics on priming, with other correlated
variables statistically controlled for (e.g. Hutchison, Balota, Cortese, & Watson,
2008).
We will be using the SPP to address two broad and related questions. First,
how stable are semantic priming effects, as reflected by within-session reliability
(assessed by split-half correlations) and between-session reliability (assessed by
test–retest correlations)? Second, assuming that priming effects are stable, how
are they moderated by theoretically important constructs such as reading ability
and attentional control? While these questions are not precisely novel, they have
received relatively little attention in the literature. To our knowledge, the present
206 M. J. Yap, K. A. Hutchison and L. C. Tan

study is the first attempt to answer these intertwined questions in a unified and
comprehensive manner, using an unusually large and well-characterized set of
items and participants. We will first provide a selective review of studies that
have explored individual differences in semantic priming, before turning to an
important study by Stolz et al. (2005), who were the first to explore the reliability
of semantic priming effects in word recognition.

Individual Differences in Semantic Priming


The great majority of the word recognition literature has been dominated by
what Andrews (2012) calls the uniformity assumption. Specifically, there seems to
be an implicit assumption that the architecture of the lexical processing system is
relatively invariant across skilled readers, and this assumption is reflected in the
field’s reliance on measures of group-level performance which aggregate across
relatively small samples (typically 25–35) of participants. However, even within a
sample of college students, who are already selected for their reading and writing
ability, there remains substantial variability in psychometric measures of reading
and spelling, and in experimental measures of word recognition performance
(Andrews & Hersch, 2010).
There is considerable evidence against the uniformity hypothesis, predomi-
nantly from word recognition studies where participants identify words presented
in isolation (i.e. words are not preceded by a prime). For example, two important
aspects of reading ability are vocabulary knowledge (i.e. knowledge of word forms
and meaning) and print exposure (i.e. amount of text read). In both isolated
lexical decision and speeded pronunciation, there is evidence that participants
high on vocabulary knowledge and print exposure recognize words faster and
more accurately, and are generally less influenced by lexical characteristics such
as word frequency, word length (i.e. number of letters), and number of orthographic
neighbors (i.e. words obtained by substituting a single letter in the target word,
e.g. sand’s neighbors include band and sang) (Butler & Hains, 1979; Chateau
& Jared, 2000; Lewellen, Goldinger, Pisoni, & Greene, 1993). However,
because processing time is positively correlated with the size of experimental
effects (Faust, Balota, Spieler, & Ferraro, 1999), it is possible that skilled lexical
processors are showing smaller effects of stimulus characteristics simply because
they are responding faster. Yap, Balota, Sibley, and Ratcliff (2012), in their
large-scale analysis of data from the English Lexicon Project (Balota et al., 2007;
http://elexicon.wustl.edu), controlled for variability in participants’ processing
speed by using z-score standardized response times (RTs), and still found that
participants with more vocabulary knowledge produced smaller effects of lexical
variables. These findings are consistent with the general perspective that as readers
acquire more experience with written language, they become more reliant on
Individual Differences in Semantic Priming 207

relatively automatic lexical processing mechanisms, and are consequently less


influenced by word characteristics (Yap et al., 2012).
To a lesser degree, researchers have also used priming paradigms to explore the
influence of individual differences on word recognition. In the typical priming
experiment, two letter strings (which are related in some manner) are presented
to the participants, with the first letter string serving as the prime and the
second as the target. Strings might be morphologically (touching—TOUCH),
orthographically (couch—TOUCH), phonologically (much—TOUCH), or seman-
tically (feel—TOUCH) related; researchers also have the option of masking the
prime (i.e. presenting it too briefly to be consciously processed) to shed light on
early lexical processes and to minimize strategic effects (Forster, 1998). A small
number of studies have examined how priming effects are moderated by individual
differences. For example, Yap, Tse, and Balota (2009) examined the joint effects
of semantic relatedness (related versus unrelated prime) and target frequency (high
versus low) on lexical decision performance, and found that these joint effects
were moderated by the vocabulary knowledge of the participants. Specifically,
participants with more vocabulary knowledge produced additive effects of priming
and frequency (i.e. equal priming for low and high-frequency words), whereas
participants with less vocabulary knowledge produced an overadditive interaction
(i.e. stronger priming for low, compared to high-frequency words).
These results are consistent with the idea that for high vocabulary knowledge
participants, both high and low-frequency words are associated with strong,
fully specified lexical representations which can be fluently accessed. Analyses
of RT distributions revealed that for such participants, semantic priming largely
reflects a relatively modular “head-start” mechanism, whereby primes prospectively
pre-activate related high and low-frequency targets to a similar extent through
spreading activation or expectancy-based processes, thereby speeding up lexical
access by some constant amount of time (Yap et al., 2009). In contrast, participants
with less vocabulary knowledge possess weaker lexical representations, particularly
for low-frequency words. For these participants, head-start mechanisms still
contribute to priming, but there is increased retrospective reliance on prime
information for low-frequency words, which yields more priming for such words.
In addition to vocabulary knowledge, there is intriguing evidence that semantic
priming mechanisms are modulated by individual differences in attentional control
(AC). Broadly speaking, AC refers to the capacity to coordinate attention
and memory so as to optimize task performance by enhancing task-relevant
information (Hutchison, 2007), especially when the environment is activating
conflicting information and prepotent responses (Shipstead, Lindsey, Marshall, &
Engle, 2014). A number of studies have demonstrated that when the stimulus
onset asynchrony (SOA) between the prime and target is sufficiently long for
participants to form expectancies, an increase in relatedness proportion (i.e. the
proportion of word prime-word target pairs that are related) leads to an increase
208 M. J. Yap, K. A. Hutchison and L. C. Tan

in priming (see Hutchison, 2007, for a review); this is known as the relatedness
proportion effect. By definition, increasing relatedness proportion increases the
validity of the prime, leading participants to rely more heavily on expectancy-based
processes, which in turn increases priming. Hutchison (2007) demonstrated that
participants who were associated with greater AC (as reflected by performance
on a battery of attentionally demanding tasks) produced a larger relatedness
proportion effect, suggesting that high-AC participants were sensitive to contextual
variation in relatedness proportion, and were adaptively increasing their reliance
on expectancy generation processes as prime validity increased. Related to this,
there is also recent evidence, based on the use of symmetrically associated (e.g.
sister—BROTHER), forward associated (i.e. prime to target; e.g. atom—BOMB),
and backward associated (i.e. target to prime; e.g. fire—BLAZE) prime-target pairs
that high-AC individuals are more likely to prospectively generate expected targets
and hold them in working memory, whereas low-AC individuals are more likely
to engage retrospective semantic matching processes (Hutchison, Heap, Neely, &
Thomas, 2014).
In summary, the individual differences literature is consistent with the idea
that as readers accrue more experience with words, they rely more on automatic
lexical processing mechanisms in both isolated (Yap et al., 2012) and primed
(Yap et al., 2009) word recognition. This is consistent with the lexical quality
hypothesis (Perfetti & Hart, 2001), which says that highly skilled readers are
associated with high-quality lexical representations which are both fully specified
and redundant. For these skilled readers, the process of identifying a word involves
the precise activation of the corresponding underlying lexical presentation, with
minimal activation of orthographically similar words (Andrews & Hersch, 2010).
Furthermore, such readers are less dependent on the strategic use of context (e.g.
prime information) to facilitate lexical retrieval (Yap et al., 2009). In addition to
high-quality representations, lexical processing is also modulated by the extent to
which readers can exert attentional control; high-AC readers are able to flexibly
adjust their reliance on priming mechanisms so as to maximize performance on a
given task (see Balota & Yap, 2006).

Is Semantic Priming Reliable?


Reliability, which is typically assessed in the domain of psychological testing, is a
fundamental psychometric property that reflects the stability or consistency of a
measure. The consistency of a measure can be evaluated across time (test–retest
reliability) and across items within a test (split-half reliability). Establishing
reliability is a critical prerequisite in the development of paper-and-pencil measures
of intelligence, aptitude, personality, interests, and attitudes, but relatively little
attention has been paid to the reliability of RT measures (Goodwin, 2009).
Without first establishing reliability, one cannot tell if variability on a measure
Individual Differences in Semantic Priming 209

reflects meaningful individual differences or measurement noise. In order to address


this, Yap et al. (2012), using trial-level data from the English Lexicon Project
(Balota et al., 2007), examined the stability of word recognition performance
in isolated lexical decision and speeded pronunciation. They found reassuringly
high within and between-session correlations in the means and standard deviations
of RTs, as well as RT distributional characteristics across distinct sets of items,
suggesting that participants are associated with a relatively stable processing profile
that extends beyond simple mean RT (see also Yap, Sibley, Balota, Ratcliff, &
Rueckl, 2015). There was also reliability in participants’ sensitivity to underlying
lexical dimensions. For example, participants who produced large word frequency
effects in session 1 also tended to produce large frequency effects in session 2.
However, while the findings above lend support to the idea that participants
respond to the influence of item characteristics in a reliable manner, it is less clear
if the influence of item context is equally reliable. In other words, are semantic
priming effects stable in the way word frequency or word length effects are?
The answer to this question is surprisingly inconclusive. In semantic priming
studies, there is tremendous variability in the magnitude of priming produced
across different participants (Stolz et al., 2005). In order to determine if these
individual differences reflect systematic or random processes, Stolz and colleagues
(2005) examined the split-half and test–retest reliability of semantic priming
across different experimental conditions where relatedness proportion and SOA
were factorially manipulated. Surprisingly, they found that reliability was zero
under certain conditions (e.g. short SOA, low relatedness proportion), which
maximized the impact of automatic priming mechanisms and became statistically
significant only under conditions (e.g. long SOA, high relatedness proportion)
which made strategically mediated priming more likely. According to Stolz et al.
(2005), these results point to a semantic system whose activity is “inherently
noisy and uncoordinated” (p. 328) when priming primarily reflects automatic
spreading activation, with performance becoming more coherent when task
demands increase the controlled influence of mechanisms such as expectancy.
Such an observation places major constraints on researchers who intend to use
semantic priming as a tool to study individual differences in domains such as
personality (e.g. Matthews & Harley, 1993) or psychopathology (e.g. Morgan,
Bedford, & Rossell, 2006), or who are interested in exploring how semantic
priming might be moderated by variables such as attentional control (Hutchison,
2007), vocabulary knowledge (Yap et al., 2009), and item characteristics
(Hutchison et al., 2008).

The Present Study


As mentioned earlier, we intend to leverage on the power of the SPP (Hutchison
et al., 2013) to address the related questions of reliability and individual differences
210 M. J. Yap, K. A. Hutchison and L. C. Tan

in semantic priming. Stolz et al. (2005) were the first to report the unexpectedly
low reliabilities associated with the semantic priming effect. In Stolz et al. (2005),
the number of participants in each experiment ranged between 48 and 96,
priming effects were based on 25 related and 25 unrelated trials, and the two test
sessions were operationalized by presenting two blocks of trials within the same
experiment. Given the theoretical and applied importance of Stolz et al.’s (2005)
findings, it is worthwhile exploring if they generalize to the SPP sample, which
contains a much larger number of participants and items. Specifically, in the SPP,
512 participants contributed to the lexical decision data, priming effects are based
on at least 100 related and 100 unrelated trials, and the two test sessions were held
on different days, separated by no more than one week. The SPP was also designed
to study priming under different levels of SOA (200 ms versus 1200 ms) and prime
type (first associate versus other associate), allowing us to assess reliability across
varied priming contexts.
Of course, for our purposes, reliability is largely a means to an end. The other
major aspect of the present work concerns the extent to which a participant’s
semantic priming effect is predicted by theoretically important measures of
individual differences, including vocabulary knowledge, reading comprehension,
and attentional control. Although there has been some work relating priming to
vocabulary knowledge (Yap et al., 2009) and to attentional control (Hutchison,
2007), there has not been, to our knowledge, a systematic exploration of the
relationship between semantic priming effects and a comprehensive array of
individual differences measures. The present study will be the first to address
that gap, by considering how priming effects that are assessed to be reliable are
moderated by a host of theoretically important variables. Collectively, the results
of these analyses will help shed more light on the relationships between the quality
of underlying lexical representations, attentional control, and priming phenomena.
Our findings could also be potentially informative for more foundational questions
pertaining to how changes in reading ability are associated with developments in
the semantic system (e.g. Nation & Snowling, 1999).

Method
Dataset
All analyses reported in this chapter are based on archival trial-level lexical
decision data from the semantic priming project (see Hutchison et al., 2013,
for a full description of the dataset). The 512 participants were native English
speakers recruited from four institutions (both private and public) located across
the midwest, northeast, and northwest regions of the United States. Data were
collected over two sessions on different days, separated by no more than one week.
Across both sessions, participants received a total of 1,661 lexical decision trials
(half words and half non-words), with word prime–word target pairs selected from
Individual Differences in Semantic Priming 211

the Nelson, McEvoy, and Schreiber (2004) free association norms; the relatedness
proportion was fixed at 0.50. Non-words were created from word targets by
changing one or two letters of each word to form pronounceable non-words
that did not sound like real words. For each participant, each session comprised
two blocks (a 20 ms and 1200 ms SOA block), and within each block, half the
related prime-target pairs featured first associates (i.e. the target is the most common
response to a cue/prime word, e.g. choose—PICK) while the other half featured
other associates (i.e. any response other than the most common response to a cue,
e.g. preference—PICK).
Additional demographic information collected included performance on the
vocabulary and passage comprehension subtests of the Woodcock–Johnson III
diagnostic reading battery (Woodcock, McGrew, & Mather, 2001) and on
Hutchison’s (2007) attentional control battery. The vocabulary measures include
a synonym, antonym, and an analogy test; for reading comprehension, participants
have to read a short passage and identify a missing keyword that makes sense
in the context of that passage. The attentional control battery consists of an
automated operational span (Unsworth, Heitz, Schrock, & Engle, 2005), Stroop,
and antisaccade task (Payne, 2005). In the operational span task, participants have
to learn and correctly recall letter sequences while solving arithmetic problems. In
the Stroop task, participants are presented with incongruent (e.g. red printed
in green), congruent (e.g. green printed in green), and neutral (e.g. deep printed
in green) words and are required to name the ink color of the word as quickly
and accurately as possible; the dependent variable is the difference in the mean
RT or error rate between the congruent and incongruent conditions. In the
antisaccade task, participants are instructed to look away from a flashed start (*)
in order to identify a target (O or Q) that is briefly presented on the other
side of the screen; the dependent variable is the target identification accuracy
rate.

Results
We first excluded 14 participants whose datasets were incomplete (i.e. their data
contained fewer than 1,661 trials); this left 498 participants. We then excluded
incorrect trials and trials with response latencies faster than 200 ms or slower than
3000 ms. For the remaining correct trials, RTs more than 2.5 SDs away from each
participant’s mean were also treated as outliers. For the RT analyses, data trimming
removed 7.3 percent (4.6 percent errors; 2.7 percent RT outliers). For all analyses,
we used z-score transformed RTs, which serve to control for individual differences
in processing speed (Faust et al., 1999) and to eliminate much of the variability in
priming for items (Hutchison et al., 2008). Z-scores were computed separately for
each participant.
212 M. J. Yap, K. A. Hutchison and L. C. Tan

Analysis 1: Reliability of Semantic Priming


Trials for each participant were first partitioned into session 1 (S1) trials,
session 2 (S2) trials, odd-numbered trials, and even-numbered trials; the trial
number denotes the order in which the trials were presented to the participant.
For each participant, we then computed the z-score priming effect (mean
unrelated z-score RT—mean related z-score RT) for all trials, S1 trials, S2
trials, odd-numbered trials, and even-numbered trials, as a function of SOA
and prime type. Distinguishing between S1 and S2 trials, and between odd and
even-numbered trials, are prerequisites for computing test–retest and split-half
reliability, respectively. Table 9.1 presents the means and standard deviations of
priming effects by experimental condition and trial type (all trials, odd-numbered
trials, even-numbered trials, S1 trials, S2 trials).
Subjecting the overall priming effects to a 2 (SOA) × 2 (prime type) repeated
measures analysis of variance revealed main effects of SOA, F(1, 497) = 30.00,
p < 0.001, MSE = 0.012, η p 2 = 0.06, and Prime Type, F(1, 497) = 70.18,
p < 0.001, MSE = 0.012, η p 2 = 0.12; the SOA × Prime Type interaction was
not significant, F < 1. Unsurprisingly, priming effects were larger when first
associates (M = 0.12) were presented as targets, compared to when other associates
(M = 0.08) were used. Less expectedly, priming effects were slighter larger at

TABLE 9.1 Means and standard deviations of


z-score transformed priming effects as a function of
experimental condition and trial type.
Lexical decision (N = 498)
Overall Odd Even Session 1 Session 2
Short SOA first associate
M 0.14 0.15 0.13 0.14 0.13
SD 0.13 0.17 0.17 0.17 0.16
Short SOA other associate
M 0.09 0.10 0.09 0.10 0.08
SD 0.11 0.16 0.15 0.15 0.15
Long SOA first associate
M 0.11 0.12 0.10 0.12 0.10
SD 0.13 0.17 0.17 0.16 0.17
Long SOA other associate
M 0.07 0.07 0.07 0.07 0.06
SD 0.12 0.16 0.16 0.16 0.16
All conditions
M 0.10 0.11 0.10 0.11 0.09
SD 0.08 0.09 0.10 0.10 0.10
Individual Differences in Semantic Priming 213

the 200 ms SOA (M = 0.11) than at the 1200 ms SOA (M = 0.09). The
greater priming for first-associate trials is not surprising. However, priming effects
are generally larger (not smaller) at longer SOAs (Neely, 1977; but see Neely,
O’Connor, & Calabrese, 2010). We will comment on this intriguing pattern in
the Discussion.
It is also worth noting that the priming effects in the SPP are somewhat smaller
than what one would expect, using studies such as Hutchison et al. (2008) as a
reference point. As Hutchison et al. (2013) have already acknowledged, it is not
entirely clear why this difference exists, but they suggest that this may be due to the
fact that related trials in semantic priming experiments (e.g. Hutchison et al., 2008)
typically predominantly feature very strong associates, whereas the SPP stimuli are
far more diverse with respect to semantic and associative relations.
Turning to the reliability analyses, Table 9.2 presents the Pearson correlations
between odd and even-numbered trials (split-half reliability), and between S1
and S2 trials (test–retest reliability), for participant-level priming effects. Like
Stolz et al. (2005), we are examining correlations between responses to distinct
sets of prime-target pairs, but the counterbalancing procedure ensures that the
descriptive statistics of different variables are relatively similar across different
sub-lists. For lexical decision, with respect to within-session reliability (reflected
by split-half reliability), we observed moderate correlations (rs from 0.21 to 0.27)
for first-associate trials, and very low correlations (rs from 0.07 to 0.08) for
other-associate trials. Turning to between-session reliability (reflected by test–retest
reliability), correlations were moderate (rs from 0.25 to 0.31) for first-associate
trials, and very low (rs from 0.07 to 0.11) for other-associate trials.
Clearly, the reliabilities of semantic priming for first-associate trials are not
only statistically significant, but are consistently higher than for counterpart
other-associate trials (all ps < 0.05), whose reliabilities approach non-significance.1
It is also reassuring that our estimates (for first-associate trials) fall broadly within

TABLE 9.2 Correlations between session 1 and


session 2 participant-level priming effects, and odd and
even-numbered trial participant-level priming effects.
Lexical decision (N = 498)
Short SOA Long SOA

First Other First Other


associate associate associate associate
Odd–even 0.269*** 0.084† 0.207*** 0.069
S1-S2 0.308*** 0.107* 0.246*** 0.070

*** p < 0.001; * p < 0.05; † p < 0.10


214 M. J. Yap, K. A. Hutchison and L. C. Tan

the range of test–retest reliabilities reported by Stolz et al. (2005) for their
conditions where the relatedness proportion was 0.50. Specifically, for the SOAs
of 200 ms, 350 ms, and 800 ms, they found test–retest correlations of 0.30, 0.43,
and 0.27, respectively.

Analysis 2: Individual Differences in Semantic Priming


Consistent with Stolz et al. (2005), we found moderate-sized split-half and
test–retest reliabilities for semantic priming, but only for first-associate trials.
For other-associate trials, correlations were mostly non-significant, even when
a relatedness proportion of 0.50 was used. These results indicate that priming is
more likely to be reliable when related primes and targets are strongly associated,
and reliability is present even when the SOA is very short (i.e. 200 ms). Having
established that first-associate semantic priming is reliable, we next examined
whether this was moderated by theoretically important individual differences.
Before conducting these analyses, we excluded 17 participants who scored more
than 2.5 standard deviations below the sample mean for any individual difference
measure, leaving 481 participants. Table 9.3 presents the correlations between the
individual difference measures, while Table 9.4 (see also Figures 9.1 and 9.2)
presents the correlations between participant-level first-associate priming effects

0.5 0.5 0.5


Adj R-sq = 0.02, p = 0.002 Adj R-sq = 0.00, ns Adj R-sq = 0.00, ns
0.4 0.4 0.4

0.3 0.3 0.3


Z-score priming

Z-score priming

Z-score priming

0.2 0.2 0.2

0.1 0.1 0.1

0.0 0.0 0.0

–0.1 –0.1 –0.1

–0.2 –0.2 –0.2

0.4 0.5 0.6 0.7 0.8 0.9 1.0 –2 –1 0 1 2 3 4 0 20 40 60 80

Antisaccade Stroop Operational span

0.5 0.5
Adj R-sq = 0.01, p = 0.02 Adj R-sq = 0.00, ns
0.4 0.4

0.3 0.3
Z-score priming

Z-score priming

0.2 0.2

0.1 0.1

0.0 0.0

–0.1 –0.1

–0.2 –0.2

25 30 35 40 45 10 15 20 25

Reading comprehension Vocabulary knowledge

FIGURE 9.1 Scatterplots (with 95 percent confidence interval) between standardized


priming effects and the five individual difference measures, when first-associate primes
at an SOA of 200 ms are presented.
Individual Differences in Semantic Priming 215

0.5 0.5 0.5


Adj R-sq = 0.03, p < 0.001 Adj R-sq = 0.00, ns Adj R-sq = 0.00, ns
0.4 0.4 0.4

0.3 0.3 0.3


Z-score priming

Z-score priming

Z-score priming
0.2 0.2 0.2

0.1 0.1 0.1

0.0 0.0 0.0

–0.1 –0.1 –0.1

–0.2 –0.2 –0.2

0.4 0.5 0.6 0.7 0.8 0.9 1.0 –2 –1 0 1 2 3 4 0 20 40 60 80

Antisaccade Stroop Operational span

0.5 0.5
Adj R-sq = 0.03, p < 0.001 Adj R-sq = 0.01, p = 0.009
0.4 0.4

0.3 0.3
Z-score priming

Z-score priming

0.2 0.2

0.1 0.1

0.0 0.0

–0.1 –0.1

–0.2 –0.2

25 30 35 40 45 10 15 20 25
Reading comprehension Vocabulary knowledge

FIGURE 9.2 Scatterplots (with 95 percent confidence interval) between standardized


priming effects and the five individual difference measures, when first-associate primes
at an SOA of 1,200 ms are presented.

TABLE 9.3 Correlations between the individual difference measures.

Antisaccade Operational Stroop Reading Vocabulary


span comprehension knowledge
Antisaccade – 0.273*** −0.109* 0.163*** 0.234***
Operational span – −0.277*** 0.312*** 0.374***
Stroop – −0.182*** −0.206***
Reading
comprehension – 0.654***
Vocabulary
knowledge –

*** p < 0.001; * p < 0.05

(for both short and long SOA) and attentional control measures (antisaccade,
Stroop,2 operational span), and between priming and reading ability (reading
comprehension and vocabulary knowledge).3
In order to address the possibility that reading comprehension differences in
priming are spuriously driven by differences in antisaccade performance (or vice
versa), we also computed partial correlations. For short SOA priming, antisaccade
216 M. J. Yap, K. A. Hutchison and L. C. Tan

TABLE 9.4 Correlations between participant-level priming


effects with attentional control and reading ability measures.
200 ms SOA (N = 481)
Individual difference measure Correlation with priming
Attentional control
Antisaccade 0.136**
Stroop −0.057
Operational span −0.007
Reading ability
Reading comprehension 0.104*
Vocabulary knowledge 0.061

1200 ms SOA (N = 481)


Individual difference measure Correlation with priming
Attentional control
Antisaccade 0.181***
Stroop −0.020
Operational span 0.042
Reading ability
Reading comprehension 0.168***
Vocabulary knowledge 0.119**

*** p < 0.001; ** p < 0.01; * p < 0.05

performance predicted priming (r = 0.121, p = 0.008) when reading compre-


hension was controlled for, while the correlation between reading comprehension
and priming was borderline significant (r = 0.083, p = 0.068) when antisaccade
performance was controlled for. For long SOA priming, antisaccade performance
predicted priming (r = 0.159, p < 0.001) when reading comprehension and
vocabulary knowledge were controlled for. Reading comprehension predicted
priming (r = 0.119, p = 0.009) when antisaccade performance and vocabulary
knowledge were controlled for. Vocabulary knowledge did not predict priming
(r = −0.015, ns) when antisaccade performance and reading comprehension were
controlled.
In sum, the results are clear-cut and straightforward to summarize. Specifically,
for both SOAs, two of the three measures of attentional control (Stroop and
operational span) were not correlated with priming, while participants who
performed better on the antisaccade task were associated with larger priming
effects. Of the two reading ability measures, reading comprehension was more
reliably related to priming, with reading comprehension scores correlating
positively with priming effects at both SOAs. It is worth reiterating that these
Individual Differences in Semantic Priming 217

correlations cannot simply be attributed to scaling or to general slowing, because


the participant-level priming effects were computed from z-score RTs, which
controls for individual differences in processing speed (Faust et al., 1999).

Discussion
In the present study, by analyzing the large-scale behavioral data from the SPP
(Hutchison et al., 2013), we examined whether semantic priming was reliable, and
if so, how semantic priming was moderated by individual differences. There are a
couple of noteworthy observations. First, we extend previous work on reliability
(Hutchison et al., 2008; Stolz et al., 2005) by demonstrating that the reliability of
semantic priming effects depends on the association strength between primes and
targets. Second, we considered the impact of a broad array of individual difference
measures on semantic priming, and our analyses reveal that participants with more
attentional control and reading ability are associated with stronger priming.

Reliability of Semantic Priming


Stolz et al. (2005) were the first to examine the reliability of semantic priming
across different levels of SOA and relatedness proportion and found that although
priming effects were very robust at the level of the group, there was much less
stability in the within and between-session performance of individual participants.
Specifically, test–retest and split-half reliabilities were statistically significant only
when the relatedness proportion was 0.50 (at SOAs of 200 ms, 350 ms, and 800
ms), and when a long SOA (800 ms) was paired with a high relatedness proportion
of 0.75. In our study, we found moderate-sized and reliable priming with a
relatedness proportion of 0.50, at both short (200 ms) and long (1200 ms) SOAs,
and extended Stolz et al.’s (2005) findings by showing that significant reliability
is observed only when relatively strong associates are used as related prime-target
pairs.
In the account of their findings, Stolz et al. (2005) suggested that the contents of
semantic memory are inherently noisy and uncoordinated, and that when priming
reflects fully automatic processes (e.g. automatic spreading activation), effects can
be expected to be unreliable. However, when semantic priming also reflects
more strategic mechanisms such as memory recruitment and expectancy-based
processes, which are sensitive to the task context, priming effects become more
reliable (see Borgmann, Risko, Stolz, & Besner, 2007, which makes a similar
argument for reliability in the Simon task). For example, consider the performance
of a participant undergoing the primed lexical decision task on two separate
occasions. Participants who are more likely to recruit the prime episode during
target processing or to generate expectancies in response to the prime on the first
occasion can also be expected to behave the same way on the second occasion.
218 M. J. Yap, K. A. Hutchison and L. C. Tan

This account helps provide a unified explanation for Stolz et al.’s (2005) results.
That is, priming is unreliable at all SOAs when the relatedness proportion is 0.25
because there is insufficient incentive (only a one in four chance the target is
related to the prime) for the participant to generate potential targets or for the
prime episode to be retrieved. Increasing the relatedness proportion to 0.50 drives
up the likelihood of episodic prime retrieval (for short SOAs) and expectancy
generation (for longer SOAs), which then yields reliable priming effects. However,
one might then wonder why priming was reliable only at the longest SOA of
800 ms when the relatedness proportion was 0.75. Stolz et al. (2005) suggested that
at very high relatedness proportions, participants begin to notice the prime-target
relation and attempt to generate expectancies in an intentional manner. At shorter
SOAs (200 ms and 350 ms), expectancy-based processes are likely to fail, because
there is insufficient time for intentional generation and successful application of
expectations. These failed attempts disrupt and overshadow the impact of prime
retrieval, hence attenuating priming reliability.
We think our results can be nicely accommodated by a similar perspective. In
the SPP, a relatedness proportion of 0.50 was consistently used, which facilitates the
recruitment of the prime episode. However, why was reliability so much higher
for first-associate, compared to other-associate, prime-target pairs? As mentioned
earlier, prime recruitment processes are positively correlated with prime utility.
That is, there is a lower probability of prime recruitment under experimental
contexts where the prime is less useful, such as when the relatedness proportion
is low (Bodner & Masson, 2003), when targets are clearly presented (Balota, Yap,
Cortese, & Watson, 2008), and as the present results suggest, when primes are
weakly associated with their targets. In general, our results provide converging
support for Stolz et al.’s (2005) assertion that an increase in prime episode retrieval
yields greater reliability in priming.
However, while significant test–retest and split-half correlations at the very short
SOA of 200 ms (see Table 9.2) may suggest that reliability under these conditions
is not mediated by expectancy, Hutchison (2007) has cautioned against relying on
a rigid cutoff for conscious strategies such as expectancy generation. Indeed, it
is more plausible that expectancy-based processes vary across items, participants,
and practice. Consistent with this, there is mounting evidence for expectancy
even at short SOAs for strong associates and high-AC individuals (e.g. Hutchison,
2007). We tested this by using a median split to categorize participants as low-AC
or high-AC, based on their performance on the antisaccade task; reliabilities of
short SOA, first-associate trials were then separately computed for the two groups.
Interestingly, reliabilities were numerically higher for the high-AC group (split-half
r = 0.327, p < 0.001; test–retest r = 0.388, p < 0.001) than for the low-AC group
(split-half r = 0.215, p < 0.001; test–retest r = 0.234, p < 0.001); while the group
difference was not significant for split-half reliability, it approached significance
(p = 0.06) for test–retest reliability. Hence, it is possible that for high-AC
Individual Differences in Semantic Priming 219

participants, expectancy generation may have partly contributed to the reliability


of short SOA priming.

Individual Differences in Semantic Priming


The literature on individual differences in semantic priming is relatively sparse,
but previous work suggests that readers with higher-quality lexical representations
(as assessed by vocabulary knowledge) rely more heavily on relatively automatic
prospective priming mechanisms, which are not modulated by target difficulty,
whereas readers with lower-quality representations show more influence of
retrospective mechanisms (e.g. episodic prime retrieval or semantic matching) that
increase prime reliance for more difficult targets (Yap et al., 2009). There is also
evidence that individuals with more attentional control are better at strategically
calibrating their reliance on expectancy-based processes in response to contextual
variations in the predictive power of the prime (Hutchison, 2007).
We found that individuals who performed better on the antisaccade task also
produced larger priming effects. This is consistent with a recent study by Hutchison
et al. (2014), which reported greater priming for high-AC participants, but only
when forward associated prime-target pairs (e.g. atom—BOMB) were presented.
These results suggest that high-AC individuals are better able at prospectively
generating and maintaining expectancy sets of possible related targets in response
to primes, thereby increasing priming for such participants. Indeed, Heyman, Van
Rensbergen, Storms, Hutchison, and De Deyne (2015) recently replicated and
extended Hutchison et al. (2014) by demonstrating that imposing a high working
memory load entirely eliminated forward priming, but left backward priming
intact. It is not entirely clear why operation span and Stroop performance did not
correlate with semantic priming effects,4 since all three attentional capacity tasks
putatively require participants to maintain information in working memory while
performing an ongoing task (Hutchison, 2007). However, in a comprehensive
analysis of the relationships between various working memory tasks, Shipstead
et al. (2014), using structural equation modeling, reported that operational span,
Stroop, and antisaccade performance do not seem to reflect a unitary construct.
Instead, operational span primarily taps the primary and secondary memory
components of working memory capacity (i.e. size of a person’s attentional focus),
whereas Stroop and antisaccade performance are more directly related to the
attentional control component of working memory capacity. Notably, antisaccade
performance (loading = 0.71) loaded far more highly on the AC factor than Stroop
performance (loading = −0.25), and Shipstead et al. (2014) speculated that this
was because the demands of the antisaccade task makes it particularly sensitive
to individuals’ ability to rapidly recover from attentional capture. In contrast, the
Stroop task is tapping the efficiency of mechanisms that support the early inhibition
of distracting information. The present analyses further reinforce the importance
220 M. J. Yap, K. A. Hutchison and L. C. Tan

of considering the influence of AC on word recognition performance, and also


suggest that researchers may want to give more weight to antisaccade performance
when operationalizing AC.
In addition to attentional control, there was also evidence that more skilled
lexical processors (as reflected by better performance on reading comprehension
and, to a lesser extent, vocabulary knowledge measures) showed greater priming.
According to the lexical quality hypothesis (Perfetti & Hart, 2001), individuals
with higher quality representations should rely less on the strategic use of prime
information for resolving targets (Yap et al., 2009), and one would therefore predict
less, not more, priming for better lexical processors. However, if one assumes that
higher quality representations allow prime words to be identified faster, thereby
increasing the efficiency of prospective priming mechanisms such as automatic
spreading activation and expectancy, related primes provide a greater head-start for
highly skilled readers, which increases priming (Hutchison et al., 2014). This is also
consistent with developmental data which indicate that as an individual’s semantic
network develops over the lifespan, the links between nodes in the semantic
network become stronger, which in turn increases the influence of automatic
priming mechanisms (Nakamura, Ohta, Okita, Ozaki, & Matsushima, 2006).
Of course, the preceding post hoc explanation is speculative and awaits empirical
verification.

Reliability of Isolated Versus Primed Lexical Decision


The results of our reliability analyses are broadly in line with Stolz et al.’s (2005)
earlier findings. That is, semantic priming effects show moderate reliability (r ≈
0.30), but only when the relatedness proportion is sufficiently high (i.e. 0.50)
and when primes and targets are strongly related. Related to this, the correlations
between semantic priming effects and individual differences are similarly modest.
In classical test theory, the correlation between two variables cannot exceed
the square root of the product of the two variables’ reliabilities (Nunnally &
Bernstein, 1994). That is, even if semantic priming were correlated with a perfectly
reliable measure, the correlation cannot, in principle, exceed 0.55 in absolute
magnitude. In practice, when less reliable measures are examined, the correlations
between attentional control and priming have ranged between 0.18 and 0.20 (see
Hutchison, 2007; Hutchison et al., 2014).
These findings contrast strongly with recent work examining individual
differences in isolated word recognition. Specifically, when one considers the
stability of participants’ sensitivity to lexical characteristics such as length,
neighborhood characteristics, and word frequency in isolated lexical decision,
correlations (rs between 0.38 and 0.75) are considerably higher (Yap et al., 2012).
The implication here is that although spreading activation processes within the
semantic network do not operate in a deterministic manner, this incoherence
Individual Differences in Semantic Priming 221

does not extend to the processing of a word’s orthographic and phonological


characteristics. Indeed, this claim meshes well with findings from Waechter
et al. (2010), who showed that unlike semantic priming, repetition priming
(e.g. dog—DOG) performance, which provides insights into orthographic and
phonological input coding processes, was largely reliable.

Why was Priming Attenuated at a Longer SOA?


One unexpected result from the present analyses was the observation of weaker
priming at a longer SOA. Priming effects at very short (< 300 ms) SOAs should
primarily reflect automatic processes (but see Hutchison, 2007), whereas priming
effects at longer SOAs become larger due to an added influence of controlled
expectancy-based processes. Because the strategic processes which underlie the
generation of potential candidates from the prime require time to develop, they
should therefore be most influential at long SOAs (Neely, 1977). We considered a
couple of possible explanations for these results.
First, as mentioned before, in typical semantic priming experiments, strongly
associated prime-target pairs are overrepresented as stimuli. In the SPP, the large
number of stimuli featured a full range of associative strength and feature overlap,
and this perhaps made it more difficult for participants to generate and maintain
expectancies. However, as reported in the Results section (see Analysis 1), prime
type and SOA produced additive effects on semantic priming, that is, SOA
decreased priming to an equivalent extent for first-associate and other-associate
pairs. If prime-target relatedness did matter, then one would predict a larger effect
of SOA for other-associate trials. Second, the participants in the SPP were exposed
to an atypically large number of trials (over 800 trials per session) over two sessions,
and it is possible that this caused participants to stop maintaining expectancies over
time. Additional analyses provide some evidence for this. Specifically, the effect of
session was significant, F(1, 497) = 9.27, p = 0.002, MSE = 0.022, η p 2 = 0.02,
with less priming at session 2 (M = 0.09) than at session 1 (M = 0.11).

Limitations and Future Directions


In the present study, we examined trial-level data from the SPP, an online
behavioral repository of over 800,000 primed lexical decision trials from over
500 participants. In line with previous studies, we found evidence that semantic
priming effects are moderately reliable, but mainly under conditions which
foster reliance on episodic prime retrieval or the strategic generation of related
targets. We also observed greater priming for participants with superior attentional
control and higher quality lexical representations, suggesting that such individuals
are better able to take advantage of prime information to drive prospective
priming mechanisms. From a more methodological perspective, this work strongly
222 M. J. Yap, K. A. Hutchison and L. C. Tan

underscores the utility of the megastudy approach for investigating semantic


priming phenomena, and we look forward to seeing more interesting questions
being answered by the SPP.
Of course, there are a number of limitations associated with the present
work. First, due to the archival nature of the SPP dataset, we were only
able to explore the influence of the individual differences measures available.
Researchers (Perfetti, 1992) have proposed that the most faithful index of lexical
quality is spelling performance, because accurate spelling requires precise lexical
representations; performance on other tasks (e.g. vocabulary knowledge, reading
comprehension) can be achieved on the basis of partial lexical information, which
can be compensated for by context (Andrews, 2012). In future work on individual
differences in semantic priming, researchers may want to explore the role of
spelling performance, which is presumably a more direct measure of lexical quality.
Second, we did not distinguish between forward, backward, and symmetrically
associated prime-target pairs in our analyses. Recent work (Heyman et al., 2015;
Hutchison et al., 2014; Thomas, Neely, & O’Connor, 2012) makes it clear that
prime direction can modulate the influence of automatic and controlled priming
processes. Future investigations can consider the interplay between attentional
control, lexical quality, and priming mechanisms in a finer-grained manner by
taking prime direction into account. Third, while we agree with Stolz et al.
(2005) that the present empirical evidence is most consistent with a noisy and
uncoordinated semantic system, it is difficult to entirely eliminate the effects of
controlled processing, even with a short SOA and low relatedness proportion. To
provide less equivocal support for the instability of automatic priming, it will be
useful to conduct similar reliability analyses using masked priming paradigms that
minimize the contaminating influence of strategy (Forster, 1998). Finally, despite
the scope of the SPP, one could plausibly argue that a sample of college students,
who are to a large extent selected for their reading ability and attentional control,
will show a restricted range of these measures compared to the general population.
Hence, it is possible that the true relationships between reading ability, attentional
control, and priming may be even larger than we have reported here.

Notes
1 Given the weaker priming observed for other-associate, compared to
first-associate, trials, one might wonder if reliability is lower in this condition
because of decreased variability in priming across participants. As suggested by
a reviewer, this possibility is ruled out by our data (see Table 9.1), which reveal
comparable variability in priming for first and other-associate trials.
2 Stroop performance was computed by averaging standardized Stroop effects in
RTs and accuracy rates.
Individual Differences in Semantic Priming 223

3 Vocabulary knowledge was computed by averaging scores on the synonym,


antonym, and analogy tests.
4 We also explored this by examining the correlations between priming effects
and performance on the three tasks in previous studies (Hutchison, 2007;
Hutchison et al., 2014). While antisaccade performance was overall the
strongest predictor of priming, all three measures correlated with priming in
at least one of the experiments. Of course, the priming effects in these studies
were based on a smaller number of observations (20 related and 20 unrelated
trials).

References
Anderson, J. R., & Milson, R. (1989). Human memory: An adaptive perspective.
Psychological Review, 96, 703–719.
Andrews, S. (2012). Individual differences in skilled visual word recognition and reading:
The role of lexical quality. In J. S. Adelman (Ed.), Visual word recognition volume 2: Meaning
and context, individuals, and development (pp. 151–172). Hove, UK: Psychology Press.
Andrews, S., & Hersch, J. (2010). Lexical precision in skilled readers: Individual differences
in masked neighbor priming. Journal of Experimental Psychology: General, 139, 299–318.
Balota, D. A., & Yap, M. J. (2006). Attentional control and flexible lexical processing:
Explorations of the magic moment of word recognition. In S. Andrews (Ed.), From
inkmarks to ideas: Current issues in lexical processing (pp. 229–258). New York: Psychology
Press.
Balota, D. A., Yap, M. J., Cortese, M. J., & Watson, J. M. (2008). Beyond mean response
latency: Response time distributional analyses of semantic priming. Journal of Memory and
Language, 59, 495–523.
Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B.,
. . . Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39,
445–459.
Balota, D. A., Yap, M. J., Hutchison, K.A., & Cortese, M. J. (2012). Megastudies: What
do millions (or so) of trials tell us about lexical processing? In James S. Adelman (Ed.),
Visual word recognition Volume 1: Models and methods, orthography and phonology (pp. 90–115).
Hove, UK: Psychology Press.
Becker, C. A. (1980). Semantic context effects in visual word recognition: An analysis of
semantic strategies. Memory & Cognition, 8, 493–512.
Bodner, G. E., & Masson, M. E. J. (1997). Masked repetition priming of words and
non-words: Evidence for a nonlexical basis for priming. Journal of Memory and Language,
37, 268–293.
Bodner, G. E., & Masson, M. E. J. (2003). Beyond spreading activation: An influence of
relatedness proportion on masked semantic priming. Psychonomic Bulletin and Review, 10,
645–652.
Borgmann, K. W. U., Risko, E. F., Stolz, J. A., & Besner, D. A. (2007). Simons says:
Reliability and the role of working memory and attentional control in the Simon Task.
Psychonomic Bulletin and Review, 14, 313–319.
Butler, B., & Hains, S. (1979). Individual differences in word recognition latency. Memory
and Cognition, 7, 68–76.
224 M. J. Yap, K. A. Hutchison and L. C. Tan

Chateau, D., & Jared, D. (2000). Exposure to print and word recognition processes. Memory
and Cognition, 28, 143–153.
Faust, M. E., Balota, D. A., Spieler, D. H., & Ferraro, F. R. (1999). Individual differences in
information processing rate and amount: Implications for group differences in response
latency. Psychological Bulletin, 125, 777–799.
Forster, K. (1998). The pros and cons of masked priming. Journal of Psycholinguistic Research,
27, 203–233.
Goodwin, C. J. (2009). Research in psychology: Methods and design (6th edn.). Hoboken, NJ:
Wiley.
Heyman, T., Van Rensbergen, B. V., Storms, G., Hutchison, K. A., & De Deyne, S. (2015).
The influence of working memory load on semantic priming. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 41, 911–920.
Hutchison, K. A. (2007). Attentional control and the relatedness proportion effect in
semantic priming. Journal of Experimental Psychology: Learning, Memory, and Cognition,
33, 645–662.
Hutchison, K. A., Balota, D. A., Cortese, M., & Watson, J. M. (2008). Predicting semantic
priming at the item-level. The Quarterly Journal of Experimental Psychology, 61, 1036–1066.
Hutchison, K. A., Balota, D. A., Neely, J. H., Cortese, M. J., Cohen-Shikora, E. R., Tse,
C-S., . . . Buchanan, E. (2013). The Semantic Priming Project. Behavior Research Methods,
45, 1099–1114.
Hutchison, K. A., Heap, S. J., Neely, J. H., & Thomas, M. A. (2014). Attentional control
and asymmetric associative priming. Journal of Experimental Psychology: Learning, Memory,
and Cognition, 40, 844–856.
Kinoshita, S., Forster, K. I., & Mozer, M. C. (2008). Unconscious cognition isn’t that smart:
Modulation of masked repetition priming effect in the word naming task. Cognition, 107,
623–649.
Kinoshita, S., Mozer, M. C., & Forster, K. I. (2011). Dynamic adaptation to history of
trial difficulty explains the effect of congruency proportion on masked priming. Journal
of Experimental Psychology: General, 140, 622–636.
Lewellen, M. J., Goldinger, S. D., Pisoni, D. B., & Greene, B. G. (1993). Lexical familiarity
and processing efficiency: Individual differences in naming, lexical decision, and semantic
categorization. Journal of Experimental Psychology: General, 122, 316–330.
Lowe, C., & Rabbitt, P. (1998). Test/re-test reliability of the CANTAB and ISPOCD
neuropsychological batteries: Theoretical and practical issues. Neuropsychologia, 36,
915–923.
McNamara, T. P. (2005). Semantic priming: Perspectives from memory and word recognition. Hove,
UK: Psychology Press.
Matthews, G., & Harley, T. A. (1993) Effects of extraversion and self-report arousal on
semantic priming: A connectionist approach. Journal of Personality and Social Psychology,
65, 735–756.
Meyer, D. E., & Schvaneveldt, R. W. (1971). Facilitation in recognizing pairs of words:
Evidence of a dependence between retrieval operations. Journal of Experimental Psychology,
90, 227–234.
Morgan, C. J. A., Bedford, N. J., & Rossell, S. L. (2006). Evidence of semantic
disorganization using semantic priming in individuals with high schizotypy. Schizophrenia
Research, 84, 272–280.
Individual Differences in Semantic Priming 225

Nakamura, E., Ohta, K., Okita, Y., Ozaki, J., & Matsushima, E. (2006). Increased inhibition
and decreased facilitation effect during a lexical decision task in children. Psychiatry and
Clinical Neurosciences, 60, 232–239.
Nation, K., & Snowling, M. J. (1999). Developmental differences in sensitivity to semantic
relations among good and poor comprehenders: Evidence from semantic priming.
Cognition, 70, B1–B13.
Neely, J. H. (1977). Semantic priming and retrieval from lexical memory: Roles of
inhibitionless spreading activation and limited-capacity attention. Journal of Experimental
Psychology: General, 106, 226–254.
Neely, J. H. (1991). Semantic priming effects in visual word recognition: A selective review
of current findings and theories. In D. Besner & G. Humphreys (Eds.), Basic processes in
reading: Visual word recognition (pp. 236–264). Hillsdale, NJ: Erlbaum.
Neely, J. H., Keefe, D. E., & Ross, K. L. (1989). Semantic priming in the lexical
decision task: Roles of prospective prime-generated expectancies and retrospective
semantic matching. Journal of Experimental Psychology: Learning, Memory, and Cognition,
15, 1003–1019.
Neely, J. H., O’Connor, P. A., & Calabrese, G. (2010). Fast trial pacing in a lexical decision
task reveals a decay of automatic semantic activation. Acta Psychologica, 133, 127–136.
Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida
free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments,
& Computers, 36, 402–407.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd edn.). New York:
McGraw-Hill.
Payne, B. K. (2005) Conceptualizing control in social cognition: How executive functioning
modulates the expression of automatic stereotyping. Journal of Personality and Social
Psychology, 89, 488–503.
Perfetti, C. A. (1992). The representation problem in reading acquisition. In P. B. Gough,
L. C. Ehri, & R. Treiman (Eds.), Reading acquisition (pp. 145–174). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Perfetti, C. A, & Hart, L. (2001). The lexical bases of comprehension skill. In D. S. Gorfein
(Ed.), On the consequences of meaning selection: Perspectives on resolving lexical ambiguity (pp.
67–86). Washington, DC: American Psychological Association.
Plaut, D. C., & Booth, J. R. (2000). Individual and developmental differences in semantic
priming: Empirical and computational support for a single-mechanism account of lexical
processing. Psychological Review, 107, 786–823.
Posner, M. I., & Snyder, C. R. R. (1975). Attention and cognitive control. In R. Solso
(Ed.), Information processing and cognition: The Loyola symposium (pp. 55–85). Hillsdale, NJ:
Erlbaum.
Shipstead, Z., Lindsey, D. R. B., Marshall, R. L., & Engle, R. L. (2014). The mechanisms of
working memory capacity: Primary memory, secondary memory, and attention control.
Journal of Memory and Language, 72, 116–141.
Stolz, J. A., Besner, D., & Carr, T. H. (2005). Implications of measures of reliability for
theories of priming: Activity in semantic memory is inherently noisy and uncoordinated.
Visual Cognition, 12, 284–336.
Thomas, M. A., Neely, J. H., & O’Connor, P. (2012). When word identification gets tough,
retrospective semantic processing comes to the rescue. Journal of Memory and Language, 66,
623–643.
226 M. J. Yap, K. A. Hutchison and L. C. Tan

Unsworth, N., Heitz, R. P., Schrock, J. C., & Engle, R. W. (2005). An automated version
of the operation span task. Behavior Research Methods, 37, 498–505.
Waechter, S., Stolz, J. A., & Besner, D. (2010). Visual word recognition: On the reliability
of repetition priming. Visual Cognition, 18, 537–558.
Whittlesea, B. W. A., & Jacoby, L. L. (1990). Interaction of prime repetition with visual
degradation: Is priming a retrieval phenomenon? Journal of Memory and Language, 29,
546–565.
Woodcock, R. W., McGrew, K. S., & Mather, N. (2001). Woodcock Johnson III tests of
cognitive abilities. Rolling Meadows, IL: Riverside Publishing.
Yap, M. J., Balota, D. A., Sibley, D. E., & Ratcliff, R. (2012). Individual differences in
visual word recognitions: Insights from the English lexicon project. Journal of Experimental
Psychology: Human Perception and Performance, 38, 53–79.
Yap, M. J., Sibley, D. E., Balota, D. A., Ratcliff, R., & Rueckl, J. (2015). Responding to
non-words in the lexical decision task: Insights from the English Lexicon Project. Journal
of Experimental Psychology: Learning, Memory, and Cognition, 41, 597–613.
Yap, M. J., Tse, C.-S., & Balota, D. A. (2009). Individual differences in the joint effects of
semantic priming and word frequency: The role of lexical integrity. Journal of Memory and
Language, 61, 303–325.
10
SMALL WORLDS AND BIG DATA
Examining the Simplification Assumption in
Cognitive Modeling

Brendan Johns,
Douglas J. K. Mewhort,
and Michael N. Jones

Abstract
The simplification assumption of cognitive modeling proposes that to understand a given
cognitive system, one should focus on the key aspects of the system and allow other sources
of complexity to be treated as noise. The assumption grants much power to a modeller,
permits a clear and concise exposition of a model’s operation, and allows the modeller
to finesse the noisiness inherent in cognitive processes (e.g., McClelland, 2009; Shiffrin,
2010). The small-world (or “toy model” approach) allows a model to operate in a simple
and highly controlled artificial environment. By contrast, Big Data approaches to cognition
(e.g. Landauer & Dumais, 1997; Jones & Mewhort, 2007) propose that the structure of a
noisy environment dictates the operation of a cognitive system. The idea is that complexity
is power; hence, by ignoring complexity in the environment, important information about
the nature of cognition is lost. Using models of semantic memory as a guide, we examine
the plausibility, and the necessity, of the simplification assumption in light of Big Data
approaches to cognitive modeling.

Approaches to modeling semantic memory fall into two main classes: Those
that construct a model from a small and well-controlled artificial dataset (e.g.
Collins and Quillian, 1969; Elman, 1990; Rogers & McClelland, 2004) and
those that acquire semantic structure from a large text corpus of natural language
(e.g. Landauer & Dumais, 1997; Griffiths, Steyvers, & Tenenbaum, 2007; Jones
& Mewhort, 2007). We refer to the first class as the small-world approach1 and
the latter as the Big Data approach. The two approaches differ on the method by
which to study semantic memory and exemplify a fundamental point of theoretical
divergence in subfields across the cognitive sciences.
The small-world approach to semantic memory is appealing because it makes
ground-truth possible; all of the knowledge about a world is contained within the
228 B. Johns, D. J. K. Mewhort and M. N. Jones

living thing
can grow
is
living

ISA ISA

has roots has skin


plant animal can
move

ISA ISA ISA ISA


bark feathers
scales
has has leaves has swimcan has
big is petals has
fly can bird fish
has tree is has has
branches pretty
wings gills
ISA ISA ISA ISA ISA ISA ISA ISA

pine oak rose daisy robin canary sunfish salmon


is has is is is is can is is
green leaves red yellow red yellow sing yellow red

FIGURE 10.1 The semantic network used in Collins & Quillian (1969) and Rogers &
McClelland (2004).

training materials, and the true generating model is known. For example, Figure
10.1 reconstructs the taxonomic tree from Collins and Quillian (1969), and is also
used to generate the training materials in Rogers and McClelland (2004). The
reconstructed tree contains knowledge about a limited domain of living things,
namely plants and animals. Sampling propositions from the tree (as Rogers &
McClelland do) produces well-structured and accurate knowledge for a limited
domain of items. Thus, even though it is small in scale, the theorist knows the
structure of the world under question (e.g. that a canary is a bird), and hence, is
able to gauge how well the model acquires and processes information from the
training set.
The Big Data approach to the same problem is fundamentally different. Rather
than assuming that the world is structured in a single way, it proposes that the lexical
environment provides the ground-truth for the meaning of words, an approach
that also has a rich history (e.g. Wittgenstein, 1953). Although different learning
mechanisms have been proposed (see Jones, Willits, & Dennis, 2015 for a review),
they all rely upon co-occurrence patterns of words across large text corpora to infer
the meaning of a word. A model’s performance is assessed on a behavioral task (e.g.
a synonym test, ratings of semantic similarity, semantic priming, etc.) where the
model’s use of a word is compared against human performance. The model’s match
to the item-level behavior is used to justify the plausibility of the specific algorithm
used by the model.
Small Worlds and Big Data 229

Big Data techniques underscore the amount of information that is available in


the linguistic environment about a word’s meaning and the kind of mechanisms
that are needed to learn from it (e.g. Bullinaria & Levy, 2012; Recchia & Jones,
2009). Although Big Data approaches include more noise and uncertainty than
the small-world approaches, they have greater ecological validity. The Big Data
approach reminds us of the natural environment in which cognition is embedded,
ideas that have a long history in the cognitive sciences (Estes, 1975; Simon, 1969).
As engines with which to understand semantic memory, both approaches
have advantages and disadvantages. Small worlds support a more complete
understanding of how a model functions and what type of data it can learn than the
Big Data approach; the Big Data approach, by contrast, learns the structure of the
natural linguistic environment, data that are more plausible than hand-constructed
representations. Hence, there is a trade-off of clarity for plausibility in the different
techniques.
Clarity is a central tenant of the simplification assumption in cognitive modeling
(McClelland, 2009; Shiffrin, 2010). The assumption allows researchers to focus on
specific aspects of the cognitive system, while assuming other complexities away.
As McClelland (2009: 18) states, “The more detail we incorporate, the harder the
model is to understand.” However, as McClelland (2009) also points out, there is
a need to ensure that one’s assumptions are plausible and accurate. Otherwise one
may be misled by assuming away too many details.
In the study of semantic memory, the small-world approach exemplifies the
simplification assumption (Rogers & McClelland, 2004). The approach assumes
that the environment is structured in systematic ways and that the important issue
to be understood is how this information is acquired and processed. The risk is,
however, that a small word ignores the complexity of the problem—the acquisition
of meaning from the natural environment. By using artificially structured data to
train a model, one risks endorsing a learning mechanism that cannot be scaled to
the complexity of the natural environment. That is, the approach may trade clarity
for plausibility; such a trade-off forces the model to solve an artificial problem that
is not representative of human experience.
Our goal is to explore the issues of simplification in cognitive modeling, using
semantic memory as an example. The first part of the chapter examines multiple
small worlds used in semantic memory modeling with the goal of understanding
the power that using highly structured datasets provides. The second part examines
the amount of natural language information that is needed to approximate the
structure that is contained in the small world. We will use a standard co-occurrence
model of semantics to examine both (Jones & Mewhort, 2007; Recchia, Jones,
Sahlgren, & Kanerva, 2015). By assessing the amount of data needed to form an
approximation, it provides an insight into the relationship between the small world
and the Big Data approaches.
230 B. Johns, D. J. K. Mewhort and M. N. Jones

Analysis of a Small World


In their book Semantic Cognition, Rogers and McClelland (2004) offer a simple and
powerful model of semantic memory. This model accounts for a variety of semantic
memory phenomena, such as the learning of different concepts, developmental
trends, and the use of causal knowledge in semantics. The basic operation of
the model is backpropagation of propositional information within a connectionist
network, based on the previous work of Rumelhart (1990) and Rumelhart and
Todd (1993). The model is experience dependent and requires training materials,
derived from a corpus of propositions, to account for behavior.
Although we accept many of the central tenets of the Rogers and McClelland
semantic cognition theory, we question one aspect of their approach, namely the
training set used. They used a hierarchical representation of the world, building off
of Collins and Quillian (1969, see Figure 10.1), to generate a corpus of training
propositions. Only the item-level propositions are used in training (e.g. canary isa
living_thing, canary isa bird, canary can fly, etc.). Two hierarchies were used, one with
eight items and one with 21 items.
Although the plausibility of using such propositional information has been
questioned elsewhere (e.g. Barsalou, 1999), our question is the requisite amount
of structure in the training materials. The small world, as outlined in Figure 10.1,
is essentially noise free and contains a great deal of structure about the relationship
between items. The natural environment is much noisier and more ambiguous than
this setup. Given such heavily structured training materials, we question whether
the explanatory power of the model derives from the structure of the materials
used in training or the learning mechanism used by the model.
The question echoes recent work in the implicit memory literature: There
may be no need for an advanced learning mechanism to acquire the structure
of an artificial grammar, but rather that the similarity structure of the items to
which subjects are exposed contains sufficient information to model the relevant
behavioral findings using a simple memory model (Jamieson & Mewhort, 2009a,
2009b, 2010, 2011). That is, the structure of the training materials is sufficiently
constrained that a simple encoding provides enough power to account for the
behavior under question. Recent work in natural language sentence processing
has strengthened the importance of item-level information in the modeling of
language effects (Bannard, Lieven, & Tomasello, 2012; Johns & Jones, 2015).
To analyze the amount of structure in Rogers and McClelland’s (2004) small
world, we will employ the BEAGLE model of lexical semantics (Jones & Mewhort,
2007) to construct representations of items. BEAGLE has been used in several
domains, including semantic priming (Jones, Kintsch, & Mewhort, 2006; Hare,
Jones, Thomson, Kelley, & McRae, 2009), memory search (Hills, Jones, & Todd,
2012), and semantic deficits in cognitive impairment (Johns et al., 2013). A key
aspect of the model—one that makes it appealing for use here—is that its lexical
Small Worlds and Big Data 231

representations are a direct reflection of the language environment; there is no


advanced inference or abstraction procedure. Thus, its representation will provide
insight into just how structured the small world is.
BEAGLE uses four vectors to represent a word: (1) A static environment
vector (with elements sampled from a Gaussian distribution in the standard
implementation), which is assumed to represent perceptual properties of a word;
(2) a context vector, which represents pure co-occurrence relationships; (3) an
order vector, which records the position of a word relative to the other words in
a sentence; and (4) a composite vector formed by summing the context and order
vectors. A word’s context vector is updated with the sum of the environmental
vectors for the other words appearing in the same sentence. A word’s order vector
is formed by binding it with all ngram chunks in the sentence with directional
circular convolution (see Jones & Mewhort, 2007 for additional detail).
The BEAGLE model offers a fundamentally different view of semantic memory
than Rogers and McClelland’s (2004) approach. The most obvious difference is
that BEAGLE does not depend on error-driven learning, unlike a backpropagation
model. BEAGLE is a vector accumulation model; it builds a representation of
the world through the direct encoding of experience without requiring an error
signal. Hence, meaning is not based in the refinement of the predictions from the
model, but from the gradual buildup of knowledge through episodic experience.
The vector accumulation approach to semantics provides a simple framework to
determine the power of experience in explaining structure in semantic memory.

Learning a Small World


In this section, we will examine the acquisition of hierarchical knowledge through
the learning of propositional information. Rogers and McClelland demonstrated
that, across hundreds or thousands (in the case of the large network) of learning
epochs, the model gradually acquired a hierarchical representation of the items,
similar to the semantic network displayed in Figure 10.1. They assumed that
semantic memory is generally hierarchical so the acquisition of a hierarchical
representation was of central importance to their modeling goals.
To determine how well BEAGLE learns this type of structure, we used the
same set of propositions from Rogers and McClelland (2004). This was done for
both the small network displayed in Figure 10.1, which includes eight items and
84 item-level propositions (propositions with the items in them), as well as the
larger semantic network contained in Appendix B.3 of Rogers and McClelland
(2004). The larger semantic network consisted of 21 words and 220 item-level
propositions.
To train BEAGLE, 25 propositions were iteratively sampled randomly with
replacement, and the word vectors were updated with those propositions.
Iteratively sampling with replacement means that a set of 25 propositions was
232 B. Johns, D. J. K. Mewhort and M. N. Jones

selected at each training point, with a selected proposition not being removed
from the search set if it was chosen (that is, it could be sampled multiple times
across training). Each proposition was treated as a sentence, with both relations
(e.g. isa, has, can, is) and features (fly, feathers, wings, swim, etc.), being represented
as words with their own unique environmental vectors. Words were represented
as the composite of the context and order vectors, similar to the approach taken
to understand artificial grammar acquisition in Jamieson and Mewhort (2011).
Similarity values taken from BEAGLE were averaged across 25 resamples of the
environmental vectors, to ensure that the results were not due to the random initial
configuration of similarity and that the emerging structure is from the learning of
propositional information.
To test how well the model learned the structure contained in the semantic
networks, a rank correlation of the vector cosine of the word’s representation to the
number of shared features between the words in the semantic network was taken.
For example, in the network displayed in Figure 10.1, robin and canary share ten
features (isa bird, has wings, can fly, has feathers, isa animal, can move, has skin, can grow,
is living, isa living_thing), while robin and oak contain three (isa living_thing, can grow,
is living). We used a simple metric of learning by assessing how related the similarity
values are to the feature overlap values. We use dendograms generated from a
hierarchical clustering algorithm to display that the model is learning accurate
hierarchical information, similar to how structure was illustrated by Rogers and
McClelland (2004).
Figure 10.2 shows the increase in correlation of the BEAGLE similarity values
to the amount of feature overlap of words in the two proposition corpora, as
a function of the number of propositions studied for both the small and large
networks. Figure 10.2 shows a simple trend: The model learned the structure of
the small world rapidly, even with minimal exposure to the training materials. For
the small network, performance hits asymptote at 75 propositions, close to the
size of the actual corpus. However, even at only 25 propositions the model was
capable of inferring a large amount of structure. As would be expected, it took
longer for the model to learn the structure of the large semantic network, with
the model hitting asymptote at around 150 propositions. The entire proposition
set is not needed to acquire accurate knowledge of the small world, due to the
highly structured nature of the learning materials. That is, the model was capable
of acquiring an accurate representation of the small world with only 75 percent of
the total number of propositions, in only a single training session.
To determine if the model reproduced the hierarchical structure of the
propositions correctly, Figure 10.3 shows the dendogram of the hierarchical
clustering of the similarity space for the small network, while Figure 10.4 contains
the same information for the large network. The dendograms were generated from
a BEAGLE model that was trained on all of the propositions from the different
networks, with no repeats. A dendogram uses hierarchical clustering to determine
Small Worlds and Big Data 233

1.0

0.8

0.6 Small
Large
R

0.4

0.2

0.0
25 50 75 100 125 150 175 200 225 250 275 300
Number of propositions
FIGURE 10.2 Increase in correlation between the cosine similarity of items and feature
overlap values derived from the two semantic networks, as a function of the number
of propositions studied.

daisy

rose

pine

oak

robin

canary

sunfish

salmon
FIGURE 10.3 Dendogram of the hierarchical structure of BEAGLE trained on
propositions derived from the small semantic network.
234 B. Johns, D. J. K. Mewhort and M. N. Jones

dog
pig
mouse
cat
goat
canary
penguin
sparrow
robin
trout
cod
flounder
salmon
rose
daisy
tulip
sunflower
oak
maple
birch
pine

FIGURE 10.4 Dendogram of the BEAGLE model trained on propositions derived from
the large semantic network.

the hierarchical relationships among items. Figures 10.3 and 10.4 show that the
model learned the correct hierarchical generating structure of the environment, in
that it is identical to the structure displayed in the semantic network in Figure 10.1
(for the small network). This demonstrates that the BEAGLE model was able to
acquire hierarchical information (central to the proposals of the semantic cognition
theory), even with no explicit goal to do so, and with a very limited amount of
experience (as compared to the amount of training that a typical backpropagation
algorithm would require). These simulations demonstrate that a simple encoding
of the training materials provides enough power to learn both networks.

Discussion
Our goal in this section was twofold: First to determine the power that a
well-formed description of a small world provides a learning mechanism (in
the form of sets of propositions derived from semantic networks), and second
to assess how easily this information is learned with a simple co-occurrence
Small Worlds and Big Data 235

model of semantics. We did not intend to compare the vector accumulation


techniques of Jones and Mewhort (2007) with the backpropagation techniques of
Rogers and McClelland (2004). Both are perfectly capable of learning the required
information, but the BEAGLE model provides a simple method of determining the
power contained within the structure of the training materials. There is nothing
complicated about the vector accumulation model; it provides a mechanism for
efficiently recording the usage of a word. The model’s explanatory power comes
from the statistical structure of the environment.
BEAGLE was able to learn the structure of the small worlds efficiently, with very
limited training, through exposure to propositional information. One difference
between the vector accumulation approach and that of Rogers and McClelland
(2004) is that we assumed propositional information was equivalent to a sentence
in a natural language corpus. Rogers and McClelland, by contrast, assumed that
the goal of the learning operation was to take a word (e.g. canary) and a relation
(e.g. can) to construct a prediction about what the output pattern should be (e.g.
sing). There is no a priori reason why either approach should be preferred over
the other, as in both models the learning tasks are completely linguistic (that is,
there is no formal basis for assuming that the training materials represent more
than patterns to associate).
However, the BEAGLE model did not even require the full proposition set
to learn the small world (that is, it required less than a single training epoch);
backpropagation required hundreds or thousands of epochs of training to capture
the correct hierarchical structure of the training materials (Rogers & McClelland,
2004), depending on the size of the training corpus. The differential amount
of training required to learn the latent structure of the small worlds provides an
interesting look into the motivation of the two theories. Rogers and McClelland
used the evolution of knowledge gained across epochs to relate the model to
developmental trends in the acquisition of knowledge. The BEAGLE model was
not designed to explain small datasets, and, given the limited training materials
required for the model to acquire the small world, it would not be possible to
speak to developmental data with this analysis. Instead, the appealing aspect of the
framework is that it can be easily scaled up to analyze massive amounts of text
(Recchia et al., 2015), on a scale that is consistent with the amount of language a
typical adult would experience.
The ability to scale provides an opportunity to constrain the learning
mechanisms used to explain semantic memory, as it should be possible to
determine the amount of natural language information that is necessary to learn
the structure of a proposed small world. Thus, it allows for a connection to be
formed between the assumptions of the small world approach (heavily structured,
small-scale training materials) with the Big Data approach (noisy, large-scale data)
to understanding semantic memory. This affords a formal relationship to be formed
between the two approaches: Given the structure of a small world (where the
236 B. Johns, D. J. K. Mewhort and M. N. Jones

simplification assumption makes the learning task much more straightforward),


it should be possible to determine how much natural language information is
necessary to approximate a small-world structure.

Big Data Analysis of Small-World Structure


This section will assess the minimum amount of natural language information
that would be necessary to learn the approximate structure of a proposed small
world with a semantic space model. The aim is to obtain an understanding of
the complexity of scaling from a small world to the natural environment. We will
use a vector accumulation model to analyze a large, and unique, collection of
high quality texts. A data fitting methodology will be used to determine the most
informative set of texts to approximate the structure of a small world. The high
quality of the texts and the use of a data fitting method will allow for confidence
that a set of highly informative texts is being assembled.

Model
The model that will be used here is an approximation of BEAGLE that is based on
sparse representations rather than Gaussian ones (Sahlgren, Host, & Kanerva, 2008;
Recchia et al., 2015). The advantage of the sparse-vectors approach is that it greatly
reduces the computational complexity of the model and allows for a greater degree
of scaling. A very large amount of text is going to be used in this analysis, so the
computationally simpler model has obvious advantages. Only context vectors will
be used, rather than order vectors, in order to simplify the learning task, as now the
model is only using sentential context to form semantic representations. Similar to
past studies (e.g. Recchia et al., 2015), vectors will be large (5,000 dimensional) and
the environmental vectors are very sparse (six non-zero values randomly sampled),
similar to binary spatter codes.

Training materials
The set of texts that will be used to train the model is drawn from five different
sources: (1) Wikipedia; (2) Amazon product descriptions (from McAuley &
Leskovec, 2013); (3) a set of 1,000 fiction books; (4) a set of 1,050 non-fiction
books; and (5) a set of 1,500 young adult books. All book text was scraped from
e-books, and all were published in the last 60 years by popular authors. To ensure
that each text source would contribute equally, each source was trimmed to a set
of six million sentences with random sampling, for a total of 30 million sentences
across all texts (approximately 400 million words). The data fitting method will
determine which set of texts are the most informative for generating the small
worlds.
Small Worlds and Big Data 237

Data fitting methodology


Rather than training the model with all language materials, we used a new way to
determine which sets of text offer the best fit to a proposed small world. Taking
the different corpora described above, we split them into subsections of 10,000
sentences (approximately the size of an average fiction novel, a rather small amount
of language). The result was 3,000 separate text subsections. The subsections were
treated as individual corpora, and a set of vectors was learned for each subsection.
That is, 3,000 different models were constructed. These representations were rather
sparse in terms of the amount of knowledge that they contain, given the rather
limited amount of language each part contains.
These 3,000 different vector sets were used to generate an overall representation
that was maximized on the amount of knowledge contained about the small
world. To do this, a hill-climbing algorithm was used to determine which parts
provide the best fit to the proposed structure of a small world, by iteratively
selecting the sections that provide the largest increase in fit to the structure
of the small worlds. The first iteration of the method selects the section
that provides the best fit. On subsequent iterations, vector sets are summed
together to form an overall representation. Whichever section provides the largest
increase in fit is selected and summed into the representation. All sets are
combined with the overall representation at each iteration, meaning that the
entire language space is being tested. Once a vector set has been selected, it
is removed from the search set. In this way, the semantic representation that is
constructed continuously increases its resolution, as is the amount of knowledge
that they contain about a small world. The process ends when the addition of
further vectors into the overall representation decreases the fit of the model to
the data.
Although the problems with hill-climbing algorithms are well-known (e.g.
getting stuck in local maxima), the method used here provides a simple
means by which to determine how much linguistic information is necessary
to form an approximation of the structure contained in a small world. One
could think of the process as a form of parameter fitting, whose use is
ubiquitous across cognitive modeling (see Shiffrin, Lee, Kim, & Wagenmakers,
2008), but instead of maximizing the internal parameters to explain a set of
behavioral data, we maximized the structure of the external world (i.e. linguistic
information).

Small Worlds
The small worlds that will be approximated here are the same that were used in the
previous section, both the small and large semantic networks. The only difference was
that the word sunfish was replaced with trout, due to its very low frequency across the
238 B. Johns, D. J. K. Mewhort and M. N. Jones

different corpora. The first tests conducted examined the similarity values (assessed
with a vector cosine) between the word-level items in the hierarchy. These were
assessed with a rank correlation to the amount of feature overlap in the semantic
network, identical to the small world analysis described above. This analysis will
provide an analogous examination to those used in the small world analysis.
In order to further test the knowledge acquisition process, we used an additional
test over both semantic networks. Specifically, we used an alternative forced choice
(AFC) task, where the model has to determine which semantic feature (e.g. plant
or animal) is more likely for a particular word (e.g. canary). This test was conducted
at each level of the hierarchy (e.g. the model was asked to discriminate plant/animal,
and then tree/flower/bird/fish/mammal, etc.. . . until the bottom level was reached).
The test involves 52 questions for the small network and 140 questions for the large
network, derived from every level of the hierarchy for both of the networks. Not all
levels contained the same number of features, so the number of alternatives ranges
from two to ten. Obviously, discriminating among ten alternatives is a difficult task
for the model, but it does provide a strong test of the semantic representation that
the model is constructing. Performance was assessed by determining the proportion
of correct responses on the AFC test.

Results
Figure 10.5 shows the fit of the model when it is optimized to account for
feature overlap values for both the small and large semantic networks. As shown
in Figure 10.5, the model is capable of rapidly acquiring the semantic structure
of a small world, as the correlation between the model’s vector cosine and the
feature overlap values from the network increases substantially as more linguistic
information is learned. The complexity of the small world obviously plays a large
role in the amount of linguistic experience that is necessary to account for the
proposed structure. The small network maximized at 150,000 sentences, while the
large network maximized at 880,000 sentences. However, the small network hit
asymptote at around 50,000 sentences, while the large network did the same at
approximately 200,000 sentences.
To demonstrate that the model acquired the same hierarchical information
learned in the small-world modeling, the dendograms for both the small and large
networks are displayed in Figure 10.6 and Figure 10.7, respectively. For the small
network, the hierarchical clustering method accurately infers the clusters across
the eight words. For the large network, the model reproduces the overall cluster
properties (clustered into trees/flowers/fish/bird/mammals), with only one error (robin
was classified in its own cluster, closer to mammals). The error arose because
the word robin had approximately equal similarity to both birds and mammals.
Additionally, the clusters are not as well discriminated as the dendogram in Figure
10.4, a result expected given the differences in noise across the training sets. Natural
Small Worlds and Big Data 239

1.0

0.8

Small
0.6 Large
R

0.4

0.2

0.0
0 200,000 400,000 600,000 800,000 1,000,000
Number of sentences
FIGURE 10.5 Correlation of the similarity between words and the feature overlap
values from the two semantic networks, as a function of the number of sentences
sampled.

language contains much more knowledge about the words under question (e.g.
that a robin nests in trees), that shift the similarity space. However, the simulation
is impressive because acquiring hierarchical information is not an explicit goal for
the model. Instead, the language environment was enough to acquire such data, as
the hierarchical structure emerged across training, similar to the findings of Rogers
and McClelland (2004).
A stronger test of the model’s ability to acquire the requisite information is given
by the AFC test described above. In the AFC test, the model has to discriminate
features associated with the word in the semantic networks. Figure 10.8 displays
the increase in performance across the corpus sampling routine. Even though
the learning task was quite difficult, the model reached an impressive 98 percent
accuracy for the small network questions, and a 87 percent accuracy for the large
network questions. The test is similar to a synonym test, such as the TOEFL
test used in Landauer and Dumais (1997), where LSA achieved a performance
level of 55 percent. That the model achieves such a high performance level
demonstrates the power both of the training materials that were assembled and
of the data fitting method that was used. Rather surprisingly, the model did
not require as many sentences to learn the features of the semantic network as
it did to learn the connection among words, as only 130,000 sentences were
240 B. Johns, D. J. K. Mewhort and M. N. Jones

pine

oak

daisy

rose

trout

salmon

robin

canary
FIGURE 10.6 Dendogram of the hierarchical structure of the representation learned
from the corpus sampling routine for words from the small network.

flounder
cod
trout
salmon
canary
penguin
sparrow
goat
pig
cat
dog
mouse
robin
pine
birch
maple
oak
daisy
tulip
sunflower
rose

FIGURE 10.7 Dendogram of the hierarchical structure of the representation learned


from the corpus sampling routine for words from the large network.
Small Worlds and Big Data 241

1.0

0.8

Small
0.6 Large
% Correct

0.4

0.2

0.0
0 100,000 200,000 300,000 400,000 500,000 600,000
Number of sentences
FIGURE 10.8 Performance on the AFC test as a function of the number of sentences
sampled.

needed to reach maximum performance for the small network, and 560,000
sentences were needed for the large network. It is worth noting that the TASA
corpus—a standard corpus used in co-occurrence models of semantics since
Landauer and Dumais (1997)—contains approximately 760,000 sentences. So, the
number of sentences that the optimization method required is not overly large
when compared with standard corpora that are used in the field of semantic
memory. The test demonstrates that a model trained directly with data from the
linguistic environment not only learned the hierarchical structure of a small world
but also learned the semantic features that define it.
The simulations demonstrate that a Big Data analysis approximates the structure
contained in small worlds quite readily. However, a different question lies in the
comparative complexity of the learning task that models of semantic memory face.
For the small network, the small world had 84 propositions. In the Big Data
analysis, the model maximized performance at 1,807 times more sentences than
is contained in the proposition corpus. For the larger semantic network, the scale
was that 3,911 times more sentences were required. However, the feature AFC test
only needed 1,566 and 2,488 times more sentences than the proposition corpus
for the small and large networks, respectively, suggesting that different types of
information can be learned more efficiently. Overall, our analysis suggests that the
242 B. Johns, D. J. K. Mewhort and M. N. Jones

amount of information contained in a small world versus that contained in Big


Data are on scales that differ by multiple orders of magnitude.

Discussion
This section attempted to understand the connection between the proposals of
small world analyses of semantic memory, compared to those that rely upon
Big Data. We found that a model which relied upon large amounts of lexical
experience to form a semantic representation of an item was able to acquire
a high quality approximation to the structure of different small worlds, assessed
with multiple tests, including the acquisition of hierarchical information and the
learning of semantic features. The corpus-sampling algorithm allowed the model
to select a set of texts that provided the best fit to the small world structure. Even
with this data fitting methodology, to obtain a maximal fit required large amounts
of natural language materials.

General Discussion
The goal of this chapter was to examine the simplification assumption in cognitive
modeling by comparing the small world versus Big Data approaches to semantic
memory and knowledge acquisition. A central aspect of semantic memory is
the learning of environmental structure through experience. The small-world
approach proposes that in order to understand the mechanisms that underlie this
ability, the structure of the environment must be constrained to lay bare the
learning process itself. By constraining the complexity of the training materials,
it is possible to study the operations of a model at a fine-grained level, as the
theorist knows the generating model that produced the environmental experience.
In contrast, a Big Data approach proposes that ecologically valid training materials
are necessary—natural language materials. Although the researcher loses control
of the minutiae of what the model learns, it gains power through the plausibility
of the approach, as such models readily scale up to the linguistic experience that
humans receive.
To determine the relation between these two approaches, the small world
analysis examined how readily a vector-accumulation model of semantics (a
standard Big Data model) could acquire the structure of a small world. We found
that the model rapidly acquired knowledge of the small world through a small
corpus of propositions. By limiting the complexity of the training materials, the
learning task became quite simple. The structure of the items used was sufficient
to explain the behaviors under question, with no advanced learning or abstraction
processes necessary. This echoes recent work on implicit memory and artificial
grammar learning (e.g. Jamieson & Mewhort, 2011) and natural language sentence
processing (Johns & Jones, 2015).
Small Worlds and Big Data 243

Given the ease with which the model was capable of learning the small world,
the big data analysis determined how much natural language information was
necessary to approximate the structure proposed by the two semantic networks
from Rogers and McClelland (2004). We used a very large amount of high quality
language sources—large sets of fiction, non-fiction, and young adult books, along
with Wikipedia and Amazon product descriptions. The texts were split into smaller
sections, and a hill-climbing algorithm was used to select the texts iteratively that
allowed the model to attain the best fit to the proposed structure. Across multiple
tests, the model was readily capable of learning the small worlds, but the amount of
experience needed was orders of magnitude greater than the size of the proposition
corpora. This analysis suggests that the learning tasks under the two approaches
differ greatly in the complexity of the materials used.
This issue of informational complexity gets at the crux of the problems
surrounding the simplification assumption in studying semantic memory: By
reducing the complexity of the structure of the environment to a level that is
tractable for a theorist to understand fully, the problem faced by a hypothesized
learner is trivialized. The linguistic environment, although heavily structured in
some ways, is still extremely noisy, and requires learning mechanisms that are
capable of discriminating items contained in very large sets of data. Thus, the
simplification assumption biases theory selection towards learning and processing
mechanisms that resemble humans on a small and artificial scale. The emergence of
Big Data approaches to cognition suggests that artificial toy datasets are no longer
necessary. Models can now be trained and tested on data that are on a similar
scale to what people experience, increasing the plausibility of the model selection
and development process. For models of semantic memory, the existence of high
quality large corpora to train models eliminates the necessity for oversimplification,
and offers additional constraints as to how models should operate.
This is not to say that the small-world approach is without merits, a point
clear in the history of cognitive science. The goal of small-world assumption is
to provide an accurate understanding of the operation of a model with clear and
concise examples, something that models that focus only on Big Data techniques
cannot achieve. Thus, as in the evolution of any new theoretical framework, past
knowledge should be used to constrain new approaches. The Big Data approach
should strive to embody the ideals of the simplification assumption in cognitive
modeling, that of clear and concise explanations, while continuing to expand the
nature of cognitive theory.

Note
1 What we refer to here as a small-world approach is also commonly referred to
as a “toy model” approach.
244 B. Johns, D. J. K. Mewhort and M. N. Jones

References
Bannard, C., Lieven, E., & Tomasello, M. (2009). Modeling children’s early grammatical
knowledge. Proceedings of the National Academy of Sciences of the United States of America,
106, 17284–17289.
Barsalou, L. W. (1999). Perceptual symbol systems. Behavioral Brain Science, 22, 577–660.
Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word
co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44,
890–907.
Collins, A. M., & Quillian, M. R. (1969). Retrieval time from semantic memory. Journal
of Verbal Learning and Verbal Behavior, 8, 240–247.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
Estes, W. K. (1975). Some targets for mathematical psychology. Journal of Mathematical
Psychology, 12, 263–282.
Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation.
Psychological Review, 114, 211–244. doi: 10.1037/0033-295X.114.2.211.
Hare, M., Jones, M. N., Thomson, C., Kelley, S., & McRae, K. (2009). Activating event
knowledge. Cognition, 111, 151–167.
Hills, T. T., Jones, M. N., & Todd, P. T. (2012). Optimal foraging in semantic memory.
Psychological Review, 119, 431–440.
Jamieson, R. K., & Mewhort, D. J. K. (2009a). Applying an exemplar model to the
artificial-grammar task: Inferring grammaticality from similarity. Quarterly Journal of
Experimental Psychology, 62, 550–575.
Jamieson, R. K., & Mewhort, D. J. K. (2009b). Applying an exemplar model to the
serial reaction time task: Anticipating from experience. Quarterly Journal of Experimental
Psychology, 62, 1757–1784.
Jamieson, R. K., & Mewhort, D. J. K. (2010). Applying an exemplar model to
the artificial-grammar task: String-completion and performance for individual items.
Quarterly Journal of Experimental Psychology, 63, 1014–1039.
Jamieson, R. K., & Mewhort, D. J. K. (2011). Grammaticality is inferred from global
similarity: A reply to Kinder (2010). Quarterly Journal of Experimental Psychology, 64,
209–216.
Johns, B. T., & Jones, M. N. (2015). Generating structure from experience: A retrieval-based
model of sentence processing. Canadian Journal of Experimental Psychology, 69, 233–251.
Johns, B. T., Taler, V., Pisoni, D. B., Farlow, M. R., Hake, A. M., Kareken, D. A., . . . Jones,
M. N. (2013). Using cognitive models to investigate the temporal dynamics of semantic
memory impairments in the development of Alzheimer’s disease. In Proceedings of the 12th
International Conference on Cognitive Modeling, ICCM.
Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order
information in a composite holographic lexicon. Psychological Review, 114, 1–37.
Jones, M. N., Kintsch, W., & Mewhort, D. J. K. (2006). High-dimensional semantic space
accounts of priming. Journal of Memory and Language, 55, 534–552.
Jones, M. N., Willits, J. A., & Dennis, S. (2015). Models of semantic memory. In
J. R. Busemeyer & J. T. Townsend (Eds.), Oxford handbook of mathematical and computational
psychology.
Landauer, T. K., & Dumais, S. (1997). A solution to Plato’s problem: The Latent
Semantic Analysis theory of the acquisition, induction, and representation of knowledge.
Psychological Review, 104, 211–240.
Small Worlds and Big Data 245

McAuley, J., & Leskovec, J. (2013). Hidden factors and hidden topics: Understanding rating
dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender
Systems, Rec-Sys ’13 (pp. 165–172).
McClelland, J. L. (2009). The place of modeling in cognitive science. Topics in Cognitive
Science, 1, 11–38.
Recchia, G. L., & Jones, M. N. (2009). More data trumps smarter algorithms: Comparing
pointwise mutual information to latent semantic analysis. Behavior Research Methods, 41,
657–663.
Recchia, G. L., Jones, M. N., Sahlgren, M., & Kanerva, P. (2015). Encoding sequential
information in vector space models of semantics: Comparing holographic reduced
representation and random permutation. Computational Intelligence and Neuroscience.
Retrieved from http://dx.doi.org/10.1155/2015/986574.
Rogers, T. T., & McClelland, J. L. (2004). Semantic Cognition: A parallel distributed processing
approach. Cambridge, MA: MIT Press.
Rumelhart, D. E. (1990). Brain style computation: Learning and generalization. In S. F.
Zornetzer, J. L. Davis, & C. Lau (Eds.), An introduction to neural and electronic networks (pp.
405–420). San Diego, CA: Academic Press.
Rumelhart, D. E., & Todd, P. M. (1993) Learning and connectionist representations. In
D. E. Meyer & S. Kornblum (Eds.), Attention and performance XIV: Synergies in experimental
psychology, artificial intelligence, and cognitive neuroscience (pp. 3–30). Cambridge, MA: MIT
Press.
Sahlgren, M., Holst, A., & Kanerva, P. (2008). Permutations as a means to encode
order in word space. In Proceedings of the 30th Conference of the Cognitive Science Society
(pp. 1300–1305).
Shiffrin, R. M. (2010). Perspectives on modeling in cognitive modeling. Topics in Cognitive
Science, 2, 736–750.
Shiffrin, R. M., Lee, M. D., Kim, W. T., Wagenmakers, E. -J. (2008). A survey of model
evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science,
32, 1248–1284.
Simon, H. A. (1969). The sciences of the artificial. Cambridge, MA: MIT Press.
Wittgenstein, L. (1953). Philosophical investigations. Oxford: Blackwell.
11
ALIGNMENT IN WEB-BASED
DIALOGUE
Who Aligns, and How Automatic Is It? Studies in
Big-Data Computational Psycholinguistics

David Reitter

Abstract
Studies on linguistic alignment explain the emergence of mutual understanding in
dialogue. They connect to psycholinguistic models of language processing. Recently,
more computational cognitive models of language processing in the dialogue setting have
been published, and they increasingly are based on observational, large datasets rather
than hypothesis-driven experimentation. I review literature in data-intensive computational
psycholinguistics and demonstrate the approach in a study that elucidates the mechanisms
of alignment. I describe consistent results on data from large online forums, Reddit, and
the Cancer Survivors Network, which suggest that linguistic alignment is a consequence of
the architecture of the human memory system and not an explicit, consciously controlled
process tailored to the recipient of a speaker’s contributions to the dialogue.

Introduction
In this chapter, I will survey studies on linguistic alignment in order to connect
the inferences we make from large datasets to the psycholinguistics of dialogue.
This work is motivated by a high-level question: How does our mind select and
combine words in a way that reliably communicates our intentions to a dialogue
partner? The search for a computational answer to this question has been revitalized
in recent years by the advent of large datasets. Large data give us a window into an
individual’s mind and a cooperative process between minds. It allows us to look at
how dialogue partners gradually converge in their choices of words and sentence
structure, thereby creating a shared language.
This process relies on an implied contract among the people engaging in
dialogue. It specifies long-term and temporary conventions that establish the
Alignment in Web-based Dialogue 247

meaning of words, idioms, syntactic structure, or the general topics of conversation.


The rules governing convention-forming are subject to debate: How does the
contract evolve, and how do people accommodate the linguistic needs of their
dialogue partners? What are the cognitive mechanisms that access memory and
produce sentences in order to comprehend, and produce contextualized language?
The studies I discuss in the following aim to shed light on these mechanisms.
The availability of data and new methods from information science has given
researchers the tools they need to answer these questions in reference to “language
in the wild”: Real-world and large-scale language use as opposed to hand-picked
examples or carefully constructed experimental materials. However, information
and network science as well as modern-era computational linguistics have all been
somewhat agnostic to the psychological processes that produce the data that they
study. Yet, many of their methods are useful in the context of cognitive science.
In the following, I will use linguistic and structural cues to identify syntactic
repetition, but also to characterize an interlocutor’s role in contributing novelty
to the conversation.

Integrating Psycholinguistics and Cognitive Modeling


Thus far, models of language production have used representations that were
either too specialized or too generic. Grammar formalisms are representations
that describe syntax at a high level, or that provide a computational account
of the syntactic process (e.g. Pollard & Sag, 1994; Joshi & Schabes, 1997;
Steedman, 2000). However, these representations leave open many computational
questions. They may fall short of explaining all permissible sentences, or they can
over-generate by permitting too many sentences. Connectionnist representations
(e.g. Dell, Chang, & Griffin, 1999; Elman, 1990) are often focused on specific
aspects of language processing, although the machine learning and AI literature
has advanced far beyond (e.g. Mnih & Hinton, 2009; Mikolov, Sutskever, Chen,
Corrado, & Dean, 2013). While models of language processing have been
connected to the psychological literature, their assumptions about memory use
are intentionally more modest (e.g. Gibson, 1998). To move forward, we need a
tighter integration of models of language processing and cognitive architectures.
This raises several questions, which address the core of cognitive science.

1 Cognitive Plausibility: Given the available information, and computational-


cognitive resources, which accounts of linguistic representation can be learned
and processed in real-time by the mind?
2 Representations: Which learned mental representations guide language
processing?
3 Specialization: Which computational operations and memory components
are specific to language, rather than being shared with general cognition?
248 D. Reitter

An integrated account can take a stance with respect to each of these questions.
In following Newell’s call for models that “must have all the details” and describe
“how the gears clank and how the pistons go” (Anderson, 2007), the model should
be a computational account that actually carries out language acquisition, language
comprehension, and language production. Further, Newell’s call for functional
models means that we need to cover the broad range of linguistic constructions
present in a corpus. To achieve this objective, we must use the large-scale language
resources that are standard in computational linguistics. They reflect language use
in the wild.
The conversation about such approaches has been taking place in a relatively
new field, computational psycholinguistics, which discovered a range of phenomena
that may form the basis for how we think about the mechanisms of human language
acquisition and processing.
Linguists have asked provocative questions using these methods. To name a
few: How is information density distributed throughout text, and why? When
is language production incremental? How is working memory used in language
processing? Computational psycholinguistics has pushed the boundaries to cover
the broad expressive range found in corpora.
The field discovers how humans learn, produce, and comprehend natural lan-
guage, and the models are informed by observations from contemporary language
use. Standard psycholinguistic methods examine human language performance by
collecting data on comprehension and production speed, eye movements while
reading (Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995; Demberg &
Keller, 2008; Henderson & Ferreira, 2004), or specific processing difficulties
(e.g. self-embedding or general center-embedding: Chomsky & Miller, 1963,
and Gibson, 1991). These methods are productive but require data collection,
while more can be learned from unannotated data. The machine learning
field of semi-supervised learning has developed computational accounts that
describe successful learning from small portions of annotated and large portions
of unannotated data (e.g. Chapelle, Scholkopf, & Zien, 2006; Ororbia II, Reitter,
Wu, & Giles, 2015).
Large datasets have been used to first verify experimental results in naturalistic
language, and now they allow us to find more fine-grained support for theoretical
models of dialogue. The subject that exemplifies the use of Big Data is
alignment, the tendency for people in a conversation to conform to each other
in syntax, word choice, or other linguistic decision levels. The studies relating
to alignment are particularly interesting, as they relate low-level repetition effects
such as syntactic priming to high-level dialogue strategies and even up to the
social roles of those participating in dialogue. Big Data has been contributing
to our understanding of this process. I will bridge the range of work from
traditional controlled experimentation to a new analysis of a very large Internet
dataset, illustrating not only results, but also challenges associated with such
datasets.
Alignment in Web-based Dialogue 249

Alignment
I focus on a set of well-known memory-related phenomena around alignment
(Bock, 1986; Pickering & Ferreira, 2008). This is an effect of gradual convergence
throughout dialogue. Alignment describes a range of related phenomena that cause
speakers to repeat themselves or others, to gradually adapt to someone else, or to
become more consistent in their own speech. This tendency affects not just the
words speakers use; it also affects their sentence structure and even some aspects
of semantics. Indeed, alignment1 is claimed to be based on adaptation effects at
several linguistic levels (Pickering and Garrod, 2004): Lexical priming, syntactic
(structural) priming, or priming at even higher, behavioral levels.

Function in Dialogue: Alignment as a Driver of Dialogue


The interactive alignment model (IAM) (Pickering and Garrod, 2004) suggests
that the convergence of situation models in dialogue is the result of an interactive
process. It is based on mechanistic repetition at a number of linguistic levels.
This mechanistic repetition that forms the basic building blocks of alignment
may be due to a known memory effect called priming. Pickering and Garrod (2004)
suggest that a cascade of priming effects at different levels is a driver of mutual
understanding and cooperation. Indeed, one can find priming or a priming-like
effect in everything from phonetic reductions (Bard et al., 2000; Jaeger, 2006) to
a joint interpretation of the dialogue situation (Garrod & Anderson, 1987). The
notion of alignment in dialogue also applies to the distributional similarity and con-
trast of verbal and non-verbal features (Paxton, Abney, Kello, & Dale, 2014), and
also the use of high-frequency words (Nenkova, Gravano, & Hirschberg, 2008).
Empirical evidence also underlines the function of alignment in discourse. Speakers
tailor the amount of their alignment to both the perceived needs of the interlocutor
(Branigan, Pickering, Pearson, McLean, & Brown, 2011) and dialogue goals
(Reitter et al., 2006b). The adaptation at several linguistic levels is predictive of task
success (Reitter & Moore, 2007; Fusaroli, et al., 2012; Reitter & Moore, 2014).
Alignment can be seen as a default approach to ensuring mutual understanding.
The converse would require speakers to model their intelocutors much more
explicitly, tracking what they understand and what information they agree on
(grounding, Brennan & Clark, 1996). However, there is only limited evidence that
alignment has a possible function as a signal of agreement between speakers, as it
seems to be mostly automatic. Danescu-Niculescu-Mizil and Lee (2011) examined
a corpus of movie scripts, finding alignment of function word use in dialogue
between movie characters. Even though the authors of these scripts did not receive
any potential social benefits of alignment, they still created aligned dialogue.

Syntactic Priming
Syntactic priming is one adaptation effect that contributes to alignment. Speakers
mainly choose different words and grammatical structures to express their ideas.
250 D. Reitter

However, when people can choose between several alternative grammatical


structures, their choice tends to be influenced by what has been already been said
in the conversation. Speakers tend to repeat previously encountered grammatical
structures, a pattern of behavior that is referred to as syntactic (or structural)
priming.2
Syntactic priming is interesting in the context of the present discussion because
corpus-based studies of this effect have demonstrated that the Big Data inquiry
can reveal mental representation and processes in dialogue. It is an informative case
study in connecting experimental evidence to observational data.
Priming occurs when comprehending or producing language material alters
the likelihood of future linguistic choices. Syntactic or structural priming applies
this definition to syntactic choices, while semantic priming refers to the priming
of words. A descriptive measure of the magnitude of priming effects has been a
challenge to define. Szmrecsanyi (2005), Gries (2005), Jaeger and Snider (2007),
and Reitter (2008) look at either repetition counts or use logistic regression to
examine priming’s effect size or decay. These are not commensurable measures
that would be suitable to compare magnitudes across studies. Work is under way
to define a statistically useful, robust metric with a sensitive measuring apparatus
(Fusaroli, et al., 2012; Jones, Cotterill, Dewdney, Muir, & Joinson, 2014; Xu &
Reitter, 2015).
As a result of syntactic priming, speakers have a tendency to prefer one syntactic
construction over an available alternative shortly after having used this structure or
having heard an interlocutor use it (Bock, 1986).
Alignment can be used as a method to infer information about language
processing from data, if we agree on the following paradigm: Structures that do
not exist during language production do not prime. These structures refer to the actual
content of our memories: In this way, alignment can actually serve as a window
into the mind. This paradigm can be applied to more or less detailed process
models. However, the focus of the study I will present is to go beyond simple
short-term repetition effects. I use lexical and syntactic alignment to analyze
memory effects that are the precursor to permanent language change in the
individual, which in turn are precursors to language change within a cultural
community.
Syntactic priming effects are particularly interesting because they reflect implicit
decisions. Speakers do not consciously choose syntactic structure, as they might
with some words. The effect has been studied extensively in psychological
literature. In a now-classic study, Bock (1986) found that adults who listened to
and repeated a sentence in a passive form (The boy was kissed by the girl) were more
likely to describe an image about something completely different using the passive
form (The cat was chased by the dog) as opposed to the active form (The dog chased
the cat). Many other syntactic choices have been shown to exhibit priming as well.
Some others include prepositional objects (The painter showed his drawing to the
Alignment in Web-based Dialogue 251

customer) versus direct objects (The painter showed the customer his drawing)
(Bock, 1986; Pickering & Branigan, 1998), complex noun phrases (Cleland &
Pickering, 2003), the order of main verbs and auxiliary verbs (Hartsuiker, Bernolet,
Schoonbaert, Speybroeck, & Vanderelst, 2008), and a range of other grammatical
structures in various languages. Not surprisingly, syntactic priming applies to both
dialogues and monologues. Nonetheless, some studies suggest that priming effects
are stronger in dialogue than in monologue (Cleland & Pickering, 2003; Kootstra,
van Hell, & Dijkstra, 2010).
The tendency to repeat a particular structure when it has been recently
encountered has also been observed in corpus studies of spoken language (Reitter,
2008; Szmrecsanyi, 2005; Travis, 2007) and Internet forum conversations (Wang
et al., 2014). Alignment applies to other levels of linguistic analysis, including
referring expressions such as pronouns (Brennan & Clark, 1996) and style
(Niederhoffer & Pennebaker, 2002).
Corpus studies have since provided evidence for syntactic priming outside of
carefully controlled laboratory settings. Speakers adapt in situated, realistic dialogue.
For instance, the Map Task corpus (Anderson, et al., 1991; McKelvie, 1998) shows
syntactic priming-like repetition effects (Reitter, Moore, & Keller, 2006c). The cited
corpus study also models priming as an effect that applies to syntactic rules in general,
rather than specific alternations such as those in the above examples.
Analyses of spoken language corpora generally showed that the probability of
repeating a structure decreases as the amount of time between priming object
and the primed object increases (Gries, 2005; Reitter et al., 2006b; Szmrecsanyi,
2006). This would suggest priming effects decay over time. However, these earlier
corpus studies did not control for the characteristics of the language between prime
and target. To control for this possible confound, I modeled decay of repetition
probability as the variable that quantified priming (Reitter, 2008).
There are two clearly separate syntactic priming effects: (a) Fast, short-term
and short-lived priming, and (b) slow, long-term adaptation that is likely a result
of implicit learning (see Ferreira & Bock, 2006; Pickering & Ferreira, 2008).
Long-term adaptation is a learning effect that can persist, at least, over several days
(Bock, Dell, Chang, & Onishi, 2007; Kaschak, Kutta, & Schatschneider, 2011).
Recent work has proposed models that explain the mechanisms of these effects
within the context of language acquisition (Bock & Griffin, 2000; Chang, Dell, &
Bock, 2006; Kaschak, Kutta, & Jones, 2011) and general memory retrieval (Reitter
et al., 2011). These studies suggests that priming is the precursor to persistent
language change.

Characteristics of Syntactic Priming


Priming means that the use of a syntactic construction increases the probability of
its future occurrence overall, but initially, the probability decays rapidly (Branigan,
252 D. Reitter

p(prime=target|target,distance) 0.016

0.014

0.012 Switchboard
PP
Switchboard
CP
0.010 Map task
PP

0.008 Map task


CP

2 4 6 8 10 12 14
Distance: temporal distance between prime and target (seconds)

FIGURE 11.1 Logistic regression models of syntactic repetition fitted to data from
two corpora: Map Task (task-oriented dialogue) and Switchboard (spontaneous
conversation by phone). CP: Priming between speakers, PP: Self-priming.

Pickering, & Cleland, 1999). This decay has been shown to follow a logarithmic
function of time or linguistic activity (Reitter et al., 2006b). Since this decay is
a side effect of priming, we can use it to quantify the strength of the priming:
A stronger priming effect will decay more quickly. This way, we can distinguish
it from other potential sources of increased repetition, such as text genre or the
clustering of topics. Figures 11.1 and 11.4 illustrate the decay effect using different
methods and datasets. The repetition rate of linguistic material can be modeled as
a function of either time (Figure 11.1) or linguistic activity (Figure 11.4). Indeed,
decay can be observed at much larger time-scales than the one found in spoken
dialogue, which has been observed in Internet forum conversations (Figure 11.4,
Wang et al., 2014).
Syntactic priming is also cumulative (Hartsuiker & Westenberg, 2000; Jaeger
& Snider, 2008; Kaschak, Loney, & Borregine, 2006). While cumulativity has
not been taken into account by previous corpus analyses, it is included in the
proposed effort. Models that account for cumulativity and decay in a cognitively
plausible manner will make more precise predictions about structure use, processing
principles, and parameters defining alignment strength.

Priming is Evident in Corpus Data


Previous work showed that syntactic priming and several related interactions can be
observed in corpora of spoken dialogue (Gries, 2005; Reitter et al., 2006b; Reitter,
Alignment in Web-based Dialogue 253

2008; Szmrecsanyi, 2006). In a series of experiments, we developed logistic


linear mixed-effects regression models (logistic GLMMs) that predict repetition
probability as a function of the prime-target distance, which represents the decay of
short-term effects, and prior exposure, which represents the effect of longer-term
learning (Reitter et al., 2006b; Reitter & Moore, 2014). We found priming
resulted from both comprehending and producing sentences. Interestingly, we
find not only syntactic priming, but also convergence of syntactic complexity (Xu
& Reitter, 2016) and general communicative (pragmatic) intent (Wang, Yen, &
Reitter, 2015).
Note that GLMMs have become the statistical model of choice to describe
corpus data. GLMMs are models that predict variables as functions of discrete
factors, continuous predictors, their interactions, and additional random variables.
The purpose of these additional random variables is to control for repeated items
or several data-points from a single speaker. They remain, however, linear with the
goal of fitting response variables that can be described as the sum of the covariates
and their interactions, in contrast with non-statistical process models formulated in
connectionist paradigms or cognitive frameworks such as ACT-R.
The work shows that memory effects can be studied in large, observational
datasets and that Big Data methodology can go beyond the replication of known
effects. For instance, we have modeled syntactic and lexical priming to predict
the successful outcome of conversations (Reitter & Moore, 2014). This work also
shows that linear mixed-effects regression can be used to contrast syntactic priming
in different conditions. The regression models approximate the decay described by
cognitive psychology (Anderson, 1993; Anderson & Lebiere, 1998).

How Mechanistic is the Effect?


The observation that syntactic priming can differ in strength brings into question
how mechanistic the adaptation effect actually is. Do people have some form of
control over how much they adapt? Is it strategic as opposed to automatic?
Both copying and contrasting syntactic structure are common rhetorical
devices. The sentence Guess what, I went to the Waffle Shop for breakfast this morning!
could be answered with Guess what, I went to the gym for a workout to demonstrate
a contrast. By comparison, Well, I did a workout at the gym today does not convey
quite the same pragmatic implicature. How much control do speakers exercise over
sentence structure? Is this what controls interactive alignment in dialogue?
A range of modulation effects could address this set of questions. For one, under
an account of alignment-for-a-purpose we would expect increased adaptation
when needed, such as in task-oriented dialogue. There could also be social
modulation of adaptation, such as by social status. There could be modulation
according to a speaker’s role in the conversation, and there could be implications
254 D. Reitter

for adaptation by and among people with social communication disorders and/or
autism spectrum disorders.
Indeed, we have observed an increased amount of syntactic priming in
task-oriented dialogue compared to spontaneous conversation (Reitter & Moore,
2007). A follow-up cognitive model of syntactic priming (Reitter et al., 2011)
may have a mechanistic explanation for the differences we observed between
task-oriented dialogue and spontaneous conversation. According to the model,
working memory serves as cues to the retrieval of syntactic material. That means
that attention to concepts discussed in the conversation plays a role in making
syntactic material available. By learning associations of concepts with syntactic
decisions, remaining within a specific topic would yield stronger priming effects,
because the semantics and associations to syntactic construction are still available,
while switching topics would reduce priming effects. This does not disallow any
control over adaptation, but it argues for mechanistic adaptation as the default in
conversation. However, more empirical work is clearly needed to back up the
account.

Examining Social Modulation of Alignment


Interactive alignment suggests an alternative theory to the deep cognitive processes
suggested by explicit grounding (Brennan & Clark, 1996). The new theory
assumed a cascade of simple, mechanistic priming effects at all linguistic levels
that led to a shared language (Pickering and Garrod, 2004). At least at the syntactic
level, the priming effects can be explained in terms of general cue-based memory
retrieval and decay (Reitter et al., 2011). The model is formulated within an
independently validated cognitive architecture, ACT-R (Anderson, 2007), and
it occupies a middle ground between psychological theories that argue syntactic
priming is purely the result of either implicit learning or residual activation. To
summarize, a complex, explicit thought process was largely replaced by a fast,
intuitive heuristic default. This parallels a general trend in behavioral science. This
heuristic may not always produce the normatively correct answer, but it seems to
generally work well enough.
What this theoretical commitment does not mean is that speakers lack control
over their choices of words and sentence structure. It does not mean that alignment
cannot be modulated under any circumstance. But there are explanations of this
modulation that arise out of the mechanistic adaptation effect. For instance,
I explain stronger priming in task-oriented dialogue as the result of increased
persistence in working memory, leading to more associative activation of syntactic
constructions. Danescu-Niculescu-Mizil and Lee (2011) examined a corpus of
movie scripts, finding alignment of function word use in dialogue between movie
characters. Even though the authors of these scripts did not receive any potential
“social benefits” of alignment, they still created linguistically coordinated dialogue.
Alignment in Web-based Dialogue 255

According to the authors of the paper, this means that alignment has been
engrained in communicative patterns and is removed from its functional role. They
even compare linguistic alignment to what was called the Chameleon Effect, namely
the “nonconscious mimicry of the postures, mannerisms, facial expressions, and
other behaviors of one’s interaction partners” (Chartrand & Bargh, 1999: 893).
Modulation of alignment levels may be linked to affect: del Prado Martin and Bois
(2015) find that positive attitudes are linked to more alignment. It is possible that
attitudes are linked to engagement and attention, which affects alignment.
This brings up the question of just how does alignment really depend on
attention. Do we align primarily with attended-to speakers? This would support a
more mechanistic theory of alignment rooted in basic cognitive processes. On the
other hand, it may be the case that alignment is an audience-design effect, which
causes us to align strategically with those speakers we address as opposed to those
speakers to whose language we were most recently exposed.
Branigan, Pickering, McLean, and Cleland (2007) studied the latter hypothesis
using lab-based, staged interactions among experimental participants. The results
were mixed. While there was some alignment to speakers that weren’t addressed,
the effect was smaller. This leaves an attention-based explanation as well as social
modulation on the table. Observational studies with large datasets allow us to look
beyond primary effects of repetition. They cannot replace studies that establish
causality, but with enough high-quality data, we can examine more fine-grained
interactions that, in this case, would reveal the effects of social modulation on
decay. Such observations would have consequences for the architecture of both
language production and language acquisition.
The proposed memory-based explanation for alignment suggests that alignment
relies on the same mechanisms as language learning. This has the empirically
verifiable consequence that alignment is a precursor to permanent language
change in individuals and among members of a cultural group. We can consider
this possibility in the context of two scientific realms: Psycholinguistics and
information science. From a psycholinguistic perspective, we seek confirmation
of the mechanisms of language production. A sample of pertinent questions:
Is it attention or intention that modulates alignment? Is memory the driver of
alignment, and if so, which kind of memory: Declarative or procedural? Answering
these psycholinguistic questions faces challenges, as Big Data in the form of online
forums comes with both variability and confounds. Variability occurs due to a
lack of control over which messages an author actually read before writing a
reply; however, we can counter such error with more data. Confounds occur,
for instance, because the selection of words is inextricably linked to topics, which
shift systematically throughout dialogue. However, we can avoid this confound by
measuring alignment on other linguistic decisions, such as sentence structure.
From the perspective of information science, I am interested in how choices
of words and topics propagate through a community. How does discourse change
256 D. Reitter

as a consequence of an individual’s contributions? How does a speaker’s role (as


initiator, moderator, or information provider) determine his or her influence in
the larger-scale process of language change?

Data and Methods


We make use of data from two web forums that allow us to study the case
of asynchronous, written dialogue. The first forum is the Cancer Survivors
Network (CSN) web forum data. This forum represents about 10 years of online
conversation (5 GB) in a community of cancer patients and survivors, which
focuses on providing peer support on informational and emotional levels. The
second is the Reddit web forum, which is one of the largest discussion platforms
on the Internet, consisting of 2 TB of uncompressed text by early 2015. Reddit
comprises years of discussions, each started by a single question or submission.
The CSN conversations, or threads, are treated as a flat sequence of replies. The
user interface of the web forum made it easier to reply to a post with a message
for the whole thread than it did to reply to a specific message on that thread. By
contrast, the Reddit conversations are hierarchical, giving us an additional criterion
to identify structure in language change. We first published our analysis of the CSN
dataset in Wang et al. (2014), and I owe gratitude to Yafei Wang for preprocessing
the dataset, and to her and John Yen for our collaboration that led to the initial
methodology published in Wang et al.
Both datasets are, in principle, publicly available through scraping the respective
websites. However, the CSN dataset was obtained through an agreement with
the American Cancer Society. The Reddit dataset was curated by a Reddit
user and augmented through use of the Reddit interfaces. We use a distributed
NoSQL database, which provides high performance in exchange for computational
limitations.
In the studies presented in this chapter, I will sample from these datasets to
infer statistical models rather than use the complete data for two reasons. The first
is computational convenience (I admit), the second is predictive validity: Novel
hypotheses should be tested on a fresh dataset to prevent data-fishing and the
associated risk of non-replication.
In both online forums, threads are structured into an initial submission, followed
by a tree hierarchy of replies. Each post, a synonym for a written message on a
forum, replies to exactly one other post. The submission can be thought of as the
first post, although it can be empty and only contain a picture or hyperlink. In
the case of CSN, the subsequent posts after the submission are flat, rather than
hierarchical, due to the presentation of the messages on the website. Therefore,
we can only utilize the dependency information in Reddit. Posts are written
by authors using pseudonyms, so the authors can be identified throughout the
discourse.
Alignment in Web-based Dialogue 257

The two corpora were processed using natural-language processing tools:


They were parsed with the Stanford CoreNLP PCFG parser (Manning et al.,
2014); syntax trees for each sentence were converted into sets of rules such as
NP→DET N. Then, measures for syntactic (SILLA) and lexical repetition (LILLA)
(Fusaroli et al., 2012) were calculated over pairs of posts:
P
wordi ∈targeti δ(wordi )
S/LILLA(target, origin) =
len(origin)len(target)

δ(wordi ) = 1 if wordi ∈ origin; 0 otherwise
The Reddit dataset reflects a sample of 3 million origin-target pairs which
stem from 2,200 different discussion threads that occurred in 2014 and 2015. No
outliers were removed, and messages were not selected to come from specific,
potentially more interactive subreddits. The CSN data are a sample of 3,000
conversation threads containing 23,045 posts. The SILLA/LILLA measures tend
to have favorable distributional properties (Xu & Reitter, 2015) that allow for
parametric inference.

Research Questions
To determine social influence on alignment, we ask whether messages of different
status in a thread can be more or less linguistically influential. We determine the
amount of repetition of words and of syntactic constructions for pairs of messages:
Consisting of an origin and a target message. The earlier “origin” message can take
on one of several roles. The origin message can be the first reply to the topic (F),
messages by the initial author (I), self-replies (S), and any other messages (A).
Under the hypothesis of social modulation of the memory effect, we would
expect differences in strength of adaptation regardless of the distance. However, we
would also expect differences in decay. Specifically, we would expect to see more
adaptation, and thus more decay, in important origin messages or those origin posts
authored by someone deemed important. For the purposes of this study, we assume
that the initiator of the conversation is important, as is his or her first message, as
well as the first reply to the initiator. Under the alternative hypothesis of a purely
mechanistic effect, we would see no difference in decay in this scenario or possibly
even the opposite relationship.
In order to examine decay of alignment, we first need a system for measuring
the distance between the origin and target. One metric for measuring decay in
forums is the reply distance. For an example, if post PA replies to post PB , which
in turn is a reply to post PC , then we would say the reply distance from PA to PC
is 2. As an alternative measure of distance, we can use the actual time that the post
was written. This can be useful because information in a relatively uncontrolled
conversation becomes stale and loses influence as time passes by. Notably, there is
no single, correct measure of distance when it comes to the analysis of linguistic
258 D. Reitter

expression in a study of adaptation. As is typical for a study of Internet-mediated


dialogue, we do not have information about whether and when exactly an author
of a target message has read the origin message. Likewise, we have no information
about how closely she or he has paid attention to that message or any intervening
material. Distance in replies and in time is simply a proxy for how much material
has intervened, and how much time has passed between consuming the messages.
The lack of control and the inaccuracy of the measurement proxy is counteracted
by the sheer amount of data.

Results and Discussion


In the Reddit dataset, we observe some differences in the decay of syntactic or
lexical alignment (Figure 11.2) when comparing different types of sources for the
origin of each origin-target pair. Messages show the most lasting alignment with
the initial post, while experiencing stronger decay when aligning with just any
given parent3 post. Importantly, lexical and syntactic alignment with the thread
initiator is consistently lower and shows no characteristic decay in the case of lexical
alignment. This is incompatible with the hypothesis of alignment as audience
design: Messages do not align more with the messages of the person who initiated
the thread, who should be the most socially important person.
In a secondary analysis (Figure 11.3), we show decay over time rather than
as a function of intervening messages. Early decay is observed for all classes of

−5.0
−4.0
Syntactic alignment (log−SILLA)
Lexical alignment (log−LILLA)

−5.5

−4.4

−6.0

primeType −4.8 primeType


initial post initial post
any post by initial author any post by initial author
−6.5 parent parent

Post distance between prime and target post Post distance between prime and target post

FIGURE 11.2 Lexical and syntactic alignment in Reddit as a function of origin-target


distance measured in intervening replies (+1). Repetition between the first (F) post
of a thread and any other, a parent (P) origin and a target post, and those posts by
the thread initiator (I) and any target post. Shaded areas indicate approx. 95 percent
confidence intervals assuming Gaussian errors around LOESS regression (as in other
graphs).
Alignment in Web-based Dialogue 259

0.025 0.035

Syntactic alignment (SILLA)


0.020
Lexical alignment (LILLA)

0.030

0.015
0.025

0.010
0.020
primeType primeType
initial post initial post
any post by initial author any post by initial author
0.005 parent parent

0 20000 40000 60000 0 20000 40000 60000


Temporal distance between prime and target Temporal distance between prime and target
post (in seconds) post (in seconds)

FIGURE 11.3 Lexical and syntactic alignment in Reddit as a function of origin-target


distance measured in time of posting. Repetition between the first (F) post of a thread
and any other, a parent (P) origin and a target post, and those posts by the thread
initiator (I) and any target post.

origin-target pairs including for those where the origin was written by the thread
initiator. For longer time spans between origins and targets beyond about 30,000
seconds, repetitions may actually increase with increasing distance to the origin
message, which is not alignment. Only the syntax analysis shows a mild repetition
increase, which is then followed by a strong decay. Overall, these data do not
seem to support an audience-design hypothesis. However, I would caution that
the time between writing origin and target posts is a measure that confounds
an individual author’s memory with a form of externalized, networked memory.
This can be appropriate from the perspective of an ecological, high-level model of
multi-party dialogue, but is more problematic when interpreting these data from
the psychologist’s perspective.
The CSN dataset in an analysis first discussed elsewhere tells a similar story
(Wang et al., 2014). Lexical alignment in CSN decreases with post distance.
Alignment with posts written by the thread initiator is lower initially and also
decays less (Figure 11.4). For syntactic priming, we even observe an increase in
alignment over time, speaking against the audience design hypothesis.
The evidence we find, overall, confirms the results by Branigan et al. (2007)
for the cases of both naturalistic multi-party dialogue and lexical alignment. Any
memory mechanisms underlying alignment seem to have little sensitivity to the
role of the source (origin message), as decay is not greater for such roles. Further,
absolute repetitions are initially lower for origin messages by the thread initiator
than for other origin messages. The observed differences in lexical similarity can be
interpreted as the result of the pragmatic consequences of addressing one another’s
260 D. Reitter

primeType 0.0125
initial post
0.005 any post by initial author
LILLA(word ∈ target | word ∈ prime post)

any post

SILLA(rule ∈ target | rule ∈ prime post)


0.0100
0.004

0.003 0.0075

0.002
0.0050 primeType
initial post
any post by initial author
any post
0.001
0 25 50 75 100 0 25 50 75 100
Post distance between prime and target post Post distance between prime and target post

FIGURE 11.4 Lexical and syntactic alignment in the Cancer Survivors Network. Data
and analysis from Wang et al. (2014).

messages throughout the conversation, rather than as a sign of a lower-level


mechanistic process.
The Cancer Survivors Network and Reddit are very different communities: In
CSN, members aim to provide emotional and social support. Reddit is a forum
where a writer does not necessarily address another individual. Replies respond to
specific questions and comments, but we cannot assume that the author of a reply
has a specific addressee in mind. Using natural-language processing and perhaps the
analysis of pronouns, we may be able to infer better information about addressees
in the future. This caveat implies that one needs to analyze more than one corpus
to draw conclusions about audience design, for we cannot always determine who
a writer’s audience actually is.
With the parallel presentation of lexical and syntactic adaptation data, I would
like to draw attention to some problems with the use of multi-level alignment in
corpus data for psycholinguistic modeling. While syntactic priming lends itself to
corpus study, lexical adaptation may not be due to a priming effect in the process
of lexical choice. Lexical repetition has much to do with the topic structure of
text: Lexical choices are of course, in part, a consequence of shifting topic clusters.
Topic shifting similarly interacts with information distribution in discourse (Qian
& Jaeger, 2011). Short of modeling topic flow in an attempt to subtract topic
effects from the observed lexical alignment, we have to rely on additional syntactic
analyses to draw conclusions about language production.
Experimental control for lexical repetition and topic clustering is possible
to an extent. And indeed, when comparing syntactic priming effects vanish
or are overshadowed by negative priming effects where speakers avoid syntactic
parallels (Healey, Purver, & Howes, 2014). These results, on non-task-oriented,
Alignment in Web-based Dialogue 261

spontaneous conversation, first underline the need to verify lab-derived studies


on naturalistic data in order to get a picture of how ubiquitous a reported effect
really is. Despite the observations I presented here as an argument against audience
design, it is clear that alignment is still sensitive to modulation as a memory-based
effect. High syntactic rule frequency reduces priming (Scheepers, 2003; Snider &
Jaeger, 2009), and semantic repetition or relatedness (as in topic chains) is predicted
to boost syntactic priming (Reitter et al. 2011).

Questions and Challenges for Data-Intensive


Computational Psycholinguistics
As I have shown, alignment is a phenomenon that can be examined using
naturalistic language. I use datasets that reach from small, annotated corpora, such
as MapTask, to huge web forums, such as Reddit. Oftentimes, the size of the dataset
comes with a tradeoff regarding annotations. Small datasets offer deeper, more
precise and hand-corrected annotations (e.g. Zeldes, 2016), while Big Data in
the range of hundreds of megabytes in size are far too large for such annotations.
In that case, interesting variables such as inter-speaker relationships have to be
inferred from network structure or linguistic phenomena.
Existing datasets present an attractive opportunity for psycholinguistics, and
cognitive science as a whole. Small-scale data with reliable secondary information,
such as the Dundee eye-tracking corpus (Kennedy & Pynte, 2005) or the Schilling
corpus (Schilling, Rayner, & Chumbley, 1998) may give insights into syntax
processing (Demberg & Keller, 2008; Bicknell & Levy, 2010). Such data may
be used to evaluate and refine even complex models of reading (Reichle, Rayner,
& Pollatsek, 2003). Larger-scale, naturalistic, and relatively unannotated datasets
similarly play a role, as in work on memorability (Danescu-Niculescu-Mizil,
Cheng, Kleinberg, & Lee, 2012) or on alignment (Danescu-Niculescu-Mizil and
Lee, 2011) using movie dialogue transcripts.
I will suggest three components of a vision for psycholinguistics that uses
datasets to examine language processing and communication. Just like this chapter
as a whole, they reach from internal representations and micro-processes to
high-level models of text and dialogue.

Adaptation as a Window into the Mind


Syntactic priming has been of great interest to psycholinguists. Its temporal
decay and its associations with other linguistic representations give clues about
the memory representations involved in language production and comprehension.
Going beyond, syntactic priming and indeed alignment at multiple levels can
be useful in identifying concrete processes and mental representations. To give
an example: Suppose we hypothesize a structure X that is part of a language’s
grammar, and that is representative of a cognitive process involved in speaking that
262 D. Reitter

language. The basic principle we follow is that another speaker will adjust his or
her uses of X after hearing another speaker use X. However, for this adjustment of
the speaker’s language model to happen, structure X must be actually cognitively
present. Otherwise, there would be no memory item to reinforce. That means
that by identifying speakers adapt upon hearing X we find evidence in favor
of X as a cognitive artifact. As long as we can cheaply determine, on a large
dataset, where that structure applies, we can measure sensitivity to its use and
thereby detect adaptation. As an example, we have done that in a small study
with competing classes of representations (Reitter, Hockenmaier, & Keller, 2006a).
The structures we looked at described either fully incremental or non-incremental
syntactic processing. (The question here is whether new words and phrases are
immediately adjoined to the semantics or syntactic type of the existing sentence, or
if they are buffered in some form of working memory and combined out of order.)
By looking at adaptation in a relatively small corpus, we found some hints that
incrementality is actually flexible—although more work is necessary to robustly
model incrementality on more data. Much of this work can be done cheaply on
unannotated data once we have the computational means to induce grammar from
data (c.f., Bod, 1992) based on weak adaptation effects.

Evaluation of Integrated Models of Language Acquisition


Large-scale datasets rarely come annotated with interpreted linguistic knowledge.
However, using predefined deep linguistic knowledge should not be more than
a temporary goal anyway. After all, we model how individuals can learn to
process surface form into semantics. Syntactic structure, for example, is transient. It
reflects a cognitive process rather than permanent mental representations. Which
representations are learned from the data is the consequence of computational
constraints for language processing, prior knowledge, and general-purpose learning
algorithms. Inference from raw language data (in multiple languages) without prior
constraints may well be possible. However, it is more likely that the integration
of language and general cognitive architectures will provide useful priors. I see
two complementary approaches. The first approach here is to integrate what
cognitive psychology teaches us, in architecture as well as quantitatively, about
memory and processing. ACT-R (Anderson, 2007) defines an independently
validated set of principles that codify that knowledge computationally. In
Reitter et al. (2011), I proposed a model of syntactic priming in ACT-R,
and we are now extending its coverage to corpora. The second approach
to integrating general cognitive architectures is to assume basic constraints of
learnability. Do the data that a learner is exposed to hold enough information
to acquire the hypothesized structures and processes when combined with a
general-purpose learning mechanism? In this context, we should consider con-
nectionist approaches, which have seen a remarkable resurgence, mostly prompted
by Hinton, Osindero, and Teh’s (2006) discovery of methods that facilitate
Alignment in Web-based Dialogue 263

the learning of multi-layered (“deep”) networks, which lead to plausible real-world


performance in machine-learning tasks. The underlying artificial neural networks
are only loosely inspired by biological neural networks. However, at a higher
level, some types of networks may serve as models of learning. Frameworks
with the potential to enable not just language processing (Manning, 2015) but
general artificial intelligence may be integrated into psycholinguistic models. This
approach involves online, semi-supervised learning mechanisms (e.g. Ororbia II,
Reitter, & Giles, 2015). These systems will acquire structural representations that
allow us to make context-dependent processing decisions. They do so rapidly
with just a few annotated and many unannotated examples. If complex syntactic
and lexical representations can be learned from unannotated data, we may have
a computational answer to the poverty of the stimulus argument (Chomsky, 1965;
Pullum & Scholz, 2002), which suggests that children need to have a substantive
language acquisition device (e.g. Pinker, 1991).

Information-Theoretic Models of Text and Dialogue


The mention of learnability brings us to the question of computational and
psychological plausibility of processing: Which algorithms can recognize words,
parse sentences, and interpret meaning given the information available in text or
dialogue (cf. Lewis, Vasishth, & Van Dyke, 2006)? Computational psycholinguists
have been particularly interested in the question of predictability, as it gives
insights into probabilistic learning of, for example, word meanings or syntactic
choices. For instance, expectations that are guided by past experience can
facilitate or burden online processing (e.g. Hale, 2003; Levy, 2008; Smith &
Levy, 2013). These accounts may be seen as agnostic to the concrete algorithms
that produce or comprehend language. Yet they do represent a higher-level
description of informativity. Information content, or conversely, entropy, varies
systematically throughout small and large units of text (Genzel & Charniak, 2003).
It has been hypothesized that speakers striving to distribute information evenly
among text are doing so in order to optimize the use of cognitive resources
(Jaeger, 2010). To date, dialogue as a text genre has been under-studied with
respect to entropy distribution, even though a model of entropy in dialogue
may answer a key question: Which dialogue partner contributes information,
and why?

Conclusion
In this chapter, I hoped to demonstrate a Big Data approach to cognitive
science that observes linguistic performance in the wild. The minimally controlled
environment comes with obvious benefits and with some challenges. The benefits
lie in the broad coverage of syntactic constructions, conversational styles, and
communities. With the analysis of dialogue corpora such as Switchboard, Maptask,
264 D. Reitter

and Reddit, we were able to not only show that alignment effects in real-world
data were smaller than observed in the lab, but that they also varied in theoretically
relevant ways: For example, with task success (Reitter & Moore, 2014), but not
necessarily with the intended audience.
The challenges of the Big Data approach, however, also illustrate where a
carefully constructed experiment can produce more informative conclusions. The
correlation between lexical and syntactic levels is an example for this problem.
Work with large datasets in general comes with an inherent challenge: They
are observational. While we can observe correlations, causal inference is much
more difficult and requires more information, such as temporal relationships
(what happens later cannot have caused what happened earlier). However, direct
causal relationships without latent variables cannot be inferred. Further, with a
large dataset, we can usually find some correlations that are deemed significant,
numerically. As Adam Kilgariff put it: “Language is never, ever random”
(Kilgarriff, 2005). We should not rely on a single dataset or at least not on a single
sample of one to draw good conclusions. Most importantly, the opportunity to
observe seemingly reliable correlations between variables emphasizes our obligation
to always begin with a theoretical framework and clear hypotheses. For with
hindsight, many models can explain observational data.

Acknowledgments
This research was supported by the National Science Foundation (IIS 1459300
and BCS 1457992). The analyses on the CSN dataset were made possible by a
collaboration agreement between Penn State and the American Cancer Society. I
would like to thank Yafei Wang, Yang Xu, and John Yen for their discussions and
work on the CSN dataset, Kenneth Portier and Greta E. Greer for creating and
making available the CSN dataset, Lillian Lee for her comments on the Reddit
dataset, as well as Jason M. Baumgartner (Pushshift.io) for curating the Reddit
dataset, and Jeremy Cole for advice in preparing this chapter.

Notes
1 Alignment in this sense is distinct from alignment between texts or sentences
as used in machine translation. It is also different from alignment as used in
constituent substitutability, as by Van Zaanen (2000).
2 For a review, see Pickering and Ferreira (2008a); the terms syntactic priming
and structural priming are used more or less interchangeably in the literature.
3 A parent is defined as being a message to which the target message responds, or
as being the parent of such any parent message.
Alignment in Web-based Dialogue 265

References
Anderson, A. A., Bader, M., Bard, E., Boyle, E., Doherty, G. M., Garrod, S., . . . Weinert,
R. (1991). The HCRC Map Task corpus. Language and Speech, 34(4), 351–366.
Anderson, J. R. (1993). Rules of the mind. Hillsdale, NJ: Erlbaum.
Anderson, J. R. (2007). How can the human mind occur in the physical universe? Oxford, UK:
Oxford University Press.
Anderson, J. R., & Lebiere, C. (1998). The atomic components of thought. Mahwah, NJ:
Erlbaum.
Bard, E. G., Anderson, A. H., Sotillo, C., Aylett, M., Doherty-Sneddon, G., & Newlands,
A. (2000). Controlling the intelligibility of referring expressions in dialogue. Journal of
Memory and Language, 42, 1–22.
Bicknell, K., & Levy, R. (2010). A rational model of eye movement control in reading.
In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp.
1168–1178). ACL ’10. Uppsala, Sweden: Association for Computational Linguistics.
Bock, J. K. (1986). Syntactic persistence in language production. Cognitive Psychology, 18,
355–387.
Bock, J. K., & Griffin, Z. M. (2000). The persistence of structural priming: Transient
activation or implicit learning? Journal of Experimental Psychology: General, 129(2), 177.
Bock, J. K., Dell, G. S., Chang, F., & Onishi, K. H. (2007). Persistent structural priming
from language comprehension to language production. Cognition, 104(3), 437–458.
Bod, R. (1992). A computational model of language performance: Data oriented parsing.
In Proceedings of the 14th Conference on Computational Linguistics—volume 3 (pp. 855–859).
Association for Computational Linguistics.
Branigan, H. P., Pickering, M. J., & Cleland, A. A. (1999). Syntactic priming in language
production: Evidence for rapid decay. Psychonomic Bulletin and Review, 6(4), 635–640.
Branigan, H. P., Pickering, M. J., McLean, J. F., & Cleland, A. A. (2007).
Syntactic alignment and participant role in dialogue. Cognition, 104(2), 163–197. doi:
http://dx.doi.org/10.1016/j.cognition.2006.05.006.
Branigan, H. P., Pickering, M. J., Pearson, J., McLean, J. F., & Brown, A. (2011). The
role of beliefs in lexical alignment: Evidence from dialogs with humans and computers.
Cognition, 121(1), 41–57. doi: 10.1016/j.cognition.2011.05.011.
Brennan, S. E., & Clark, H. H. (1996). Conceptual pacts and lexical choice in conversation.
Journal of Experimental Psychology: Learning, Memory, and Cognition, 22(6), 1482.
Chang, F., Dell, G. S., & Bock, J. K. (2006). Becoming syntactic. Psychological Review,
113(2), 234–272.
Chapelle, O., Schölkopf, B., Zien, A. (Eds.). (2006). Semi-supervised learning: Adaptive
computation and machine learning. Cambridge, MA: MIT Press.
Chartrand, T. L., & Bargh, J. A. (1999). The chameleon effect: The perception-behavior
link and social interaction. Journal of Personality and Social Psychology, 76(6), 893.
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.
Chomsky, N., & Miller, G. A. (1963). Introduction to the formal analysis of natural
languages. In R. D. Luce, R. R. Bush, & E. Galanter (Eds), Handbook of mathematical
psychology (Vol. 2, pp. 269–321). New York, NY: Wiley.
Cleland, A. A., & Pickering, M. J. (2003). The use of lexical and syntactic information
in language production: Evidence from the priming of noun-phrase structure. Journal of
Memory and Language, 49, 214–230.
Danescu-Niculescu-Mizil, C., & Lee, L. (2011). Chameleons in imagined conversations: A
new approach to understanding coordination of linguistic style in dialogs. In Proceedings
266 D. Reitter

of the 2nd Workshop on Cognitive Modeling and Computational Linguistics (pp. 76–87).
Association for Computational Linguistics.
Danescu-Niculescu-Mizil, C., Cheng, J., Kleinberg, J., & Lee, L. (2012). You had me at
hello: How phrasing affects memorability. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics: Long papers—volume 1 (pp. 892–901). ACL ’12.
Jeju Island, Korea: Association for Computational Linguistics.
Dell, G. S., Chang, F., & Griffin, Z. M. (1999). Connectionist models of language
production: Lexical access and grammatical encoding. Cognitive Science, 23(4), 517–542.
del Prado Martin, F. M., & Bois, J. W. D. (2015). Syntactic alignment is an index of affective
alignment: An information-theoretical study of natural dialogue. In Proceedings of the 37th
Annual Meeting of the Cognitive Science Society. Pasadena, CA: Cognitive Science Society.
Demberg, V., & Keller, F. (2008). Data from eye-tracking corpora as evidence for theories
of syntactic processing complexity. Cognition, 109(2), 193–210.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
Ferreira, V. S., & Bock, J. K. (2006). The functions of structural priming. Language and
Cognitive Processes, 21(7–8), 1011–1029.
Fusaroli, R., Bahrami, B., Olsen, K., Roepstorff, A., Rees, G., Frith, C., & Tylen, K.
(2012). Coming to terms: Quantifying the benefits of linguistic coordination. Psychological
Science, 23(8), 931–939.
Garrod, S., & Anderson, A. (1987). Saying what you mean in dialogue: A study in
conceptual and semantic coordination. Cognition, 27, 181–218.
Genzel, D., & Charniak, E. (2003). Variation of entropy and parse trees of sentences as a
function of the sentence number. In Proceedings of the 2003 Conference on Empirical Methods
in Natural Language Processing (pp. 65–72). Association for Computational Linguistics.
Gibson, E. (1998). Linguistic complexity: Locality of syntactic dependencies. Cognition,
68(1), 1–76. doi: 10.1016/S0010–0277(98)00034-1.
Gibson, E. A. F. (1991). A computational theory of human linguistic processing: Memory limitations
and processing breakdown. Doctoral dissertation, School of Computer Science, Carnegie
Mellon University.
Gries, S. T. (2005). Syntactic priming: A corpus-based approach. Journal of Psycholinguistic
Research, 34(4), 365–399.
Hale, J. (2003). The information conveyed by words in sentences. Journal of Psycholinguistic
Research, 32(2), 101–123.
Hartsuiker, R. J., & Westenberg, C. (2000). Persistence of word order in written and spoken
sentence production. Cognition, 75B, 27–39.
Hartsuiker, R. J., Bernolet, S., Schoonbaert, S., Speybroeck, S., & Vanderelst, D. (2008).
Syntactic priming persists while the lexical boost decays: Evidence from written and
spoken dialogue. Journal of Memory and Language, 58(2), 214–238.
Healey, P. G., Purver, M., & Howes, C. (2014). Divergence in dialogue. PLoS One, 9(6),
e98598.
Henderson, J., & Ferreira, F. (2004). Interface of language, vision and action: Eye movements and
the visual world. New York, NY: Psychology Press.
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief
nets. Neural Computation, 18(7), 1527–1554.
Jaeger, T. F. (2006). Redundancy and syntactic reduction in spontaneous speech. Doctoral
dissertation, Stanford University.
Alignment in Web-based Dialogue 267

Jaeger, T. F. (2010). Redundancy and reduction: Speakers manage syntactic information


density. Cognitive Psychology, 61(1), 23–62.
Jaeger, T. F., & Snider, N. (2007). Implicit learning and syntactic persistence: Surprisal and
cumulativity. University of Rochester Working Papers in the Language Sciences, 3(1), 26–44.
Jaeger, T. F., & Snider, N. (2008). Implicit learning and syntactic persistence: Surprisal and
cumulativity. In Proceedings of the 30th Annual Conference of the Cognitive Science Society (pp.
1061–1066). Washington: Cognitive Science Society.
Jones, S., Cotterill, R., Dewdney, N., Muir, K., & Joinson, A. (2014). Finding zelig in text:
A measure for normalising linguistic accommodation. In The 25th International Conference
on Computational Linguistics (COLING 2014). Dublin, Ireland.
Joshi, A. K., & Schabes, Y. (1997). Tree-adjoining grammars. In G. Rozenberg & A.
Salomaa (Eds.), Handbook of formal languages (Vol. 3, pp. 69–124). Berlin, New York:
Springer.
Kaschak, M. P., Kutta, T. J., & Jones, J. L. (2011). Structural priming as implicit learning:
Cumulative priming effects and individual differences. Psychonomic Bulletin & Review,
18(6), 1133–1139.
Kaschak, M. P., Kutta, T. J., & Schatschneider, C. (2011). Long-term cumulative structural
priming persists for (at least) one week. Memory & Cognition, 39(3), 381–388.
Kaschak, M. P., Loney, R. A., & Borregine, K. L. (2006). Recent experience affects the
strength of structural priming. Cognition, 99, B73–B82.
Kennedy, A., & Pynte, J. (2005). Parafoveal-on-foveal effects in normal reading. Vision
Research, 45(2), 153–168.
Kilgarriff, A. (2005). Language is never ever ever random. Corpus Linguistics and Linguistic
Theory, 1(2), 263–276.
Kootstra, G. J., van Hell, J. G., & Dijkstra, T. (2010). Syntactic alignment and shared word
order in code-switched sentence production: Evidence from bilingual monologue and
dialogue. Journal of Memory and Language, 63(2), 210–231.
Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126–
1177.
Lewis, R. L., Vasishth, S., & Van Dyke, J. A. (2006). Computational principles of working
memory in sentence comprehension. Trends in Cognitive Sciences, 10(10), 447–454.
McKelvie, D. (1998). SDP – spoken dialogue parser (tech. rep. No. HCRC-TR/96).
Edinburgh, UK: Human Communication Research Centre.
Manning, C. D. (2015). Computational linguistics and deep learning. Computational
Linguistics, 41(4), 701–707.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D.
(2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of
52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations
(pp. 55–60).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. In Advances in neural
information processing systems (pp. 3111–3119).
Mnih, A., & Hinton, G. E. (2009). A scalable hierarchical distributed language model. In
Advances in neural information processing systems (pp. 1081–1088).
Nenkova, A., Gravano, A., & Hirschberg, J. (2008). High frequency word entrainment in
spoken dialogue. In Proceedings of the 46th Annual Meeting of the Association for Computational
268 D. Reitter

Linguistics on Human Language Technologies: Short papers (pp. 169–172). HLT-Short ’08.
Columbus, Ohio: Association for Computational Linguistics.
Niederhoffer, K. G., & Pennebaker, J. W. (2002). Linguistic style matching in social
interaction. Journal of Language and Social Psychology, 21(4), 337–360.
Ororbia II, A. G., Giles, C. L., & Reitter, D. (2015). Learning a deep hybrid model
for semi-supervised text classification. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing (EMNLP). Lisbon, Portugal.
Ororbia II, A. G., Reitter, D., Wu, J., & Giles, C. L. (2015). Online learning of deep hybrid
architectures for semi-supervised categorization. In ECML PKDD. European conference on
machine learning and principles and practice of knowledge discovery in databases. Porto, Portugal:
Springer.
Paxton, A., Abney, D., Kello, C. K., & Dale, R. (2014). Network analysis of multimodal,
multiscale coordination in dyadic problem solving. In P. M. Bello, M. Guarini, M.
McShane, & B. Scassellati (Eds.), Proceedings of the 36th Annual Meeting of the Cognitive
Science Society. Austin, TX: Cognitive Science Society.
Pickering, M. J., & Branigan, H. P. (1998). The representation of verbs: Evidence
from syntactic priming in language production. Journal of Memory and Language, 39,
633–651.
Pickering, M. J., & Ferreira, V. S. (2008). Structural priming: A critical review. Psychological
Bulletin, 134(4), 427–459.
Pickering, M. J., & Garrod, S. (2004). Toward a mechanistic psychology of dialogue.
Behavioral and Brain Sciences, 27, 169–225.
Pinker, S. (1991). Rules of language. Science, 253(5019), 530–535.
Pollard, C., & Sag, I. A. (1994). Head-driven phrase structure grammar. Chicago: University of
Chicago Press.
Pullum, G. K., & Scholz, B. C. (2002). Empirical assessment of stimulus poverty arguments.
The Linguistic Review, 18(1–2), 9–50.
Qian, T., & Jaeger, T. F. (2011). Topic shift in efficient discourse production. In Proceedings
of the 33rd Annual Conference of the Cognitive Science Society (pp. 3313–3318).
Reichle, E. D., Rayner, K., & Pollatsek, A. (2003). The E-Z reader model of eye-movement
control in reading: Comparisons to other models. Behavioral and Brain Sciences, 26(04),
445–476.
Reitter, D. (2008). Context effects in language production: Models of syntactic priming in dialogue
corpora. Doctoral dissertation, University of Edinburgh.
Reitter, D., & Moore, J. D. (2007). Predicting success in dialogue. In Proceedings of the 45th
Annual Meeting of the Association of Computational Linguistics (pp. 808–815). Prague, Czech
Republic.
Reitter, D., & Moore, J. D. (2014). Alignment and task success in spoken dialogue. Journal
of Memory and Language, 76, 29–46. doi: 10.1016/j.jml.2014.05.008.
Reitter, D., Hockenmaier, J., & Keller, F. (2006a). Priming effects in combinatory categorial
grammar. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language
Processing (EMNLP) (pp. 308–316). Sydney, Australia.
Reitter, D., Keller, F., & Moore, J. D. (2006b). Computational modeling of structural
priming in dialogue. In Proceedings of the Human Language Technology Conference/North
American Chapter of the Association for Computational Linguistics (HLT/NAACL) (pp.
121–124). New York, NY.
Alignment in Web-based Dialogue 269

Reitter, D., Keller, F., & Moore, J. D. (2011). A computational cognitive model of syntactic
priming. Cognitive Science, 35(4), 587–637. doi: 10.1111/j.1551–6709.2010.01165.
Reitter, D., Moore, J. D., & Keller, F. (2006c). Priming of syntactic rules in task-oriented
dialogue and spontaneous conversation. In Proceedings of the 28th Annual Conference of the
Cognitive Science Society (pp. 685–690). Vancouver, Canada.
Scheepers, C. (2003). Syntactic priming of relative clause attachments: Persistence of
structural configuration in sentence production. Cognition, 89, 179–205.
Schilling, H. E., Rayner, K., & Chumbley, J. I. (1998). Comparing naming, lexical decision,
and eye fixation times: Word frequency effects and individual differences. Memory &
Cognition, 26(6), 1270–1281.
Smith, N. J., & Levy, R. (2013). The effect of word predictability on reading time is
logarithmic. Cognition, 128(3), 302–319.
Snider, N., & Jaeger, T. F. (2009). Syntax in flux: Structural priming maintains probabilistic
representations. Poster at the 15th Annual Conference on Architectures and Mechanisms of
Language Processing, Barcelona.
Steedman, M. (2000). The syntactic process. Cambridge, MA: MIT Press.
Szmrecsanyi, B. (2005). Creatures of habit: A corpus-linguistic analysis of persistence in
spoken English. Corpus Linguistics and Linguistic Theory, 1(1), 113–149.
Szmrecsanyi, B. (2006). Morphosyntactic persistence in spoken English: A corpus study at
the intersection of variationist sociolinguistics, psycholinguistics, and discourse analysis. Berlin,
Germany: Mouton de Gruyter.
Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. (1995).
Integration of visual and linguistic information in spoken language comprehension.
Science, 268(5217), 1632–1634.
Travis, C. E. (2007). Genre effects on subject expression in Spanish: Priming in narrative
and conversation. Language Variation and Change, 19(2), 101–135.
Van Zaanen, M. (2000). ABL alignment-based learning. In Proceedings of the 18th Conference
on Computational Linguistics (pp. 961–967). Association for Computational Linguistics.
Wang, Y., Reitter, D., & Yen, J. (2014). Linguistic adaptation in online conversation
threads: Analyzing alignment in online health communities. In Cognitive Modeling
and Computational Linguistics (CMCL) (pp. 55–62). Baltimore, MD: Association for
Computational Linguistics.
Wang, Y., Yen, J., & Reitter, D. (2015). Pragmatic alignment on social support type in health
forum conversations. In Proceedings of Cognitive Modeling and Computational Linguistics
(CMCL) (pp. 9–18). Denver, CO: Association for Computational Linguistics.
Xu, Y., & Reitter, D. (2015). An evaluation and comparison of linguistic alignment
measures. In Proceedings of Cognitive Modeling and Computational Linguistics (CMCL) (pp.
58–67). Denver, CO: Association for Computational Linguistics.
Xu, Y., & Reitter, D. (2016). Convergence of syntactic complexity in conversation. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. ACL
’16. Berlin: Association for Computational Linguistics.
Zeldes, A. (2016) The GUM corpus: Creating multilayer resources in the classroom.
Language Resources and Evaluation, 1–32.
12
ATTENTION ECONOMIES,
INFORMATION CROWDING,
AND LANGUAGE CHANGE
Thomas T. Hills,
James S. Adelman,
and Takao Noguchi

Abstract
Language is a communication system that adapts to the cognitive capacities of its users
and the social environment in which it used. In this chapter we outline a theory, inspired
by the linguistic niche hypothesis, which proposes that language change is influenced by
information crowding. We provide a formal description of this theory and how information
markets, caused by a growth in the communication of ideas, should influence conceptual
complexity in language over time. Using American English and data from multiple language
corpora, including over 400 billion words, we test the proposed crowding hypothesis as well
as alternative theories of language change, including learner-centered accounts and semantic
bleaching. Our results show a consistent rise in concreteness in American English over the
last 200 years. This rise is not strictly due to changes in syntax, but occurs within word classes
(e.g. nouns), as well as within words of a given length. Moreover, the rise in concreteness
is not correlated with surface changes in language that would support a learner-centered
hypothesis: There is no concordant change in word length across corpora nor is there a
preference for producing words with earlier age of acquisition. In addition, we also find
no evidence that this change is a function of semantic bleaching: In a comparison of two
different concreteness norms taken 45 years apart, we find no systematic change in word
concreteness. Finally, we test the crowding hypothesis directly by comparing approximately
1.5 million tweets across the 50 US states and find a correlation between population density
and tweet concreteness, which is not explained by state IQ. The results demonstrate both
how Big Data can be used to discriminate among alternative hypothesis as well as how
language may be changing over time in response to the rising tide of information.

Language has changed. Consider the following example from The Scarlet Letter,
written by Nathaniel Hawthorne and originally published in 1850 (Hawthorne,
2004):
Information Crowding and Language Change 271

Children have always a sympathy in the agitations of those connected with


them; always, especially, a sense of any trouble or impending revolution, of
whatever kind, in domestic circumstances; and therefore Pearl, who was the
gem on her mother’s unquiet bosom, betrayed, by the very dance of her
spirits, the emotions which none could detect in the marble passiveness of
Hester’s brow.
Compare this with a more recent example of writing from David Sidaris’s Barrel
Fever, published in 1994 (Sidaris, 1994):

If you’re looking for sympathy you’ll find it between shit and syphilis in the
dictionary.
Both examples contain words that readers are likely to recognize. Both allude
to fairly abstract concepts in relation to the word sympathy. However, they use
dramatically different words to do so. Yet, with but a few examples, there is always
the risk that one is cherry picking. Compare, as another example, the following
two quotations, both published in Nature, but more than 100 years apart:

When so little is really known about evolution, even in the sphere of organic
matter, where this grand principle was first prominently brought before our
notice, it may perhaps seem premature to pursue its action further back in
the history of the universe. (Blanshard, 1873)

Each sex is part of the environment of the other sex. This may lead
to perpetual coevolution between the sexes, when adaptation by one sex
reduces fitness of the other. (Rice, 1996)
The difference in language between these examples hints at something
systematic, with the more recent examples appearing more efficient—they are
perhaps more conceptually clear and perhaps more memorable, arguably without
any loss of content or fuzzying of the intended meaning. One could consider
this a kind of conceptual evolution in the language. Still, these are but a few
examples and the change is difficult to quantify. To characterize conceptual change
in language over hundreds of years would require reading thousands or millions
of books and attempting to precipitate out change in a psycholinguistic variable
capable of capturing change in conceptual efficiency. The computational and data
resources associated with “Big Data” make this possible. In recent years millions of
words of natural language have been digitized and made available to researchers and
large-scale psycholinguistic norms have been published using online crowdsourcing
to rate thousands of words. In this chapter, we combine a wide collection of
such resources—books, newspapers, magazines, public speeches, tweets, and
psycholinguistic norms—to investigate the recent psycholinguistic evolution of
American English.
272 T. T. Hills, J. S. Adelman, and T. Noguchi

Language Change
It is well established that languages change over time (Labov, 1980; Lieberman,
Michel, Jackson, Tang, & Nowak, 2007; Michel et al., 2011), and this evolution
has been characterized at both long time scales, on the order of thousands of years
(Pagel et al., 2007; Lupyan & Dale, 2010), and short time scales, on the order
of generations (Labov, 1980; Michel et al., 2011). Word forms change, such as
the transition from Old English lufie to present day love. Words are replaced by
other words, such as replacement of Old English wulcen by the Scandinavian word
sky (Romaine et al., 1992). Words change their grammatical value, such as the
word going, which once strictly referred to physical movement from one place
to another and now refers to the future, as in “They are going to change their
policy.” And words often change in more than one way at the same time, as in the
Proto-Indo-European word kap, which may have originally meant “to seize” (as
in the Latin root cap, from “capture” and “captive”), but now hides in plain sight
in English as the word have (Deutscher, 2010).
The factors influencing which words change and why has seen an upsurge of
interest outside traditional linguistics fueled by the availability of large databases
of language. Using this approach, researchers have isolated a variety of factors
that influence language change. For example, the average half-life of a word
is between 2,000 and 4,000 years (i.e. the time at which half of all words
are replaced by a new non-cognate word), but the half-life of a word is
prolonged for high-frequency words (Pagel et al., 2007) and for words with
earlier age of acquisition (Monaghan, 2014). Languages are also susceptible to
losing grammatically obligatory word affixes, called morphological complexity.
This varies across languages in predictable ways as a function of the number of
speakers. English, compared to most other languages, has very little morphological
complexity. For example, English speakers add the suffix -ed to regular verbs to
signal the past tense. Navajo, on the other hand, adds obligatory morphemes to
indicate a wide variety of information about past action, for example, whether
it occurred among groups, was repeated, or changed over time (Young, Morgan,
& Midgette, 1992). In a comparison of more than 2,000 languages, languages
with more speakers have less morphological complexity than languages with
fewer speakers (Lupyan & Dale, 2010). All of the above examples indicate that
potential selective pressures such as frequency of shared language use can preserve
language features and that these can be characterized in much the same way
as genetic evolution (compare, e.g. Pagel et al., 2007; Duret & Mouchiroud,
2000).
The hypothesis that languages evolve in response to selective pressures provided
by the social environment in which they are spoken is called the linguistic niche
hypothesis (Lupyan & Dale, 2010). Lupyan & Dale (2010) interpret the above
changes in morphological complexity as evidence consistent with a selective force
Information Crowding and Language Change 273

introduced by an influx of adult language learners among languages with more


speakers. This shares much in common with other learner-centered accounts of
language change (Trudgill, 2002; McWhorter, 2007). Our goal in the present
work is to extend the linguistic niche hypothesis by developing theory around
another potential source of social selection.

Conceptual Crowding
Cognitive performance is intimately connected with our capacity to process,
understand, and recall new information. One of the principal features of cognitive
information is that it can experience crowding. An example is list-length effects,
where items on a longer list are remembered less well than items on shorter lists
(Ratcliff, Clark, & Shiffrin, 1990). Similar information crowding may also be
taking place at the level of our day to day experience where the production and
exposure to information has seen unprecedented growth (Varian & Lyman, 2000;
Eppler & Mengis, 2004). These conditions create highly competitive environments
for information, especially among languages with many speakers, where speakers
represent potential competitors in the language marketplace. This competition in
information creates information markets, which have inspired the idea of attention
economies (Hansen & Haas, 2001; Davenport & Beck, 2001). Attention represents
a limiting resource in information-rich environments. Attention economies can
drive evolutionary change in communication in the same way that noise drives
change in other communication systems. Acoustic crowding associated with bird
song is known to influence communication strategies in bird species, in ways
that make songs easier to discriminate (Luther, 2009; Grant & Grant, 2010).
Similar to a global cocktail party problem—where communicators compete for
the attention of listeners—conceptual information may also compete for attention,
both at the source—when it is spoken—as well as later, when it is being
retrieved.
To formalize the above argument, consider that crowding mimics the inclusion
of noise in information theoretic accounts of signal transduction (Shannon &
Weaver, 1949). This is often discussed in terms of signal to noise ratio, NS . From
the perspective of an information producer, other producers’ messages are noise,
and as these increase our own messages are transmitted less efficiently.
How should cognitive systems adapt to this situation? Simon provided an
answer for receivers: “a wealth of information creates a poverty of attention and
a need to allocate that attention efficiently” (Simon, 1971, pp. 40–41). One way
to accomplish this is to order information sources in proportion to their value
to the receiver. Here, people should learn to tune out (i.e. inhibit) unwanted
messages and pay more attention to those that represent real conceptual value.
This is the basis of algorithmic approaches to filtering junk e-mail (e.g. Sahami,
274 T. T. Hills, J. S. Adelman, and T. Noguchi

Dumais, Heckerman, & Horvitz, 1998). Conceptually ambiguous messages may be


tuned out in the crowded information market. However, message senders can also
respond to crowded information markets by changing the message. As language
adapts to the needs of its users (Christiansen & Chater, 2008), language should also
change in response to information crowding. In particular, language should evolve
to increase signal strength, S.
How can languages increase S? Shannon’s mathematical account of information
theory is not so helpful here. This is because S must necessarily be adapted to the
characteristics of the receiver. In the case of language, this means that messages
which are most likely to be processed effectively are messages that are more rapidly
understood and later recalled. Thus, one way to increase S is to reduce the
conceptual length of a message.

An Illustrative Example: Optimal Conceptual Length


To provide a quantitative illustration of the impact of noise on the conceptual
evolution of language, consider the following example. Let a message of bit size,
b, indicate truth information about the world, which provides information about
the locations of resource targets. Here we take b to represent the conceptual length
of a message. Suppose that the number of targets, n b , identified in a message of a
given length is n b = bλ . When λ > 1, longer messages contain more information
per conceptual bit than short messages. If the error rate per bit is f , then the
probability that a message is received without error is ω = (1 − f )b .
Assume that transmitters repeat messages about targets until they are successfully
received and then transmit locations of additional targets. With a constant bit rate,
the target acquisition rate for messages of length b is
bλ ω
R= (1)
b
which is the product of the message value in targets and the rate of successful signal
transmission. R(b) is maximized when
λ−1
b∗ = − (2)
ln(1 − f )
Formally: b∗ → 0 as f → 1.
As noise increases signal lengths should become smaller. Figure 12.1 shows
this relationship for a variety of error rates. This is a general finding that is not
constrained to words, but to messages and the time it takes to process them. It is
similar to the inverse relationship between channel capacity and noise in Shannon’s
noisy-channel encoding theorem (MacKay, 2003), except here we are concerned
with optimal message length when message interruption leads to complete message
failure.
Information Crowding and Language Change 275

10
λ
1.1
1.5
8 2
5
Optimal message length (b)

0.0 0.2 0.4 0.6 0.8 1.0


Error rate (a)
FIGURE 12.1 Optimal message length as a function of error rate.

Importantly, the result points to how conceptual information in communication


should respond to noise in the form of information crowding. As a selective force,
crowding should enhance the conceptual efficiency of language by reducing the
conceptual length of messages. In the absence of crowding, this selective force
is reduced and languages may respond to other influences, such as signalling
information about the speaker or providing details that offer more information
with respect to the intended message. In crowded environments, the cost of this
additional information may jeapordize the message in its entirety.

Surface Versus Conceptual Complexity


The above example leaves ambiguous what is precisely meant by conceptual length.
That is, the processing required to render a message into its cognitive output
depends on multiple aspects. A receiver of a message must not only veridically
identify the particular surface forms that have been transmitted, but also interpret
and understand the conceptual content of the message. Thus, the amount of
processing required depends on message form and content. A reader will have
less work to do to determine who has done what to which object when the writer
writes “Dr Estes dropped his morning banana on the desk in Dr Wade’s office” than
276 T. T. Hills, J. S. Adelman, and T. Noguchi

when he writes “The professor’s mishandling of his fruit influenced his colleague’s
work.” This is true even for a reader with enough knowledge to disambiguate
the references, because recomposing a message from its parts requires cognitive
processing and more so when those parts are ambiguous (Murphy & Smith, 1982;
Rosch, Mervis, Gray, Johnson, & Boyesbraem, 1976).
Messages that require more cognitive processing are more vulnerable to
incomplete transmission. First, each unit of processing work is vulnerable to
intrinsic failure; the greater the amount of work, the higher the probability of
failure. All else being equal, this will result in more complex messages being lost.
Second, as discussed above, there is competition for the receiver’s resources. New
messages may arrive during the processing of older messages, and interrupt their
transmission.
Messages may also be more complex than their payloads require. Take for
example the following sentence, which won the Bad Writing Contest in 1997:

The move from a structuralist account in which capital is understood


to structure social relations in relatively homologous ways to a view of
hegemony in which power relations are subject to repetition, convergence,
and rearticulation brought the question of temporality into the thinking
of structure, and marked a shift from a form of Althusserian theory that
takes structural totalities as theoretical objects to one in which the insights
into the contingent possibility of structure inaugurate a renewed conception
of hegemony as bound up with the contingent sites and strategies of the
rearticulation of power. (Butler, 1997: 13)

Regardless of how profound the sentence may be, many have found it to be
most valuable as a lesson in maximizing words per concept (e.g. Pinker, 2014).
In the Dr. Estes example above, using less specific terms such as “fruit” when
“banana” is meant, increases the conceptual complexity of the message without
changing the payload, and without a gain in surface complexity. Writers can
eliminate such inefficiencies, but they may not do so unless pressured because
the cost to language users of being more specific is non-zero. G. K. Zipf referred
to a similar balance between reductive and expansive forces in language:

. . . whenever a person uses words to convey meanings he will automatically


try to get his ideas across most efficiently by seeking a balance between the
economy of a small wieldy vocabulary of more general reference on the one
hand, and the economy of a larger one of more precise reference on the
other, with the result that the vocabulary of n different words in his resulting
flow of speech will represent a vocabulary balance between our theoretical
forces of unification and diversification. (Zipf, 1949: 22)

Crowding is but one force that language users must accomodate for.
Information Crowding and Language Change 277

The formal basis for the costs of surface and conceptual complexity are
analogous, but they are subject to different trade-offs. At its simplest, surface
complexity corresponds to physical length (e.g. word length), suggesting a cost
for longer words. However, longer words are also more phonologically isolated,
such that they are effectively error-correcting codes: If a long word is disrupted
slightly, it is still unlikely to be confused with another word, but if a short word
suffers distortion, the outcome can be more easily confused with another word.
Conceptual length is not necessarily associated with physical length, because it
relies on higher-order cognitive processing capacities involved in message retrieval
and production. In the next section we expand further on this by relating
conceptual length to historical change in language.

Conceptual Efficiency and Concreteness


The properties of language that are best associated with conceptual
efficiency—being more rapidly understood and later recalled—have been
extensively studied by psycholinguists. In particular, concreteness is well marked
by its ability to enhance these processing capacities. Concreteness refers to a word’s
ability to make specific and definite reference to particular objects. Among psy-
cholinguistic variables, the range and depth of concreteness in cognitive processing
is easily among the most profound. Paivio’s dual-coding theory (Paivio, 1971),
which proposed both a visual and verbal contribution to linguistic information,
led to years of research showing that concrete words had a memory advantage in
recall tasks (Paivio, Walsh, & Bons, 1994; Romani, McAlpine, & Martin, 2008;
Fliessbach, Weis, Klaver, Elger, & Weber, 2006; Miller & Roodenrys, 2009).
This initial research has since led to numerous studies articulating the influence of
concreteness as an important psycholinguistic variable. A Google Scholar search of
“concreteness” and “linguistics” finds approximately 30,000 articles that contain
both terms, with approximately 2,000 published in the last year.
The breadth of this research is compelling. Concrete language is perceived as
more truthful (Hansen & Wanke, 2010) and it is more interesting and easier to
comprehend (Sadoski, 2001). Concrete words are recognized more quickly in
lexical decision tasks than more abstract words (James, 1975). Concrete words show
an advantage in bilingual translation and novel word learning (Tokowicz & Kroll,
2007; Kaushanskaya & Rechtzigel, 2012). And concrete words are more readily
learned by both second and first language learners (De Groot & Keijzer, 2000). In
addition, concrete and abstract words are processed differently in the brain (Adorni
& Proverbio, 2012; Huang, Lee, & Federmeier, 2010).
Multiple explanations for the importance of concreteness have been proposed,
ranging from the imagibility of words (Paivio, 1971), to their contextual availability
(Schwanenflugel, Harnishfeger, & Stowe, 1988), to a more recent account based
on emotional valence (Kousta, Vigliocco, Vinson, Andrews, & Del Campo, 2011).
278 T. T. Hills, J. S. Adelman, and T. Noguchi

The important point here is that each of these theories acknowledges the powerful
role that concreteness plays in facilitating linguistic processing.
The wealth of evidence on the value of concreteness in language presents
a problem. Why should words ever become abstract? The assumption in
the mathematical example above provides what we feel is the most likely
explanation—abstract words, by the nature of their generality, provide information
about broader classes of phenomena. The word fruit, being more abstract than items
subordinate to that category, e.g. apple, can efficiently communicate information
about categories of truth that would otherwise take many individual instances of
more concrete examples—“fruit can prevent scurvy when eaten on long ocean
voyages.” Or to take another example, the word essentially is one of the most
abstract words in concreteness norms (see Brysbaert et al., 2013); however the
sentence “My office hours are essentially on Wednesday at noon” provides a degree
of probabilistic hedging that “My office hours are on Wednesday at noon” does
not.
Besides the theory of crowding proposed above, we know of no prior theories
that speak directly to evolutionary changes in concreteness at the level of word
distributions. Nonetheless, some evidence of cultural change may speak indirectly
to this issue. The most prominent is associated with an explanation for the Flynn
effect. The Flynn effect is the observation that intelligence scores, associated with
both crystallized and fluid intelligence, have risen steadily from approximately the
middle of the last century (Flynn, 2012). Flynn noted that nutrition has failed to
explain the observed effects and, in the absence of evidence for other biological
theories, more cognitive theories have risen to the foreground (Flynn, 2012). In
particular, numerous theories speak to an increase in computational, symbolic, and
potentially more abstract processing abilities (Greenfield, 1998; Fox, & Mitchum,
2013). One implication of knowledge-based accounts is that language may change
its composition to reflect our capacity for more abstract processing, and thus show
an increase in abstract words.
However, the causal arrow may point in the other direction. An increase
in concrete language may enhance our individual capacity to process complex
information. By this interpretation, language should have become more concrete,
and where it is concrete people should tend to learn and process more about it
(Sadoski, 2001).

The Rise in Concreteness


We combined multi-billion word diachronic language corpora (e.g. the Google
Ngram corpora and the Corpus of Historical American English) with a
recent collection of concretness norms composed of concreteness ratings for
approximately 40,000 words (Brysbaert et al., 2013). The Google Ngram corpus
of American English provides a collection of over 355 billion words published
Information Crowding and Language Change 279

in books digitized by Google (Michel et al., 2011). We limit our analysis


to years following 1800 to ensure appropriate representation. The Corpus of
Historical American English, collected independently of the Google Ngram
corpus, represents a balanced and representative corpus of American English
containing more than 400 million words of text from 1810 to 1990, by decade,
and composed of newspaper and magazine articles (Davies, 2009). In addition,
we included presidential inaugural addresses (Bird, Klein, & Loper, 2009), with
measures computed from the entire text of each speech. All of the data sources are
freely available online.
We tracked changes in concreteness over time by computing a measure of
average concreteness. The concreteness index, C y , for each year, y, in a corpus
was computed as follows,
n
X
Cy = ci pi,y
i

where ci is the concreteness for word i as found in the Brysbaert et al. concreteness
norms and pi is the proportion of word i in year y. The proportion is computed
only over the n words in the concreteness norms, or the appropriate comparison
set (as described in the caption for Figure 12.2). We also computed concreteness
on a per document basis, as opposed to per word, with similar results.
As shown in Figure 12.2, we found a steady rise in concreteness across multiple
corpora, including books (Google Ngrams), newspapers and magazines (the
Corpus of Historical American English), and presidential speeches. The Google
Ngrams also provide a corpus based on English fiction, which shows the same
pattern, with a slightly more dramatic rise in concreteness from approximately
2.35 in 1800 to 2.57 in 2000 (data not shown).
We also found that changes in concreteness occurred within word classes and
were therefore not strictly due to changes in syntax (e.g. by a reduction in the
use of articles). Figure 12.3 shows that, over the same time period, concreteness
increases within nouns, verbs, and prepositions. Together these findings show that
the change is not only systematic but systemic, permeating many different aspects
of the language.
This observation is consistent with a number of hypotheses, including crowding
and the influence of second language learners. In what follows, we examine a
number of these hypotheses in an effort to discriminate among a wide set of
possibilities.

Semantic Bleaching
It has been proposed that word evolution follows a fairly predictable pattern over
time, from specific to abstract. This takes a variety of forms including semantic
bleaching, desemanticization, and grammaticalization (Hopper & Traugott, 2003;
Aitchison & Lewis, 2003). An example is the word very, which derives from the
French word vrai. In French, vrai did and still does mean “true.” In Middle English,
280 T. T. Hills, J. S. Adelman, and T. Noguchi

2.6 GN: All words


GN: Known > 95%
GN: Present in 1800
COHA: All words
Inaugural addresses
2.5

2.4
Concreteness

2.3

2.2

2.1

1800 1825 1850 1875 1900 1925 1950 1975 2000


Year
FIGURE 12.2 Concreteness in the Google Ngrams (GN), Corpus of Historical
American English (COHA), and presidential inaugural addresses. Also shown is the
Google Ngram corpus using only words that were present in the 1800 lexicon or only
words that were known by more than 95 percent of participants in the concreteness
norms (Brysbaert et al., 2013). Pearson correlations with time (95 percent confidence
interval): Google Ngrams, r = 0.86 (0.82, 0.89); COHA, r = 0.89 (0.73, 0.96);
Inaugural addresses, r = 0.77 (0.63, 0.86). Figure taken from Hills & Adelman (2015).

the word very also meant “true,” as in a very knight, meaning a real knight. However,
over time the word became a means to emphasize the strength of a relationship,
in a probabilistic way. For example, nowadays we say that something is very true,
meaning that there are degrees of truth and this particular thing may have more of
it than others.
Although semantic bleaching is proposed to be unidirectional, this is not
without debate (Hollmann, 2009). Moreover, it is certainly the case that not all
diachronic linguistic patterns are associated with loss of concreteness. Metonymy
Information Crowding and Language Change 281

2.50 Nouns Verbs

3.7
GN
COHA

3.6
2.45
Concreteness

Concreteness
3.5
2.40

3.4 3.3
2.35

1800 1850 1900 1950 2000 1800 1850 1900 1950 2000
Year Year
Prepositions
2.18
2.16
2.14
Concreteness
2.12
2.10
2.08
2.06

1800 1850 1900 1950 2000


Year

FIGURE 12.3Changes in concreteness within word classes. Figure taken from Hills &
Adelman (2015).

is a common figure of speech where some specific aspect of an abstract concept is


used in its place, as in the phrases the pen is mightier than the sword and the Pentagon.
However, if bleaching were sufficiently strong, it could explain the observed
rise in concreteness. Language could have been perceived as more concrete by its
users at the time, but appear less concrete now because the norms are based on
present day perceptions of the word concreteness.
Concreteness norms were not collected as far back as the 1800s. However, the
existence of the Paivio norms (Paivio et al., 1968) provides a 45-year window of
282 T. T. Hills, J. S. Adelman, and T. Noguchi

(a) 1.0 (b)


200
0.8
Brysbaert norms 2013

150
0.6

Frequency
100
0.4

0.2 50

0.0 0
0.0 0.2 0.4 0.6 0.8 1.0 –0.4 –0.2 0.0 0.2 0.4 0.6
Paivio norms 1968 Change in concreteness

FIGURE 12.4 Comparison of the Paivio and Brysbeart norms. The Paivio concreteness
norms (Paivio et al., 1968) consist of 925 nouns, collected in the laboratory and using
a 7-point scale. The Brysbaert norms were collected on a 5-point scale. Both norms
are normalized to be between zero and one. (a) Shows the change in concreteness
over the 45-year span between collection. (b) Shows the histogram of concreteness
differences per word. Figure taken from Hills & Adelman (2015).

comparison with the Brysbaert norms and provides the only basis we know of for a
quantitative test of semantic bleaching. Normalizing the ratings for both shows that
there are no systematic changes in word concreteness over the approximately 900
words used for comparison (Figure 12.4). The median change is centered around
zero and a paired t-test finds no significant difference in concreteness (t (876) =
−0.79, p = 0.45). This suggests that a systematic loss of concreteness is unlikely to
explain the apparent rise in concreteness we see in the data.
These results also provide a large-scale approach to better understanding the
unidirectionality of semantic bleaching, which to our knowledge has not been
possible in the past. As a preliminary step in that direction, in Table 12.1 we provide
the ten words that have increased or decreased the most in concreteness over the
last 45 years. Importantly, the words from each side of the distribution offer an
effective demonstration that semantic bleaching may be mirrored by an equally
powerful semantic enrichment. A dreamer may have become more concrete—but
the devil, although he may have been in the details in the past, resides more in
abstraction today.

Reductions in Surface Complexity


The conceptual evolution described above is consistent with our information
crowding hypothesis, but it is also consistent with a learner-centered hypothesis
(e.g. Lupyan & Dale, 2010). Indeed, concreteness may represent one of many
changes reflecting an overall adaptation in English to selective forces driven
Information Crowding and Language Change 283

TABLE 12.1 Lists of words that have increased


or decreased the most in concreteness between
the Paivio and Brysbaert concreteness norms.
Positive values indicate an increase in
concreteness over time.

Word Change Word Change


sovereign 0.56 facility −0.45
plain 0.40 month −0.43
dreamer 0.39 devil −0.41
originator 0.36 competition −0.37
master 0.35 death −0.35
outsider 0.35 vacuum −0.33
habitation 0.32 demon −0.32
antitoxin 0.30 panic −0.31
connoisseur 0.30 length −0.31
captive 0.29 sensation −0.30

by language learners. A hypothesis based on language learners—assuming it is


operating over the timescale in which we investigate American English—should
also predict surface simplification in the language. Surface simplification would
include features such as shorter word length and preference for words with earlier
age of acquisition. The absence of these features changing alongside concreteness
in no way discounts prior evidence for learner-centered change in language more
generally. However, it would indicate that the changes driving concreteness in
American English may not be learner-centered or a result of language speakers
becoming more childlike, but may instead be driven by factors more specifically
associated with conceptual clarity induced by crowding.

Word Length
More concrete words tend to be phonologically and orthographically shorter.
Among the words in the Brysbaert norms (Brysbaert et al., 2013), the correlation
between concreteness and word length is β =-0.40, p < 0.001. If the selective
forces driving concreteness are more directly driven by preferences for shorter
words, then word length should change in tandem with concreteness. However,
Figure 12.5 shows that the general trends found in concreteness are not preserved
across corpora in relation to word length. In general, word length does not change
much across the three corpora until the 1900s, and then the direction of change
appears to depend on the corpora. Words in presidential speeches get shorter, while
words in books tend to get longer. Words in newspapers and magazines, on the
284 T. T. Hills, J. S. Adelman, and T. Noguchi

GN
4.6 COHA
Inaugural

4.5

4.4
Word length

4.3

4.2

4.1

4.0

1800 1850 1900 1950 2000


Year
FIGURE 12.5 Changes in average word length over time for words in the Google
Ngrams (GN), Corpus of Historical American English (COHA), and presidential
inaugural addresses.

other hand, first show a trend towards reduced length but then increase in length,
but only up to approximately the point they were in 1800.
One can also look at concreteness with words of a given length, and ask if the
rise in concreteness is independent of word length. Figure 12.6 shows that this is
largely the case. Although words of one, two, or three characters in length show
roughly no change in concreteness over time, words of four or more characters
consistently show a rise in concreteness over time ( ps < 0.001).
Additional evidence of the independence between concreteness and word
length is found in Figure 12.7, which shows that within a range of concreteness
words tend to grow in length, especially following the 1900s. This is also mirrored
by an increase in word length across the full corpus. This would appear to be
counter to a potential selective force imposed by language learners. In sum, changes
in concreteness do not appear to be driven by changes in word length—on the
contrary, concreteness appears to rise despite an overall trend towards longer words.
Information Crowding and Language Change 285

3.0
6
5
7
8
4

2.5 9
Full corpus
Concreteness

2
2.0 3

1.5 1

1800 1850 1900 1950 2000


Year
FIGURE 12.6 Changes in concreteness within words of a given length in characters.
Data are taken from the Google Ngrams.

Age of Acquisition
Age of acquisition provides an additional, and possibly more direct, measure of
evidence for a learner-centered hypothesis. In a comparision of the 23,858 words
that are shared between the Brysbaert concreteness norms and the Kuperman age
of acquisition norms (Kuperman et al., 2012), age of acquisition is correlated with
concreteness, β = −0.35, p < 0.001. Moreover, previous work has established
that words with earlier age of acquisition are more resilient to age-related decline
and are retrieved more quickly in lexical decision tasks than words acquired later
in life (Juhasz, 2005; Hodgson & Ellis, 1998). If language change in American
English is driven by the influence of language learners—who may only show partial
learning—or the influence of an aging population—who produce earlier acquired
words preferentially—then the language as a whole may move towards words of
earlier age of acquisition.
To evaluate changes in preference for words of earlier acquisition over time, we
used the Kuperman age of acquisition norms (Kuperman et al., 2012) to compute
286 T. T. Hills, J. S. Adelman, and T. Noguchi

6.0

5.5
(2,2.5]

(4,4.5]
5.0 (3,3.5]
(4.5,5]
Word length

(3.5,4]
(2.5,3]
4.5
Full corpus

4.0
(1.5,2]

3.5

(1,1.5]
3.0
1800 1850 1900 1950 2000
Year
FIGURE 12.7 Changes in word length within narrow ranges of concreteness. Data are
taken from the Google Ngrams.

a weighted age of acquisition value for each corpora as was done for concreteness.
Figure 12.8 shows that age of acquisition tends to follow a similar pattern as that
found for word length, but not concreteness. Changes in age of acquisition and
word length are also highly correlated across the three corpora (Google Ngram:
β = 0.96, p < 0.001; COHA: β = 0.66, p < 0.001; inaugural addresses: β = 0.95,
p < 0.001). On the other hand age of acquisition is not well correlated with
changes in concreteness (e.g. Google Ngram: β = 0.33, p < 0.001).

Discussion of the Absence of Reduction


in Surface Complexity
The above evidence for an absence of systematic changes in word length and
age of acquisition suggests that the observed changes in concreteness are not
the result of factors that might also lead to surface simplication. Specifically,
the evidence suggests that concreteness is not being driven by second language
Information Crowding and Language Change 287

GN
COHA
Inaugural
6.0

5.8
Age of acquisition

5.6

5.4

5.2

1800 1825 1850 1875 1900 1925 1950 1975 2000


Year
FIGURE 12.8Changes in average age of acquisition for words in the Google Ngrams
(GN), Corpus of Historical American English (COHA), and presidential inaugural
addresses.

learners, who we would predict would also show a preference for shorter words
and words with earlier age of acquisition. Furthermore, the results also suggest
that the change in concreteness is not being driven by a rising tide of more
children or lower IQ individuals entering the language market. Again, if this
were the case, we would expect language to also show systematic changes in
surface complexity. In the next section we examine more directly the relationship
between crowding and concrete language by looking at another source of Big Data:
Twitter.

Population Density and Concreteness in US States


As a final look into the relationship between crowding and concreteness, we
investigated the concreteness of tweets across the 50 US states. Here our prediction
is that states with a higher population density should also produce more concrete
language. To investigate this, from 24 February 2014 till 8 April 2014, we
288 T. T. Hills, J. S. Adelman, and T. Noguchi

2.72 Louisiana

Florida
Hawaii

Mississippi Maryland
2.70
Nevada
Concreteness

Georgia
Oregon South Carolina
ColoradoTexas Delaware
Vermont California
VirginiaNew York
Alabama Illinois New Jersey

Arkansas
Arizona Connecticut
2.68
Michigan
Wisconsin Massachusetts
North Carolina
Alaska Montana Tennessee
Missouri Rhode Island
Pennsylvania
Washington
Nebraska
Wyoming
Minnesota
New Mexico Maine
Oklahoma
Kansas New HampshireOhio
Indiana
2.66 Kentucky
North Dakota West
Iowa Virginia
South Dakota
Idaho Utah

0 2 4 6
Log (population density)
FIGURE 12.9 Concreteness in language increases with the population density across US
states. The data are taken from approximately 1.5 million tweets, with 30,000 tweets
per state.

collected 66,348,615 tweets, made within 50 states of the USA, using Twitter’s
streaming API. The collected tweets exclude retweets (i.e. repetition of tweets
previously made). The number of collected tweets varies between the states from
39,397 (Wyoming) to 8,009,114 (California), and thus to achieve similarity in
measurement accuracy, we randomly sampled 30,000 tweets from each state. Then
after removing hash tags, non-ascii characters, and punctuation marks (e.g. ”), we
calculated the concreteness for each tweet and then mean averaged these for each
state.
Figure 12.9 shows the relationship between log population density and tweet
concreteness for states (β = 0.36, p < 0.01). There is a clear pattern of rising
concreteness with population density. There are many potential confounds here, as
Information Crowding and Language Change 289

styles of writing (e.g. syntax and tweet length) may change across states. However,
as we note above, concreteness is but one of many ways that conceptual efficiency
may change and thus we see it as an indicator, which may in turn be driven by other
aspects of language use. One factor that is unlikely to be influenced by crowding,
however, is IQ, which may in turn be associated with concreteness, as we note
in the introduction. In our data, IQ is inversely correlated with concreteness
(β = −0.003, p < 0.001), but this may not be particulary surprising as the
McDaniel (2006) measure is based partly on reading comprehension. However,
the relationship between concreteness and population density is preserved after
partialing out the variance accounted for by changes in IQ (McDaniel, 2006),
with population density still making up approximately 12 percent of the variance
( p < 0.01).

Conclusions
Culture is a marketplace for competing ideas. This leads to the prediction that
any broad medium of communication should evolve over time to have certain
properties that facilitate communication. Certain aspects of these signals should
be enhanced as competition (i.e. crowding) increases. In particular, aspects of
information that increase speed of processing and memorability should be favored
as information markets become more crowded. As we have shown, concreteness
facilitates these cognitive demands and has risen systematically in American English
for at least the last 200 years. We have also shown that these changes are not
consistent with a learner-centered hypothesis, because we would expect additional
changes in language associated with a reduction in surface complexity, such as
reduced word length and preference for words with earlier age of acquisition,
which we do not observe. The lack of evidence for these changes also indicates
that the change in concreteness is not due to a general simplifying of the language,
which one might predict if language were being influenced by, for example, a
younger age of entry into the language marketplace or a general dumbing down of
the language.
The work we present here is preliminary in many respects. We have taken
a bird’s eye view of language and focused on psycholinguistic change, but these
necessarily require some assumptions on our part and do not focus on other
factors in language change. It is very likely that there are other changes in writing
and speech conventions that one could document. To see how these align with
our present investigation, one would also need to investigate the causes of these
changes. If writing was once meant to provide additional information about the
intelligence of the author, this may have been lost in modern language—but the
natural question is why? When there are but few authors, the authors may compete
along different dimensions than when there are many, and conventions may change
accordingly.
290 T. T. Hills, J. S. Adelman, and T. Noguchi

The present work demonstrates the capacity for data analytic approaches to
language change that can discriminate among alternative hypotheses and even
combine data from multiple sources to better inform hypotheses. Naturally, we
also hope this work leads to future questions and research on the influence of
concreteness and language evolution. In particular, we find it particularly intriguing
to ask if the rise in concrete language may be associated with the rise in IQ
associated with the Flynn effect (Flynn, 2012)? Compared with the writing of
several hundred years ago, the examples we provide in the introduction suggest
that today’s writing is more succinct, often to the point of being terse. It is difficult
to deny the comparative ease with which modern language conveys its message.
Indeed, we suspect that more memorable language (such as aphorisms) share a
similar property of making their point clearly and efficiently.
The research we present also poses questions about the influence of competition
in language. Is language produced in a competitive environment more memorable
in general, or is the increased memorability for some passages just a consequence of
a larger degree of variance among the language produced? If the former, this may
suggest that something about competitive language environments facilitates the
production of more memorable messages, and that this is something that humans
are potentially aware of and capable of modulating. Such a capacity would explain
the enhanced memorability of Facebook status updates relative to other forms
of language (Mickes et al., 2013). If such competition is driving language, and
language change maintains its current course, we may all be speaking Twitterese
in the next 100 to 500 years (compare the y-axis on Figure 12.2 and Figure 12.9).
Finally, this work may also provide applications in relation to producing more
memorable information in learning environments, for example, via a mechanism
for concretizing text or competitive writing. Although we recognize that these
questions are speculative, we hope that this work provides some inspiration for
their further investigation.

References
Adorni, R., & Proverbio, A. M. (2012). The neural manifestation of the word concreteness
effect: an electrical neuroimaging study. Neuropsychologia, 50, 880–891.
Aitchison, J., & Lewis, D. M. (2003). Polysemy and bleaching. In B. Nerlich, Z. Todd, V.
Hermann, & D. D. Clarke (Eds.), Polysemy: Flexible patterns of meaning in mind and language
(pp. 253–265). Berlin: Walter de Gruyter.
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python. Sebastopol,
CA: O’Reilly.
Blanshard, C. (1873). Evolution as applied to the chemical elements. Nature, 9, 6–8.
Brysbaert, M., Warriner, A. B., & Kuperman, V. (2013). Concreteness ratings for 40
thousand generally known English word lemmas. Behavior Research Methods, 46, 1–8.
Butler, J. (1997). Further reflections on conversations of our time. Diacritics, 27, 13–15.
Information Crowding and Language Change 291

Christiansen, M. H., & Chater, N. (2008). Language as shaped by the brain. Behavioral and
Brain Sciences, 31, 489–509.
Davenport, T. H., & Beck, J. C. (2001). The attention economy: Understanding the new currency
of business. Brighton, MA: Harvard Business Press.
Davies, M. (2009). The 385+ million word corpus of contemporary American English
(1990– 2008+): Design, architecture, and linguistic insights. International Journal of Corpus
Linguistics, 14, 159–190.
De Groot, A., & Keijzer, R. (2000). What is hard to learn is easy to forget: The roles of
word concreteness, cognate status, and word frequency in foreign-language vocabulary
learning and forgetting. Language Learning, 50, 1–56.
Deutscher, G. (2010). The unfolding of language. London: Random House.
Duret, L., & Mouchiroud, D. (2000). Determinants of substitution rates in mammalian
genes: Expression pattern affects selection intensity but not mutation rate. Molecular
Biology and Evolution, 17, 68–670.
Eppler, M. J., & Mengis, J. (2004). The concept of information overload: A review of
literature from organization science, accounting, marketing, MIS, and related disciplines.
The Information Society, 20, 325–344.
Fliessbach, K., Weis, S., Klaver, P., Elger, C., & Weber, B. (2006). The effect of word
concreteness on recognition memory. Neuroimage, 32, 1413–1421.
Flynn, J. R. (2012). Are we getting smarter? Rising IQ in the twenty-first century. Cambridge,
UK: Cambridge University Press.
Fox, M. C., & Mitchum, A. L. (2013). A knowledge-based theory of rising scores on
‘culture-free’ tests. Journal of Experimental Psychology: General, 142, 979–1000.
Grant, B. R., & Grant, P. R. (2010). Songs of Darwin’s finches diverge when a new species
enters the community. Proceedings of the National Academy of Sciences, 107, 2015620163.
Greenfield, P. M. (1998). The cultural evolution of IQ. In U. Neisser (Ed.), The rising
curve: Long-term gains in IQ and related measures (pp. 81–123). Washington, DC: American
Psychological Association.
Hansen, J., & Wänke, M. (2010). Truth from language and truth from fit: The impact of
linguistic concreteness and level of construal on subjective truth. Personality and Social
Psychology Bulletin, 36, 1576–1588.
Hansen, M. T., & Haas, M. R. (2001). Competing for attention in knowledge markets:
Electronic document dissemination in a management consulting company. Administrative
Science Quarterly, 46, 1–28.
Hawthorne, N. (2004). The scarlet letter. New York: Simon and Schuster.
Hills, T. T., & Adelman, J. S. (2015). Recent evolution of learnability in American English
from 1800 to 2000. Cognition, 143, 87–92.
Hodgson, C., & Ellis, A. W. (1998). Last in, first to go: Age of acquisition and naming in
the elderly. Brain and Language, 64, 146–163.
Hollmann, W. B. (2009). Semantic change. In J. Culpeper, F. Katamba, P. Kerswill, &
T. McEnery (Eds.), English language: Description, variation and context (pp. 525–537).
Basingstoke: Palgrave.
Hopper, P. J., & Traugott, E. C. (2003). Grammaticalization. Cambridge, UK: Cambridge
University Press.
Huang, H.-W., Lee, C.-L., & Federmeier, K. D. (2010). Imagine that! ERPs provide
evidence for distinct hemispheric contributions to the processing of concrete and abstract
concepts. Neuroimage, 49, 1116–1123.
292 T. T. Hills, J. S. Adelman, and T. Noguchi

James, C. T. (1975). The role of semantic information in lexical decisions. Journal of


Experimental Psychology: Human Perception and Performance, 1, 130–136.
Juhasz, B. J. (2005). Age-of-acquisition effects in word and picture identification.
Psychological Bulletin, 131, 684–712.
Kaushanskaya, M., & Rechtzigel, K. (2012). Concreteness effects in bilingual and
monolingual word learning. Psychonomic Bulletin & Review, 19, 935–941.
Kousta, S.-T., Vigliocco, G., Vinson, D. P., Andrews, M., & Del Campo, E. (2011). The
representation of abstract words: Why emotion matters. Journal of Experimental Psychology:
General, 140, 14–34.
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition
ratings for 30,000 English words. Behavior Research Methods, 44, 978–990.
Labov, W. (1980). Locating language in time and space. New York: Academic Press.
Lieberman, E., Michel, J.-B., Jackson, J., Tang, T., & Nowak, M. A. (2007). Quantifying
the evolutionary dynamics of language. Nature, 449, 713–716.
Lupyan, G., & Dale, R. (2010). Language structure is partly determined by social structure.
PLoS One, 5, e8559.
Luther, D. (2009). The influence of the acoustic community on songs of birds in a
neotropical rain forest. Behavioral Ecology, 20, 864–871.
McDaniel, M. A. (2006). Estimating state IQ: Measurement challenges and preliminary
correlates. Intelligence, 34, 607–619.
MacKay, D. J. (2003). Information theory, inference, and learning algorithms. Cambridge, UK:
Cambridge University Press.
McWhorter, J. H. (2007). Language interrupted: Signs of non-native acquisition in standard
language grammars. Oxford, UK: Oxford University Press.
Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books
Team, ... Aiden, E. L. (2011). Quantitative analysis of culture using millions of digitized
books. Science, 331, 176–182.
Mickes, L., Darby, R. S., Hwe, V., Bajic, D., Warker, J. A., Harris, C. R., & Christenfeld,
N. J. (2013). Major memory for microblogs. Memory & Cognition, 41, 481–489.
Miller, L. M., & Roodenrys, S. (2009). The interaction of word frequency and concreteness
in immediate serial recall. Memory & Cognition, 37, 850–865.
Monaghan, P. (2014). Age of acquisition predicts rate of lexical evolution. Cognition, 133,
530–534.
Murphy, G. L., & Smith, E. E. (1982). Basic-level superiority in picture categorization.
Journal of Verbal Learning and Verbal Behavior, 21, 1–20.
Pagel, M., Atkinson, Q. D., & Meade, A. (2007). Frequency of word-use predicts rates of
lexical evolution throughout Indo-European history. Nature, 449, 717–720.
Paivio, A. (1971). Imagery and verbal processes. Hillsdale, NJ: Holt, Rinehart & Winston.
Paivio, A., Walsh, M., & Bons, T. (1994). Concreteness effects on memory: When and
why? Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 1196–1204.
Paivio, A., Yuille, J. C., & Madigan, S. A. (1968). Concreteness, imagery, and meaningful-
ness values for 925 nouns. Journal of Experimental Psychology, 76, 1–25.
Pinker, S. (2014). The sense of style: The thinking person’s guide to writing in the 21st century.
London: Penguin.
Ratcliff, R., Clark, S. E., & Shiffrin, R. M. (1990). List-strength effect: I. Data and
discussion. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16, 163178.
Information Crowding and Language Change 293

Rice, W. (1996). Sexually antagonistic male adaptation triggered by experimental arrest of


female evolution. Nature, 381, 232–234.
Romaine, S., Hogg, R. M., Blake, N. F., Lass, R., Algeo, J., & Burchfield, R. (1992). The
Cambridge history of the English language. Cambridge, UK: Cambridge University Press.
Romani, C., McAlpine, S., & Martin, R. C. (2008). Concreteness effects in different
tasks: Implications for models of short-term memory. The Quarterly Journal of Experimental
Psychology, 61, 292–323.
Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyesbraem, P. (1976). Basic
objects in natural categories. Cognitive Psychology, 8, 382–439.
Sadoski, M. (2001). Resolving the effects of concreteness on interest, comprehension, and
learning important ideas from text. Educational Psychology Review, 13, 263–281.
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian approach to
filtering junk e-mail. In Learning for text categorization: Papers from the 1998 workshop (Vol.
62, pp. 98–105).
Schwanenflugel, P. J., Harnishfeger, K. K., & Stowe, R. W. (1988). Context availability
and lexical decisions for abstract and concrete words. Journal of Memory and Language, 27,
499–520.
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana, IL:
University of Illinois Press.
Sidaris, D. (1994). Barrel fever. New York: Little, Brown and Company.
Simon, H. A. (1971). Designing organizations for an information-rich world. In M. Green-
berge (Ed.), Computers, communications, and the public interest (pp. 37–52). Baltimore, MD:
Johns Hopkins Press.
Tokowicz, N., & Kroll, J. F. (2007). Number of meanings and concreteness: Consequences
of ambiguity within and across languages. Language and Cognitive Processes, 22, 727–779.
Trudgill, P. (2002). Sociolinguistic variation and change. Baltimore, MD: Georgetown
University Press.
Varian, H. R., & Lyman, P. (2000). How much information. University of California,
Berkeley School of Information Management & Systems Report. Retrieved from www.sims.
berkeley.edu/how-much-info.
Young, R. W., Morgan, W., & Midgette, S. (1992). Analytical lexicon of Navajo. Albuquerque,
NM: University of New Mexico Press.
Zipf, G. K. (1949). Human behavior and the principle of least effort. Oxford, UK:
Addison-Wesley Press.
13
DECISION BY SAMPLING
Connecting Preferences to Real-World
Regularities

Christopher Y. Olivola and


Nick Chater

Abstract
Decision by sampling theory (DbS) offers a unique example of a cognitive science theory in
which the role of Big Data goes beyond providing high-powered tests of hypotheses: Within
DbS, Big Data can actually form the very basis for generating those hypotheses in the first
place. DbS is a theory of decision-making that assumes people evaluate decision variables,
such as payoffs, probabilities, and delays, by comparing them to relevant past observations
and experiences. The theory’s core reliance on past experiences as the starting point in the
decision-making process sets it apart from other decision-making theories and allows it to
form a priori predictions about the patterns of preferences that people will exhibit. To do so,
however, the theory requires good proxies for the relevant distributions of comparison values
(i.e. past observations and experiences) that people are likely to hold in memory. In this
chapter, we summarize the theory of DbS and describe several examples of Big Data being
successfully used as rich proxies for memory distributions that form the foundations of the
theory. We show how, using these Big Data sets, researchers were able to independently predict
(i.e. without fitting choice data) the shapes of several important psychoeconomic functions
that describe standard preference patterns in risky and intertemporal decision-making. These
novel uses of Big Data reveal that well-known patterns of human decision-making, such as
loss aversion and hyperbolic discounting (among others), originate from regularities in the
world.

Introduction
The study of human decision-making has made great strides over the past several
decades: The normatively appealing, but descriptively lacking, axiomatic theories
of expected utility maximization (e.g. von Neumann & Morgenstern, 1947) were
successfully challenged, and gave way to more behaviorally inspired approaches,
Decision by Sampling 295

such as prospect theory (Kahneman & Tversky, 1979; Tversky & Kahneman,
1992), regret theory (Loomes & Sugden, 1982), and other variants (for reviews,
see Schoemaker, 1982; Starmer, 2000). Another major step forward has been the
development of dynamic theories, which attempt to capture, not just the output
of decision-making, but also the process of deliberation that ultimately leads to
observed choices (e.g. Busemeyer & Townsend, 1993; Usher & McClelland, 2004).
And there is no sign of slowing down: Even the past couple of years have witnessed
the birth of new decision-making theories (e.g. Bhatia, 2013; Dai & Busemeyer,
2014).
Yet, for all of the progress that has been made, most decision-making theories
remain fundamentally tethered to the utility-based approach—that is, they are built
on the core assumption (i.e. have as their starting point) that there are underlying
utility or “value” functions1 that govern our preferences, and thus our choices.
In this way, they have not really escaped the shadow of expected utility theory
(EUT; von Neumann & Morgenstern, 1947), which they sought to challenge
and replace. By comparison, there have been far fewer attempts to conceptualize
the decision-making process as being “free” of utility (or value) functions. Rare
examples include reason-based choice (Shafir, Simonson, & Tversky, 1993), query
theory (Appelt, Hardisty, & Weber, 2011; Johnson, Häubl, & Keinan, 2007; Weber
et al., 2007), and a few simple choice heuristics (e.g. Brandstätter, Gigerenzer, &
Hertwig, 2006; Thorngate, 1980).
More generally, nearly all utility-based and utility-absent theories have one
fundamental thing in common: They conceptualize the decision-making process as
an interaction between a pre-existing set of preferences (or decision rules) and the
attribute values of the choice options under consideration. The decision maker’s
past experiences and observations, however, are either totally absent from the
decision-making process (e.g. Brandstätter et al., 2006), or their role is limited
to a very short time-window (e.g. Gonzalez, Lerch, & Lebiere, 2003; Plonsky,
Teodorescu, & Erev, 2015), or they merely shape his or her beliefs about the
likelihoods of outcomes occurring (e.g. Fudenberg & Levine, 1998). By contrast,
the decision-maker’s extended past typically plays no role in shaping how the
attribute values of choice alternatives are evaluated.2
Thus, a striking feature of most models of decision-making is that each decision
occurs largely in a vacuum, so that only the most recent (if any) environmental cues
are brought to bear on the decision-making process. Indeed, it is typically assumed
that people evaluate the attributes of choice alternatives without reference to their
experience of such attributes in the past. Yet, even intuitively, this assumption
seems implausible. Consider, for example, the decision to purchase a car: Doing
so requires weighing and trading-off relevant attributes, such as fuel consumption
and safety features. However, we are not able to directly evaluate how different
absolute values of these attributes will impact our well-being (e.g. we are not
able to quantify the absolute boost in utility associated with a particular reduction
296 C. Y. Olivola and N. Chater

in the probability of a fatal collision). Instead, we often evaluate the attributes


of one car in relation to the attributes of other cars we have seen in the past.
Thus, we are positively disposed towards a car if it performs better in terms of fuel
consumption, acceleration, and safety than most other models we have previously
examined. In short, we evaluate attributes in comparative, rather than absolute,
terms. Consequently, the distribution of attribute values to which we compare
a given attribute will be crucial to our evaluations. If, for example, we have
been recently, or habitually, exposed to racing cars, then the fuel economy of
a typical sedan will seem remarkably impressive, while its acceleration will seem
lamentable. From this perspective, environmental distributions of attributes values
are of great importance in determining how people evaluate those attributes, as we
shall demonstrate in this chapter.
Utility-based theories also fail to offer a solid foundation for understanding
many general patterns in people’s preferences. The fact that most decision-makers
exhibit a diminishing sensitivity to payoff amounts, for example, is typically
modeled in a post hoc way (e.g. via the inclusion of a utility curvature parameter).
In other words, utility-based theories do not really explain patterns of preferences,
but instead attempt to parsimoniously re-describe these patterns in terms of
mathematical functions. This leads to a circular exercise, whereby decision theorists
first need to observe people’s preferences before they can infer the shapes of the
utility, probability weighting, and time discounting functions that are supposed to
account for those very preferences. Moreover, absent from these classic attempts to
infer utilities, probability weights, and discount rates from observed preferences, is
a theory that explains the origin of these preference patterns (or, equivalently, that
explains the shapes of utility, weighting, and discounting functions).

Breaking Free of Utility: Decision by Sampling


Decision by sampling (DbS) theory (Stewart, 2009; Stewart, Brown, & Chater,
2006; see also Kornienko, 2013) offers a novel framework for conceptualizing, not
only the decision-making process, but the very origin of people’s preferences—and
takes as its starting point the comparative nature of the cognitive machinery with
which the brain presumably constructs processes of judgment and decision-making.
At its core, DbS attempts to explain the pattern of preferences that people exhibit
for varying amounts (or magnitudes) of a given dimension—i.e. how they translate
objective outcome and attribute magnitudes into subjective values or weights.
The theory builds on fundamental psychological principles to explain how we
evaluate variables that are relevant to decision-making, such as monetary payoffs,
time delays, and probabilities of occurrence. According to DbS, the evaluation
process is governed by a few simple cognitive operations: To evaluate the size
or (un)desirability of a target outcome or attribute, we first draw upon a sample
of comparison values from our memory. Specifically, we sample from a mixture
Decision by Sampling 297

of two memory sources to obtain a set of relevant exemplars for comparison:


(i) outcome or attribute values we have observed throughout our life, which
are stored in long-term memory, and (ii) outcome or attribute values that we
observe(d) in our recent and/or immediate context, which are stored in short-term
memory.3 Having drawn a sample of comparison values, we then compare the
target value to all those in the sampled set. Specifically, we carry out a series of
simple pairwise comparisons between the target outcome or attribute and each
comparison value. With each comparison, the cognitive decision-making system
simply determines whether the target value is larger than, equal to, or smaller than
the given comparison value. The system then tallies up the proportion of favorable
and unfavorable comparisons. The subjective magnitude we assign to the target
outcome or attribute is simply the proportion of pair-wise comparisons in which
it dominates (or ties),4 which is equivalent to its percentile rank among the sampled
events (i.e. the proportion of sampled comparison values that are smaller than or
equal to the target). Consequently, regardless of its actual (or absolute) magnitude,
the target value is considered large if it ranks above most sampled comparison values
(i.e. if it has a high percentile rank within the comparison sample), of medium size
if it ranks somewhere in the middle (i.e. if it is close to the median), and small if it
ranks below most of them (i.e. if it has a low percentile rank).
To illustrate how DbS theory works, consider a person who parks her car
by a meter but returns too late and finds that she has incurred a $40 parking
ticket. How unhappy would she be (i.e. in the language of economics: how
much disutility does she get) from losing $40? The answer, according to DbS,
is that it depends on the comparison values—in this case, monetary losses—that
she draws from her memory caches. Which comparison values are most likely to
come to her mind depends, in turn, on the distribution of values she holds in
memory. From her long-term memory cache, she might sample various bills she
has paid, previous parking tickets she has incurred, or other monetary losses and
payments that she has experienced in the past. Let’s assume her sampling system
draws the following comparison values from long-term memory: a $132.54 credit
card bill, a $30 parking ticket, and $52 that she lost playing poker with friends.
From her short-term memory cache, she might sample the $1.50 she recently lost
to a dysfunctional vending machine and the $3.50 she spent purchasing coffee on
her way back to her car. Thus, her comparison set would consist of the following
monetary losses (in ascending order of magnitude): $1.50, $3.50, $30.00, $52.00,
and $132.54. Within this comparison set, the $40 parking ticket ranks second out
of six (i.e. it is in the 83rd percentile). In this case, therefore, the loss of $40 will
seem relatively large to her, and she will likely be quite upset. If, on the other
hand, she was used to experiencing much larger losses (e.g. if she were a heavy
gambler or big spender), she would have likely sampled larger comparison values,
so the $40 parking ticket would not seem as big of a loss and would thus be less
upsetting to her.5
298 C. Y. Olivola and N. Chater

The process described above can be repeated for other values (e.g. other
monetary loss amounts) that a person might be trying to evaluate. In fact, DbS
allows us to map out the relationship between each possible outcome or attribute
magnitude and its subjective value (or weight) in the eyes of the decision-maker (as
we will illustrate, in the next section). The result is a percentile function relating
each outcome/attribute to its corresponding percentile rank, and therefore to its
subjective value. This percentile function can be used to model and predict people’s
preferences and choices in the same way that utility (or value) functions are used
to predict decisions (Olivola & Sagara, 2009; Stewart, 2009; Stewart, Chater, &
Brown, 2006; Walasek & Stewart, 2015). The key difference between DbS and
utility-based theories is that in DbS the “value functions” emerge locally from the
interactions between the memory-sampling-plus-binary-comparison process and
the distributions of relevant magnitudes that one has experienced or observed over
time. For example, according to DbS, a person’s value function for financial gains
is essentially the function relating each monetary payoff she might encounter to
its percentile rank among all the sums of money that she has earned (or observed
others earning) in the past.
Critically, this implies that the “utility” a person ascribes to an outcome (e.g.
winning a particular sum of money) will be determined by her accumulated
experiences, as these govern the distribution of comparison magnitudes (e.g.
past monetary gains) that she can draw from. Therefore, if we can know (or
at least approximate) the distribution of relevant outcomes that a person has
observed in her lifetime then, using DbS, we can predict the typical shape of
her corresponding value function(s).6 The same logic applies for other attribute
values, such as the probabilities or timings of outcomes: DbS allows us to predict
the shapes of a person’s probability weighting and time-discounting functions
from the distributions of probabilities and delays she has previously observed and
experienced. This contrasts sharply with utility-based approaches, which have no
way of independently predicting a person’s preferences in advance (i.e. without
first observing some of their past choices). Of course, being able to test this
important distinguishing feature of DbS requires access to rich and representative
data concerning the occurrence of these outcomes or attributes. Fortunately, as
we’ll show, the growing availability of Big Data has made it possible for researchers
to estimate a variety of relevant distributions.
Like any theory, DbS rests on a few key assumptions—namely, that
decision-makers sample from their memory, engage in binary comparisons, and
finally tally up the proportion of favorable (versus unfavorable) comparisons. The
first assumption (memory sampling) is supported by evidence that humans and
other animals automatically encode and recall frequencies (Sedlmeier & Betsch,
2002). The second assumption (binary comparison) is supported by the simplicity
of the mechanism and by extensive evidence that people are much more adept at
providing relative, rather than absolute, judgments (Stewart et al., 2005). Finally,
Decision by Sampling 299

the assumption that decision-makers tally up the proportion of (un)favorable


binary comparisons finds support in research showing that people are quite
good at estimating summary statistics, such as averages (Ariely, 2001) and ratios
(McCrink & Wynn, 2007). In sum, the core mechanisms underlying DbS are both
psychologically plausible and well supported by the cognitive psychology literature.

From Real-World Distributions to (E)valuations


As previously mentioned, one of the most appealing features of DbS, which forms
the core focus of this chapter, is that it allows us to make predictions about people’s
preferences before we collect any choice data. This is in stark contrast to most
utility-based theories, which require some choice data to be collected in order
to fit the parameters of the underlying value, probability weighting, and time
discounting functions (e.g. Gonzalez & Wu, 1999; Olivola & Wang, in press;
Prelec, 1998; Stott, 2006; Tversky & Kahneman, 1992; Wu & Gonzalez, 1996).
Specifically, as we explained above, the DbS framework allows us to model a
person’s value, weighting, and discounting functions; however, doing so requires
that we know, or can approximate, the distribution of relevant magnitudes that
he or she would have encountered throughout his or her life. Being able to
approximate these distributions would have been a daunting, if not impossible,
task for researchers throughout most of the twentieth century. Fortunately, the
recent growth of Big Data on the Internet and elsewhere has made this increasingly
feasible. In what follows, we describe several different examples of how Big Data
has been used to model people’s value functions for money and for human lives, as
well as their probability weighting and time discounting functions.

The Subjective Value of Monetary Gains and Losses


More than any other variable, the valuation of money has received considerable
attention from economists, decision theorists, and psychologists studying human
decision-making. There are several reasons for this. Standard economic theory
assumes that nearly every outcome or state of the world can be translated into an
equivalent monetary value (Hanemann, 1994; Porter, 2011; Viscusi & Aldy, 2003;
Weyl, in press). Moreover, even non-economists often implicitly buy into this
assumption when they ask their participants how much they would be willing to
pay to obtain desirable outcomes (or avoid undesirable ones), or the amount they
would need to be compensated to accept undesirable outcomes (or forgo desirable
ones) (e.g. Kahneman, Ritov, Jacowitz, & Grant, 1993; Olivola & Shafir, 2013;
Ritov & Baron, 1994; Stewart, Chater, Stott, & Reimers, 2003; Walker, Morera,
Vining, & Orland, 1999). All utility-based theories of decision-making assume
that decision-makers assign subjective values to monetary gains and losses, and this
relationship between the amount of money won (or lost) and its subjective value is
300 C. Y. Olivola and N. Chater

typically non-linear. One of these, prospect theory (Kahneman & Tversky, 1979;
Tversky & Kahneman, 1992), has been particularly successful at accounting for
much of the data observed in the laboratory and the field (e.g. Camerer, 2004).
According to prospect theory (and the copious evidence that supports it), people
treat monetary gains and losses separately and they exhibit a diminishing sensitivity
to both dimensions, such that initial gains (losses) have a larger impact on their
(dis)utility than subsequent gains (losses). Consequently, people tend to be risk
averse for monetary gains and risk seeking for monetary losses (Kahneman &
Tversky, 1979; Tversky & Kahneman, 1992). Another defining feature of prospect
theory is that the value function for losses is assumed to be steeper than the
one for gains, so that people derive more disutility from losing a given amount
of money than utility from winning that same amount. In other words, losing
an amount of money (e.g. –$50) typically feels worse than winning the same
amount (e.g. +$50) feels good. These various assumptions are beautifully and
succinctly represented in prospect theory’s S-shaped and kinked value function
(see Figure 13.1(a)). Although it is descriptively accurate, the prospect theory
value function fails to explain why or how people come to perceive monetary
gains and losses as they do. Fortunately, DbS offers a theoretical framework for not
only predicting, but also explaining these patterns of behaviors. As Stewart et al.
(2006) showed, the diminishing sensitivity that people exhibit for monetary gains
and losses, as well as the tendency to react more strongly to losses than equivalent
gains, emerge from the relevant distributions of values in the world, as predicted
by DbS.

Monetary Gains
DbS assumes that people evaluate the utility (subjective positive value) of a
monetary gain by comparing it to other monetary gains they have experienced
or observed. These comparison values could be past sums of money that they
received or won (e.g. previous earnings or lotteries wins), past sums of money
that they observed others winning (e.g. learning about a colleague’s salary), or
other sums of money that are currently on offer (e.g. when given a choice
between several different payoffs). Together, these various comparison values form
a distribution in memory that a decision-maker samples in order to evaluate the
target payoff. The values that people ascribe to monetary gains therefore depend on
the distribution of earnings and winnings that they typically observe. To estimate
the shape of this distribution, Stewart et al. (2006) analyzed a random sample of
hundreds of thousands of bank deposits (i.e. money that people added to their
own accounts) made by customers of a leading UK bank. Unsurprisingly, the
sums of money that people receive (and deposit into their accounts) follow a
power-law like function, such that small deposits are more frequent than large
ones (Figure 13.1(c)). Consequently, the percentile function for monetary gains
(a)

(c)

(d) (b)

FIGURE 13.1 The value function for monetary gains and losses. (a) Shows the standard S-shaped value function from prospect theory, with a
steeper curve for monetary losses. (b) Shows the value functions (black dots) for gains (top-right quadrant) and losses (bottom-left quadrant)
predicted by DbS. These predictions are derived from data on the occurrence frequencies of bank deposits (c) and bank withdrawals (d),
reported in Stewart et al. (2006). The grey dots in the top-right quadrant of (b) represent a 180◦ rotation of the DbS predictions for
monetary losses (bottom-left quadrant), and show that a steeper curve for monetary losses (compared to monetary gains) emerges naturally
from the frequency distributions in (c) and (d).
302 C. Y. Olivola and N. Chater

is concave (Figure 13.1(b)), implying diminishing sensitivity and risk-aversion for


monetary gains.

Monetary Losses
The same logic applies to monetary losses. DbS assumes that people evaluate the
disutility (subjective negative value) of a monetary loss by comparing it to other
monetary losses they have previously experienced or observed. These comparison
values could be past payments they have made (e.g. previous purchases or debts
paid), past sums of money that they observed others losing (e.g. learning about
the sum that someone lost to a friendly bet), or other potential losses under
consideration (e.g. several different ways to pay a bill). As with monetary gains,
these various (negative) comparison values form a distribution in memory that a
decision-maker can sample in order to evaluate the seriousness of a given loss.
The disutilities that people ascribe to monetary losses therefore depend on the
distribution of costs and payments that they typically observe. To estimate the
shape of this second distribution, Stewart et al. (2006) analyzed a random sample
of more than one million bank debits (i.e. money that people withdrew from
their own accounts) made by the same population of UK bank customers. As
with gains, the sizes of payments that people make (and withdraw from their
accounts) follow a power-law like function, such that small payments are more
frequent than large ones (Figure 13.1(d)). Consequently, the percentile function for
monetary losses is concave when plotted against disutility and convex when plotted
against utility (Figure 13.1(b)), implying diminishing sensitivity and risk-seeking
preferences for monetary losses. As another proxy for the monetary losses that
people typically experience and observe, Stewart and his colleagues (Stewart &
Simpson, 2008; Stewart et al., 2006) looked at the distribution of prices for various
goods. Across a wide variety of goods, prices followed similar distributions, such
that cheaper products were more frequent than their more expensive counterparts.
Thus, sampling prices to generate the comparison set would also lead to convex
utility (concave disutility) evaluations.

Monetary Gains versus Losses


Critically, the bank transaction data showed that small losses (debits) were
proportionally more frequent than small gains (deposits), while large losses
were proportionally less frequent than gains of the same (absolute) magnitude
(Figure 13.1(c) versus 13.1(d)). This makes sense given the way spending and
earnings tend to play out in real life: We frequently make small purchases (and thus
experience lots of small losses), whereas many of our earnings tend to come in
big chunks (e.g. birthday checks, monthly salaries, etc.). Thus, even though small
monetary gains or losses are more frequent than large ones, the inverse relationship
between magnitude and frequency (or likelihood) of occurrence is stronger for
Decision by Sampling 303

losses than it is for gains. As a result, the percentile function is steeper for losses
than it is for gains (Figure 13.1(b)), which explains the origin of loss aversion.

The Subjective Value of Human Lives


Research shows that the valuation of human lives follows a similar pattern to
that observed for money (Figure 13.2(a)): The disutility people assign to human
fatalities and the utility they assign to saving lives do not increase linearly with the
number of lives at risk, but instead exhibit diminishing sensitivity (or diminishing
marginal [dis]utility) (Olivola, 2015; Slovic, 2007). As a result, people tend to
be risk seeking when they focus on preventing human fatalities but risk averse
when they focus on saving lives (Tversky & Kahneman, 1981). Moreover, there is
also evidence suggesting that people may be loss averse for human lives (Guria,
Leung, Jones-Lee, & Loomes 2005; McDaniels, 1992). These features of the
(dis)utility function that characterize the valuation of human lives have a number
of important implications for the way people and policy-makers perceive and
respond to potential large-scale deadly events, such as disasters and wars, or
to frequent small-scale mortality risks, such as auto-accidents and fires (Olivola,
2015).
Inspired by Stewart et al.’s (2006) approach, Olivola and Sagara (2009) used
DbS to explain how human lives are valued; in particular, the disutility (or
psychological shock) people experience when they hear or read about human
fatalities caused by wars, epidemics, accidents, etc. According to DbS, people
evaluate the disutility (psychological shock) of a particular death toll by comparing
it to other deadly events they have previously observed. The news media, for
example, provide a steady stream of information about the loss (or rescue)
of human lives during armed conflicts and natural disasters. People can also
learn about these events from reading books or talking to family, friends, and
colleagues. Every death toll they learn about provides a comparison value
and, together, these comparison values form a distribution in memory that
people can later sample to evaluate new events. Thus, the utility that people
ascribe to saving a given number of human lives depends on the distribution
of lives saved that they have observed (in the past) and hence can sample from
memory; similarly, the disutility they ascribe to a given death toll depends on the
distribution of fatalities that they had previously observed and can thus sample
from.
Olivola and Sagara used several sources of data to model these distributions:
(1) Data on the actual numbers of fatalities associated with natural and industrial
disasters; (2) the quantity of hits returned by Google searches for news articles
reporting the numbers of human lives lost or saved in dangerous events; and (3) the
responses of participants who were asked to recall eight deadly events that they had
heard or read about, and to estimate the number of fatalities associated with each
(a)

(c)

(d) (b)

FIGURE 13.2 The value function for lives saved and lost. (a) Shows the standard S-shaped value function from prospect theory, with a steeper
curve for lives lost. (b) Shows the value functions (black dots) for lives saved (top-right quadrant) and lives lost (bottom-left quadrant)
predicted by DbS. These predictions are derived from data on the frequency of media reporting of events involving human lives saved
(c) and lives lost (d), reported in Olivola and Sagara (2009). The grey dots in the top-right quadrant of (b) represent a 180◦ rotation of the
DbS predictions for lives lost (bottom-left quadrant), and show that a steeper curve for lives lost (compared to lives saved) emerges naturally
from the frequency distributions in (c) and (d).
Decision by Sampling 305

one. Olivola and Sagara showed, in line with the predictions of DbS, that people’s
diminishing sensitivity to human lives emerges from the distributions of human
fatalities (or lives saved) that they learn about from the news, reading books, talking
to their friends and colleagues, etc.

Human Fatalities
To proxy the distribution of death tolls that people are likely to observe and thus
hold in memory, Olivola and Sagara (2009) looked at three types of data. First,
they examined a large archival dataset that tracked the occurrence of epidemics,
natural disasters, and industrial disasters, and recorded each event’s associated death
toll. This provided a proxy for the frequency of death tolls, as they actually occur.
Next, they iteratively queried Google News Archives—a massive online repository
of published news stories—for news articles about human fatalities. Specifically,
they counted the number of articles reporting a given number of fatalities, using
search terms that specified the number of fatalities and contained keywords related
to deaths (e.g. “3 died,” “4 killed,” etc.). The resulting counts of “hits” provided a
proxy for the relative frequency with which various death tolls are reported in the
media (Figure 13.2(d)). Finally, Olivola and Sagara asked a sample of respondents
to recall eight past deadly events (the first eight that came to mind) and to estimate
each one’s death toll. These recollections provided a proxy for the distribution
of death tolls that people hold in memory and can access. It turns out that
all three distributions follow a similar power-law-like pattern (Olivola & Sagara,
2009): The larger the death toll (actual, reported, or recalled), the less frequent it
was. Consequently, the percentile functions for all three distributions are concave,
implying a diminishing sensitivity to human fatalities and a preference for higher
variance intervention options that offer a chance of preventing a greater number
of fatalities but also risk failing.
Olivola, Rheinberger, and Hammitt (2015) also looked at the distributions of
death tolls produced by low-magnitude, frequent events; namely, auto-accidents
and avalanches. Specifically, they looked at three sources of data: (i) government
statistics on the occurrence of fatalities caused by auto-accidents (or avalanches);
(ii) the frequencies of news stories reporting deaths caused by auto-accidents (or
avalanches); and (iii) the responses of participants who were asked to estimate the
percentages of auto-accidents (or avalanches) that cause a given death toll. Again,
these proxy distributions all followed a power-law-like pattern: The larger the
death toll, the smaller its frequency (actual, reported, or estimated). Consequently,
the percentile function for smaller-scale, frequent events is also concave, implying a
diminishing sensitivity and a preference for risky rescue strategies when it comes to
preventing fatalities in the context of auto-accidents and avalanches (Olivola et al.,
2015).
306 C. Y. Olivola and N. Chater

Lives Saved
Compared to human fatalities, finding a proxy for the distribution of lives saved is
considerably trickier since there are few good statistics available and the media tends
to focus on the loss of lives associated with deadly events. However, Olivola and
Sagara (2009) were able to obtain such a proxy using the Google News Archives
database by modifying their search terms to focus on lives saved (e.g. “3 saved,” “4
rescued,” etc.) rather than lives lost. Doing so allowed them to capture news stories
that reported on the numbers of lives saved during (potentially) deadly events.
The resulting distribution also follows a power-law-like function (Figure 13.2(c)):
There are more news stories reporting small numbers of lives saved than there are
reporting large numbers of lives saved. The resulting percentile function is thus
concave (Figure 13.2(b)), implying a diminishing sensitivity (i.e. diminishing joy)
and an aversion to risky rescue strategies when it comes to saving human lives.

Lives Saved versus Lost


A comparison of the percentile distributions for news stories reporting the loss
versus saving of human lives reveals an asymmetry akin to what has been found for
monetary gains and losses: Although news stories were more likely to report a small
number of lives saved or lost, than large numbers of either, the inverse relationship
between the number of lives involved and the frequency (or likelihood) of news
stories is stronger for lives lost than it is for lives saved. As a result, the percentile
function is steeper for losses than it is for gains (Figure 13.2(b)), producing loss
aversion in the domain of human lives (Guria et al., 2005; McDaniels, 1992). This
implies, for example, that people will reject a medical or policy intervention that
might, with equal likelihood, cost 100 lives (e.g. if the intervention has deadly
side-effects) or save 100 lives (e.g. if it successfully prevents an epidemic). The fear
of losing 100 lives will outweigh the attractiveness of the possible 100 lives saved,
in people’s minds, due to this loss aversion.

The Weighting of Probabilities


Another important dimension of decision-making involves the perception and
interpretation of probabilities (when these are known or estimated). Rationally
speaking, decisions-makers should simply weight outcomes (or their associated
utilities) by their assumed probabilities of occurrence. This implies that probabilities
should be perceived or “weighted” in a linear, one-to-one fashion. However,
prospect theory, and the evidence that supports it, suggests that the perception
of probabilities is not linear, but rather follows an inverse S-shaped function
(Figure 13.3(a)), whereby small probabilities are overweighted, large probabilities
are underweighted, and people are least sensitive to changes in medium
probabilities (i.e. the 0.2 to 0.8 range of probability values) (Gonzalez & Wu,
Decision by Sampling 307

(a) 1
0.9
0.8

Subjective likelihood
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Likelihood (probability)

(c) (b)
1
10000 0.9
0.8
1000 0.7
Frequency

Percentile
0.6
100
0.5
0.4
10
0.3
1 0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Likelihood (probability)
Likelihood (probability)

FIGURE 13.3 The probability weighting function. (a) Shows the standard inverse
S-shaped probability weighting function from prospect theory. (b) Shows the
probability weighting function (black dots + connecting grey lines) predicted by DbS.
These predictions are derived from data on the usage frequency of likelihood terms
(c), reported in Stewart et al. (2006).

1999; Prelec, 1998; Tversky & Kahneman, 1992; Wu & Gonzalez, 1996;
although see Stott, 2006). The non-linear weighting of probabilities incorporated
in prospect theory accurately predicts a number of tendencies in decisions under
risk (Camerer & Ho, 1994; Gonzalez & Wu, 1999; Kahneman & Tversky, 1979;
Tversky & Kahneman, 1992; Wu & Gonzalez, 1996), yet fails to explain the
origin of this tendency. DbS explains this pattern in terms of the distribution of
probability-related terms that people are typically exposed to in the real world. We
say “probability-related” terms because most people are less frequently exposed to
probabilities and other numerical expressions of likelihood (e.g. “0.2 probability,”
“50/50 odds,” “100% chance”) than they are to verbal descriptors that denote
likelihoods (e.g. “unlikely,” “possible,” “certain”). Therefore, one needs to consider
the relative frequency of verbal terms that convey different likelihoods when trying
to proxy the distribution of probability magnitudes that people can draw from to
evaluate a given probability of occurrence.
To find a proxy for this distribution, Stewart et al. (2006) searched the British
National Corpus (BNC) for verbal terms that people typically use to communicate
308 C. Y. Olivola and N. Chater

likelihoods (e.g. “small chance,” “maybe,” “likely,” etc.; see Karelitz & Budescu,
2004). The BNC provides a large corpus of spoken and written English words and
phrases, along with their frequencies of usage. Stewart et al. were therefore able
to estimate the occurrence frequency of each likelihood term. Next, following
the approach used in previous studies (Beyth-Marom, 1982; Budescu & Wallsten,
1985; Clarke, Ruffin, Hill, & Beamen, 1992; Reagan, Mosteller, & Youtz,
1989; for a review, see Budescu & Wallsten, 1995), they asked a sample of
40 participants to translate these verbal likelihood terms into their equivalent
numerical probabilities. This was done to identify the probability magnitudes
that people typically associate with each verbal term. For example, the median
participant in their study judged the word “likely” to indicate a 70 percent
chance of occurrence. The translation of likelihood terms into their equivalent
probabilities allowed Stewart et al. to estimate the frequencies with which people
(implicitly) refer, and are thus exposed, to various probability magnitudes in their
day-to-day lives (Figure 13.3(c)).
When Stewart et al. (2006) plotted the cumulative usage frequencies (i.e.
percentile ranks within the BNC) of probability magnitudes that people associate
with verbal likelihood terms they obtained an inverse S-shaped curve that
mimics the basic features of the prospect theory probability weighting function
(Figure 13.3(b)). This resemblance, between the predictions of DbS and prospect
theory, was more than just qualitative: Stewart et al. fit their data to a commonly
used single-parameter probability weighting function and found that the resulting
estimate (β = 0.59) was close to previous ones obtained from risk preference
data (β = 0.56, 0.61, 0.71, reported by Camerer & Ho, 1994; Tversky &
Kahneman, 1992; Wu & Gonzalez, 1996, respectively). In other words, likelihood
terms that denote very small or very large probabilities are more commonly used
than those that denote mid-range probabilities. As a result, people are more
frequently exposed, and therefore more sensitive, to extreme probabilities than
they are to most medium-sized probabilities. Thus, DbS can explain the classic
pattern of probability weighting that has been observed in the literature without
resorting to a priori assumptions about the shape of an underlying probability
weighting function. Specifically, it suggests that human perceptions of probability
magnitudes, and the peculiar inverse S-shape that they seem to follow, are
governed by the distribution of probability terms that people are exposed to
in their daily lives. People are more sensitive to variations in extremely low or
high probabilities (including departures from certainty and impossibility) because
those are the kinds of likelihoods they most frequently refer to and hear (or read)
others refer to. By contrast, the general hesitance (at least within the Western,
English-speaking world) to communicate mid-range probabilities means there is
far less exposure to these values, leading to a diminished sensitivity outside of the
extremes.
Decision by Sampling 309

The Perception of Time Delays


Decision-makers are regularly faced with tradeoffs between various consumption
opportunities that occur at different points in time (Read, Olivola, & Hardisty,
in press). Examples of such intertemporal tradeoffs include: Spending money
now versus investing it in order to have more money later; consuming delicious
but unhealthy foods now versus enjoying greater health in the future; enduring
a slightly painful visit to the dentist this week to prevent tooth decay versus
undergoing a much more painful dental procedure in a few years when the
tooth decay has set in. Navigating these intertemporal tradeoffs requires being
able to properly weight outcomes by their delay periods, in much the same way
that probabilistic outcomes should be weighted by their associated likelihoods. In
the case of outcomes occurring over time, the expected pleasures (or pains) of
anticipated future outcomes should be discounted relative to similar outcomes that
would occur in the present. The further into the future an outcome is expected
to occur, the more it should be discounted. There are many reasons to discount
delayed outcomes relative to immediate ones (Soman et al., 2005), including the
greater uncertainty associated with outcomes in the future, the ability to invest
immediate outcomes and earn interest, and the natural preference that most people
have to speed up pleasurable outcomes and delay undesirable ones. According to
standard economic theory, rational decision-makers should discount outcomes at
an exponential rate, as doing so guarantees consistency in preference orderings
over time. However, numerous studies have shown that people typically violate
the assumption of exponential discounting, by exhibiting strong inconsistencies
in their consumption preferences over time. To accommodate these intertemporal
preference inconsistencies, psychological and behavioral economic theories have
instead assumed that people discount delayed outcomes according to a hyperbolic
or quasi-hyperbolic function (Frederick, Loewenstein, & O’donoghue, 2002;
Olivola & Wang, in press; Read, 2004) (Figure 13.4(a)). As with prospect
theory, however, these alternative functional forms do little more than describe
the patterns of preferences that have been observed in research studies and they
fail to explain the origin of (quasi-)hyperbolic discounting or the underlying
psychological processes that govern the perception of delays.
According to DbS, people evaluate the size of a delay (e.g. how “far away”
next month seems) by comparing it to a set of exemplar delays sampled from
memory. These sampled delays, in turn, are drawn from the distribution of delays
that people have previously experienced, considered experiencing, or observed
others experiencing. To proxy the distribution of delays that people are typically
faced with, Stewart et al. (2006) repeatedly queried the Google search engine
for terms indicating various delay lengths (e.g. “1 day,” “1 year,” etc.) and used
the total number of “hits” associated with each delay length as a measure of its
relative frequency of occurrence (Figure 13.4(c)). When they then plotted the
310 C. Y. Olivola and N. Chater

(a) 1
0.9
0.8

Discounted value
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Time delay

(c) (b)
1
0.9
10000000
0.8
0.7
100000
Frequency

Percentile
0.6
0.5
100000
0.4
0.3
10000
0.2
0.1
1000
1 10 100 0
0 10 20 30 40 50 60 70 80 90 100
Time delay (days)
Time delay (days)

FIGURE 13.4 The time discounting function. (a) Shows a standard hyperbolic
time-discounting function. (b) Shows the time discounting function (black dots)
predicted by DbS. These predictions are derived from data on the frequency with
which delay lengths are referenced on the Internet (c), reported in Stewart et al.
(2006).

resulting cumulative percentile function for delay lengths, they found that it is
close to hyperbolic in shape (Figure 13.4(b)). Thus, DbS is able to predict, and also
explain, how people perceive time delays.

Recap: Using Big Data to Explain Preference Patterns


This chapter illustrates how researchers can utilize Big Data to provide rich
proxies for people’s past experiences and thus the contents of their memory.
This knowledge can then be used to model the predictions of (long-term)
memory-based theories, such as decision by sampling (DbS), which assumes that
people evaluate outcomes, probabilities, and delays by comparing them to relevant
past values (i.e. outcomes, probabilities, and delays that they have previously
experienced or observed). Decision by sampling’s core reliance on past experiences
as the starting point in the decision-making process sets it apart from other
decision-making theories. Rather than assuming that people have underlying
value, weighting, and discounting functions that drive their preferences, DbS
actually specifies the origin of people’s preference evaluations, both in terms of
their sources (the distributions of events that people observe and experience in
Decision by Sampling 311

their lives) and their underlying mechanisms (a combination of memory sampling


and pairwise binary comparisons). Consequently, and in stark contrast to most
choice theories, DbS can be used to form a priori predictions about the patterns
of preferences that people will exhibit—that is, without collecting any choice data.
All one needs is a good proxy for the relevant distribution of comparison values
(i.e. past observations and experiences) that a typical person is likely to hold in
memory.
Here, we illustrated this approach by showing how, using a variety of Big
Data sources, our colleagues and us were able to identify the predictions of DbS
and, moreover, demonstrate that these predictions closely follow the shapes of
important psychoeconomic functions that describe standard preference patterns
in risky and intertemporal decisions. Stewart et al. (2006) used a random (and
anonymized) dataset of bank deposits and withdrawals to estimate the shapes of
the value functions for monetary gains and losses, predicted by DbS. They found
that the resulting figure is strikingly similar, in shape, to the classic prospect theory
value function (Figure 13.1). Consequently, DbS can explain why people exhibit
diminishing sensitivity to monetary gains and losses (leading them to be risk averse
for monetary gains and risk seeking for monetary losses), and also why they are
loss averse for monetary outcomes. Next, Stewart et al. estimated the shape of the
probability weighting function predicted by DbS, by querying the British National
Corpus (BNC) for phrases indicating varying levels of probabilistic belief (e.g.
“no chance,” “extremely doubtful,” “fifty-fifty chance,” “possible,” “likely,” etc.).
The resulting figure resembles the inverse S-shaped function of prospect theory
(Figure 13.3). Thus, DbS can explain how people evaluate probabilities, and why
they tend to overweight small probabilities and underweight large probabilities (in
decisions from description—see Hertwig, 2012). Finally, Stewart et al. estimated
the shape of the time-discounting function predicted by DbS, by querying the
Internet for delay-related terms (e.g. “1 day,” “1 year,” etc.). The resulting figure
is approximately hyperbolic in shape (Figure 13.4), which means that DbS can
explain how people perceive time delays, and why they tend to discount outcomes
at a hyperbolic (rather than exponential) rate.
Moving beyond money, probabilities, and time, Olivola and Sagara (2009)
examined the predictions of DbS concerning the way people evaluate the loss
(or saving) of human lives. To do so, they used two sources of naturally occurring
data to proxy the distribution of death tolls that people would be exposed to. First,
to estimate the actual distribution of death tolls, they analyzed archival data on the
death tolls resulting from past large-scale natural and industrial disasters. Second, to
estimate the amount of media attention given to various death tolls, they repeatedly
queried the Google News Archives search engine to calculate the number of news
articles reporting a specified death toll (e.g. articles with titles such as “4 died” or
“5 killed”). They found that, regardless of which distribution they used to proxy
people’s past observations (the real-world distribution or the distribution of news
312 C. Y. Olivola and N. Chater

reporting), the resulting disutility function predicted by DbS implied a diminishing


sensitivity to human fatalities, in line with existing evidence (Olivola, 2015; Slovic,
2007; Tversky & Kahneman, 1981). They also queried Google News Archives to
calculate the number of news articles reporting a specified number of lives saved
during a (potentially) deadly event (e.g. articles with titles such as “4 saved” or
“5 rescued”), and they found that DbS predicted a diminishing sensitivity when it
comes to lives saved, but also that the resulting curvature was less steep than the
one for lives lost. Taken together, the percentile ranks obtained from the Google
News Archives queries produced an S-shaped function with all the key properties
of prospect theory (Figure 13.2). Thus, DbS was shown to successfully predict,
and explain, how people perceive and respond to the prospect of losing or saving
human lives. Importantly, Olivola and Sagara (2009) discovered that the shapes
of the real-world and media-reported distributions of human fatalities were not
only similar to each other: They were also similar to the distribution of death tolls
that people recall from memory (when asked to do so). Their studies therefore
demonstrated an important correspondence between (i) the relative frequencies
with which death tolls actually occur in the real world, (ii) the relative frequencies
with which they are reported in the news, and (iii) the likelihoods that they
are recalled from memory. This suggests that the memory samples people would
draw from to evaluate death tolls resemble the real-world distributions of human
fatalities.
In sum, different sources of Big Data have allowed us to find proxies for
the distribution of outcomes that people might observe and/or experience
across a variety of dimensions, such as payoffs, probabilities, delays, and human
fatalities. By “feeding” these data into the DbS “engine,” we were not only
able to examine that theory’s predictions; doing so also showed that many of
the “psychoeconomic” functions (Stewart et al., 2006) that describe regularities
in human valuation and decision-making can actually be explained as emerging
from the interactions between real-world regularities and a set of simple cognitive
operations (memory sampling and binary comparisons) designed to evaluate
magnitudes. Thus, well-known regularities in human decision-making (e.g. risk
aversion for gains, loss aversion, etc.) seem to originate from regularities in the
world.

Causality and Coincidence


In keeping with the theme of this book, we have focused our discussion on
correlational studies that utilized Big Data sets to test the predictions of DbS
and demonstrate the relationship between real-world regularities and people’s
preference patterns. However, two related concerns about the conclusions we have
drawn (that DbS’s predictions are well supported and that real-world regularities
causally contribute to people’s preferences) need to be addressed.
Decision by Sampling 313

One concern is that the direction of the relationship between the regularities
we observe and people’s preferences might actually be reversed. For example, it
might be the case that people’s preferences for monetary gains and losses shape
their spending and savings decisions, rather than the other way around. At least
two pieces of evidence attenuate this concern. First, this reverse causality is harder
to argue in the case of death tolls from natural disasters (i.e. it seems implausible that
people’s preferences would determine the magnitudes of natural disasters). As such,
reverse causality of this sort fails to explain why distributions of death tolls predict
various features of the value functions for human lives (e.g. Study 1A in Olivola &
Sagara, 2009). Second, and perhaps more to the point, several studies have shown
that experimentally manipulating the relevant distribution(s) people are exposed
to alters their preferences in the direction(s) predicted by DbS (e.g. Walasek &
Stewart, 2015; Study 2 in Olivola & Sagara, 2009). In other words, there also exists
causal evidence in support of DbS (and the hypothesis that real-world distributions
can impact preferences), although we did not focus on these studies in this chapter
since they did not utilize Big Data.
A second concern is that most of the Big Data distributions that have been
used (so far) to test the predictions of DbS share an important common feature:
An inverse relationship between event frequency and event magnitude. Therefore,
one could speculate that DbS is benefitting from the (merely coincidental) fact that
taking the cumulative of these kinds of distributions yields curvatures that would
predict a diminishing relationship between objective outcomes and subjective
evaluations. Such a coincidence could allow the theory to correctly predict
that people will be risk averse for gains, risk seeking for losses, and exhibit
hyperbolic time preferences. Again, several pieces of evidence attenuate this second
concern. First, this coincidence fails to explain why DbS successfully predicts loss
aversion (Olivola & Sagara, 2009; Stewart et al., 2006). Second, not all of the
distributions we examined exhibited the inverse frequency–magnitude relationship.
In particular, the usage frequency of likelihood terms was not inversely related to
their magnitude; indeed, had that been the case, DbS would not predict the inverse
S-shaped probability weighting function. Third, as Olivola and Sagara (2009)
demonstrated (in their Study 3), the predictions of DbS go beyond qualitative
statements about people’s broad preference tendencies. Specifically, they compared
the distributions of death tolls across several different countries and found that DbS
successfully predicted variations in the extent to which the people in each country
were risk seeking when it came to choosing between different outcomes involving
human fatalities. In doing so, Olivola and Sagara clearly showed that DbS’s capacity
to predict patterns of preferences (and even, in this case, cross-country differences
in these patterns) goes well beyond the mere fact that event frequencies are
often inversely related to their magnitudes. In sum, the ability to make nuanced
quantitative predictions with DbS undermines concerns that its predictive power
mainly derives from a general property of all real-world distributions.
314 C. Y. Olivola and N. Chater

Conclusion
The rapid growth and accessibility of Big Data, over the last decade, seems to hold
enormous promise for the study of human behavior (Moat, Olivola, Chater, &
Preis, 2016; Moat, Preis, Olivola, Liu, & Chater, 2014). Indeed, a steady stream of
studies have demonstrated creative uses of Big Data sources, such as for predicting
human behavior on a large scale (e.g. Choi & Varian, 2012; Ginsberg et al., 2009;
Goel, Hofman, Lahaie, Pennock, & Watts, 2010), studying network dynamics
(e.g. Calcagno et al., 2012; Szell, Lambiotte, & Thurner, 2010), verifying the
robustness of empirical laws (e.g. Klimek, Bayer, & Thurner, 2011; Thurner,
Szell, & Sinatra, 2012), or providing new macro-level indices (e.g. Noguchi,
Stewart, Olivola, Moat, & Preis 2014; Preis, Moat, Stanley, & Bishop, 2012;
Saiz & Simonsohn, 2013). However, the contribution of Big Data to cognitive
science has been noticeably smaller than in other areas. One likely reason is that
most existing Big Data sets within the social sciences (with the exception of brain
imaging data) tend to focus on human behaviors and thus lack variables related to
mental processes, making it difficult to form insights about cognition. Decision
by sampling theory (DbS) therefore offers an exceptional example of a theoretical
framework in which Big Data don’t merely provide high-powered tests of existing
hypotheses; they form the basis for developing the hypotheses in the first place.

Acknowledgments
The authors would like to thank Mike Jones for his guidance and support as we
prepared our manuscript, and two anonymous reviewers for useful suggestions that
helped us further improve it. We also thank Neil Stewart and Rich Lewis for
providing us with the data for Figures 13.1, 13.3, and 13.4. Finally, we thank
Aislinn Bohren, Alex Imas, and Stephanie Wang for helpful feedback on the
discussion of economic theories that incorporate experience and memory.

Notes
1 Most dynamic models of decision-making, such as Decision Field Theory
(Busemeyer & Townsend, 1993) and the leaky competing accumulator model
(Usher & McClelland, 2004), also assume the existence of one or more
(typically non-linear) functions that transform objective attribute values into
subjective evaluations. As such, even these dynamic theories could be
considered “utility”-based approaches, to some extent, since they require
utility-like transformations to account for preferences. However, these models
differ from the more classic types of utility theories in that they do not assign
utilities to choice alternatives as a whole (only to their attributes).
2 Although there have been attempts to explicitly model the role of experience
(e.g. Becker & Murphy, 1988) and memory (e.g. Mullainathan, 2002)
Decision by Sampling 315

in shaping the decision-making process, these theories still start from the
assumption that there are stable, inherent utility functions that interact with
experiences and/or memory to shape preferences. By contrast, the theory
we discuss here (decision by sampling or “DbS”), assumes that the value
functions themselves are inherently malleable and shaped by past experiences,
via memory.
3 This chapter mainly focuses on the role of long-term memory sampling (i.e. the
stored accumulation of past experiences over a person’s lifetime). However, the
presence of two memory sources raises an interesting question concerning the
relative contribution of each one to the final comparison sample. Prior evidence
suggests the predictions of DbS can be surprisingly robust to a wide range of
assumptions about the proportion of sampling that derives from long-term vs.
short-term memory sources (Stewart & Simpson, 2008). At the same time,
it is clear that short-term memory needs to play some role in order for DbS
to explain certain context effects (e.g. Study 2 in Olivola & Sagara, 2009;
Walasek & Stewart, 2015), while long-term memory sampling is necessary for
the theory to also explain systematic preference tendencies (e.g. Studies 1 and
3 in Olivola & Sagara, 2009; Stewart, Chater, & Brown, 2006). The relative
role of these two memory systems remains an open question and (we believe) a
fruitful topic for future research.
4 Counting the proportion of comparison values that are smaller than or equal
to a target value has a nice property: doing so over the entire range of
(potential) target values yields the cumulative distribution function. In other
words, under this approach (for treating ties), DbS predicts that the evaluation
function relating objective (target) values to their subjective percepts (i.e.
utilities or weights) is equivalent to the cumulative distribution function (CDF)
of relevant comparison values. Thus, one can estimate the predictions of
DbS by integrating over the frequency distribution of previously observed
comparison values. Alternatively, one could treat ties differently from other
outcomes. Olivola and Sagara (2009) compared different ways of treating ties
and found that these did not considerably influence the broad predictions of
DbS (at least when it comes to how people evaluate the potential loss of human
lives).
5 Alternatively, she might routinely experience large losses from gambling or
day trading, but nonetheless sample her memory in a narrow fashion when
evaluating the parking ticket; for example, by comparing it only to previous
parking penalties. In this latter case, DbS predicts that the ticket could still seem
quite large, and thus be upsetting to her, if it exceeds most of her previous fines.
6 Of course, the precise shape of her value function for a given decision will
also depend on attentional factors and her most recent (and/or most salient)
experiences, as these jointly shape the sampling process.
316 C. Y. Olivola and N. Chater

References
Appelt, K. C., Hardisty, D. J., & Weber, E. U. (2011). Asymmetric discounting of gains and
losses: A query theory account. Journal of Risk and Uncertainty, 43(2), 107–126.
Ariely, D. (2001). Seeing sets: Representation by statistical properties. Psychological Science,
12(2), 157–162.
Becker, G. S., & Murphy, K. M. (1988). A theory of rational addiction. Journal of Political
Economy, 96(4), 675–700.
Beyth-Marom, R. (1982). How probable is probable: A numerical translation of verbal
probability expressions. Journal of Forecasting, 1, 257–269.
Bhatia, S. (2013). Associations and the accumulation of preference. Psychological Review,
120(3), 522–543.
Brandstätter, E., Gigerenzer, G., & Hertwig, R. (2006). The priority heuristic: Making
choices without trade-offs. Psychological Review, 113(2), 409–432.
Budescu, D. V., & Wallsten, T. S. (1985). Consistency in interpretation of probabilistic
phrases. Organizational Behavior and Human Decision Processes, 36, 391–405.
Budescu, D. V., & Wallsten, T. S. (1995). Processing linguistic probabilities: General
principles and empirical evidence. In J. Busemeyer, D. L. Medin, & R. Hastie (Eds.),
Decision making from a cognitive perspective (pp. 275–318). San Diego, CA: Academic Press.
Busemeyer, J. R., & Townsend, J. T. (1993). Decision field theory: A dynamic-cognitive
approach to decision-making in an uncertain environment. Psychological Review, 100(3),
432–459.
Calcagno, V., Demoinet, E., Gollner, K., Guidi, L., Ruths, D., & de Mazancourt, C.
(2012). Flows of research manuscripts among scientific journals reveal hidden submission
patterns. Science, 338(6110), 1065–1069.
Camerer, C. F. (2004). Prospect theory in the wild: Evidence from the field. In C. F.
Camerer, G. Loewenstein, & M. Rabin (Eds.), Advances in behavioral economics (pp.
148–161). Princeton, NJ: Princeton University Press.
Camerer, C. F., & Ho, T. H. (1994). Violations of the betweenness axiom and non-linearity
in probability judgment. Journal of Risk and Uncertainty, 8, 167–196.
Choi, H., & Varian, H. (2012). Predicting the present with Google Trends. Economic Record,
88, 2–9.
Clarke, V. A., Ruffin, C. L., Hill, D. J., & Beamen, A. L. (1992). Ratings of orally
presented verbal expressions of probability by a heterogeneous sample. Journal of Applied
Social Psychology, 22, 638–656.
Dai, J., & Busemeyer, J. R. (2014). A probabilistic, dynamic, and attribute-wise model of
intertemporal choice. Journal of Experimental Psychology: General, 143(4), 1489–1514.
Frederick, S., Loewenstein, G., & O’Donoghue, T. (2002). Time discounting and time
preference: A critical review. Journal of Economic Literature, 40(2), 351–401.
Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. Cambridge, MA:
MIT Press.
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L.
(2009). Detecting influenza epidemics using search engine query data. Nature, 457,
1012–1014.
Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M., & Watts, D. J. (2010). Predicting
consumer behavior with web search. Proceedings of the National Academy of Sciences, 107,
17486–17490.
Decision by Sampling 317

Gonzalez, C., Lerch, J. F., & Lebiere, C. (2003). Instance-based learning in dynamic
decision-making. Cognitive Science, 27, 591–635.
Gonzalez, R., & Wu, G. (1999). On the shape of the probability weighting function.
Cognitive Psychology, 38(1), 129–166.
Guria, J., Leung, J., Jones-Lee, M., & Loomes, G. (2005). The willingness to accept value
of statistical life relative to the willingness to pay value: Evidence and policy implications.
Environmental and Resource Economics, 32(1), 113–127.
Hanemann, W. M. (1994). Valuing the environment through contingent valuation. The
Journal of Economic Perspectives, 8(4), 19–43.
Hertwig, R. (2012). The psychology and rationality of decisions from experience. Synthese,
187(1), 269–292.
Johnson, E. J., Häubl, G., & Keinan, A. (2007). Aspects of endowment: A query theory
of value construction. Journal of Experimental Psychology: Learning, Memory, and Cognition,
33(3), 461–474.
Kahneman, D., & Tversky, A. (1979). Prospect theory. Econometrica, 47, 263–292.
Kahneman, D., Ritov, I., Jacowitz, K. E., & Grant, P. (1993). Stated willingness to pay for
public goods: A psychological perspective. Psychological Science, 4(5), 310–315.
Karelitz, T. M., & Budescu, D. V. (2004). You say “probable” and I say “likely”: Improving
interpersonal communication with verbal probability phrases. Journal of Experimental
Psychology: Applied, 10, 25–41.
Klimek, P., Bayer, W., & Thurner, S. (2011). The blogosphere as an excitable social
medium: Richter’s and Omori’s Law in media coverage. Physica A: Statistical Mechanics
and its Applications, 390(21), 3870–3875.
Kornienko, T. (2013). Nature’s measuring tape: A cognitive basis for adaptive utility (Working
paper). Edinburgh, Scotland: University of Edinburgh.
Loomes, G., & Sugden, R. (1982). Regret theory: An alternative theory of rational choice
under uncertainty. Economic Journal, 92, 805–824.
McCrink, K., & Wynn, K. (2007). Ratio abstraction by 6-month-old infants. Psychological
Science, 18(8), 740–745.
McDaniels, T. L. (1992). Reference points, loss aversion, and contingent values for auto
safety. Journal of Risk and Uncertainty, 5(2), 187–200.
Moat, H. S., Olivola, C. Y., Chater, N., & Preis, T. (2016). Searching choices: Quantifying
decision-making processes using search engine data. Topics in Cognitive Science, 8,
685–696.
Moat, H. S., Preis, T., Olivola, C. Y., Liu, C., & Chater, N. (2014). Using big data to
predict collective behavior in the real world. Behavioral and Brain Sciences, 37, 92–93.
Mullainathan, S. (2002). A memory-based model of bounded rationality. Quarterly Journal of
Economics, 117(3), 735–774.
Noguchi, T., Stewart, N., Olivola, C. Y., Moat, H. S., & Preis, T. (2014). Characterizing
the time-perspective of nations with search engine query data. PLoS One, 9, e95209.
Olivola, C. Y. (2015). The cognitive psychology of sensitivity to human fatalities:
Implications for life-saving policies. Policy Insights from the Behavioral and Brain Sciences,
2, 141–146.
Olivola, C. Y., & Sagara, N. (2009). Distributions of observed death tolls govern sensitivity
to human fatalities. Proceedings of the National Academy of Sciences, 106, 22151–22156.
Olivola, C. Y., & Shafir, E. (2013). The martyrdom effect: When pain and effort increase
prosocial contributions. Journal of Behavioral Decision Making, 26, 91–105.
318 C. Y. Olivola and N. Chater

Olivola, C. Y., & Wang, S. W. (in press). Patience auctions: The impact of time versus
money bidding on elicited discount rates. Experimental Economics.
Olivola, C. Y., Rheinberger, C. M., & Hammitt, J. K. (2015). Sensitivity to fatalities from
frequent small-scale deadly events: A Decision-by-Sampling account. Unpublished manuscript,
Carnegie Mellon University.
Plonsky, O., Teodorescu, K., & Erev, I. (2015). Reliance on small samples, the wavy recency
effect, and similarity-based learning. Psychological Review, 122(4), 621–647.
Porter E. (2011). The price of everything. New York, NY: Penguin.
Preis, T., Moat, H. S., Stanley, H. E., & Bishop, S. R. (2012). Quantifying the advantage of
looking forward. Scientific Reports, 2, 350.
Prelec, D. (1998). The probability weighting function. Econometrica, 66(3) 497–527.
Read, D. (2004). Intertemporal choice. In D. J. Koehler & N. Harvey (Eds.), Blackwell
handbook of judgment and decision-making (pp. 424-443). Oxford, UK: Blackwell.
Read, D., Olivola, C. Y., & Hardisty, D. J. (in press). The value of nothing: Asymmetric
attention to opportunity costs drives intertemporal decision making. Management Science.
Reagan, R. T., Mosteller, F., & Youtz, C. (1989). Quantitative meaning of verbal probability
expressions. Journal of Applied Psychology, 74, 433–442.
Ritov, I., & Baron, J. (1994). Judgements of compensation for misfortune: The role of
expectation. European Journal of Social Psychology, 24(5), 525–539.
Saiz, A., & Simonsohn, U. (2013). Proxying for unobservable variables with Internet
document frequency. Journal of the European Economic Association, 11, 137–165.
Schoemaker, P. J. (1982). The expected utility model: Its variants, purposes, evidence and
limitations. Journal of Economic Literature, 20(2), 529–563.
Sedlmeier, P. E., & Betsch, T. E. (2002). ETC: Frequency processing and cognition. Oxford,
UK: Oxford University Press.
Shafir, E., Simonson, I., & Tversky, A. (1993). Reason-based choice. Cognition, 49(1),
11–36.
Slovic, P. (2007). “If I look at the mass I will never act”: Psychic numbing and genocide.
Judgment and Decision Making, 2, 79–95.
Soman, D., Ainslie, G., Frederick, S., Li, X., Lynch, J., Moreau, P., . . . , Wertenbroch,
K. (2005). The psychology of intertemporal discounting: Why are distant events valued
differently from proximal ones? Marketing Letters, 16(3–4), 347–360.
Starmer, C. (2000). Developments in non-expected utility theory: The hunt for a
descriptive theory of choice under risk. Journal of Economic Literature, 38(2), 332–382.
Stewart, N. (2009). Decision by sampling: The role of the decision environment in risky
choice. The Quarterly Journal of Experimental Psychology, 62, 1041–1062.
Stewart, N., & Simpson, K. (2008). A decision-by-sampling account of decision under risk.
In N. Chater & M. Oaksford (Eds.), The probabilistic mind: Prospects for Bayesian cognitive
science (pp. 261–276). Oxford, UK: Oxford University Press.
Stewart, N., Brown, G. D., & Chater, N. (2005). Absolute identification by relative
judgment. Psychological Review, 112, 881–911.
Stewart, N., Chater, N., & Brown, G. D. A. (2006). Decision by sampling. Cognitive
Psychology, 53, 1–26.
Stewart, N., Chater, N., Stott, H. P., & Reimers, S. (2003). Prospect relativity: How choice
options influence decision under risk. Journal of Experimental Psychology: General, 132,
23–46.
Decision by Sampling 319

Stott, H. P. (2006). Cumulative prospect theory’s functional menagerie. Journal of Risk and
Uncertainty, 32(2), 101–130.
Szell, M., Lambiotte, R., & Thurner, S. (2010). Multirelational organization of large-scale
social networks in an online world. Proceedings of the National Academy of Sciences, 107(31),
13636–13641.
Thorngate, W. (1980). Efficient decision heuristics. Behavioral Science, 25(3), 219–225.
Thurner, S., Szell, M., & Sinatra, R. (2012). Emergence of good conduct, scaling and Zipf
laws in human behavioral sequences in an online world. PLoS One, 7, e29796.
Tversky A., & Kahneman, D. (1981). The framing of decisions and the psychology of
choice. Science, 211, 453–458.
Tversky, A., & Kahneman, D. (1992). Advances in prospect theory: Cumulative
representation of uncertainty. Journal of Risk and Uncertainty, 5, 297–323.
Usher, M., & McClelland, J. L. (2004). Loss aversion and inhibition in dynamical models
of multialternative choice. Psychological Review, 111(3), 757–769.
Viscusi, W. K., & Aldy, J. E. (2003). The value of a statistical life: A critical review of market
estimates throughout the world. Journal of Risk and Uncertainty, 27(1), 5–76.
von Neumann, J., & Morgenstern, O. (1947). Theory of games and economic behavior.
Princeton, NJ: Princeton University Press.
Walasek, L., & Stewart, N. (2015). How to make loss aversion disappear and reverse: Tests of
the decision by sampling origin of loss aversion. Journal of Experimental Psychology: General,
144, 7–11.
Walker, M. E., Morera, O. F., Vining, J., & Orland, B. (1999). Disparate WTA–WTP
disparities: The influence of human versus natural causes. Journal of Behavioral Decision
Making, 12(3), 219–232.
Weber, E. U., Johnson, E. J., Milch, K. F., Chang, H., Brodscholl, J. C., & Goldstein, D.
G. (2007). Asymmetric discounting in intertemporal choice: A query-theory account.
Psychological Science, 18(6), 516–523.
Weyl, E. G. (in press). Price theory. Journal of Economic Literature.
Wu, G., & Gonzalez, R. (1996). Curvature of the probability weighting function.
Management Science, 42, 1676–1690.
14
CRUNCHING BIG DATA WITH
FINGERTIPS
How Typists Tune Their Performance Toward
the Statistics of Natural Language

Lawrence P. Behmer Jr. and


Matthew J. C. Crump

Abstract
People have the extraordinary ability to control the order of their actions. How people
accomplish sequencing and become skilled at it with practice is a long-standing problem
(Lashley, 1951). Big Data techniques can shed new light on these questions. We used the
online crowd-sourcing service, Amazon Mechanical Turk, to measure typing performance
from hundreds of typists who naturally varied in skill level. The large dataset allowed us to
test competing predictions about the acquisition of serial-ordering ability that we derived
from computational models of learning and memory. These models suggest that the time
to execute actions in sequences will correlate with the statistical structure of actions in the
sequence, and that the pattern of correlation changes in particular ways with practice. We
used a second Big Data technique, n-gram analysis of large corpuses of English text, to
estimate the statistical structure of letter sequences that our typists performed. We show
the timing of keystrokes correlates with sequential structure (letter, bigram, and trigram
frequencies) in English texts, and examine how this sensitivity changes as a function of
expertise. The findings hold new insights for theories of serial-ordering processes, and how
serial-ordering abilities emerge with practice.

Introduction
The infinite monkey theorem says a room full of monkeys typing letters on
a keyboard for infinity will eventually produce any text, like the works of
Shakespeare or this chapter (Borel, 1913). This small space of natural texts has
more predictable structure than the many other random texts produced by the
typing monkeys. For example, letters and bigrams that occur in English appear
with particular frequencies, some high and some low. The present work examines
whether typists, who routinely produce letters by manipulating a keyboard,
Crunching Big Data with Fingertips 321

become sensitive to these statistical aspects of the texts they type. Answering
this question addresses debate about how people learn to produce serially ordered
actions (Lashley, 1951).

The Serial-Order Problem


The serial-order problem cuts across human performance, including walking and
talking, routine activities like tying a shoe, making coffee, and typing an email,
to crowning achievements in the arts and sports where performers dazzle us
with their musical, visual, and physical abilities. Performers exhibit extraordinary
serial-ordering abilities. They can produce actions in sequences that accomplish
task goals.
Three general features of serial-ordering ability requiring explanation are
specificity, flexibility, and fluency. Performers can produce highly specific
sequences, as in memorizing a piece of music. Performers can flexibly produce
different sequences, as in language production. And, performers can produce
sequences with great speed and accuracy. The serial-order problem is to articulate
the processes enabling these abilities across domains of human performance, which
is an overbroad task in itself, so it is not surprising that a complete account of the
problem has remained elusive. However, theoretical landscapes for examining the
problem have been put forward. We review them below.

Associative Chain Theory


Prior to the cognitive revolution (pre-1950s) serial-ordering ability was explained
by associative chains whereby one action triggers the next by association (Watson,
1920). This domino-like process assumes that feedback from movement n triggers
the next movement (n+1), which triggers the next movement (n+2), and so on.
Karl Lashley (1951; see Rosenbaum, Cohen, Jax, Weiss, & van der Wel, 2007,
for a review) famously critiqued chaining theories on several fronts. The idea that
feedback triggers upcoming actions does not explain action sequences that can be
produced when feedback is eliminated, or rapid action sequences (like lightning
fast musical arpeggios) where the time between actions is shorter than the time for
feedback to return and trigger the next action.
The biggest nail in the coffin was that associative chains could not flexibly
produce meaningful sequences satisfying grammatical rules for ordering—a
requirement for language production. Consider an associative chain for ordering
letters by associations with preceding letters. The associative strength between
two letters could reflect the number of times or likelihood that they co-occur.
Bigram co-occurrence can be estimated from large corpuses of natural text. For
example, taking several thousand e-books from Project Gutenberg (see Methods)
and counting the occurrence of all bigrams estimates the associations between
322 L. P. Behmer Jr. and M. J. C. Crump

letter pairs in English. Generating sequences of letters using these bigram statistics
will produce sequences with the same bigram frequency structure as the English
language, but will rarely produce words, let alone meaningful sentences. So an
associative chain theory of letter sequencing would take a slightly shorter amount
of infinite time to produce the works of Shakespeare, compared to random
monkeys.
Although associative chains fail to explain serial-ordering behavior in complex
domains like language, the more general goal of explaining serial ordering in terms
of basic learning and memory processes has not been abandoned. For example,
Wickelgren (1969) suggested that associative chains could produce sequences
of greater complexity by allowing contextual information to conditionalize
triggering of upcoming actions. And, as we will soon discuss, contemporary
neural network approaches (Elman, 1990), which are associative in nature, have
been successfully applied as accounts of serial-ordering behavior in a number of
tasks.
Lashley’s critique inspired further development of associative theories, and
opened the door for new cognitive approaches to the serial-order problem, which
we broadly refer to as hierarchical control theories.

Hierarchical Control Theory


Hierarchical control theories of serial-ordering invoke higher-order planning
and monitoring processes that communicate with lower-order motor processes
controlling movements. Hierarchical control is aptly described by Miller,
Galanter and Pribram’s (1960) concept of TOTE (test, operate, test, exit) units.
TOTEs function as iterative feedback loops during performance. For example, a
TOTE for pressing the space key on a keyboard would test the current environment:
has the space key been pressed? No; then, engage in the operation: press the space
key; then, re-test the environment: has the space key been pressed? Yes; and, finally
exit the loop. TOTEs can be nested within other TOTEs. For example, a TOTE
to type a word would control sub-TOTEs for producing individual letters, and
sub-sub-TOTEs, for controlling finger movements. Similarly, higher-level TOTEs
would guide sentence creation, and be nested within even higher level TOTEs for
producing whole narratives.
A critical difference between hierarchical control and associative theories is
the acknowledgment of more complex cognitive functions like planning and
monitoring. These assumptions enable hierarchical control frameworks to describe
more complicated serial-ordering behaviors, but require further explanation of
planning and monitoring processes. Hierarchical control theories are common in
many areas, including language (Dell, Burger, & Svec, 1997), music (Palmer &
Pfordresher, 2003), routine everyday actions (Cooper & Shallice, 2000), and skilled
typing (Logan & Crump, 2011).
Crunching Big Data with Fingertips 323

Emergence versus Modularism (A Short Coffee-Break)


The history of theorizing about the serial-order problem is a dance between
emergent and modularistic explanations. Emergent explanations seek parsimony by
examining whether basic learning and memory processes produce serial-ordering
abilities “for free,” without specialized systems for sequencing actions. Modules
(Fodor, 1983) refer to specialized processes tailored for the demands of sequencing
actions. Associative chain theory uses commonly accepted rules of association to
argue for a parsimonious account of serial-ordering as an emergent by-product
of a general associative process. Hierarchical control theories assume additional
processes beyond simple associations, like those involved in plan construction,
implementation, and monitoring.
The interplay between emergent and modularistic accounts was featured
recently in a series of papers examining errors made by people brewing cups of
coffee. Cooper and Shallice (2000, 2006) articulated computational steps taken
by a planning process to activate higher-level goals (make coffee), and series of
lower level sub-goals (grab pot, fill with water, etc.), and sub-sub-goals (reach and
grasp for handle), and then described how their algorithms could brew coffee,
and make human-like errors common to coffee-brewing. Then, Botvinick and
Plaut (2004, 2006) showed how a general associative process, modeled using a
serial recurrent neural network (Elman, 1990), could be trained to make coffee
as accurately and error-prone as people. Their modeling efforts show that some
complex routine actions can be explained in an emergent fashion; and that
representational units for higher-order plans (i.e. TOTE units) are functionally
equivalent to distributed collections of lower-level associative weights in a neural
network. Thus, a non-hierarchical learning process can be substituted for a more
complicated hierarchical process as an explanation of serial-ordering behavior in
the coffee-making domain.
Coffee-making is a difficult task for discriminating between theories. Although
plan-based theories are not required, they may be necessary elsewhere, for example
in tasks that require a larger, flexible repertoire of sequences as found in skilled
typewriting (Logan & Crump, 2011). Perhaps the two approaches share a nexus,
with general associative learning principles tuning aspects of the construction,
activation, and implementation of plans for action. So the dance continues, and in
this chapter, leads with the movements of fingers across computer keyboards.

Hierarchical Control and Skilled Typing


Skilled typing is a convenient tool for examining hierarchical control processes
and is naturally suited to studying serial-ordering abilities. Prior work shows that
skilled typing is controlled hierarchically (for a review see Logan & Crump, 2011).
Hierarchically controlled processes involve at least two distinguishable levels in
324 L. P. Behmer Jr. and M. J. C. Crump

which elements from higher levels contain a one-to-many mapping with elements
in the lower levels. Levels are encapsulated. The labor of information processing
is divided between levels, and one level may not know the details of how another
level accomplishes its goals. Because of the division of labor, different levels should
respond to different kinds of feedback. Finally, although levels are divided, they
must be connected or coordinated to accomplish task goals.
The terms outer and inner loop are used to refer to the hierarchically nested
processes controlling typing. The outer loop relies on language production and
comprehension to turn ideas into sentences and words, passing the result one word
at a time to the inner loop. The inner loop receives words as plans from the outer
loop, and translates each word into constituent letters and keystrokes. The outer
loop does not know how the inner loop produces keystrokes. For example, typists
are poor at remembering where keys are located on the keyboard (Liu, Crump, &
Logan, 2010; Snyder, Ashitaka, Shimada, Ulrich, & Logan, 2014), and their typing
speed slows when attending to the details of their actions (Logan & Crump, 2009).
The outer and inner loop rely on different sources of feedback, with the outer
loop using visual feedback from the computer screen to detect errors, and the
inner loop using tactile and kinesthetic feedback to guide normal typing, and
to independently monitor and detect errors (Crump & Logan, 2010b; Logan &
Crump, 2010). Finally, word-level representations connect the loops, with words
causing parallel activation of constituent letters within the inner loops’ response
scheduling system (Crump & Logan, 2010a).

Developing a Well-Formed Inner Loop


We know that expert typists’ inner loops establish a response scheduling process that
is precise, flexible, and fluid: capable of accurately executing specific sequences,
flexibly producing different sequences, and fluidly ordering keystrokes with speed.
However, we do not know how these abilities vary with expertise and change
with practice. The outer/inner loop framework and other computational theories
of skilled typing have not addressed these issues.
Rumelhart and Norman’s (1982) computational model of skilled typing
provides a starting point for understanding inner loop development. Their model
used word representations to activate constituent letter nodes, which then drove
finger movements. Because video recordings of skilled typists showed fingers
moving in parallel across the keyboard (Gentner, Grudin, & Conway, 1980), their
word units caused parallel activation of letter units and finger movements. This
created a problem for outputting the letters in the correct order. Their solution was
a dynamic inhibition rule (Estes, 1972): Each letter inhibits every other letter in
series, letter activation moves fingers to keys (with distance moved proportional to
activation), a key is pressed when a finger reaches its target and its letter activation
value is higher than the others, and letter units are de-activated when keys are
Crunching Big Data with Fingertips 325

pressed. This clever rule is specialized for the task of response-scheduling, but
has emergent qualities because the model types accurately without assuming a
monitoring process.
The model explains expert typing skill, but says nothing about learning and
skill-development. It remains unclear how associations between letters and specific
motor movements develop with practice, or how typing speed and accuracy for
individual letters changes with practice. Addressing these issues is the primary goal
of the present work.

Becoming a Skilled Typist


The processes underlying the development of typing skill enable people to proceed
from a novice stage where keystrokes are produced slowly, to an expert stage
where they are produced quickly and accurately. Novices with no previous typing
experience scan the keyboard to locate letters, and use visually targeted movements
to press intended keys. Their outer loop goes through the motions of (1) intending
to type a letter, (2) looking for the letter on the keyboard, (3) finding the letter,
(4) programming a motor movement to the found key, and (5) repeating these steps
in a loop until all planned letters are produced. Experts can type without looking at
the keyboard, with fingers moving in parallel, and with impressive speed (100 ms
per keystroke and faster). Experts use their outer loop for planning words and
phrases, and their inner loop for producing individual keystrokes in series.
We divide the problem of how inner loops are acquired into questions
about how response-scheduling ability changes with practice, and how the
operations of response-scheduling processes work and change with practice.
Abilities will be measured here in terms of changes in speed and accuracy
for typing individual letters. Operations refer to processing assumptions about
how one response is ordered after another. Here, we test theories about the
development of response-scheduling abilities, and leave tests of the operations of
the response-scheduling process to future work.

More Than One Way to Speed Up a Typist


We consider three processes that could enable a slow typist to improve their speed
with practice. Each makes different predictions for how typing times for individual
letters would change with practice.

Adjust a Global Response Scheduling Timing Parameter


If keystroke speed is controlled by adjustable timing parameters, then faster typing
involves changing timing parameters to reduce movement and inter-movement
times. Normatively speaking, if a typist could choose which letters to type more
326 L. P. Behmer Jr. and M. J. C. Crump

Crafted Random
180
Simulated typing time

160

140

–1.0 –0.5 0.0 0.5 1.0 –1.0 –0.5 0.0 0.5 1.0
Letter frequency and simulated IKSI correlation

FIGURE 14.1 Simulations showing that mean letter typing time to type a text is shorter
for simulated typists whose typing times for individual letters negatively correlate with
letter frequency in the typed text.

quickly, they would benefit from knowing letter frequencies in texts they type. All
things being equal, and assuming that typists are typing non-random texts, a typist
whose letter typing times are negatively correlated with letter frequency norms for
the text (i.e. faster times for more than less frequent letters) will finish typing the
same text faster than a typist whose letter typing times do not negatively correlate
with letter frequency.
To illustrate, we conducted the simulations displayed in Figure 14.1. First,
we created a vector of 26 units populated with the same number (e.g. 150 ms)
representing mean keystroke times for letters in the alphabet. This scheme assumes
that all letters are typed at the same speed. Clearly, overall speed will be increased
by decreasing the value of any of those numbers. However, to show that sensitivity
to letter frequency alone increases overall speed, we crafted new vectors that could
be negatively or positively correlated with letter frequency counts consistent with
English texts (taken from Jones & Mewhort, 2004). We created new vectors using
the following rules: (1) randomly pick a unit and subtract X, then randomly pick a
different unit, and add the same value of X; (2) compute the correlation between
the new vector and the vector of letter frequencies; (3) keep the change to the
vector only if the correlation increases; (4) repeat 1–3. All of the crafted vectors
summed to the same value, but were differentially correlated to letter frequencies
through the range of positive and negative values. The random panel shows a
second simulation where the values for simulated letter typing speed were simply
randomly selected, with the constraint that they sum to the same value. Figure 14.1
shows that time to finish typing a text is faster for vectors that are more negatively
correlated with letter frequencies in the text.
Crunching Big Data with Fingertips 327

The simulation shows that typists have an opportunity to further optimize their
typing speed by modifying individual letter typing speeds in keeping with the
frequencies of individual letters in the text they are typing. Indeed, there is some
existing evidence that, among skilled typists, letter typing times do negatively
correlate with letter and bigram frequencies (Grudin & Larochelle, 1982).
However, it is unclear how these micro-adjustments to the timing of individual
keystrokes take place. If typists are simply changing the timing parameters for each
keystroke whenever they can, without prioritizing the changes as a function of
letter frequency, then we would not expect systematic correlations to exist between
letter typing times and letter frequencies. The next two hypotheses assume that
typists become sensitive to the statistics of their typing environment “for free,”
simply by using the same general learning or memory processes they always use
when learning a new skill.

Use a General Learning or Memory Process


The task of typing has processing demands equivalent to laboratory sequencing
tasks, such as artificial grammar learning (Reber, 1969) and the serial-reaction
time (SRT) task (Nissen & Bullemer, 1987). In those tasks, participants respond
to unfamiliar patterned sequences of characters or shapes defined by a grammar.
The grammar controls the probability that particular characters follow others.
By analogy, the task of typing words is very similar. The letters in words in
natural language occur in patterned fashion, with specific letters, bigrams, and
trigrams (and higher-order n-grams) occurring with specific frequencies. In the
artificial grammar task, as participants experience sequences they develop the
ability to discriminate sequences that follow the grammar from those that do not.
In the SRT task, as participants gain experience with sequences they become
faster at responding to each item in the sequence. In the task of typing, the
question we are interested in asking is how typists become sensitive to the
frequencies of letters, bigrams, and trigrams, as they progress in acquiring skill in
typing.
Computational models of learning and memory can account for performance in
the artificial grammar and SRT task. For example, exemplar-based memory models
(Jamieson & Mewhort, 2009a, 2009b) and serial recurrent neural network learning
models (Cleeremans & McClelland, 1991; Cleeremans, 1993) explain sensitivity
to sequential structure as the result of general learning and memory processes
operating on experiences containing order information. The serial recurrent
network (Elman, 1990) applied to the SRT task involves the same architecture
used by Botvinick and Plaut (2006) to model action sequences for making coffee.
These models make specific predictions about how people become sensitive to
statistical structure in sequences (i.e. n-gram likelihoods) with practice.
328 L. P. Behmer Jr. and M. J. C. Crump

Serial Recurrent Network Predictions for Acquiring Sensitivity


to Sequential Statistics
A serial recurrent neural network (see Elman, 1990) is a modified feed-forward
neural network with input units, a hidden layer, and output units. Input patterns
are unique numerical vectors describing elements in a pattern, like letters in a
sequence. After learning, the distributed pattern of weights in the hidden layer can
represent multiple input patterns, such that a particular input generates a learned
output pattern. Learning occurs in the model by presenting input patterns, and
making incremental adjustments to the weights that reduce errors between input
and output patterns. The model learns sequential structure because the hidden
layer is fed back into itself as part of the input units. This provides the model
with the present and preceding input patterns at each time step. If the model
is trained on patterned sequences of letters, then the model gradually learns to
predict upcoming letters based on preceding letters. In other words, it becomes
sensitive to the n-gram structure of the trained sequences.
In the context of learning sequences in the SRT task, Cleeremans (1993)
showed that participants and network simulations gradually accrue sensitivity to
increasingly higher orders of sequential statistics with training. Early in training,
most of the variance in reaction times to identify a target is explained by
target frequency, but with practice target frequency explains less variance, and
higher-order frequencies (e.g. bigrams, trigrams, etc.) explain more variance. The
model makes two important predictions. First, sensitivity to sequential structure
is scaffolded: Novices become sensitive to lower-order statistics as a prerequisite
for developing sensitivity to higher-order statistics. Second, as sensitivity to each
higher-order statistic increases, sensitivity to the previous lower-order statistics
decreases. The model makes the second prediction because the weights in the
hidden layer overwrite themselves as they tune toward higher-level sequential
structure, resulting in a loss of sensitivity to previously represented lower-level
sequential structure. This is an example of catastrophic interference (McCloskey &
Cohen, 1989) whereby newly acquired information disrupts older representations
by changing the weights of those representations.

Instance Theory Predictions for Acquiring Sensitivity


to Sequential Statistics.
Instance theories of memory make similar predictions for the development of
sensitivity to sequential structure. Instance theories assume individual experiences
are stored in memory and later retrieved in the service of performance. For
example, the instance theory of automatization (Logan, 1988) models the
acquisition of routine behavior and the power law of learning with distributional
assumptions for sampling memory. Response time is a race between a process that
remembers how to perform an action, and an algorithm that computes the correct
Crunching Big Data with Fingertips 329

action, with the faster process winning the race and controlling action. Instances
in memory are not created equal, and some can be retrieved faster than others. As
memory for a specific action is populated with more instances, that action is more
likely to be produced by the memory process because one of the instances will tend
to have a faster retrieval time than the algorithmic process. So, memory speeds are
responding as a function of the likelihood of sampling increasingly extreme values
from increasingly large distributions. More simply, faster memory-based responses
are more likely when the instance pool is larger than smaller.
We simulated predictions of instance theory for acquiring sensitivity to letter
and bigram frequencies in English texts (Jones & Mewhort, 2004) with practice.
Response times to a letter or a bigram were sampled from normal distributions,
with the number of samples constrained by the number of instances in memory
for that letter or bigram. The response time for each was the fastest time sampled
from the distribution.
To simulate practice, we repeated this process between the ranges of 50
experiences with letters and bigrams, up to 1,000,000 experiences. To determine
sensitivity to letter and bigram frequency, we correlated the vectors of retrieval
times for presented letters and bigrams with the vectors of letter and bigram
frequencies from the corpus counts.
The No floor panel in Figure 14.2 shows increasingly negative correlations
between retrieval time and corpus count frequency as a function of practice
for letters and bigrams. The correlations plateau with practice, and letter and
bigram sensitivity develop roughly in parallel. Bigram sensitivity is delayed because

nGram Bigram Letter

Floor No floor
0.0
N-gram frequency by speed correlation

–0.2

–0.4

–0.6

50 100 1000 10000 1e+05 1e+06 50 100 1000 10000 1e+05 1e+06
Simulated practice (no. of experiences)

FIGURE 14.2 Simulated instance theory (Logan, 1988) predictions for how correlations
between letter and bigram frequency, and their simulated typing times, would change
as a function of practice. Floor versus No floor refers to whether simulated typing
times were limited by some value reflecting physical limitations for movement time.
330 L. P. Behmer Jr. and M. J. C. Crump

experience with specific bigrams occurs more slowly than letters (which repeat
more often within bigrams). So, different from the SRN model, the instance
model does not predict sensitivity to lower-order statistics decreases with increasing
sensitivity to higher-order statistics. However, a modified version of this simulation
that includes a floor on retrieval times to represent the fact that reaction times will
eventually hit physical limitations does show waxing and waning of sensitivities,
with the value of letter and bigram correlations increasing to a maximum, and
then slowly decreasing toward zero as all retrieval times become more likely to
sample the same floor value.

Testing the Predictions in Skilled Typing


The present work tests the above predictions in the real-world task of skilled typing.
The predictions are tested by analyzing whether typists of different skill levels are
differentially sensitive to letter, bigram, and trigram frequencies. This analysis is
accomplished by (1) having typists of different skill levels type texts, (2) recording
typing times for all keystrokes and for each typist computing mean keystroke time
for each n-gram, (3) for each typist, sensitivity to sequential structure is measured
by correlating mean typing times for each letter, bigram, and trigram with their
respective frequencies in the natural language, and (4) by ordering typists in terms
of their skill level, we can determine whether sensitivity to n-gram frequency
changes as a function of expertise.
To foreshadow our analysis, steps three and four involve two different
correlational measures. Step three computes several correlations for each individual
typist. Each correlation relates n-gram frequency to mean keystroke times for each
n-gram typed by each typist. This results in three correlations for each n-gram level
(letter, bigram, and trigram) per typist. Because skill increases with practice, we
expect faster keystrokes (decreasing value) for more frequent n-grams (increasing
value). For example, typists should type high frequency letters faster than low
frequency letters, and so on for bigrams and trigrams. So in general, all of these
correlations should be negative. We take the size of the negative correlation as
a measure of sensitivity to n-gram structure, with larger negative values showing
increasing sensitivity.
Step four takes the correlations measuring sensitivity to n-gram frequency from
each typist and examines them as a function of typing expertise. For example,
one measure of typing expertise is overall typing speed, with faster typists showing
more expertise than slower typists. According to the learning and memory models
novices should show stronger sensitivity to letter frequency than experts. Similarly,
experts should show stronger sensitivity to bigram and trigram frequencies than
novices. These results would license consideration of how general learning and
memory processes participate in hierarchically controlled skilled performance
domains like typing.
Crunching Big Data with Fingertips 331

Using Big Data Tools to Answer the Question


To evaluate the predictions we needed two kinds of Big Data. First, we needed
access to large numbers of typists that varied in skill level. Second, we needed
estimates of sequential structure in a natural language such as English.
We found our typists online using Amazon Mechanical Turk (mTurk), an
Internet crowdsourcing tool that pays people small sums of money to complete
HITS (human intelligence tasks) in their web browser. HITs are tasks generally
easier for people than computers, like listing keywords for images, or rating
websites. The service is also becoming increasingly popular as a method for
conducting behavioral experiments because it provides fast and inexpensive access
to a wide population of participants. More important, modern web-browser
technology has reasonably fine-grained timing abilities, so it is possible to measure
the timing of keypress responses at the level of milliseconds. For example, Crump,
McDonnell, and Gureckis (2013) showed that browser-based versions of several
classic attention and performance procedures requiring millisecond control of
display presentation and response time collection could easily be reproduced
through mTurk. We followed the same approach and created a paragraph typing
task using HTML and JavaScript, loaded the task onto mTurk, and over the course
of a few days asked 400 people type our paragraphs.
To estimate sequential structure in natural language (in this case English) we
turned to n-gram analysis techniques. N-grams are identifiable and unique units
of sequences, such as letters (a, b, c), bigrams (ab, bc, cd), and trigrams (abc,
bcd, cde). Letters, bigrams, and trigrams appear in English texts with consistent
frequencies. These frequencies can be estimated by counting the occurrence of
specific n-grams in large corpuses of text. For example, Jones and Mewhort (2004)
reported letter and bigram frequency counts from several different corpuses, and
the google n-gram project provides access to n-gram counts taken from their
massive online digitized repository of library books. Generally speaking, larger
corpuses yield more accurate n-gram counts (Kilgarriff & Grefenstette, 2011).
Because we were also interested in examining typists’ sensitivity to trigram
frequencies, we conducted our own n-gram analysis by randomly selecting
approximately 3000 English language eBooks from Project Gutenberg, and counting
the frequency of each lowercase letter (26 unique), bigram (676 unique), and
trigram (17576 unique) from that corpus. We restricted our analysis to n-grams
containing lowercase letters and omitted all other characters because uppercase
characters and many special characters require the use of the shift key, which
produces much slower typing times than keystrokes for lowercase letters. Both
single letter (r = 0.992) and bigram frequencies (r = 0.981) between the New York
Times (Jones & Mewhort, 2004) and Gutenberg corpuses were highly correlated
with one another. Additionally, inter-text letter, bigram, and trigram counts
from the Gutenberg corpus were highly correlated with one another, showing that
332 L. P. Behmer Jr. and M. J. C. Crump

the sequential structure that typists may be learning is fairly stable across English
texts.

Methodological Details
All typists copy-typed five normal paragraphs from the Simple English Wiki, a
version of the online encyclopedia Wikipedia written in basic English. Four
of the paragraphs were from the entry about cats (http://simple.wikipedia.
org/wiki/Cat), and one paragraph was from the entry for music (http://simple.
wikipedia.org/wiki/Music). Each normal paragraph had an average of 131 words
(range 124–137). The paragraphs were representative of English texts and highly
correlated with Gutenberg letter (26 unique letters, 3051 total characters, r = 0.98),
bigram (267 unique bigrams, 2398 total bigrams, r = 0.91), and trigram frequencies
(784 unique trigrams, 1759 total trigrams, r = 0.75).
As part of an exploratory analysis we also had typists copy two paragraphs of
non-English text, each composed of 120 five-letter strings. The strings in the
bigram paragraph were generated according to bigram probabilities from our corpus
counts, resulting in text that approximated the bigram structure of English text (i.e.
a first-order approximation to English (Mewhort, 1966, 1967). This paragraph was
generally well correlated with the Gutenberg letter counts (24 unique, 600 total,
r = 0.969), bigram counts (160 unique, 480 total, r = 0.882), and trigram counts
(276 unique, 360 total, r = 0.442).
The strings in the random letter paragraph were constructed by sampling each
letter from the alphabet randomly with replacement. This paragraph was not well
correlated with Gutenberg letter counts (26 unique, 600 total, r = 0.147), bigram
counts (351 unique, 479 total, r = −0.056), or trigram counts (358 unique,
360 total, r = −0.001).
Workers (restricted to people from the USA, with over 90% HIT completion rate)
on mTurk found our task, consented, and then completed the task. The procedure
was approved by the institutional review board at Brooklyn College of the City
University of New York. Four hundred individuals started the task; however data
were only analyzed for the 346 participants who successfully completed the task
(98 men, 237 women, 11 no response). Participants reported their age within
5-year time bins, ranging from under 20 to over 66 years old (mean bin = 35
to 40 years old, +/− 2 age bins). Two hundred and ninety-six participants were
right-handed (33 left-handed, 11 ambidextrous, 6 no response). One hundred
and thirty-five participants reported normal vision (202 corrected, 5 reported
“vision problems,” 4 no response). Three hundred and twenty-nine participants
reported that English was their first language (7 reported English being their second
language, 10 no response). Participants reported that they had been typing between
1 and 60 years (M = 20.2 years, SE = 9.3), and had started typing at between
3 and 49 years old (M = 13.3 years old, SE = 5.5). Two hundred and eighty
Crunching Big Data with Fingertips 333

participants reported being touch typists (63 not touch typists, 3 no response), and
187 reported having formal training (154 no formal training, 5 no response).
During the task, participants were shown each of the seven different paragraphs
in a text box on their monitor (order randomized). Paragraph text was black,
presented in 14 pt, Helvetica font. Participants were instructed to begin typing
with the first letter in the paragraph. Correctly typed letters turned green, and
typists could only proceed to the next by typing the current letter correctly. After
completing the task, participants were presented with a debriefing, and a form to
provide any feedback about the task. The task took around 30 to 45 minutes to
complete. Participants who completed the task were paid $1.

The Data
We collected inter-keystroke interval times (IKSIs; in milliseconds), for every
correct and incorrect keystroke for each subject and each paragraph. Each IKSI
is the difference between the timestamp for typing the current letter and the most
recent letter. IKSIs for each letter were also coded in terms of their associated
bigrams and trigrams. Consider typing the word cat. The IKSI for typing letter t
(timestamp of t – timestamp of a) has the letter level t, the bigram level at, and
the trigram level cat. In addition, each letter, bigram, and trigram has a frequency
value from the corpus count.
In this way, for each typist we compiled three signatures of sensitivity to letter,
bigram, and trigram frequency. For letters, we computed the vector of mean IKSIs
for all unique correctly typed letters and correlated it with the vector of letter
frequencies. The same process was repeated for the vector of mean IKSIs for all
unique correctly typed bigrams and trigrams. The resulting correlation values for
each typist appear as individual dots in the figures that follow.

Answering Questions with the Data


Are Typists Sensitive to Sequential Structure?
Our first goal was to determine whether typing times for individual keystrokes are
correlated with letter, bigram, and trigram frequencies. Looking at performance
on the normal English paragraphs, for each subject we found the mean IKSIs for
each unique correctly typed letter, bigram, and trigram, and then correlated (using
Spearman’s rank correlation coefficient that tests for any monotonic relationship)
these vectors with their respective frequency counts. A one-way analysis of variance
performed on the correlations revealed a main effect for n-gram type [F(2, 1035)
= 209, p = 0.001]. Post-hoc t-tests (p = 0.016) revealed that each mean was
significantly different from one another, with mean correlations being greatest for
letter (r = −0.410, SE = 0.01), bigram (r = −0.280, SE = 0.004), and then trigram
334 L. P. Behmer Jr. and M. J. C. Crump

(r = −0.220, SE = 0.003). Additionally, one sample t-tests revealed that the mean
correlation of each n-gram type was significantly greater than zero. All of the
mean correlations were significant and negative, showing that in general, typing
times are faster for higher than lower frequency n-grams. And, the size of the
negative correlation decreases with increasing n-gram order, showing that there
is more sensitivity to lower than higher-order structure. The major take home
finding is that typists are indeed sensitive to sequential structure in the texts they
type.

Does Sensitivity to Sequential Structure Change with Expertise?


The more important question was whether or not sensitivity to n-gram frequency
changes with expertise in the manner predicted by the learning and memory
models. We addressed this question with a cross-sectional analysis. Our first step
was to index the skill of our typists. As a simple proxy for skill we used the mean
IKSI for each typist (overall typing speed), which assumes that faster typists are
more skilled than slower typists. Overall typing speed is the x-axes in the following
figures. The fastest typists are closer to the left because they have the smallest mean
IKSIs, and the slowest typists are closer to the right because they have the largest
mean IKSIs. Because we have many typists we expected to cover a wide range of
skill, and indeed the figures show a nice spread across the x-axes.
Next, we plotted the previous measures of sensitivity to n-gram frequency for
each typist as a function of their overall typing speed. So, the y-axes in the following
graphs are correlation values for individual typists between the vector of mean
IKSIs for individual n-grams and their frequencies. A score of 0 on the y-axis
shows that a given typist was not sensitive to n-gram frequency. A negative value
shows that a given typist was sensitive to n-gram frequency, and that they typed
high frequency n-grams faster than low frequency n-grams. Positive values indicate
the reverse. In general, most of the typists show negative values and very few show
positive values.
Figure 14.3 shows the data from the normal paragraph condition. The first panel
shows individual typists sensitivity to letter frequency. We expected that typists
should be sensitive to letter frequency, and we see that most of the typists show
negative correlations. Most important, sensitivity to letter frequency changes with
typing skill. Specifically, the fastest typists show the smallest correlations, and the
slowest typists show the largest correlations. In other words, there was a significant
negative correlation between sensitivity to letter frequency and skill measured by
overall typing speed (r = −0.452, p < 0.001). This finding fits with the prediction
that sensitivity to lower-order statistics decreases over the course of practice. Our
novices showed larger negative correlations with letter frequency than our experts.
Turning to the second and third panels, showing individual typist sensitivities
to bigram and trigram frequencies as a function of mean typing speed, we see a
Crunching Big Data with Fingertips 335

Normal paragraph correlations

Letters Bigrams Trigrams


0.3 y = –0.14 + –0.0014-x, y = –0.33 +0.00027 -x, y = –0.26 +0.00022 -x,
r 2 = 0.184 r 2 = 0.0254 r 2 = 0.0298
Frequency/IKSI Correlations

0.0

−0.3

−0.6

100 200 300 400 100 200 300 400 100 200 300 400
Mean typing speed

FIGURE 14.3 Scatterplots of individual typist correlations between n-gram frequency


and IKSIs for each as a function of mean typing speed for normal paragraphs.

qualitatively different pattern. Here we see that the faster typists on the left show
larger negative correlations than the slower typists on the right. In other words,
there was a small positive correlation between sensitivity to bigram frequency
and skill (r = 0.144, p < 0.007), and trigram frequency and skill (r = 0.146,
p < 0.006). Again, predictions of the learning and memory models are generally
consistent with the data, which show that highly skilled typists are more sensitive
to higher-order sequential statistics than poor typists.

Does Sensitivity to Sequential Structure Change when Typing


Unfamiliar Letter Strings?
Our typists also copy-typed two paragraphs of unfamiliar, non-word letter strings.
The bigram paragraph was constructed so that letters appeared in accordance
with their bigram-based likelihoods from the corpus counts, whereas the random
paragraph was constructed by picking all letters randomly. A question of interest
was whether our measures of typists’ sensitivity to n-gram structure in English
would vary depending on the text that typists copied. If they do, then we can infer
that utilization of knowledge about n-gram likelihoods can be controlled by typing
context.
Figure 14.4 shows scatterplots of individual typists’ correlations between IKSIs
and letter (r = −0.214, p < 0.001; r = −0.080, p = 0.13), bigram (r = 0.345,
336 L. P. Behmer Jr. and M. J. C. Crump

Bigram paragraph correlations


Letters Bigrams Trigrams
0.5 y = –0.07 + –0.00072-x, y = –0.3 +0.00061 -x, y = –0.2 +0.00023 -x,
r 2 = 0.0606 r 2 = 0.127 r 2 = 0.0412
Frequency/IKSI Correlations

0.0

−0.5

100 200 300 400 500 600100 200 300 400 500 600100 200 300 400 500 600
Mean typing speed

Random paragraph correlations


Letters Bigrams Trigrams
y = –0.63 + –0.000079-x, y = –0.5 +0.00029 -x, y = –0.36 +0.00028 -x,
r 2 = 0.00749 r 2 = 0.164 r 2 = 0.102

0.00
Frequency/IKSI Correlations

0.25

−0.50

−0.75

200 400 600 800 200 400 600 800 200 400 600 800
Mean typing speed

FIGURE 14.4 Scatterplots of individual typist correlations between n-gram frequency


and IKSIs for each as a function of mean typing speed for the bigram and random
paragraphs.
Crunching Big Data with Fingertips 337

p < 0.001; r = 0.340, p < 0.001), and trigram (r = 0.171, p < 0.001; r = 0.244,
p < 0.001) frequencies, respectively, for both the bigram and random paragraphs,
and as a function of mean IKSI or overall typing speed. In general, we see the same
qualitative patterns as before. For the bigram paragraph, the slower typists are more
negatively correlated with letter frequency than the faster typists, and the faster
typists are more negatively correlated with bigram and trigram frequency than the
slower typists. For the random paragraph, the slope of the regression line relating
mean typing speed to letter correlations was not significantly different from 0,
showing no differences between faster and slower typists. However, the figures
shows that all typists were negatively correlated with letter frequency. Typing
random strings of letters disrupts normal typing (Shaffer & Hardwick, 1968), and
appears to have turned our faster typists into novices, in that the faster typists’
pattern of correlations looks like the novice signature pattern. It is noteworthy
that even though typing random letter strings disrupted normal typing by slowing
down mean typing speed, it did not cause a breakdown of sensitivity to sequential
structure. We return to this finding in the general discussion.

General Discussion
We examined whether measures of typing performance could test predictions
about how learning and memory participate in the acquisition of skilled
serial-ordering abilities. Models of learning and memory make straightforward
predictions about how people become sensitive to sequential regularities in actions
that they produce. Novices become tuned to lower-order statistics, like single
letter frequencies, then with expertise develop sensitivity to higher-order statistics,
like bigram and trigram frequencies, and in the process appear to lose sensitivity
to lower-order statistics. We saw clear evidence of these general trends in our
cross-sectional analysis of a large number of typists.
The faster typists showed stronger negative correlations with bigram and trigram
frequencies than the slower typists. This is consistent with the prediction that
sensitivity to higher-order sequential structure develops over practice. We also
found that faster typists showed weaker negative correlations with letter frequency
than the slower typists. This is consistent with the prediction that sensitivity to
lower-order sequential structure decreases with practice.

Discriminating Between SRN and Instance Theory Models


The instance theory and SRN model predictions that we were testing are globally
similar. Instance theory predicts that sensitivity to higher-order sequential structure
develops in parallel with sensitivity to lower-order sequential structure, albeit
at a slower rate because smaller n-gram units are experienced more frequently
than larger n-gram units. The SRN model assumes a scaffolding process, with
338 L. P. Behmer Jr. and M. J. C. Crump

sensitivity to lower-order structure as a prerequisite for higher-order structure. So,


both models assume that peaks in sensitivity to lower-order sequential structure
develop before peaks in sensitivity to higher-order sequential structure. The data
from our cross-sectional analyses are too coarse to evaluate fine differences in rates
of acquisition of n-gram structure across expertise.
However, the models can be evaluated on the basis of another prediction. The
SRN model assumes that experts who have become sensitive to higher-order
sequential statistics have lost their sensitivity to lower-order statistics. Instance
theory assumes sensitivity to lower-order statistics remains and grows stronger with
practice, but does not influence performance at high levels of skill. Our data from
the normal typing paragraphs show that faster typists had weaker sensitivity to letter
frequency than slower typists. However, we also found that all typists showed strong
negative correlations with letter frequency when typing the random paragraph.
So, the fastest typists did not lose their sensitivity to lower-order structure. This
finding is consistent with the predictions of instance theory. Experts show less
sensitivity to letter frequency when typing normal words because their typing
speed hits the floor and they are typing at maximum rates; however their sensitivity
to letter frequency is revealed when a difficult typing task forces them to slow
down.

Relation to Response-Scheduling Operations


We divided our questions about serial-ordering into issues of abilities and operations.
Our data speak to the development of serial-ordering abilities in skilled typing.
They are consistent with the hypothesis that general learning and memory
processes do participate in hierarchically controlled skills, like typing. However,
our data do not speak directly to the nature of operations carried out by a
response-scheduling process controlling the serial-ordering of keystrokes. A central
question here is to understand whether/how learning and memory processes,
which clearly bias keystroke timing as a function of n-gram regularities, link
in with theories of the response-scheduling process. We took one step toward
addressing this question by considering the kinds of errors that our typists
committed.
The operations of Rumelhart and Norman’s (1982) response-scheduling process
involve a buffer containing activated letter–keystroke schemas for the current word
that is being typed. The word activates all of the constituent letter schemas in paral-
lel, and then a dynamic inhibition rule weights these activations enabling generally
accurate serial output. We considered the possibility that learning and memory rep-
resentations sensitive to n-gram statistics could bias the activation of action schemas
in the buffer as a function of n-gram context, by increasing activation strength of
letters that are expected versus unexpected according to sequential statistics. We
assumed that such an influence would produce what we term “statistical” action
Crunching Big Data with Fingertips 339

8000

Paragraph
No. of errors 6000
Normal
4000
Bigram

2000 Random

0
–1.0 –0.5 0.0 0.5 1.0
Typed error MLE---planned correct MLE
FIGURE 14.5 Histograms of the distribution of differences between the maximum
likelihoods for the typed error letter and the planned correct letter based on the bigram
context of the correct preceding letter.

slips (Norman, 1981), which might be detected in typists’ errors. For example,
consider having to type the letters “qi.” Knowledge of sequential statistics should
lead to some activation of the letter “u,” which is more likely to follow “q” than “i.”
We examined all of our typists, incorrect, keystrokes following the intuition that
when statistical slips occur, the letters typed in error should have a higher maximum
likelihood expectation given the prior letter than the letter that was supposed to
be typed according to the plan. We limited our analyses to erroneous keystrokes
that were preceded by one correct keystroke. For each of the 38,739 errors, we
subtracted the maximum likelihood expectation for the letter that was typed in
error given the correct predecessor, from the maximum likelihood expectation for
the letter that was supposed to be typed given the correct predecessor.
Figure 14.5 shows the distributions of difference scores for errors produced
by all typists by paragraph typing conditions. If knowledge of sequential statistics
biases errors, then we would expect statistical action slips to occur. Letters typed
in error should have higher likelihoods than the planned letter, so we would
expect the distributions of difference scores to be shifted away from 0 in a positive
direction. None of the distributions for errors from typing any of the paragraphs
are obviously shifted in a positive direction. So, it remains unclear how learning and
memory processes contribute to the operations of response-scheduling. They do
influence the keystroke speed as a function of n-gram frequency, but apparently
do so without causing a patterned influence on typing errors that would be
expected if n-gram knowledge biased activation weights for typing individual
letters. Typists make many different kinds of errors for other reasons, and larger
datasets could be required to tease out statistical slips from other more common
errors like hitting a nearby key, transposing letters within a word, or missing letters
entirely.
340 L. P. Behmer Jr. and M. J. C. Crump

Big Data at the Fingertips


We used Big Data tools to address theoretical issues about how people develop
high-level skill in serial-ordering their actions. We asked how typists “analyze”
Big Data that comes in the form of years of experience of typing, and apply
knowledge of sequential structure in that data to their actions when they are typing.
Our typists’ learning and memory processes were crunching Big Data with their
fingertips. More generally, Big Data tools relevant for experimental psychology
are literally at the fingertips of researchers in an unprecedented fashion that is
transforming the research process. We needed to estimate the statistical structure
of trigrams in the English language, and accomplished this task in a couple of
days by downloading and analyzing freely available massive corpuses of natural
language, which were a click away. We needed hundreds of people to complete
typing tasks to test our theories, which we accomplished in a couple of days using
freely available programming languages and the remarkable mTurk service. We
haven’t figured out how to use Big Data to save time thinking about our results and
writing this paper. Nevertheless, the Big Data tools we used dramatically reduced
the time needed to collect the data needed to test the theories, and they also
enabled us to ask these questions in the first place. We are excited to see where
they take the field in the future.

Acknowledgment
This work was supported by a grant from NSF (#1353360) to Matthew Crump.
The authors would like to thank Randall Jamieson, Gordon Logan, and two
anonymous reviewers for their thoughtful comments and discussion in the
preparation of this manuscript.

References
Borel, É. (1913). La mécanique statique et l’irréversibilité. Journal de Physique Theorique et
Appliquee, 3, 189–196.
Botvinick, M. M., & Plaut, D. C. (2004). Doing without schema hierarchies: A recurrent
connectionist approach to normal and impaired routine sequential action. Psychological
Review, 111, 395–429.
Botvinick, M. M., & Plaut, D. C. (2006). Such stuff as habits are made on: A reply to
Cooper and Shallice (2006). Psychological Review, 113, 917–927.
Cleeremans, A. (1993). Mechanisms of implicit learning connectionist models of sequence processing.
Cambridge, MA: MIT Press.
Cleeremans, A., & McClelland, J. L. (1991). Learning the structure of event sequences.
Journal of Experimental Psychology: General, 120, 235–253.
Cooper, R. P., & Shallice, T. (2000). Contention scheduling and the control of routine
activities. Cognitive Neuropsychology, 17, 297–338.
Cooper, R. P., & Shallice, T. (2006). Hierarchical schemas and goals in the control of
sequential behavior. Psychological Review, 113, 887–916.
Crunching Big Data with Fingertips 341

Crump, M. J. C., & Logan, G. D. (2010a). Hierarchical control and skilled typing: Evidence
for word-level control over the execution of individual keystrokes. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 36, 1369–1380.
Crump, M. J. C., & Logan, G. D. (2010b). Warning: This keyboard will deconstruct—The
role of the keyboard in skilled typewriting. Psychonomic Bulletin & Review, 17, 394–399.
Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s
Mechanical Turk as a tool for experimental behavioral research. PLoS One, 8, e57410.
Dell, G. S., Burger, L. K., & Svec, W. R. (1997). Language production and serial order: A
functional analysis and a model. Psychological Review, 104, 123–147.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211.
Estes, W. K. (1972). An associative basis for coding and organization in memory. In A. W.
Melton & E. Martin (Eds.), Coding processes in human memory (pp. 161–190). Washington,
DC: V. H. Winston & Sons.
Fodor, J. A. (1983). The modularity of mind: An essay on faculty psychology. Cambridge, MA:
MIT Press.
Gentner, D. R., Grudin, J., & Conway, E. (1980). Finger movements in transcription typing.
DTIC document. San Diego, CA: University of California, San Diego, La Jolla Center
for Human Information Processing.
Grudin, J. T., & Larochelle, S. (1982). Digraph frequency effects in skilled typing. DTIC
document. San Diego, CA: University of California, San Diego, La Jolla Center for
Human Information Processing.
Jamieson, R. K., & Mewhort, D. J. K. (2009a). Applying an exemplar model to the
artificial-grammar task: Inferring grammaticality from similarity. The Quarterly Journal
of Experimental Psychology, 62, 550–575.
Jamieson, R. K., & Mewhort, D. J. K. (2009b). Applying an exemplar model to the serial
reaction-time task: Anticipating from experience. The Quarterly Journal of Experimental
Psychology, 62, 1757–1783.
Jones, M. N., & Mewhort, D. J. (2004). Case-sensitive letter and bigram frequency counts
from large-scale English corpora. Behavior Research Methods, Instruments, & Computers, 36,
388–396.
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as
corpus. Computational Linguistics, 29, 333–347.
Lashley, K. S. (1951). The problem of serial order in behavior. In L. A. Jeffress (Ed.), Cerebral
mechanisms in behavior (pp. 112–136). New York: Wiley.
Liu, X., Crump, M. J. C., & Logan, G. D. (2010). Do you know where your fingers have
been? Explicit knowledge of the spatial layout of the keyboard in skilled typists. Memory
& Cognition, 38, 474–484.
Logan, G. D. (1988). Toward an instance theory of automatization. Psychological Review, 95,
492–527.
Logan, G. D., & Crump, M. J. C. (2009). The left hand doesn’t know what the right hand is
doing: The disruptive effects of attention to the hands in skilled typewriting. Psychological
Science, 20, 1296–1300.
Logan, G. D., & Crump, M. J. C. (2010). Cognitive illusions of authorship reveal hierarchical
error detection in skilled typists. Science, 330, 683–686.
Logan, G. D., & Crump, M. J. C. (2011). Hierarchical control of cognitive processes: The
case for skilled typewriting. In B. H. Ross (Ed.), The Psychology of Learning and Motivation
(Vol. 54, pp. 1–27). Burlington: Academic Press.
342 L. P. Behmer Jr. and M. J. C. Crump

McCloskey, M., & Cohen, N. (1989) Catastrophic interference in connectionist networks:


The sequential learning problem. Psychology of Learning and Motivation, 24, 109–165.
Mewhort, D. J. (1966). Sequential redundancy and letter spacing as determinants of
tachistoscopic recognition. Canadian Journal of Psychology, 20, 435.
Mewhort, D. J. (1967). Familiarity of letter sequences, response uncertainty, and the
tachistoscopic recognition experiment. Canadian Journal of Psychology, 21, 309.
Miller, G. A., Galanter, E., & Pribram, K. H. (1960). Plans and the structure of behavior.
New York, NY: Adams-Bannister-Cox.
Nissen, M. J., & Bullemer, P. (1987). Attentional requirements of learning: Evidence from
performance measures. Cognitive Psychology, 19, 1–32.
Norman, D. A. (1981). Categorization of action slips. Psychological Review, 88, 1–15.
Palmer, C., & Pfordresher, P. Q. (2003). Incremental planning in sequence production.
Psychological Review, 110, 683–712.
Reber, A. S. (1969). Transfer of syntactic structure in synthetic languages. Journal of
Experimental Psychology, 81, 115–119.
Rosenbaum, D. A., Cohen, R. G., Jax, S. A., Weiss, D. J., & van der Wel, R. (2007).
The problem of serial order in behavior: Lashley’s legacy. Human Movement Science, 26,
525–554.
Rumelhart, D. E., & Norman, D. A. (1982). Simulating a skilled typist: A study of skilled
cognitive-motor performance. Cognitive Science, 6, 1–36.
Shaffer, L. H., & Hardwick, J. (1968). Typing performance as a function of text. The
Quarterly Journal of Experimental Psychology, 20, 360–369.
Snyder, K. M., Ashitaka, Y., Shimada, H., Ulrich, J. E., & Logan, G. D. (2014). What skilled
typists don’t know about the QWERTY keyboard. Attention, Perception, & Psychophysics,
76, 162–171.
Watson, J. B. (1920). Is thinking merely action of language mechanism? British Journal of
Psychology, General Section, 11, 87–104.
Wickelgren, W. A. (1969). Context-sensitive coding, associative memory, and serial order
in (speech) behavior. Psychological Review, 76, 1–15.
15
CAN BIG DATA HELP US UNDERSTAND
HUMAN VISION?
Michael J. Tarr and
Elissa M. Aminoff

Abstract
Big Data seems to have an ever-increasing impact on our daily lives. Its application
to human vision has been no less impactful. In particular, Big Data methods
have been applied to both content and data analysis, enabling a new, more
fine-grained understanding of how the brain encodes information about the
visual environment. With respect to content, the most significant advance has
been the use of large-scale, hierarchical models—typically “convolutional neural
networks” or “deep networks”—to explicate how high-level visual tasks such
as object categorization can be achieved based on learning across millions of
images. With respect to data analysis, complex patterns underlying visual behavior
can be identified in neural data using modern machine-learning methods or
“multi-variate pattern analysis.” In this chapter, we discuss the pros and cons
of these applications of Big Data, including limitations in how we can interpret
results. In the end, we conclude that Big Data methods hold great promise
for pursuing the challenges faced by both vision scientists and, more generally,
cognitive neuroscientists.

Introduction
With its inclusion in the Oxford English Dictionary (OED), Big Data has come
of age. Beyond according “Big Data” an entry in their lexicon, the quotes
that the OED chose to accompany the definition are telling, painting a picture
from skepticism in 1980, “None of the big questions has actually yielded to the
bludgeoning of the big data people,” to the realization of Big Data’s value in
344 M. J. Tarr and E. M. Aminoff

2003, “The recognition that big data is a gold mine and not just a collection
of dusty tapes” (Oxford English Dictionary, 2016). Today Big Data is being
used to predict flu season severity (Ginsberg et al., 2009; sometimes imperfectly;
Lazer, Kennedy, King, & Vespignani, 2014), guide game time decisions in sports
(Sawchik, 2015), and precisely call elections before they happen (Clifford, 2008).
These and other high-profile Big Data applications typically rely on quantitative
or textual data—people’s preferences, atmospheric measurements, online search
behavior, hitting and pitching statistics, etc. In contrast, Big Data applied to vision
typically involves image statistics; that is, what kinds of visual information across
1,000,000s of images support object categorization, scene recognition, or other
high-level visual tasks. In this vein, perhaps the most well-known result over the
past decade is the finding by a Google/Stanford team that YouTube videos are
frequently populated by cats (Le et al., 2012). Although the cats-in-home-movies
was certainly not the first application of Big Data to images, this paper’s notoriety
signaled that Big Data had come to vision.
The vision community’s deployment of Big Data mirrors similar applications
across other domains of artificial and biological intelligence (e.g., natural language
processing). Such domains are unique in that they often attempt to link artificial
systems to biological systems performing the same task. As such, Big Data is
typically deployed in two distinct ways.
First, Big Data methods can be applied to content. That is, learning systems
such as convolutional or “deep” neural networks (LeCun, Bengio, & Hinton,
2015) can be trained on millions of images, movies, sounds, text passages, and
so on. Of course, before the rise of the Internet accessing such large datasets
was nearly impossible, hence the term “web-scale,” which is sometimes used
to denote models relying on this sort of data (Mitchell et al., 2015; Chen,
Shrivastava, & Gupta, 2013). Recent demonstrations of the application of Big
Data to image classification are numerous, exemplified by the generally strong
interest in the ImageNet competition (Russakovsky et al., 2015), including systems
that automatically provide labels for the content of images drawn from ImageNet
(Deng et al., 2014). However, a more intuitive demonstration of the popularity
and application of Big Data to image analysis can be found in Google Photos.
Sometime in the past few years Google extended Photos’ search capabilities to
support written labels and terms not present in any of one’s photo labels (Rosenberg
& Image Search Team, 2013). For example, the first author almost never labels his
uploaded photos, yet entering the term “cake” into the search bar correctly yielded
five very different cakes from his photo collection (Figure 15.1). Without knowing
exactly what Google is up to, one presumes that they have trained a learning model
on millions of images from their users’ photos and that there are many labeled and
unlabeled images of cakes in this training set. Given this huge training set, when
presented with the label “cake,” Google Photos is able to sift through an unlabeled
photo collection and pick the images most likely to contain a cake.
Can Big Data Help Us Understand Human Vision? 345

FIGURE 15.1 Big Data methods applied to visual content. Images returned by
the search term “cake” from the first author’s personal Google Photos collection
(https://photos.google.com).

Second, Big Data methods can be applied to data analysis. That is, a set of
neural (typically) or behavioral (occasionally) data collected in a human experiment
may be analyzed using techniques drawn from machine learning or statistics.
In cognitive neuroscience, a family of such approaches is often referred to as
multi-voxel or multivariate pattern analysis or “MVPA” (Norman, Polyn, Detre, &
Haxby, 2006; Haxby et al., 2001). MVPA is appealing in that it takes into account
the complex pattern of activity encoded across a large number of neural units
rather than simply assuming a uniform response across a given brain region. MVPA
methods are often used to ask where in the brain is information encoded with
respect to a specific stimulus contrast. More specifically, one of many different
classifiers (typically linear) will be trained on a subset of neural data and then
used to establish which brain region(s) are most effective for correctly classifying
new data (i.e. the best separation of the data with respect to the contrast of
interest). However, when one looks more closely at MVPA methods, it is not
clear that the scientific leverage they provide is really Big Data. That is, while
it is certainly desirable to apply more sophisticated models to neuroimaging data,
acknowledging, for example, that neural codes may be spatially distributed in a
non-uniform manner, there is little in MVPA and related approaches that suggests
that they employ sufficient numbers of samples to enable the advantages that come
with Big Data. As we review below, we hold that all present-day neuroimaging
methods are handicapped with respect to how much data can practically be
collected from a given individual or across individuals.
346 M. J. Tarr and E. M. Aminoff

We believe that the real advantage of Big Data applied to understanding human
vision will come from a third approach—that of using large-scale artificial models as
proxy models of biological processing. That is, as we will argue below, well-specified
computational models of high-level biological vision are scarce. At the same time,
progress in computer vision has been dramatic over the past few years (Russakovsky
et al., 2015). Although there is no way to know, a priori, that a given artificial
vision system either represents or processes visual information in a manner similar
to the human visual system, one can build on the fact that artificial systems rely on
visual input that is often similar to our own visual world—for example, images of
complex scenes, objects, or movies. At the same time, the output of many artificial
vision systems is likewise coincident with the apparent goals of the human visual
system—object and/or scene categorization and interpretation.

What is Big Data?


In assessing the potential role of Big Data methods in the study of vision, we ask
how much of the data collected by brain scientists is really “big” rather than simply
complicated or ill-understood? To even pose this question, we need to come up
with some definition of what we mean by big. As a purely ad hoc definition for
the purposes of this chapter,1 we will consider datasets of the order of 106 samples
as being officially big. With respect to both visual input and the analysis of neural
data, the number of samples qua samples is not the only issue. For example, when
one thinks about image stimuli or brain data it is worth considering the degree to
which individual samples, typically single image frames or voxels, are correlated
with one another. Approaches to Big Data in the artificial vision domain, for
instance, convolutional neural networks (LeCun et al., 2015), assume images that
are bounded by being neither entirely independent of one another nor overly
correlated with one another. That is, if one’s samples—no matter how many of
them we obtain—are either too independent or too correlated, then no amount of
data will suffice to allow inference about the overall structure of the visual world
or its mental representation. Indeed, some degree of commonality across images
is leveraged by most Big Data models. Of course, to the extent that complete
independence between natural images or voxels is highly unlikely, almost any large
collection of images or set of voxels should suffice (e.g. Russakovsky et al., 2015;
Haxby et al., 2001). On the other hand, the bound of being overly correlated may
be a concern under some circumstances: If all of our image samples are clustered
around a small number of regions in a space, such as considering only images of
cats, our model is unlikely to make many effective visual inferences regarding other
object categories. Put another way, we need samples that cover cats, dogs, and a
lot more.
Of course, the visual world we experience is highly non-independent: For
some period of time, we might see our cat playing with a ball of twine. That
Can Big Data Help Us Understand Human Vision? 347

is, our experience is sequential and governed by the casual dynamics of our
environment. So many visual samples over shorter time windows are at least locally
non-independent, consisting of a series of coherent “shots.” The most likely image
following a given image in dynamic sequence is almost always another image
with nearly the same content. Indeed, the human visual system appears to rely
on this fact to build up three-dimensional representations of objects over time
(Wallis, 1996). At the same time, more globally, there is likely to be significant
independence between samples taken over longer time lags. That is, we do see a
variety of cats, dogs, and much more as we go about our everyday lives. These
sorts of issues are important to consider in terms of stimuli for training artificial
vision systems or testing human vision in the era of Big Data. When systems
and experiments only included a handful of conditions, variation across stimuli
was difficult to achieve and one often worried about the generalizability of one’s
results.
Now, however, with the use of many more stimuli, the issue is one of variation:
We need sufficient variation to support generalization, but sufficient overlap to
allow statistical inferences. That is, with larger stimulus sets, it is important to
consider how image variation is realized within a given set. For example, movies
(the YouTube kind, not the feature-length kind) as stimuli are a great source of
large-scale visual data. By some estimates there are nearly 100,000,000 videos on
YouTube averaging about 10 minutes each in length. In terms of video frames,
assuming a 30 fps frame rate, we have about 1,800,000,000,000 available image
samples on YouTube alone. However, frames taken from any given video are
likely to be highly correlated with one another (Figure 15.2). Thus, nominally
“Big Data” models or experiments that rely on movie frames as input may be
overestimating the actual scale of the employed visual data. Of course, vision
researchers realize that sequential samples from a video sequence are typically
non-independent, but in the service of nominally larger datasets, this consideration
is sometimes overlooked.
Modulo the issue of sample variance, the high-dimensionality of the visual
world and our mental representation of that world that makes one think “Aha!

(a) (b) (c)

FIGURE 15.2 Three non-adjacent frames from a single movie. Although each frame of
the movie might be considered a distinct sample under some approaches, the content
contained in each frame is strongly correlated with the content shown in other frames
from the same movie (adapted from www.beachfrontbroll.com).
348 M. J. Tarr and E. M. Aminoff

Vision is clearly big data—people see millions of different images everyday and
there are about 3 × 109 cortical neurons involved in this process” (Sereno &
Allman, 1991). Even better, there are probably about 5,000–10,000 connections
per one neuron. Thus, both visual content and biological vision appear to be
domains of Big Data. Moreover, much of the leverage that Big Data provides
for vision is in densely sampling these spaces, thereby providing good coverage
of almost all possible images or vision-related brain states. However, we should
note that although the vision problem is both big and complicated, it is not the
problem itself that determines whether we can find solutions to both artificial and
biological vision problems using Big Data approaches. Rather, it is the kind of data
we can realistically collect that determines whether Big Data provides any leverage
for understanding human vision.
In particular, within the data analysis domain, sample sizes are necessarily
limited by sampling methods, including their bandwidth limitations and their cost,
and by the architecture of the brain itself. With respect to this latter point, almost all
human neuroimaging methods unavoidably yield highly correlated samples driven
by the same stimulus response, rather than samples from discrete responses. This is
true regardless as to whether we measure neural activity at spatial locations in the
brain or at time points in the processing stream (or both).
To make this point more salient, consider the methodological poster child
of modern cognitive neuroscience: functional magnetic resonance imaging
(fMRI). fMRI is a powerful, non-invasive method for examining task-driven,
function-related neural activity in the human brain. The strength of fMRI is spatial
localization—where in the brain differences between conditions are reflected in
neural responses. The unit for where is a “voxel”—the minimal volume of brain
tissue across which neural activity may be measured. While the size of voxels in
fMRI has been continually shrinking, at present, the practical volume limit2 for
imaging the entire brain is about 1.0 mm3 —containing about 100,000 cortical
neurons (Sereno & Allman, 1991). As a consequence, the response of a voxel in
fMRI is actually the aggregate response across 100,000 or more neurons. This level
of resolution has the effect of blurring any fine-grained neural coding for visual
information and, generally, creating samples that are more like one another than
they otherwise might be if the brain could be sampled at a finer scale. Reinforcing
this point, there is evidence that individual visual neurons have unique response
profiles reflecting high degrees of selectivity for specific visual objects (Woloszyn
& Sheinberg, 2012). A second reason why spatially adjacent voxels tend to exhibit
similar neural response profiles is that the brain is organized into distinct, localized
neural systems that realize different functions (Fodor, 1983). In vision this means
that voxels within a given region of the visual system hierarchy (e.g. V1, V2, V4,
MT, etc.) respond in a like fashion to visual input (which is how functional regions
are defined in fMRI). That is, adjacent voxels are likely to show similarly strong
responses to particular images and similarly weak responses to the particular images
Can Big Data Help Us Understand Human Vision? 349

(a fact codified in almost all fMRI analysis pipelines whereby there is a minimum
cluster or volume size associated with those regions of activity that are considered
to be significant). By way of example, voxels in a local neighborhood often appear
to be selective for specific object categories (Kanwisher, McDermott, & Chun,
1997; Gauthier, Tarr, Anderson, Skudlarski, & Gore, 1999). Thus, regardless of
how many voxels we might potentially sample using fMRI, their potentially high
correlation implies that it may be difficult to obtain enough data to densely sample
the underlying fine-grained visual representational space.
Similar limitations arise in the temporal domain. First, within fMRI there is a
fundamental limit based on the “hemodynamic response function” or HRF. Since
fMRI measures changing properties of blood (oxygenation), the rate at which
oxygenated blood flows into a localized brain region limits the temporal precision
of fMRI. A full HRF spans some 12–16 seconds, however methodological
cleverness has allowed us to reduce temporal measurements using fMRI down
to about 2–3 seconds (Mumford, Turner, Ashby, & Poldrack, 2012). Still, given
that objects and scenes may be categorized in about 100 ms (Thorpe, Fize, &
Marlot, 1996), 2–3 seconds is a relatively coarse sampling rate that precludes
densely covering many temporal aspects of visual processing. As discussed below,
there are also practical limits on how many experimental trials may be run in a
typical one hour study.
Alternatively, vision scientists have used electroencephalography (EEG; as well
as its functional variant, event-related potentials or ERPs) and magnetoencephalog-
raphy (MEG) to explore the fine-grained temporal aspects—down to the range of
milliseconds—of visual processing. With such techniques, the number of samples
that may be collected in a relatively short time period is much greater than with
fMRI. However, as with spatial sampling, temporal samples arising from neural
activity are likely to be highly correlated with one another and probably should
not be thought of as discrete samples—the measured neural response functions are
typically quite smooth (much as in the wave movie example shown in Figure 15.1).
Thus, the number of discrete temporal windows that might be measured during
visual task performance is probably much smaller than the raw number of samples.
At the same time, the spatial sampling resolution of EEG and MEG is quite
poor in that they both measure summed electrical potentials (or their magnetic
effects) using a maximum of 256–306 scalp sensors. Not only is the dimensionality
of the total sensor space small as compared to the number of potential neural
sources in the human brain, but source reconstruction methods must be used to
estimate the putative spatial locations generating these signals (Yang, Tarr, & Kass,
2014). Estimation of source locations in EEG or MEG is much less reliable than in
fMRI—on the order of 1 cm at best given current methods (at least 100,000,000
neurons per one reconstructed source). As such, it is difficult to achieve any sort
of dense sampling of neural units using either technique.
In trying to leverage Big Data for the study of human vision, we are also
faced with limitations in experimental power. By power, we mean the minimum
350 M. J. Tarr and E. M. Aminoff

sample size required to reliably detect an effect between conditions (i.e. correctly
rejecting the null). Power is constrained by three different factors that necessarily
limit the amount of data we can collect from both individuals and across a
population.
First, although different neuroimaging methodologies measure different
correlates of neural activity, they are all limited by human performance. That is, we
can only show so many visual stimuli and collect so many responses in the typical
vision experiment. Assuming a minimum display time of between 100 and 500 ms
per stimulus and a minimum response time of between 200 and 500 ms per subject
response, adding in a fudge factor for recovery between trials, we ideally might
be able to run 2,400 1.5 sec experimental trials during a one hour experiment.
However, prep time, consent, rests between runs, etc. rapidly eat into the total
time we can actually collect data. Moreover, 1.5 sec is an extremely rapid pace that
is likely to fatigue subjects. Realistically, 1,000 is about the maximum number of
discrete experimental trials that might be run in one hour.
Second, as we have already discussed, different neuroimaging methodologies are
limited by what they actually measure. That is, the particular correlates of neural
activity measured by each method have specific spatial or temporal limitations.
Spatially, MRI methods provide the highest resolution using non-invasive
techniques.3 As mentioned above, at present, the best generally deployed MRI
systems can achieve a resolution of about 1 mm3 when covering the entire brain.
More realistically, most vision scientists are likely to use scanning parameters that
will produce a functional brain volume of about 700,000 voxels at one time
point assuming the frequently used 2 mm3 voxel size.4 In contrast, as already
discussed, because EEG and MEG measure electrical potentials at the scalp, the
spatial localization of neural activity requires source reconstruction, whereby the
highest resolution that can be achieved is on the order of 1 cm3 . In either case, the
sampling density is quite low relative to the dimensionality of neurons in the visual
cortex.
The challenge of limited sample sizes—in absolute terms or relative to the
dimensionality of the domain-to-be-explained—is exacerbated by the third, highly
practical factor: cost. That is, regardless as to the efficacy of a given method,
experimental power is inherently limited by the number of subjects that can be
run. In this vein, both fMRI and MEG tend to be relatively expensive—costing
in the neighborhood of $500/hour. Moreover, this cost estimate typically reflects
only operating costs, but not acquisition costs, which are in the neighborhood of
$1,000,000 a tesla (e.g. a 3T scanner would cost about $3,000,000 to purchase
and install). As such, modern vision science, which has enthusiastically adopted
neuroimaging tools in much of its experimental research, appears to be suffering
from a decrease in experimental power relative to earlier methods. Whether this
matters for understanding human vision is best addressed by asking what makes
Big Data “work”—the question we turn to next.
Can Big Data Help Us Understand Human Vision? 351

Why Does Big Data Work?


Assuming we are dealing with truly Big Data (e.g. reasonably independent samples
and good coverage of the entire space of interest), it is worth asking why Big Data
methods seem to be so effective, particularly for many visual pattern recognition
problems (Russakovsky et al., 2015)? To help answer this question, consider
Pinker’s (1999) hypothesis as to how we mentally encode word meanings. He
suggests that most word forms can be generated by applying a set of rules to base
word meanings (e.g. if “walk” is the present tense, the rule of adding an “ed”
generates the past tense form “walked”). However, such rules are unlikely to
handle every case for a given language—what is needed is the equivalent of a
“look-up table” or large-scale memory storage to handle all of the non-rule-based
forms (e.g. “run” is the present tense, but “ran” is the past tense). We suggest that
a similar sort of rules+memory structure underlies many domains in biological
intelligence, and, in particular, how our visual system represents the world. That
is, we posit that high-level scene and object recognition is based, in part, on
reusable parts or features that can be recombined to form new instances of a
known class (Barenholtz & Tarr, 2007). It is important to note that we use the
term “rule” here to mean the default application of a high-frequency, statistically
derived inference (e.g. Lake, Salakhutdinov, & Tenenbaum, 2015), not a rule
in the more classical, symbolical sense (e.g. Anderson, 1993). At the same time,
such a visual “compositional” architecture cannot cover all of the possible visual
exemplars, images, scenes, and object categories humans encounter, learn, and
represent. Thus, a memory component is necessary to encode all of those images
that cannot be represented in terms of the compositional features of the system.
Based on this logic, we suggest that human object and scene representation is best
understood as a particular instance of a rules+memory mental system.
Our hypothesis is that Big Data applied to content works well precisely because
Big Data methods effectively capture both of these aspects of the visual world.
For example, deep neural networks for vision typically have architectures that
include 1,000,000s of parameters and they are typically trained on 1,000,000s of
images. At the same time they are “deep” in the sense that they are comprised
of a many-level—often more than 20—hierarchy of networks. Considering the
number of parameters in these models, there is little mystery regarding the ability
of such models to learn large numbers of distinct examples—they have memory to
spare and are provided with a very large number of training examples. Thus, they
are able to densely sample the target domain—the space of all visual images we
are likely to encounter (or at least that appear on Instagram or Flickr). Moreover,
this dense sampling of image space enables similarity-based visual reasoning: New
images are likely to be visually similar to some known images and it is often correct
to apply the same semantics to both. At the same time, large-scale data enable better
inferences about the statistical regularities that are most prevalent across the domain.
352 M. J. Tarr and E. M. Aminoff

That is, across 1,000,000s of objects particular image regularities will emerge and
can be instantiated as probabilistic visual “rules”—that is, default inferences that
are highly likely to hold across most contexts. Of note, because of the scale
of the training data and number of available parameters within the model, the
number of learned regularities can be quite large (i.e. much larger than the posited
number of grammatical rules governing human language) and can be quite specific
to particular categories. What is critical is that this set of visual rules can be applied
across collections of objects and scenes, even if the rules are not so general that they
are applicable across all objects and scenes. Interestingly, this sort of compositional
structure, while learnable by these models (perhaps because of their depth), may
be otherwise difficult to intuit or derive through formal methods.
In sum, Big Data may be an effective tool for studying many domains of
biological intelligence, and, in particular, vision, because it is often realized in
models that are good at both ends of the problem. That is, the sheer number of
parameters in these approaches means that the visual world can be densely learned
in a memory-intensive fashion across 1,000,000s of training examples. At the same
time, to the extent visual regularities exist within the domain of images, such
regularities—not apparent with smaller sample sizes—will emerge as the number
of samples increases. Of course, these benefits presuppose large-scale, sufficiently
varying, discrete samples within training data—something available when studying
vision of the content domain, but, as we have reviewed above, less available or
possible in the analysis of neural data (using present-day methodologies). That is,
it is rarely the case that data arising from neuroimaging studies are of sufficient
scale to enable sufficient sampling and clear signals about how the human visual
system encodes images and make inferences about them with respect to either raw
memorial processes or compositional representations.

Applications of Big Data to Human Vision


How then might Big Data methods be applied to the study of human vision?
As one example where neural data is sampled much more densely, Gallant and
colleagues (Huth, Nishimoto, Vu, & Gallant, 2012; Agrawal, Stansbury, Malik, &
Gallant, 2014; Stansbury, Naselaris, & Gallant, 2013) used fMRI to predict or
“decode” the mental states of human observers looking at frames drawn from
thousands of images presented in short, 10–20 second, movie clips. Each frame
of each movie was labeled for object and action content, thereby providing a
reasonably dense sampling of image space without any a priori hypotheses of
dimensionality or structure. Huth et al. (2012) then adopted a model based
on WordNet (Miller, 1995), which provides lexical semantics for both objects
and actions. The total number of lexical entries derived from WordNet was
1,705—thereby forming a 1,705 parameter model. Using the neural responses for
each image frame and its labeled contents, Huth et al. then used linear regression
Can Big Data Help Us Understand Human Vision? 353

to find parameter weights for the WordNet-based model for the response of each
individual voxel in the brain. The key result of this analysis was a semantic map
across the whole brain, showing which neural units responded to which of the
objects and actions (in terms of the 1,705 lexical labels). Interestingly, this map
took the form of a continuous semantic space, organized by category similarity,
contradicting the idea that visual categories are represented in highly discrete brain
regions.
As an alternative to decoding, proxy models allow the leverage provided by Big
Data in the content domain to be applied to behavioral or neural data. Many
examples of this sort of approach can be seen in the recent, and rapidly growing,
trend of applying models drawn from computer vision, and, in particular, deep
neural network models, to fMRI and neurophysiological data (Agrawal et al., 2014;
Yamins et al., 2014).
A somewhat more structured approach has been adopted by directly using
artificial vision models to account for variance in brain data. As alluded to
earlier, deep neural networks or convolutional neural networks have gained
rapid popularity as models for content analysis in many domains with artificial
intelligence (LeCun et al., 2015). One of the more interesting characteristics of
such models is that they are hierarchical: Higher layers represent more abstract, or
high-level visual representations such as object or scene categories, while lower
layers represent low-level visual information, such as lines, edges, or junctions
localized to small regions of the image. This artificial hierarchical architecture
appears quite similar to the low-to-high-level hierarchy realized in the human
visual system. Moreover, these artificial models appear to have similar goals to
human vision: Taking undifferentiated points of light from a camera or a retina and
generating high-level visual representations that capture visual category structure,
including highly abstract information such as living/non-living or functional roles.
Recently, studies involving both human fMRI (Agrawal et al., 2014) and
monkey electrophysiology (Yamins et al., 2014) have found that, across the visual
perception of objects and scenes, deep neural networks are able to successfully
predict and account for patterns of neural activity in high-level visual areas
(Khaligh-Razavi & Kriegeskorte, 2014). Although the deep network models
employed in these studies are not a priori models of biological vision, they serve, as
we have argued, as proxy models whereby progress will be made by incorporating
and testing the efficacy of biologically derived constraints. For example, based
on present results we can confirm, not surprisingly, that the primate visual
system is indeed hierarchical. Perhaps a little less obviously, Yamins et al. (2014)
used a method known as hierarchical modular optimization to search through
a space of convolutional neural networks and identify which model showed the
best—from a computer vision point of view—object categorization performance.
What is perhaps surprising is that there was a strong correlation between model
performance and a given model’s ability to predict neuron responses recorded from
354 M. J. Tarr and E. M. Aminoff

monkey IT. That is, the model that performed best on the object categorization
task also performed best at predicting the responses of IT neurons. This suggests
that when one optimizes a convolutional neural network to perform the same task
for which we assume the primate ventral stream is optimized, similar intermediate
visual representations emerge in both the artificial and biological systems. Yamins
et al. (2014) support this claim with the finding that the best performing model was
also effective at predicting V4 neuron responses. At the same time, many challenges
remain. In particular, how do we understand such intermediate-level visual
representations—the “dark matter” of both deep networks and human vision?
To demonstrate how we might pursue the question of visual representation
in neural systems, we introduce an example from our own lab regarding how
the human brain represents complex, natural scenes. As in the studies discussed
above, we applied an artificial vision model trained on Big Data—1,000,000s of
images—to fMRI data. However, in this case, rather than predicting IT or V4
neural responses, we used the model to account for responses in three brain regions
already known to be involved in scene processing. Our goal was not to identify
new “scene-selective” areas, but to articulate how scenes are neurally represented
in terms of mid-level scene attributes. That is, we still know very little about how
scenes are neurally encoded and processed. Scenes are extremely complex stimuli,
rich with informative visual features at many different scales. What is unknown is
the “vocabulary” of these features—at present, there is no model for articulating
and defining these visual features that may then be tested against neural scene
processing data.
As mentioned to above, we are interested in those mid-level visual features
that are built up from low-level features, and are combined to form high-level
features. In particular, such intermediate features seem likely to play a critical role
in visual recognition (Ullman, Vidal-Naquet, & Sali, 2002). By way of example,
although intuitively we can categorize a “contemporary” apartment from a “rustic”
apartment, possibly based on the presence of objects in each scene, there are also
many non-semantic, mid-level visual features that may separate these categories.
Critically, such features are difficult to label or define. As such, it is at this
mid-level that the field of human visual science has fallen short in articulating
clear hypotheses. Why are mid-level features difficult to define? When trying to
articulate potential mid-level visual features we, as theorists, are biased and limited
in two ways. First, we are limited in that we define only those features that we can
explicitly label. However, useful visual features may not be easy to describe and
label, and therefore may not be obvious. This leads to the second limitation: We
are limited by defining those features that we think are important. Yet, introspection
does not provide conscious access to much (or even most) of our visual processing.
In order to move beyond these limitations and define a set of mid-level features
that may correlate with more high-level semantics, we have adopted artificial vision
models trained on “Big Data.” Big Data is likely to be useful here because it
Can Big Data Help Us Understand Human Vision? 355

leverages large-scale image analysis. Similarly, humans have a lifetime of exposure


to the image regularities that make up our visual environment. By analyzing
1,000,000s of images, artificial vision systems mimic this experience. At the same
time, relying on an artificial vision model to define mid-level visual features
removes the two biases discussed above. In particular, artificial vision models are
not restricted to labelled or intuitive visual features, but can, instead, build up
feature basis sets through statistical regularities across many images.
NEIL (“never ending image learner,” http://neil-kb.com/; Chen et al., 2013),
the particular artificial vision system we adopted, is actually not a deep neural
network. NEIL is a large-scale image analysis system that, using only weak
supervision, automatically extracts underlying statistical regularities from millions
of scene images and constructs intuitively correct scene categories and mid-level
attributes. As such, NEIL reduces the need for human intuitions and allows us
to explore the processing of features that are potentially important in moving
from low-level to high-level representations. NEIL is a particularly advantageous
model to use because NEIL learns explicit relationships between low, mid, and
high-level features and uses these relationships to better recognize a scene. For
example, it learns that jail cells have vertical lines (bars), and that a circus tent is
cone-shaped. By using these relationships, NEIL can limit the set of mid-level
features or “attributes” to those that carry meaningful information with respect
to characterizing a given scene. Critically, each attribute is one that accounts for
some variance across the scene space. Our conjecture is that NEIL’s attributes
representation is akin to how the human visual system learns which mid-level
features are optimally represented—those that best aid in characterizing scenes.
Such inferences can only be obtained using large-scale image data. Of note, this
aspect of NEIL represents a third advantage: Both NEIL and the human visual
system analyze images based on the common end goal of understanding scenes.
To explore NEIL’s ability to account for the neural representation of visual
scenes, we had participants view 100 different scenes while measuring their
brain activity using fMRI (Aminoff et al., 2015). Of particular interest was the
performance of mid-level visual attributes derived from NEIL as compared to
another high-level artificial vision model, SUN (Patterson, Xu, Su, & Hays,
2012). In contrast to NEIL, the features used in SUN were intuitively chosen
by humans as important in scene understanding (e.g. materials, affordances,
objects). If NEIL were to exhibit equivalent, or even better, performance
relative to SUN, this would indicate that data-driven models that derive critical
features as part of learning statistics across large numbers of images are effective,
outperforming hand-tuned, intuition-based models. Of course, we expect that
many different large-scale artificial vision models might do well at this task—the
future challenge will be understanding which model parameters are critical and
developing a comprehensive account of biological scene processing. As a first step,
to ensure that any mid-level features derived through NEIL were not simply highly
356 M. J. Tarr and E. M. Aminoff

correlated with low-level features, we also compared NEIL’s performance to several


alternative artificial vision models that are considered effective for characterizing
features such as edges, lines, junctures, and colors (HOG, SIFT, SSIM, Hue
Histogram, GIST).
The performance of all of the artificial models was assessed by first generating a
scene similarity matrix under each model—that is, measured pairwise similarity
across the 100 scenes—and then comparing these matrices to scene similarity
matrices derived from our neuroimaging data or from behavioral judgments of
the same scenes (Figure 15.3). For matrices arising from the artificial models, we
calculated a Pearson coefficient for each cell by correlating the vectors of feature
weights for a given pair of scenes. For the matrix arising from fMRI data, we
calculated the Pearson coefficient by correlating the vectors of responses for a given
pair of scenes using voxels within a given brain region of interest (Figure 15.3(b)).
For the matrix arising from behavior, we used mTURK-derived data, averaged
over participants, in which those participants judged the similarity for each pair of
scenes. These correlation matrices allow us to examine the ability of the different
models to account for patterns of variation in neural responses across scenes or
patterns of variation in similarity judgments across scenes. To examine scene
processing in the human visual system, we selected regions of interest (ROI) that
have been previously found to be selective for scenes: The parahippocampal/lingual
region (PPA), the retrosplenial complex (RSC), and the transverse occipital cortex
(TOS, also known as the occipital place area, OPA; Figure 15.3(a)). As controls, we
also selected two additional regions: An early visual region as a control for general
visual processing; and the right dorsolateral prefrontal cortex (DLPFC) as a control
for general, non-visual responses during scene processing.
We then correlated the scene space derived from fMRI with the scene spaces
derived from each of the artificial models and with the scene space derived from
behavior (Figure 15.3(c)). These analyses revealed that the neural patterns of
activity within scene-selective ROIs are most strongly correlated with the SUN
and NEIL models. Interestingly, NEIL was more effective in accounting for
neural response patterns than were behavioral judgments of similarity, suggesting
that NEIL is tapping into neural structures to which observers do not have
conscious access. Consistent with our expectations, the SIFT model, which
captures lower-level features, accounted for more pattern variance in the early
visual control regions as compared with the scene-selective regions of interest;
in contrast, for the DLPFC, all models performed about equally well (Aminoff
et al., 2015). Next, we used a hierarchical regression to examine whether NEIL
accounted for unique variance over and above the SUN model and any of several
low-level visual models. These included GIST, which captures spatial frequency
patterns in scenes; as well as HOG, RGB SIFT, Hue SIFT, SSIM, and Hue
Histogram (Aminoff et al., 2015). Finally, we included an artificial model (GEOM)
that divides scenes into probability maps of likely scene sections, such as the
Can Big Data Help Us Understand Human Vision? 357

Transverse Retrosplenial Parahippocampal


(a) Occipital Complex Place Area
Sulcus (TOS) (RSC) (PPA)

(c) 0.35 Correlations between Scene Spaces


Similarity Matrix
(b) Feature Space “Scene Space” 0.30

0.25
SCENES
SCENES

0.20

0.15

0.10

VOXELS SCENES 0.05

(d) Hierarchical Regression for each ROI 0.00


0.50
-0.05
LH RH LH RH LH RH Early DLPFC
0.45 PPA PPA RSC RSC TOS TOS Vis
0.40 Significant Changes in R
(e)
0.35 Low-
level GIST GEOM SUN NEIL Behavior
0.30
LH PPA 0.312 0.004 0.002 0.031 0.007 0
0.25
RH PPA 0.359 0.002 0.005 0.045 0.015 0
0.20 LH RSC 0.218 0.033 0.01 0.004 0 0.003
0.15 RH RSC 0.179 0.024 0.011 0.015 0 0.001

0.10 LH TOS 0.176 0.003 0.007 0.032 0.023 0


RH TOS 0.193 0.005 0.002 0.013 0.012 0.002
0.05
Early Vis 0.384 0.003 0.01 0.006 0.065 0
0.00
LH RH LH RH LH RH Early DLPFC DLPFC 0.229 0 0.005 0.011 0.002 0.006
PPA PPA RSC RSC TOS TOS Vis

NEIL Low -level Models Other: GIST


Behavior
SIFT: one set of
SUN Other: GEOM
low -level models

FIGURE 15.3 fMRI results. (a) Regions of interest in scene selective cortex—PPA,
RSC, and TOS. (b) Activity within an ROI varies across voxels. We create a feature
space using fMRI data with the responses from each voxel within and ROI for each
scene. This feature space is then cross-correlated to create a similarity, or correlation
matrix that represents the “scene space” for that set of data. Using the data from
computer vision models, the feature space would consist of the different features of the
model instead of voxels, as illustrated here. (c) Cross-correlated the fMRI similarity
matrix with different computer vision models and behavior. As can be seen NEIL
does just about as well as SUN, and SIFT, a low-level visual model, does not do
nearly as well. (d) A hierarchical regression was run to examine what unique variance
can be accounted for by NEIL. The order of blocks was 1—all low-level models,
2—GIST (a low-level model of spatial frequency that has shown to be important
in scene perception), 3—GEOM (a probability map of scene sections), 4—SUN,
5—NEIL, and then 6—behavioral data. (e) A table representing the change in R with
each sequential block. NEIL significantly accounted for unique variance, above all
other computer vision models used, in the PPA and TOS (adapted from Aminoff et
al., 2015).
358 M. J. Tarr and E. M. Aminoff

location of the sky and ground, where these sections capture very broad features
that intuitively seem important for scenes representation. Again, consistent with
our expectations, we observed that NEIL accounted for unique variance in the
responses seen within PPA and TOS; somewhat puzzling was the fact that NEIL
also accounted for unique variance in early visual control regions (Fig. 15.3(d,e)).
To be clear, we are not under the illusion that NEIL is a model of human
vision or that the features that emerge from NEIL are the ideal candidate features
for understanding human vision. At the same time, in terms of both inputs and
goals, NEIL and human vision share a great deal. As such, NEIL may serve as
a proxy model—a first step in elucidating a comprehensive account of how we
learn and represent visual information. To the extent that we find NEIL to be
effective in serving this role, we believe that much of its power lies in its use of
large-scale data—learning over millions of images. Data of this scale enables NEIL
to derive features and attributes from emergent statistical regularities that would
otherwise be unavailable to vision scientists. Thus, although our fMRI data is not
“big,” we are able to take advantage of Big Data approaches. In particular, here we
examined neural scene representation and found that NEIL significantly accounted
for variance in patterns of neural activity within scene-selective ROIs. NEIL’s
performance was equivalent or near-equivalent to another artificial model, SUN,
in which features were selected based on intuition (Figure 15.3(c)). Moreover,
NEIL was able to account for unique variance over and above all other artificial
vision models (Figure 15.3(d, e)).
One of the most important aspects of our results using NEIL are that they are
likely to be both scalable and generalizable. Hand-tuned models such as SUN are
only effective when the right features are chosen for a particular set of images.
When the set of images changes, for example, shifting from scenes to objects
or from one subset of scenes to another, models such as SUN may need to be
“reseeded” with new features. In contrast, NEIL learns and makes explicit features
and attributes that are likely to support the recognition of new classes of images.
As such, deploying Big Data artificial vision models such as NEIL or deep neural
networks moves us a step closer to developing successful models of human vision.

Conclusions
At present, both within and outside science, Big Data is, well . . . big. The
question is whether this degree of excitement is warranted—will heretofore
unavailable insights and significant advances in the study of human vision emerge
through novel applications of these new approaches? Or will vision scientists be
disappointed as the promise of these new methods dissipates without much in the
way of real progress (Figure 15.4)? Put another way, are we at the peak of inflated
expectations or the plateau of productivity (Figure 15.5)?
Can Big Data Help Us Understand Human Vision? 359

FIGURE 15.4 The Massachusetts Institute of Technology Project MAC Summer Vision
Project. An overly optimistic of view of the difficulty of modeling human vision circa
1966. Oops.

It is our speculation that the application of Big Data to biological vision is more
likely at the peak of inflated expectations than at the plateau of productivity. At the
same time, we are somewhat optimistic that the trough will be shallow and that the
toolbox afforded by Big Data will have a significant and lasting impact on the study
of vision. In particular, Big Data has already engendered dramatic advances in our
ability to process and organize visual content and build high-performing artificial
vision systems. However, we contend that, as of 2015, Big Data has actually had
little direct impact on visual cognitive neuroscience. Rather, advances have come
from the application of large-scale content analysis to neural data. That is, given a
shortage of well-specified models of human vision, Big-Data models that capture
both the breadth and the structure of the visual world can serve to help increase our
understanding of how the brain represents and processes visual images. However,
even this sort of application is data limited due to the many constraints afforded
by present-day neuroimaging methods: The dimensionality of the models being
applied is dramatically larger than the dimensionality of the currently available
360 M. J. Tarr and E. M. Aminoff

Peak of Inflated Expectations

Plateau of Productivity
Visibility

Slope of Enlightenment

Trough of Disillusionment
Technology Trigger
Time
FIGURE 15.5 The Gartner Hype Cycle. Only time will tell whether Big Data is sitting
at the peak of inflated expectations or at the plateau of productivity. (Retrieved
October 19, 2015 from https://commons.wikimedia.org/wiki/File:Gartner_Hype_-
Cycle.svg ).

neural data. As such, it behooves us, as vision scientists, to collect larger-scale


neural datasets that will provide a much denser sampling of brain responses across
a wide range of images. At the same time, any attempt to collect increased-scale
neural data is likely to continue to be constrained by current methods—what we
ultimately need are new neuroimaging techniques that enable fine-grained spatial
and temporal sampling of the human brain, thereby enabling genuinely Big Data
in visual neuroscience.

Acknowledgments
This work was supported by the National Science Foundation, award 1439237,
and by the Office of Naval Research, award MURI N000141010934.

Notes
1 A decade from now this definition may seem quaint and the concept of “big”
might be something on the order of 109 samples. Of course, our hedge is
likely to be either grossly over-optimistic or horribly naive, with the actual
conception of big in 2025 being much more or less.
2 This limit is based on using state-of-the-art, 7T MRI scanners. However, most
institutions do not have access to such scanners. Moreover, high-field scanning
introduces additional constraints. In particular, many individuals suffer from
Can Big Data Help Us Understand Human Vision? 361

nausea, headaches, or visual phosphenes if they move too quickly within the
magnetic field. As a consequence, even with access to a high-field scanner,
most researchers chose to use lower field, 3T systems where the minimum
voxel size is about 1.2 to 1.5 mm3 .
3 Within the neuroimaging community, there has been a strong focus on
advancing the science by improving the spatial resolution of extant methods.
This has been particularly apparent in MRI, where bigger magnets, better
broadcast/antenna combinations, and innovations in scanning protocols have
yielded significant improvements in resolution.
4 At the extreme end of functional imaging, “resting state” or “functional
connectivity” MRI (Buckner et al., 2013) allows, through the absence of
any task, a higher rate of data collection. Using state-of-the-art scanners,
about 2,300 samples of a 100,000 voxel volume (3 mm3 voxels) can be
collected in one hour using a 700 ms sample rate (assuming no rest periods
or breaks). Other, non-functional, MRI methods may offer even higher
spatial sampling rates. For example, although less commonly employed as a
tool for understanding vision (but see Pyles, Verstynen, Schneider, & Tarr,
2013; Thomas et al., 2009), diffusion imaging neuroimaging may provide
as many as 650 samples per a 2 mm3 voxel where there are about 700,000
voxels in a brain volume. Moreover, because one is measuring connectivity
between these voxels, the total number of potential connections that could be
computed is 490,000,000,000. At the same time, as with most neuroimaging
measurements, there is a lack of independence between samples in structural
diffusion imaging and the high dimensionality of such data suggests complexity,
but not necessarily “Big Data” of the form that provides leverage into solving
heretofore difficult problems.

References
Agrawal, P., Stansbury, D., Malik, J., & Gallant, J. L. (2014). Pixels to voxels: modeling visual
representation in the human brain. arXiv E-prints, arXiv: 1407.5104v1 [q-bio.NC].
Retrieved from http://arxiv.org/abs/1407.5104vi.
Aminoff, E. M., Toneva, M., Shrivastava, A., Chen, X., Misra, I., Gupta, A., & Tarr,
M. J. (2015). Applying artificial vision models to human scene understanding. Frontiers in
Computational Neuroscience, 9(8), 1–14. fncom.2015.00008.
Anderson, J. R. (1993). Rules of the mind. Hillsdale, NJ: Erlbaum.
Barenholtz, E., & Tarr, M. J. (2007). Reconsidering the role of structure in vision. In
A. Markman & B. Ross (Eds.), Categories in use (Vol. 47, pp. 157–180). San Diego, CA:
Academic Press.
Buckner, R. L., Krienen, F. M., & Yeo, B. T. (2013). Opportunities and limitations
of intrinsic functional connectivity MRI. Nature Neuroscience, 16(7), 832–837. doi:
10.1038/nn.3423.
362 M. J. Tarr and E. M. Aminoff

Chen, X., Shrivastava, A., & Gupta, A. (2013). NEIL: Extracting visual knowledge from
web data. In Proceedings of the International Conference on Computer Vision (ICCV). Sydney:
IEEE.
Clifford, S. (2008). Finding fame with a prescient call for Obama. New York Times (online,
November 9). Retrieved from www.nytimes.com/2008/11/10/business/media/10silver.
html.
Deng, J., Russakovsky, O., Krause, J., Bernstein, M. S., Berg, A., & Fei-Fei, L. (2014).
Scalable multi-label annotation. In CHI ’14: Proceedings of the SIGCHI conference on human
factors in computing systems (pp. 3099–3102). ACM. doi:10.1145/2556288.2557011.
Fodor, J. A. (1983). Modularity of mind. Cambridge, MA: MIT Press.
Gauthier, I., Tarr, M. J., Anderson, A. W., Skudlarski, P., & Gore, J. C. (1999). Activation
of the middle fusiform ‘face area’ increases with expertise in recognizing novel objects.
Nature Neuroscience, 2(6), 568–573. doi: 10.1038/9224.
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L.
(2009). Detecting influenza epidemics using search engine query data. Nature, 457(7232),
1012–1014. doi:10.1038/nature07634.
Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., & Pietrini, P. (2001).
Distributed and overlapping representations of faces and objects in ventral temporal
cortex. Science, 293(5539), 2425–2430. doi:10.1126/science.1063736.
Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic
space describes the representation of thousands of objects and action categories across the
human brain. Neuron, 76(6), 1210–1224. doi: 10.1016/j.neuron.2012.10.014.
Kanwisher, N., McDermott, J., & Chun, M. M. (1997). The fusiform face area: A module
in human extrastriate cortex specialized for face perception. Journal of Neuroscience, 17(11),
4302–4311.
Khaligh-Razavi, S. M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised,
models may explain IT cortical representation. PLoS Computational Biology, 10(11),
e1003915. doi: 10.1371/journal.pcbi.1003915.
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept
learning through probabilistic program induction. Science, 350(6266), 1332–1338. doi:
10.1126/science.aab3050.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu:
Traps in Big Data analysis. Science, 343(6176), 1203–1205. doi: 10.1126/science.1248506.
Le, Q. V., Ranzato, M. A., Monga, R., Devin, M., Chen, K., Corrado, G. S., . . . Ng,
A. Y. (2012). Building high-level features using large scale unsupervised learning. In
International Conference in Machine Learning.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521(7553), 436–444.
doi: 10.1038/nature14539.
Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM,
38(11), 39–41.
Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A.,
. . . Welling, J. (2015). Never-ending learning. In Proceedings of the Twenty-Ninth AAAI
Conference on Artificial Intelligence, AAAI.
Mumford, J. A., Turner, B. O., Ashby, F. G., & Poldrack, R. A. (2012). Deconvolving
BOLD activation in event-related designs for multivoxel pattern classification analyses.
Neuroimage, 59(3), 2636–2643. doi: 10.1016/j.neuroimage.2011.08.076.
Can Big Data Help Us Understand Human Vision? 363

Norman, K. A., Polyn. S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading:
Multi-voxel pattern analysis of fMRI data. Trends in Cognitive Science, 10(9), 424–430.
doi: 10.1016/j.tics.2006.07.005.
Oxford English Dictionary. (2016), Oxford University Press. Retrieved from www.
oed.com/view/Entry/18833.
Patterson, G., Xu, C., Su, H., & Hays, J. (2014). The SUN attribute database:
Beyond categories for deeper scene understanding. International Journal of Computer Vision,
108(1–2), 59–81.
Pinker, S. (1999). Words and rules: The ingredients of language (pp. xi, 348). New York, NY:
Basic Books Inc.
Pyles, J. A., Verstynen, T. D., Schneider, W., & Tarr, M. J. (2013). Explicating
the face perception network with white-matter connectivity. PLoS One, 8(4). doi:
10.1371/journal.pone.0061611.
Rosenberg, C., & Image Search Team. (2013). Improving photo search: A step
across the semantic gap. Google research blog. Retrieved from http://googleresearch.
blogspot.com/2013/06/improving-photo-search-step-across.html.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . Fei-Fei, L. (2015).
ImageNet large scale visual recognition challenge. International Journal of Computer Vision.
115(3), 211–252. doi: 10.1007/s11263-015-0816-y.
Sawchik, T. (2015). Big Data baseball: Math, miracles, and the end of a 20-year losing streak.
New York, NY: Flatiron Books.
Sereno, M. I., & Allman, J. M. (1991) Cortical visual areas in mammals. In A.G. Leventhal
(Ed.), The Neural Basis of Visual Function (pp. 160–172). London: Macmillan.
Stansbury, D. E., Naselaris, T., & Gallant, J. L. (2013). Natural scene statistics account for
the representation of scene categories in human visual cortex. Neuron, 79(5), 1025–1034.
doi: 10.1016/j.neuron.2013.06.034.
Thomas, C., Avidan, G., Humphreys, K., Jung, K. J., Gao, F., & Behrmann, M. (2009).
Reduced structural connectivity in ventral visual cortex in congenital prosopagnosia.
Nature Neuroscience, 12(1), 29–31. doi: 10.1038/nn.2224.
Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system.
Nature, 381(6582), 520–522. doi: 10.1038/381520a0.
Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity
and their use in classification. Nature Neuroscience, 5(7), 682–687.
Wallis, G. (1996). Using spatio-temporal correlations to learn invariant object recognition.
Neural Networks, 9(9), 1513–1519.
Woloszyn, L., & Sheinberg, D. L. (2012). Effects of long-term visual experience on
responses of distinct classes of single units in inferior temporal cortex. Neuron, 74(1),
193–205. doi: 10.1016/j.neuron.2012.01.032.
Yamins, D. L. K., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J.
(2014). Performance-optimized hierarchical models predict neural responses in higher
visual cortex. Proceedings of the National Academy of Sciences of the United States of America,
111(23), 8619–8624.
Yang, Y., Tarr, M. J., & Kass, R. E. (2014). Estimating learning effects: A short-time
Fourier transform regression model for MEG source localization. In Springer lecture notes
on artificial intelligence: MLINI 2014: Machine learning and interpretation in neuroimaging.
New York: Springer.
INDEX

Bold page numbers indicate figures, italic numbers indicate tables.


ACT-R, 40–1 Ames, M., 121–2, 148
Airport Scanner, 8 Anderson, J. R., 40–1
alignment in web-based dialogue Andrews, S., 206
adaptation as window into the mind, Aslin, R. N., 7
261–2 association networks, 178–9
alignment defined, 249 associative chain theory, 321
analysis of data, 258, 258–61, 259, 260 associative priming, 187
and attention, 255 attention economies, 273
Big Data, 246, 247, 248, 263–4 attentional control, 207–8, 210, 211,
Cancer Survivors Network dataset, 218–19, 219–20
256–7, 259, 260, 260 automatic spreading activation, 204
computational psycholinguistics, 248, average conditional information (ACI), 96,
261–3 96
data and methods for study, 256–7 average unigram information (AUI), 96, 96
as driver of dialogue, 249 Aylett, M. P., 94
information-theoretic models of text and
dialogue, 263 Balota, D. A., 206, 207
integrated models of language Barsalou, L., 154, 155–6
acquisition, 262–3 Bayesian methods
integration of psycholinguistics and Bayes factor, 19
cognitive modeling, 247–9 Big Data applications of, 21–2, 30
issues in study, 260–1 combining cognitive models, 28–9
as mechanistic, 253–4 conjugate priors, 21, 26, 29
memory-based explanation, 255 frequentist comparison, 14–16
priming, 249–54 LATER (linear approach to threshold
Reddit web forum dataset, 256–7, 258, with ergodic data) model, 23–8, 24,
258–9, 259 27, 28
research questions, 257–8 MindCrowd Project, 22–3
social modulation of alignment, 254–6 posterior distribution, 16–19, 18
Amazon Mechanical Turk, 331 principles of, 16–17
Index 365

prior distribution, 17 factors influencing, 272


sequential updating, 19–21 Flynn effect, 278
stationary assumption, 29 future research, 290
stochastic approaches, 18–19 information crowding, 273–5, 275
structural approaches, 18 information markets, 273
Bayesian teaching linguistic niche hypothesis, 272–3
complexity in Baysian statistics, 67–9 morphological complexity, 272
data selection and learning, 66–7 noise, impact of on, 274–5
Dirichlet process Gaussian mixture population density, 287–9
model (DPGMM), 73, 81 semantic bleaching, 279–82, 283
Gaussian category models, 71–80, 81–2 surface complexity, 275–7
importance sampling, 68 as systematic, 271
infant-directed speech, 71–80, 76, 77, types of change, 272
79 word length, 283–4, 284
likelihood, 69 classical school of statistical inference,
Metropolis-Hastings algorithm, 68–9 14–16
Monte Carlo approximation, advances Cleeremans, A., 327
in, 69–71 Cleland, A. A., 255
natural scene categories, 80–5, 85, 86 cognitive development, Big Data use in, 7
orientation distribution in the natural cognitive science
world, 80–5, 85, 86 Big Data, use of in, 4–6
pseudo-marginal MCMC, 70–1, 84 research, changes in due to Big Data,
simulated annealing, 83 6–8
BD2K (Big Data to Knowledge) Project, 4 collaborative filtering, 42–4, 43
BEAGLE model, 230–1 collaborative tagging, memory cue
Beaudoin, J., 148, 155 hypothesis in
Beckner, C., 92 academic interest in, 120
Benjamin, A. S., 124 analytic approaches, 128–9
Big Data application of, 118
definitions, 2–4 audience for tagging, 122
evolution of application of, 344 Big Data issues, 135–6, 139
expansion of, 1–2 causal analyses, 136–8, 138
intertwining with theory, 8–9 clustering, 136–7
theory evaluation using, 2 cued recall research, 122–4
use of in cognitive science, 4–6 dataset, 125–7, 127
Bock, J. K., 250 definition, 119
Bodner, G. E., 204 entropy, tag, 133–6, 134
Bolognesi, M., 149 evidence for motivation, lack of, 118,
Botvinick, M. M., 323 125
Branigan, H. P., 255 folksonomies, 119–20
future listening to tagged item, 127,
Cancer Survivors Network dataset, 256–7, 136–8, 138
259, 260, 260 hypotheses, 127–8
change in language information theoretic analyses, 132–6,
age of acquisition, 285–6 134
attention economies, 273 Last.fm, 125–6
Big Data, 271 motivation for tagging, 120–2
competition, influence of, 290 purpose of tagging, 117
complexity, surface v conceptual, 275–7 recommendation systems, 139
conceptual length, 275–7 research question, 125
concreteness, 277–9, 280, 281 retrieval function, 120–1
data analysed, 278–9 specificity, tag, 127, 132–6, 134
366 Index

collaborative tagging, memory cue time series analysis, 128–32, 130, 131,
hypothesis in (cont.) 132, 133
time series analysis, 128–32, 130, 131, Web 2.0, impact of
132, 133 see also Flickr Distributional Tagspace
Web 2.0, impact of, 118 (FDT)
see also Flickr Distributional Tagspace convolutional neural networks, 353–4
(FDT) Cooper, R. P., 322
Collins, A. M., 178, 179 creativity, 183
complexity in Baysian statistics, 67–9 crowd-sourced data
computational psycholinguistics, 248, LATER (linear approach to threshold
261–3 with ergodic data) model, 23–8, 24,
see also alignment in web-based dialogue 27, 28
conceptual length, 275–7 MindCrowd Project, 22–3
concreteness in language crowding hypothesis, 273–5, 275
age of acquisition, 285–6 age of acquisition, 285–6
conceptual efficiency, 277–8 attention economies, 273
conceptual length, 275–7
data analysed, 278–9
concreteness, 277–9, 280, 281
future research, 290
data analysed, 278–9
learner-centred change, 282–7, 284,
Flynn effect, 278
285, 286, 287
future research, 290
population density, 287–9, 288 information markets, 273
rise of in American English, 278–9 noise, impact of, 274–5
word length, 283–4, 284 population density, 287–9, 288
conditional informational variability (CIV), reduction in surface complexity, 282–7,
96, 96–7 284, 285, 286, 287
content tagging, memory cue hypothesis in semantic bleaching, 279–82, 283
collaborative, 119 word length, 283–4, 284
academic interest in, 120 cued recall, 188
analytic approaches, 128–9 see also memory cue hypothesis in
application of, 118 collaborative tagging
audience for tagging, 122
Big Data issues, 135–6, 139
causal analyses, 136–8, 138 Dale, R., 93
Danescu-Niculescu-Mizil, C., 254–5
clustering, 136–7
data analysis, Big Data and, 344
cued recall research, 122–4
data capture, Big Data use in, 7–8
dataset, 125–7, 127
data mining
entropy, tag, 133–6, 134 collaborative filtering, 42–4, 43
evidence for motivation, lack of, 118, see also statistical inference
125 Davis, E., 4
folksonomies, 119–20 decision by sampling theory (DbS)
future listening to tagged item, 127, assumptions of, 298–9
136–8, 138 Big Data, 299–312, 301, 304, 307, 310,
hypotheses, 127–8 314
information theoretic analyses, 132–6, causality, 312–13
134 coincidence, 313
Last.fm, 125–6 comparison values in memory, use of in,
motivation for tagging, 120–2 296–8, 310–11
purpose of tagging, 117 human lives, subjective value of, 303–6,
recommendation systems, 139 304, 311–12
retrieval function, 120–1 monetary gains and losses, 299–303,
specificity, tag, 127, 132–6, 134 301, 311
Index 367

time delays, perception of, 309–10, 310 Gallant, J. L., 352


and utility-based/absent theories, 294–6, Gaussian category models, 71–80, 81–2
298 Golder, S. A., 120
weighting of probabilities, 306–8, 307, Goodman, N. D., 66
311 Google Photos, 344, 345
deep neural networks, 351, 353, 355, 358 Granger causality, 136
distributional semantics Griffiths, T. L., 193
Big Data, 144–5 Gupta, M., 121
folksonomies, 145
hybrid models, 145–7
image-based information, 146–7 Hammitt, J. K., 305
models, 145–7 Han, J., 121
see also Flickr Distributional Tagspace Heckner, M., 121, 148
(FDT) Heileman, M., 148
dual-coding theory, 277 hierarchical control theory, 322
hierarchical modular optimization, 353
Hinton, G. E., 262–3
e-commerce recommender systems, 42–3 Hotho, A., 120
Earhard, M., 123, 124 Huberman, B. A., 120
electroencephalography (EEG), 349 Human Connectome Project, 3
entropy, tag, 133–6, 134 human lives, subjective value of, 303–6,
Estes, W. K., 5–6 304, 311–12
Hutchison, K. A., 205, 208, 213,
false memories, 188 218, 219
Flickr Distributional Tagspace (FDT) Huth, A. G., 352
categorization of concepts, 160–8, 161,
162, 163, 165, 166, 167, 168
classification of tags, 148 image analysis, 344, 345
cluster analysis, 153–4, 164, 168, 168 implicit memory literature, 230
colour terms, distribution across tags, importance sampling, 68
149–50 infant development, Big Data use in, 7
dataset, 150–3, 152 infant-directed speech, 71–80, 76, 77, 79
distributional models, 145–7 information crowding, 273–5, 275
as distributional semantic space, 147, 169 age of acquisition, 285–6
environment, 148–9 attention economies, 273
human-generated features comparison, conceptual length, 275–7
153–9, 158 concreteness, 277–9, 280, 281
implementing, 150–3, 152 data analysed, 278–9
McRae’s features norms comparison, Flynn effect, 278
153–9, 158, 160 future research, 290
minimum variance clustering, 153–4 information markets, 273
motivation for tagging, 148–9 noise, impact of, 274–5
Ward method, 153–4 population density, 287–9, 288
WordNet comparison, 159, 159–60 reduction in surface complexity, 282–7,
Flynn, J. R., 278, 290 284, 285, 286, 287
folksonomies, 119–20, 145 semantic bleaching, 279–82, 283
forgetting, 37–8 word length, 283–4, 284
case study, 44–9, 46, 47, 48, 58, 59–60 information markets, 273
frequentist school of statistical inference, information theory, 93–4
14–16 instance theories of memory, 328, 329
functional magnetic resonance imaging intersession intervals, 38
(fMRI), 348 item-response theory (IRT), 43–4
368 Index

Jaeger, F. T., 94 LATER (linear approach to threshold with


Jäschke, R., 120 ergodic data) model, 23–8, 24, 27,
Jones, M., 156 28
learning, see memory retention
Lee, L., 254–5
knowledge state, 35–7, 37 length of words, 283–4
Körner, C., 122 lexical quality hypothesis, 208, 220
Kuhl, P. K., 71 Li, R., 121
Liberman, M., 9
language, social network influences on linear approach to threshold with ergodic
adaptivity and complexity of language, data (LATER) model, 23–8, 24, 27,
111–12 28
average conditional information (ACI), linguistic alignment, see alignment in
96, 96 web-based dialogue
average unigram information (AUI), 96, linguistic labels, see semantic labels
96 linguistic niche hypothesis, 272–3
bias in analysis, 101 linguistic variability, see change in language;
Big Data, and analysis of, 92 language, social network influences
Big Data approach, benefits of, 109–11, on
112 Loftus, E. F., 178, 179
complex measures, 107–9, 110 Lupyan, G., 93
conditional informational variability
(CIV), 96, 96–7 machine learning approaches, 9
connectivity, 110–11 magnetic resonance imaging (MRI), 348
expectations and predictions from using magnetoencephalography (MEG), 349
Big Data approach, 102–3 Marcus, G., 4
Gini coefficient, 101–2 Marlow, C., 148
individuals, impact on networks, 111 Masson, M. E. J., 204
influences changing, 93 McClelland, J. L., 229, 230–1, 235
information theory, 93–4 McLean, J. F., 255
linguistic measures, 95–7, 96 MCM (Multiscale Context Model), 41–2
network measures, 98, 99–100, 100, 101 McRae’s features norms, 158
network view of the mental lexicon, mediated priming, 187
182–3 memory
random review baseline, 102 instance theories of, 328–30, 329
reviewer-internal entropy (RI-Ent), see also forgetting; memory cue
95–6, 96 hypothesis in collaborative tagging;
sample social networks, 97, 97–8 memory retention; mental lexicon,
simple measures, 103–7, 104, 105, 106 network view of
social-network structure, 94 memory cue hypothesis in collaborative
study aims and method, 95 tagging
total of individuals reviews, analysis of, academic interest in, 120
100–1 analytic approaches, 128–9
uniform information density, theory of, application of, 118
94 audience for tagging, 122
unigram informational variability (UIV), Big Data issues, 135–6, 139
96, 96–7 causal analyses, 136–8, 138
variability in, explanations for, 91–2 clustering, 136–7
see also change in language cued recall research, 122–4
Lashley, Karl, 320 dataset, 125–7, 127
Last.fm, 125–6 definition, 119
latent semantic analysis (LSA), 177 entropy, tag, 133–6, 134
Index 369

evidence for motivation, lack of, 118, clinical populations, study of, 193
125 clusters of words, 184–5, 186
folksonomies, 119–20 corpus-based methods, 177–8
future listening to tagged item, 127, and creativity, 183
136–8, 138 dictionary metaphor, 175
hypotheses, 127–8 directionality of network, 187
information theoretic analyses, 132–6, individuals, networks of, 193
134 insights from, 181–91
Last.fm, 125–6 language development, 182–3
motivation for tagging, 120–2 macroscopic level insights, 181–4, 191–2
purpose of tagging, 117 memory retrieval and search, 188–9
recommendation systems, 139 mesoscopic level insights, 184–9, 186,
research question, 125 191–2
retrieval function, 120–1 microscopic level insights, 189–90,
specificity, tag, 127, 132–6, 134 191–2
time series analysis, 128–32, 130, 131, multilevel network view, 191–2
132, 133 node centrality, 189–90
Web 2.0, impact of, 118 priming, 187–8
memory retention relatedness of words, 185–7
ACT-R, 40–1 representation of semantic similarity,
collaborative filtering, 42–4, 43 179–80, 180
computational models, 40–2 research into, 175–6
and e-commerce recommender systems, rigid/chaotic structure, 192
42–3 small scale studies, 175–6
forgetting, 37–8, 44–9, 46, 47, 48, 58, small world structure, 181–2
59–60 specific groups, use with, 192–3
human-memory phenomena, 37–40, 39 spreading activation, 180–1
integration of psychological theory with thematic organization of the mental
Big Data methods, 44–60, 46, 47, lexicon, 184–5, 186
48, 54, 55, 57 thesaurus model, 176–7
intersession intervals, 38–9, 39 WordNet, 176–7
item-response theory (IRT), 43–4 Metropolis-Hastings algorithm, 68–9
knowledge state, 35–7, 37 Miller, G. A., 5–6
machine learning, 42, 43 MindCrowd Project, 22–3
MCM (Multiscale Context Model), Mitroff, S. R., 8
41–2 Monte Carlo approximation, advances in,
network view of the mental lexicon, 69–71
188–9 Morais, A. S., 193
personalized review, 49–58, 54, 55, 57 morphological complexity, 272
3PL/2PL models, 44 Multiscale Context Model (MCM), 41–2
psychological theories, 37–42, 39 multivariate pattern analysis (MVPA), 345
retention intervals, 38, 39
simulation methodology, 58, 59–60
spacing effects, 38–42, 39 n-gram analysis, 331
strengths of theory/data-driven Naaman, M., 121–2, 148
approaches, 58 natural scene categories, 80–5, 85, 86
mental lexicon, network view of NEIL (never ending image learner), 355–7,
association networks, 178–81 357
centrality measures, 189–91 network measures, 98, 99–100, 100, 101
challenges of using, 193–5 network science, 9
clinical populations, structural network view of the mental lexicon
differences in, 183–4 association networks, 178–81
370 Index

network view of the mental lexicon (cont.) Pickering, M. J., 255


centrality measures, 189–91 Pinker, S., 351
challenges of using, 193–5 2PL/3PL models, 44
clinical populations, structural Plaut, D. C., 323
differences in, 183–4 population density, 287–9, 288
clinical populations, study of, 193 posterior distribution, 16–19, 18
clusters of words, 184–5, 186 priming
corpus-based methods, 177–8 alignment in web-based dialogue,
and creativity, 183 249–54, 252
dictionary metaphor, 175 network view of the mental lexicon,
directionality of network, 187 187–8
individuals, networks of, 193 see also semantic priming
insights from, 181–91 prior distribution, 17
language development, 182–3 probabilities, weighting of, 306–8, 307, 311
latent semantic analysis, 177 Project Gutenburg, 331–2
macroscopic level insights, 181–4, 191–2 prospect theory, 300, 306–7
memory retrieval and search, 188–9 pseudo-marginal MCMC, 70–1, 84
mesoscopic level insights, 184–9, 186, psychological theories of memory
191–2 retention, 37–42, 39
microscopic level insights, 189–90, integration with Big Data methods,
191–2 44–60, 46, 47, 48, 54, 55, 57, 58
multilevel network view, 191–2
priming, 187–8
relatedness of words, 185–7 Ratcliff, R., 206
representation of semantic similarity, reaction time, LATER model, 23–8, 24,
179–80, 180 27, 28
research into, 175–6 Recchia, G., 156
rigid/chaotic structure, 192 Reddit web forum dataset, 256–7, 258,
small scale studies, 175–6 258–9, 259
small world structure, 181–2 Reitter, D., 262
specific groups, use with, 192–3 relatedness of words, 185–7
spreading activation, 180–1 remote association/triad tasks, 188–9
thematic organization of the mental research, changes in due to Big Data, 6–8
lexicon, 184–5, 186 response-scheduling process, 338–9, 339
thesaurus model, 176–7 retention intervals, 38, 39
WordNet, 176–7 review, and memory retention, 49–58, 54,
neuroimaging methods, 348–50 55, 57
Norman, D. A., 324, 338–9 reviewer-internal entropy (RI-Ent), 95–6,
Nov, O., 148 96
Rheinberger, C. M., 305
Robertson, S., 147
Olivola, C. Y., 303, 305–6, 311–12 Rogers, T. T., 230–1, 235
orientation distribution in the natural Rumelhart, D. E., 324, 338
world, 80–5, 85, 86
Osindero, S., 262–3
Sagara, N., 303, 305–6, 311–12
Schmitz, C., 120
Paivo, A., 277 semantic bleaching, 279–82, 283
past experience and decision-making, see semantic cognition theory, 230
decision by sampling theory (DbS) semantic labels
Pavlik, P. I., 40–1 Big Data, 144–5
personalized review, 49–58, 54, 55, 57 distributional models, 145–7
phonemes, teaching, 71–2 folksonomies, 145
Index 371

hybrid models, 145–7 Sibley, D. E., 206


image-based information, 146–7 Simon, H. A., 273
see also Flickr Distributional Tagspace simplification assumption of cognitive
(FDT) modeling
semantic priming analysis of a small world, 230–1, 242
analysis of data, 211–17, 213, 214, 215, appeal of, 227–8, 228
215, 216 BEAGLE model, 230–1
attentional control, 207–8, 210, 211, Big Data analysis of small-world
218–19, 219–20 structure, 236–42, 239, 240, 241
automatic spreading activation, 204 Big Data approach comparison, 228–9,
dataset for study, 210–11 242–3
expectancy, 204 clarity as central to, 229
importance of, 203–4 complexity, informational, 243
individual differences in, 206–8, 210, complexity/clarity tradeoff, 229
214, 214–17, 215, 215, 216, implicit memory literature, 230
219–20 learning a small world, 231–6, 233, 234,
individuals/groups, 204–5 243
isolated v prime, 220–1 semantic cognition theory, 230
lexical quality hypothesis, 208, 220 simulated annealing, 83
limitations of study, 222 small-world approach, see simplification
megastudy data, 205 assumption of cognitive modeling
memory-recruitment, 204 social modulation of alignment, 254–6
methodology for study, 210–11 social-network influences on language use
network view of the mental lexicon, adaptivity and complexity of language,
187–8 111–12
present study, 209–10 average conditional information (ACI),
reliability of, 208–9, 210, 212, 212–14, 96, 96
213, 217–19 average unigram information (AUI), 96,
research questions, 205–6 96
semantic matching, 204 behaviour of language users, 94
semantic priming project (SPP), 205, bias in analysis, 101
210 Big Data approach, benefits of, 109–11,
stability of effects, 205, 208–9, 210, 212, 112
212–14, 213, 217–19 complex measures, 107–9, 110
uniformity assumption, 205 conditional informational variability
vocabulary knowledge, 206–7, 210, 211 (CIV), 96, 96–7
weakness at longer SOA, 221 connectivity, 110–11
sequential Bayesian updating Gini coefficient, 101–2
combining cognitive models, 19–21, individuals, impact on networks, 111
28–9 linguistic measures, 95–7, 96
Big Data applications, 21–2 network measures, 98, 99–100, 100, 101
LATER (linear approach to threshold random review baseline, 102
with ergodic data) model, 23–8, 24, reviewer-internal entropy (RI-Ent),
27, 28 95–6, 96
MindCrowd Project, 22–3 sample networks, 97, 97–8
conjugate priors, 21, 26, 29 simple measures, 103–7, 104, 105, 106
stationary assumption, 29 study aims and method, 95
serial ordering processes, 320–2 total of individuals reviews, analysis of,
serial recurrent network predictions, 328 100–1
Shafto, P., 66 unigram informational variability (UIV),
Shallice, T., 322 96, 96–7
Shipstead, Z., 219 spacing effects, 38–42, 39
372 Index

specificity, tag, 127, 132–6, 134 motivation for tagging, 120–2


spreading activation, 204 recommendation systems, 139
statistical inference research question, 125
Bayes factor, 19 retrieval function, 120–1
Bayesian school, compared to specificity, tag, 127, 132–6, 134
frequentist, 14–16 time series analysis, 128–32, 130, 131,
Big Data applications of sequential 132, 133
updating, 21–2 Web 2.0, impact of
combining cognitive models, 28–9 see also Flickr Distributional Tagspace
conjugate priors, 21, 26, 29 (FDT)
frequentist school, 14–16 teaching, Baysian
LATER (linear approach to threshold complexity in Baysian statistics, 67–9
with ergodic data) model, 23–8, 24, data selection and learning, 66–7
27, 28 Gaussian category models, 71–80, 81–2
MindCrowd Project, 22–3 importance sampling, 68
model-driven, 14 infant-directed speech, 71–80, 76, 77,
posterior distribution, 16–19, 18 79
principles of Bayesian methods, 16–17, likelihood, 69
18 Metropolis-Hastings algorithm, 68–9
prior distribution, 17 Monte Carlo approximation, advances
sequential Bayesian updating, 19–21 in, 69–71
stationary assumption, 29 natural scene categories, 80–5, 85, 86
stochastic approaches, 18–19 orientation distribution in the natural
structural approaches, 18 world, 80–5, 85, 86
Stewart, N., 300, 302, 307–8, 309–10, 311 pseudo-marginal MCMC, 70–1, 84
Stolz, J. A., 209, 210, 214, 217 simulated annealing, 83
Stumme, G., 120 Teh, Y.-W., 262–3
syntactic priming, and alignment in theory
web-based dialogue, 249–52, 252 evaluation of using big data, 2
intertwining with methods, 8–9
tagging, memory cue hypothesis in time delays, perception of, 309–10, 310
collaborative TOTE (test, operate, test, exit), 322
academic interest in, 120 toy model approach, see simplification
analytic approaches, 128–9 assumption of cognitive modeling
application of, 118 Tse, C.-S., 207
audience for tagging, 122 Tullis, J. G., 124
Big Data issues, 135–6, 139 typing performance
causal analyses, 136–8, 138 Amazon Mechanical Turk, 331
clustering, 136–7 associative chain theory, 321
cued recall research, 122–4 Big Data, 330–6, 335, 336, 339
dataset, 125–7, 127 data sources, 331
definition, 119 emergence v. modularism, 323
entropy, tag, 133–6, 134 expertise and sensitivity to sequential
evidence for motivation, lack of, 118, structure, 334–5, 335
125 hierarchical control and skilled typing,
folksonomies, 119–20 323–4
future listening to tagged item, 127, hierarchical control theory, 322
136–8, 138 inner loop development, 324–5
hypotheses, 127–8 instance theories of memory, 328–30,
information theoretic analyses, 132–6, 329
134 inter-keystroke interval times, data on,
Last.fm, 125–6 333
Index 373

letter frequency, 325–7, 326 future regarding Big Data, 358–60, 359,
methodology, 332–3 360
n-gram analysis, 331 hierarchical modular optimization,
Project Gutenburg, 331–2 353–4
response-scheduling process, 338–9 image analysis, Big Data and, 344–5, 345
sensitivity to sequential structure, mid-level features, 354–5
327–30, 329, 335, 336 multivariate pattern analysis (MVPA),
serial ordering processes, 320–2 345
serial recurrent network predictions, NEIL (never ending image learner),
327, 367–8 355–6, 357, 358
skill development, 325 neuroimaging methods, 348–50
speed improvement, 325–30, 326, 329 as non-independent, 346–7
testing predications, 330 proxy models, 346, 353
unfamiliar letter strings and sensitivity to variation in data, 347
sequential structure, 335, 336, 337 visual attention, Big Data use in, 8
vocabulary knowledge, 206–7, 210, 211
uniform information density, theory of, 94 Vojnovic, M., 147
uniformity assumption, 205 volume
unigram informational variability (UIV), as characteristic of Big Data, 3
96, 96–7 sequential Bayesian updating, 21–2

variety Weber, I., 147


as characteristic of Big Data, 3 weighting of probabilities, 306–8, 307, 311
sequential Bayesian updating, 22 Wickelgren, 322
velocity Wolfe, J. M., 8
as characteristic of Big Data, 3 Wolff, C., 148
sequential Bayesian updating, 22 word association networks, 178–81
vision, human WordNet, 159–60, 176–7
artificial vision models, 354–5, 357, 358 Wu, L., 154, 155–6
Big Data, use of, 343–5, 345
content analysis, Big Data and, 343–4, Yamins, D. L. K., 353–4
345 Yap, M. J., 206, 207
data analysis, Big Data and, 345 Ye, C., 148
deep neural networks, 351, 353–5 Yin, Z., 121
defining Big Data for, 346–51, 347
effectiveness of Big Data, 351–8, 357
experiential power, limitations in, Zipf, G. K., 276
349–50 Zollers, A., 121

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy