Habibi Et Al 2020 Classifiers Preprint
Habibi Et Al 2020 Classifiers Preprint
Abstract
We explore how linguistic categories extend over time as novel items are assigned
to existing categories. As a case study we consider how Chinese numeral classifiers
were extended to emerging nouns over the past half century. Numeral classifiers are
common in East and Southeast Asian languages, and are prominent in the cognitive
linguistics literature as examples of radial categories. Each member of a radial category
is linked to a central prototype, and this view of categorization therefore contrasts with
exemplar-based accounts that deny the existence of category prototypes. We explore
these competing views by evaluating computational models of category growth that draw
on existing psychological models of categorization. We find that an exemplar-based
approach closely related to the Generalized Context Model provides the best account
of our data. Our work suggests that numeral classifiers and other categories previously
described as radial categories may be better understood as exemplar-based categories,
and thereby strengthens the connection between cognitive linguistics and psychological
models of categorization.
Keywords: Category growth; Semantic chaining; Radical categories; Categorization;
Exemplar theory; Numeral classifiers
∗ Correspondingauthor
Email address: yangxu@cs.toronto.edu (Yang Xu)
Language users routinely face the challenge of categorizing novel items. Over the
past few decades, items such as emojis, blogs and drones have entered our lives and
we have found ways to talk about them. Sometimes we create new categories for novel
5 items, but in many cases we assign them to existing categories. Here we present a
computational analysis of the cognitive process by which categories extend in meaning
over time.
Lakoff and other scholars [1, 2, 3, 4] have suggested that linguistic categories grow
over time through chaining, a process that links novel items to existing items that are
10 semantically similar, hence forming chain-like structures of meaning [1]. Although
Lakoff briefly suggests how chaining applies to semantic categories (e.g. the concept
of “climbing’), his two most prominent examples of chaining involve grammatical
categories. The first example is the classifier system of Dyirbal (an Australian Aboriginal
language), which groups together nouns that may not seem closely related on the surface.
15 For instance, the word balan may precede nouns related to women, fire and dangerous
things. The second example is the Japanese classifier hon, which can be applied to a
variety of long thin objects such as pencils, sticks and trees. Where an English speaker
might say “one pencil,” a Japanese speaker must insert the appropriate classifier (here
hon) between the numeral and the noun. Although hon is most typically applied to
20 long thin objects, it can also be applied to martial arts contests using swords (which
are long thin objects), and to medical injections (which are carried out using long, thin
needles). Martial arts contests and medical injections have little in common, but both
can be connected to central members of the hon category through a process of chaining.
In Lakoff’s work the notion of chaining is coupled with the notion of centrality,
25 which proposes that a category is organized around a central core. Combining chaining
with centrality leads to the notion of a radial category, or one that can be characterized
as a network of chains that radiate out from a center [1, 5]. Subsequent work in cognitive
linguistics relaxes the idea of a single center and allows that radial categories may have
“several centers of comparable importance” (Palmer & Woodman, 2000, p 230), but is
30 still committed to the idea that some members of a radial category are privileged by
2
virtue of their centrality. In principle, however, the notions of chaining and centrality
can be decoupled. Consider, for example, a category that is constructed by starting with
one element and repeatedly adding a new element that is similar to a randomly chosen
member of the category. This generative process seems consistent with the notion of
35 chaining, but the categories it produces may take the form of sprawling networks rather
than collections of chains radiating out from a center.
Many discussions of chaining within cognitive linguistics are heavily influenced
by Rosch and her prototype theory of categorization (e.g., Geeraerts, 1997), but this
literature has been largely separate from the psychological literature on computational
40 models of categorization [7, 8]. The modeling literature includes many comparisons
between exemplar models and prototype models of categorization, and the question of
whether categories have a central core lies at the heart of the difference between the
two approaches. Exemplar models proposes that the representation of a category is no
more than an enumeration of all members of the category, but prototype models propose
45 that category representations incorporate some additional element such as a prototype,
a central tendency or a set of core examples.1 Decoupling chaining from centrality
means that the process of chaining is potentially compatible with both prototype-based
and exemplar-based accounts of categorization, and opens up the possibility of formal
accounts of chaining that build on exemplar models like the Generalized Context Model
50 (GCM, Nosofsky, 1986) that have achieved notable success as psychological models of
categorization. Here we evaluate a suite of formal models, including a prototype model
and a family of exemplar models, and find that an exemplar model closely related to the
GCM provides the best account of category growth over time. Our results are broadly
consistent with previous work on computational models of categorization, which often
55 finds that exemplar theory outperforms prototype theory when instances of the two are
put to the test.
Following Lakoff we focus on grammatical categories, and as a case study we
1 In her later work Rosch explicitly suggested that “prototypes do not constitute a theory of representation
for categories” (Rosch, 1978, p 40). Much of the literature on prototype theory, however, does make
representational claims.
3
consider how Chinese numeral classifiers have been applied to novel nouns over the
past fifty years. As for Japanese classifiers, Chinese classifiers are obligatory when
60 a noun is paired with a numeral, e.g., one [classifierx ] person or two [classifiery ]
documents. Although we focus on Chinese classifiers, numeral classifiers are found in
many other languages around the world, and have been extensively studied by cognitive
psychologists, linguists, and anthropologists [11, 1, 12, 13, 14, 1]. For instance, Allan
(1977) has suggested that classifiers across languages often capture perceptual properties
65 such as shape and size, and Aikhenvald (2000) has suggested that classifiers also
capture more abstract features such as animacy. Although previous scholars have
explored how people assign classifiers to nouns [15, 16], most of this work has not been
computational. Our approach goes beyond the small amount of existing computational
work [17, 18, 19, 20, 21] by analyzing historical data and focusing on the application of
70 classifiers to novel nouns.
There are at least three reasons why numeral classifiers provide a natural venue for
testing computational theories of category extension. First, they connect with classic
examples such as Lakoff’s analysis of hon that are central to the cognitive linguistics
literature on chaining and category extension. Second, classifiers are applied to nouns,
75 which form a broad and constantly-expanding part of the lexicon, and therefore offer
many opportunities to explore how linguistic categories are applied to novel items.
Third, the item classified by a term like hon is typically the noun phrase that directly
follows the classifier, which makes it relatively simple to extract category members from
a historical corpus (e.g., via part-of-speech tags).
80 Our work goes beyond Lakoff’s treatment of classifiers in three important ways. First,
we present a computational framework that allows us to evaluate precise hypotheses
about the mechanism responsible for chaining. Second, we test these hypotheses
broadly by analyzing a large set of classifiers and their usage in natural contexts,
instead of considering a handful of isolated examples. Third, as mentioned already our
85 space of models includes exemplar-based approaches that have not been explored in
depth by previous computational accounts of chaining. Previous scholars have given
exemplar-based accounts of several aspects of language including phonetics, phonology,
morphology, word senses, and constructions [22, 23, 24, 8, 25, 26], and our approach
4
builds on and contributes to this tradition.
90 Our approach also builds on recent computational work that explores formal models
of chaining in the historical emergence of word meanings. In particular, Ramiro et al
(2018) demonstrated that neighbourhood-based chaining algorithms can recapitulate
the emerging order of word senses recorded in the history of English. This work found
that the best-performing algorithm was a nearest-neighbour model that extends the
95 semantic range of a word by connecting closely related senses. Two earlier studies
report that the same nearest-neighbour model also accounts for container naming across
languages [28, 29]. This paper compares a suite of models including the nearest-
neighbour model that was successful in previous work. We find that our historical
data on the growth of Chinese classifiers is best explained by a model that adjusts the
100 nearest-neighbour approach in several ways that are consistent with the GCM [10], an
influential exemplar-based model of categorization. Our results therefore suggest that
the same categorization mechanisms studied in lab-based tests of the GCM may help to
explain how real-world linguistic categories extend over time.
2. Theoretical framework
105 Figure 1 illustrates how semantic chaining might influence which Chinese classifier
is applied to a novel noun. We begin by assuming that nouns correspond to points in a
semantic space. Given a novel noun, the classifier for that noun can then be predicted
based on classifiers previously applied to nearby nouns in the space. In Figure 1 the
novel noun is referendum, which entered the Chinese lexicon around the year 2000.
110 Nearby nouns in the space have two different classifiers: 次 (cì) is used for nouns like
“employment,” “funding” and “speech” (shown as orange circles) and 项 (xiàng) is
used for nouns like “extension” and “estimate” (shown as blue triangles). The year in
which each noun emerged has been estimated from a corpus described later, and in this
corpus the first appearance of “referendum” happens to be paired with cì (the orange
115 classifier).
The notion of chaining suggests that “referendum” is classified by linking it with
one or more previously encountered nouns that are similar in meaning. In Figure 1,
5
“referendum” has been linked with 11 nearby nouns. According to the corpus, the nouns
closest to “referendum” tend to be paired with cì, which may explain why cì is also
120 used for “referendum.” Iterating this process through time leads to chaining because the
classification of “referendum“ influences classifications of subsequently encountered
nouns – in particular, assigning cì to “referendum“ means that the same classifier is
more likely to be used for novel nouns near “referendum.”
The informal characterization of chaining just presented leaves many details un-
125 specified, and the following sections attempt to fill in some of these gaps. The next
section presents a formal framework for modelling category growth over time. We
then specify a set of competing hypotheses about the function that determines how the
classifications of nearby nouns influence the classification of a novel noun. A subsequent
section discusses the nature and origin of the semantic space that captures similarity
130 relationships between the nouns.
6
Figure 1: An illustration of chaining in Chinese classifiers. “Referendum” entered the language around 2000,
and nearby nouns in semantic space are shown as orange circles or blue triangles depending on which of two
classifiers our corpus pairs them with. The closest nouns belong to the orange category, and “referendum”
is also assigned to this category. For visual clarity only selected nouns in the space have been labeled. The
background colors indicate how strongly each classifier is favored in each region. The blue category is favored
in the darker regions near the top, and the orange category is favored elsewhere in the space.
7
Equation 1 casts category extension as sequential probabilistic inference, where the
goal is to predict future category labels at time t+ given the likelihood f (x∗ |c)t and prior
p(c)t at the current time t. This formulation postulates that the probability of assigning
135 x∗ to category c is jointly influenced by the probability of seeing that item given the
category, and the prior probability of choosing that category.
The general framework in Equation 1 can be used to explore how categories from
many parts of speech (including nouns, adjectives, verbs, and adverbs) are applied to
novel items. Here we focus on classifiers, and therefore assume that category c is a
140 classifier and that item x∗ is a noun.
Most previous computational models of chaining [28, 29, 27] rely on a nearest-
neighbour (1nn) approach that assigns a novel item to the same category as the nearest
familiar item. Let nkc denote the number of items with category label c among the k
8
items most similar to x. 1nn can then be formulated using the function
1 if n1 = 1
c
f (x∗ |c) = (3)
0 otherwise.
1nn corresponds exactly to previous computational work on chaining [28, 29, 27],
but we suspected that the 1-neighbor assumption might be too strict. We therefore
evaluated a set of k-nearest-neighbor classifiers that assign a category label to x that
matches the most common label among the k items most similar to x:
1 if nkc = max nk0
∗ c
f (x |c) = c0 ∈C (4)
0 otherwise,
where C is the set of all categories and max nkc0 is the frequency of the most common
c0 ∈C
category among the k items most similar to x. We evaluated a total of 10 different
150 models (including 1nn) that set k to all integers between 1 and 10 inclusive.
Although some exemplar-based models rely on a nearest neighbor approach, the
dominant exemplar-based approach considers relationships between a novel item and all
previously encountered items, weighting each one by its similarity to the novel item:
Large values of s mean that similarity falls off rapidly with distance, which in turn
means that only the nearest exemplars to a novel item influence how it is classified.
Smaller values of s lead to broader regions of influence. We will refer to the likelihood
function in Equation 6 as the exemplar approach, and the function in Equation 5 as the
155 exemplar (s=1) approach.
All likelihood functions introduced so far are broadly compatible with the exemplar-
based view of categories. As mentioned earlier, however, many cognitive linguists view
9
chaining as a mechanism for generating radial categories, and the notion of a radial
category is derived from Rosch’s prototype theory. Ideally we would like to evaluate a
prototype model with a likelihood function that captures Lakoff’s views about radial
categories, and in particular his view of classifier categories like Japanese “hon.” To
our knowledge such a model has never been formulated, but the psychological literature
does include simple prototype models of categorization. Here we evaluate one such
model which assumes that the prototype of a category is the average of all exemplar
types that belong to the category [33].
1
prototypec = ∑x (7)
|c| x∈c
This approach allows the prototype of a category to change over time as new
exemplars are added to the category, and postulates that category extension occurs by
linking a novel item to the prototype that is closest in semantic space. Even if a novel
item lies closer to the prototype of category A than that of category B, the handful of
160 exemplars closest to the item may belong to category B, which means that the prototype
and exemplar models sometimes make different predictions. Although the prototype
model evaluated here is useful as a starting point, developing and evaluating more
sophisticated computational models of prototype theory is an important challenge for
future work, and we return to this issue in the general discussion.
165 Although the thirteen likelihood functions capture different assumptions about
chaining, they are comparable in model complexity. The only parameter tuned in
our model comparison is s, the sensitivity parameter used by the exemplar model.
To avoid giving this model an unfair privilege we set this parameter based on held-
out data. In principle one could consider sensitivity-weighted versions of the other
170 likelihood functions, but for our purposes these variants turn out to be equivalent to
the versions without sensitivity weights. We will evaluate our models based on the
proportion of correct classifications that they predict, and adding sensitivity weights
to the nearest-neighbour and prototype models changes the confidence associated with
their classifications but not the classifications themselves.
10
a) 1NN b) 5NN
? ?
c) Exemplar d) Prototype
? P ? P
Figure 2: Illustrations of four likelihood functions. Each panel assumes that there are two categories shown
as circles and triangles, and that a novel item shown as a question mark must be assigned to one of these
categories. Edges show relationships between the novel item (question mark) and previously encountered
item. In (c), the edges differ in thickness because items closer to the novel item are weighted more heavily. In
(d), the two nodes labelled “P” are prototypes of the two categories.
11
185 time t+ to the classifier with maximum type frequency up to time t (i.e. the classifier
that has been paired with the greatest number of different nouns). These baselines can
be interpreted as models that use either a uniform or a size-based prior but assume that
the likelihood function in Equation 1 is constant.
190 Although the exemplar and prototype models are formally different, it is possible
that they lead to categories with similar statistical properties. For example, even though
an exemplar-based category includes no central core, it is still possible that categories
grown according to the exemplar model tend to end up roughly convex in shape with
members arranged around a central region. To examine whether and how the exemplar
195 and prototype models produce different kinds of categories, we compared these models
using simulated data.
Simulation procedure.
We simulate category growth in a continuous two-dimensional space bounded by
[0,1] along each dimension. Each run begins with three randomly chosen points that
200 serve as seed exemplars for three categories. We then generate additional random points,
one at a time, and record the category labels assigned to each point by the exemplar
and prototype models. In addition to the prototype model described above, we also
consider a static prototype model where the category prototypes are fixed throughout
time and correspond to the three seed exemplars. Figure 3 illustrates one simulation run
205 and shows category growth according to the three models over 100 iterations. Although
all three models are given the same sequence of points, they produce different category
systems by the end of the run. We used two quantitative measures to compare systems
produced by the models: category size and category discriminability.
Expected category size.
210 The first measure quantifies the average size of categories generated by each models.
The prototype models are consistent with the notion of radial categories, and we expected
that they would tend to produce compact categories with members arranged around a
central prototype. The exemplar model, however, allows more scope for categories that
12
consist of elongated chains or other arbitrary shapes.
215 We measured category size as the area of the convex hull that includes all members
of a category. Expected category size is then computed as the average of this quantity
across the three categories in the simulation. Figure 3 shows that expected category size
is greater for the exemplar model than for the two prototype models, supporting the
intuition that exemplar-based categories tend to be less compact than radial categories.
220 Figure 4 (left panel) confirms this finding across 500 simulated runs. We found that
the exemplar model generally produces an expected category size that is substantially
greater than the prototype model with a moving core, and both of these models generate
categories that are larger on average than those produced by the static prototype model.
Category discriminability.
225 The second measure quantifies the degree to which categories are discriminable (or
separable) under each model. High discriminability means that there are relatively few
ambiguous cases near the boundary between two categories, and near-optimal systems
of categories will tend to have high discriminability. If exemplar-based categories tend
to be elongated, one might expect that they intertwine in complex ways and are therefore
230 less discriminable than the more convex categories produced by the prototype models.
We quantify category discriminability using an extension of Fisher’s linear discrim-
inant that allows for more than two categories. Given k = 3 categories with category
means m1 , m2 , m3 and covariances Σ1 , Σ2 , Σ3 , we compute Fisher’s discriminant ra-
tio r by weighing the cross-category separability (of the means) against the pooled
within-category variabilities (based on the covariance determinants):
Here d() represents Euclidean distance. A high discriminability value indicates that the
categories are highly separable, and is achieved, for example, if inter-category distances
are high and within-category variability is low.
Figure 3 shows that the exemplar and prototype models both produce categories
235 with equally high discriminability, and that both models produce more discriminable
categories than the static prototype model. Even though exemplar-based categories are
less compact than prototype-based categories, Figure 3 suggests that this difference in
13
compactness has no implications for discriminability, which is consistent with previous
findings from container naming that neighbourhood-based chaining tends to yield
240 categories that are near-optimally structured for efficient communication [29].
Taken together, our simulations suggest two general messages. First, the fact
that exemplar and prototype models produce category systems with similar levels of
discriminability suggests that the two models lead to outcomes that are similar in key
respects. As a result, careful analyses may be needed to distinguish between these two
245 competing models of category growth. Second, the results for category size reveal that
exemplar and prototype models do lead to patterns of category growth with statistically
different properties. This finding means that analyses of real-world categories (e.g.
Chinese classifiers) can plausibly aim to determine whether the process underlying the
growth of these categories is closer to an exemplar model or a prototype model.
14
15
Figure 3: Simulated category growth under the exemplar and prototype models. A static version of the
prototype model is also considered where the prototype remains fixed (as opposed to dynamic) over time.
Figure 4: Category compactness and discriminability analysis of the exemplar and prototype models. Category
size (left panel) and Fisher discriminant ratio (right panel) is calculated under each model over multiple
simulation runs with random initial points. Shaded areas correspond to 95% confidence bands.
We next applied the models to the growth of Chinese classifiers through time. Doing
so required three primary sources of data: a large repository of web-scraped Chinese
(classifier, noun) pairs ; 2) historical time stamps that record the first attested usage of
each (classifier, noun) pair; and 3) a semantic space capturing similarity relationships
255 between nouns.
16
over the period 1940-2003. We specifically searched for (classifier,noun) pairs that had
a “_NOUN” tag for their noun part.
Each of the models described previously was evaluated based on its ability to predict
classifiers assigned to novel nouns over the period 1951 to 2003. We assessed these
predictions incrementally over time: for each historical year where a novel classifier-
noun usage appeared according to the time stamps, we compared the held-out true
290 classifier with the model-predicted classifier that had the highest posterior probability
2 We thank an anonymous reviewer for pointing out that Chinese embeddings smuggle in information about
noun-classifier pairings.
3 Three native speakers of Mandarin Chinese independently inspected a sample of 100 Chinese-English noun
pairs and considered 98, 97, and 95 of those translations to be acceptable, respectively.
17
(i.e., term on the left of Equation 1) given the novel noun. In cases where a noun
appeared with multiple classifiers, we excluded classifiers that had previously appeared
with the noun when computing model predictions (i.e., we only make predictions about
classifers that are paired with a noun for the first time). This procedure ensures that
295 there are no repeated predictions from any of the models.
To estimate the sensitivity parameter s of the exemplar models, for each year, we
used data from the years before (i.e. data from 1941 until that year) and performed
an optimization within the range of 0.1 to 100 to identify the s that maximized the
performance of the model for the nouns that emerged during the previous year. 4
300 Appendix A includes the estimated values of the sensitivity parameter for these models.
4.5. Results
Figure 5 summarizes the overall predictive accuracies of the models. The best
performing model overall was the exemplar model (s = 1) with size-based prior. All
models are based on types rather than tokens: for example, P(c) is proportional to the
305 number of types that classifier c is paired with rather than the combined count of tokens
of these types. Appendix B includes results for token-based models and shows that they
perform uniformly worse than their type-based equivalents. We return to this finding in
the general discussion, but focus here on the results for type-based models, and begin
by considering the contribution made by each component of these models.
310 Contribution of the prior.
The baseline model with size-based prior and constant likelihood achieved an
accuracy of 29.6%, which is substantially better than random choice (accuracy of 1.6%
among 127 classifiers). Figure 5 shows that the size-based prior led to better performance
than the uniform prior. In 12 out of 13 cases, a size-based model performed better than
315 the same model with uniform prior (p < 0.002, n = 13, k = 12 under binomial test). 5
4A line search was performed with a step size 0.1 in the range 0.1–1.0 and a step size of 1.0 in the range
1.0–100.0.
5 In all k-nearest-neighbor models, we used the size-based prior when there is a tie among the classifier
categories when they share the same number of nearest neighbors to a noun. In the uniform-prior case, we
randomly choose a classifier if there is a tie.
18
Our results therefore support the idea that being paired with many types of nouns in the
past makes a classifier especially likely to be applied to novel nouns.
uniform size-based
0.40 0.40
0.35 0.35
0.30 0.30
Predictive Accuracy (%)
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
n n n n n n n n n n ) r e e n n n n n n n n n n ) r e e
1n 2n 3n 4n 5n 6n 7n 8n 9n 10n (s=1 mpla otyp selin 1n 2n 3n 4n 5n 6n 7n 8n 9n 10n (s=1 mpla otyp selin
lar exe prot Ba lar exe prot Ba
mp mp
exe exe
models models
Figure 5: Summary of predictive accuracies achieved by all models under the two priors.
19
performance from 1 neighbor to high-order neighbors suggests that approximation of
335 neighborhood density matters in the process of chaining. The exemplar model can be
considered as a soft but more comprehensive version of the k-nearest-neighbor model
class, where all semantic neighbors are considered and weighted by distance to the
novel item in prediction.
Figure 6 confirms our findings by showing the time courses of predictive accu-
340 racy for the models, highlighting three aspects: 1) models with the size-based prior
generally achieved better performance than models with a uniform prior; 2) the best
overall exemplar model (s = 1) with the size-based prior is consistently superior to the
other competing models (including the prototype model) through the time period of
investigation; 3) increasing the order of nearest neighbors improves model prediction.
345 Our results therefore support a key theoretical commitment of the GCM, which proposes
that categorization judgments are made by computing a weighted sum over all previous
category exemplars.
uniform size-based
0.50 0.50
0.45 0.45
Baseline (s) Exemplar (s=1)
1nn Exemplar
0.40 0.40 10nn Prototype
Predictive Accuracy (%)
0.35 0.35
0.30 0.30
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00
0.00
1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
Years Years
Figure 6: Predictive accuracies of representative models in the use of 127 Chinese classifiers at 5 year intervals
between 1955 and 2000.
20
355 involving transactions or business related things) in our data, and the model predicts 起
, which is a classifier used for describing events.
Figure 7 shows precision and recall for individual classifiers based on the same
model. The model achieves high precision for a classifier if it is mostly correct when it
chooses that classifier. For example, 尊 is typically applied to sculpture, and the model
360 is always correct when it chooses this classifier. High recall is achieved for a classifier if
the model chooses that classifier in most cases in which it is actually correct. The recall
for 尊 is low, suggesting that the model fails to apply this classifier in many cases in
which it is correct.
The classifier with highest recall is 个, which is a generic classifier that is extremely
365 common. Recall tends to be greatest for the most frequent classifiers, which is expected
given that the model uses a frequency-based prior. The classifiers with highest precision
are specialized classifiers that are relatively rare, which means that they are rarely chosen
by the model in cases where they do not apply.
Table 1: Examples of novel nouns, English translations, ground-truth Chinese classifiers and predictions of
the exemplar model (s = 1) with size-based prior.
21
1.0
0.9
0.8
0.7
0.6
Precision
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Recall
Figure 7: Precision and recall of individual classifiers based on the best exemplar model. Marker size is
proportional to category size (i.e., number of different nouns paired with a classifier).
5. Discussion
22
tative analyses of a handful of examples. Our work builds on previous treatments by
considering a set of computational models of chaining and evaluating them across a
relatively large historical corpus.
To our knowledge, our work is the first to apply a computational model of chaining
385 to the domain of numeral classifiers, but previous papers have used formal models
of chaining to study container naming [28, 29] and word senses in a historical dictio-
nary [27]. Each of these contributions evaluates several formal models including a
weighted exemplar approach and finds that a nearest-neighbour approach performs best.
In contrast, we found that a weighted exemplar approach closely related to the GCM
390 provided the best account of our data. The reasons for these different conclusions are
not entirely clear. As suggested earlier, the weighted exemplar approach reduces to
a nearest-neighbour approach when the sensitivity parameter s becomes large, which
means that the weighted exemplar approach should always perform at least as well as
the nearest-neighbour approach for some value of s. For generic values of s, it seems
395 possible that the nature of the semantic representation influences the performance of
the weighted exemplar approach. Previous models of chaining used semantic represen-
tations based on human similarity judgments [28, 29] and a taxonomy constructed by
lexicographers [27], and it is possible that the word embeddings used in our work are
especially well suited to a weighted exemplar approach.
400 The literature on cognitive linguistics suggests some directions in which our work
can be extended. Lakoff presents chaining as a mechanism that leads to radial categories,
which are organized around one or more central cores and therefore qualify as prototype
categories. We evaluated a simple prototype model drawn from the psychological litera-
ture, and found that this model performed worse than an exemplar-based approach. This
405 result provides some initial support for the idea that numeral classifiers are best under-
stood as exemplar-based categories, but definitive support would require the evaluation
of more sophisticated prototype models that better capture the way in which linguists
think about radial categories. A key challenge is to develop semantic representations
that better capture the full richness of word meanings (e.g., multi-modal representations
410 that combine linguistic and extra-linguistic cues such as visual and conceptual relations).
For example, consider Lakoff’s proposal that Japanese hon is extended from long thin
23
objects to medical injections because injections are given using long, thin needles. Our
work represents nouns as points in a semantic space, and although useful for some
purposes this representation does not adequately capture the way in which multiple
415 concepts (e.g. the purpose of an injection, the setting in which it might occur, and
the instrument by which it is administered) come together to create a richly-textured
meaning. Developing improved semantic representations is a major research challenge,
but one possible approach is to combine the word embeddings used in our work with
more structured representations [38] that identify specific semantic relations (e.g. agent,
420 patient, instrument) between concepts.
A second important direction is to extend our approach to accommodate mechanisms
other than chaining that lead to meaning extension over time. For example, metaphor
(e.g., grasp: “physical action”→“understanding”) has been proposed as a key cognitive
force in semantic change [39], and recent work provides large-scale empirical evidence
425 of this force operating throughout the historical development of English [40]. An
apparent difference between chaining and metaphor is that chaining operates within
localized neighborhoods of semantic space, but metaphoric extension may link items
that are relatively remote (as in the case of grasp). Metaphorical extension (e.g., mouse:
“rodent”→“computer device”) could also rely on perceptual information that is beyond
430 the scope of our current investigation. As suggested already, a richer representation of
semantic space will be needed, and it is possible that the chaining mechanisms proposed
here will capture some aspects of metaphorical extension when operating over that
richer representational space.
435 Our work is grounded in the psychological literature on categorization, and joins a
number of previous projects [41, 42, 43] in demonstrating how computational models
can be taken out of the laboratory and used to study real-world categories. Our best
performing model is a weighted-exemplar approach that is closely related to the GCM
and that goes beyond nearest-neighbor models in two main respects. First, it classifies a
440 novel item by comparing it to many previously-observed exemplars, not just a handful
of maximally-similar exemplars. Second, it uses a prior that favors classifiers that have
24
previously been applied to many different items. Both ideas are consistent with the
GCM, and our results suggest that both are needed in order to account for our data as
well as possible.
445 Our best model, however, differs from the GCM in at least one important respect.
Throughout we focused on type frequency rather than token frequency. For example, the
size-based prior in our models reflects the number of types a classifier was previously
paired with, not the number of previous tokens of the classifier. Models like the GCM can
be defined over types or tokens [44], but it is more common and probably more natural
450 to work with tokens rather than types. The empirical evidence from the psychological
literature on type versus token frequencies is mixed: some studies find an influence of
type frequency [45], but others suggest that token-based models perform better than
type-based models [44, 46]. It seems likely that type frequencies and token frequencies
both matter, but predicting how the two interact in any given situation is not always
455 straightforward.
Our finding that the exemplar model performed better given type frequencies rather
than token frequencies is broadly compatible with an extensive linguistic literature on
the link between type frequency and the productivity of a construction [47, 48, 49, 50].
For example, consider two past-tense constructions that both include a slot for a verb. If
460 the two constructions occur equally often in a corpus (i.e. token frequency is equal) but
one construction occurs with more different verbs (i.e. has higher type frequency) than
the other, then the construction with higher type frequency is more likely to be extended
to a novel verb. The link between type frequency and productivity is supported by
both corpus analyses and modeling work. For example, our results parallel the work of
465 Albright & Hayes (2003), who present a model of morphological learning that achieves
better performance given type frequencies instead of token frequencies.
Although the link between type frequency and productivity has been clearly es-
tablished, token frequency also affects linguistic generalizations. For instance, Bybee
(1985) suggests that high token frequency is negatively related to productivity, because a
470 construction that appears especially frequently with one particular item may be learned
as an unanalyzed whole instead of treated as a structure with slots that can be filled by
a range of different items. Items with high token frequencies may also be treated as
25
category prototypes [53, 49], which means that token frequency will be relevant when
developing prototype models more sophisticated than the one evaluated here. Previous
475 theories [47, 54, 48, 55] and computational models [56, 57] of language learning have
incorporated both type frequency and token frequency, and extending our approach in
this direction is a natural goal for future work.
The psychological literature suggests at least two additional directions that future
work might aim to pursue. We considered how an entire speech community handles
480 new items that emerge over a time scale of years or decades, but psychological models
often aim to capture how individuals learn on a trial-by-trial basis. Accounting for the
classifications made by individual speakers is likely to require ideas that go beyond
our current framework. For example, individuals might be especially likely to reuse
classifiers that have recently occurred in a conversation, and there may be kinds of
485 selective attention that operate at a timescale of seconds or minutes and are not well
captured by the models used in our work. Psychologists have studied how numeral
classifiers are applied in the lab [15, 16], and there is an opportunity to combine this
experimental approach with the modeling approach that we have developed. A second
important direction is to explore how children acquire numeral classifiers over the
490 course of development. If applied to a corpus of child-directed speech, our model could
potentially make predictions about errors made by children as they gradually learn the
adult system of numeral classifiers.
6. Conclusion
26
Although we focused on numeral classifiers, our approach is relatively general, and
could be used to explore how other linguistic categories change over time. In recent
years historical corpora have become more accessible than ever before, and we hope
505 that future work can build on our approach to further explore how linguistic change is
shaped by cognitive processes of learning and categorization.
7. Acknowledgements
We thank Luis Morgado da Costa for sharing the classifier dataset, and three anony-
mous reviewers for helpful comments on the manuscript. This work is supported by an
510 NSERC Discovery Grant, a SSHRC Insight Grant, and a Connaught New Researcher
Award to YX.
Figures A.8 and A.9 show the estimated values of the sensitivity parameter for the
exemplar models under different choices of prior and semantic space, based on types
515 and tokens separately.
8
Estimated s
0
1950 1960 1970 1980 1990
Years
Figure A.8: Estimated optimal values of the sensitivity parameter (s) from the type-based models.
27
uniform prior size-based prior
10
8
Estimated s
0
1950 1960 1970 1980 1990
Years
Figure A.9: Estimated optimal values of the sensitivity parameter (s) from the token-based models.
The models in the main text are based on types rather than tokens, and Figure B.10
shows corresponding results for token based exemplar and prototype models (keeping
k-nearest-neighbor models the same because token-based results for low-order k’s are
520 effectively invariant and similar to a type-based 1nn model). For the prototype model,
we defined the prototype of a category as the frequency-weighted average:
freq(x)
prototypec = E[x|c] = ∑ x p(x|c) = ∑ x (B.1)
x∈c x∈c ∑ freq(x0 )
x0 ∈c
28
uniform size-based
0.40 0.40
0.35 0.35
0.30 0.30
Predictive Accuracy (%)
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
n n n n n n n n n n ) r e e n n n n n n n n n n ) r e e
1n 2n 3n 4n 5n 6n 7n 8n 9n 10n (s=1 mpla otyp selin 1n 2n 3n 4n 5n 6n 7n 8n 9n 10n (s=1 mpla otyp selin
lar exe prot Ba lar exe prot Ba
xe mp xe mp
e e
models models
Figure B.10: Summary of predictive accuracies achieved by all token-based models under the two priors.
Code and data used for our analyses are available on GitHub at https://github.
525 com/AmirAhmadHabibi/ChainingClassifiers. Pre-trained English word2vec em-
beddings are available at https://code.google.com/archive/p/word2vec/ [58,
59, 60], and the N-gram data we used from the Google Book corpus are available at
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html [61].
29
References
530 [1] G. Lakoff, Women, fire, and dangerous things: What Categories Reveal About
The Mind, University of Chicago Press, Chicago, 1987.
[7] F. Polzenhagen, X. Xia, Language, culture, and prototypicality, in: The Routledge
Handbook of Language and Culture, Routledge, 2014, pp. 269–285.
550 [9] E. Rosch, Principles of categorization, in: E. Rosch, B. B. Lloyd (Eds.), Cognition
and categorization, Lawrence Erlbaum Associates, New York, 1978, pp. 27–48.
30
[11] A. Y. Aikhenvald, Classifiers: A typology of noun categorization devices, Oxford
555 University Press, Oxford, 2000.
[16] J. H. Y. Tai, Chinese classifier systems and human categorization, in: Honor of
565 William S.-Y. Wang: Interdisciplinary studies on language and language change,
Pyramid Press, Taipei, 1994, pp. 479–494.
[17] H. Guo, H. Zhong, Chinese classifier assignment using SVMs, in: Proceedings of
the 4th SIGHAN Workshop on Chinese Language Processing, 2005.
575 [20] L. Morgado da Costa, F. Bond, H. Gao, Mapping and generating classifiers using
an open Chinese ontology, in: Proceedings of the 8th Global WordNet Conference,
2016.
[21] M. Zhan, R. Levy, Comparing theories of speaker choice using a model of classifier
production in mandarin chinese, in: Proceedings of the 17th Annual Conference
31
580 of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, 2018, p. 1997–2005.
[29] Y. Xu, T. Regier, B. C. Malt, Historical semantic chaining and efficient communi-
cation: The case of container names, Cognitive Science 40 (2016) 2081–2094.
600 [30] R. N. Shepard, Stimulus and response generalization: tests of a model relating gen-
eralization to distance in psychological space., Journal of Experimental Psychology
55 (1958) 509?523.
[31] R. M. Nosofsky, Luce’s choice model and Thurstone’s categorical judgment model
compared: Kornbrot’s data revisited, Attention, Perception, & Psychophysics 37
605 (1985) 89–91.
32
[32] F. G. Ashby, L. A. Alfonso-Reese, Categorization as probability density estimation,
Journal of Mathematical Psychology 39 (1995) 216–233.
615 [36] Y. Luo, Y. Xu, Stability in the temporal dynamics of word meanings, in: Proceed-
ings of the 40th Annual Meeting of the Cognitive Science Society, 2018.
620 [38] C. F. Baker, C. J. Fillmore, J. B. Lowe, The Berkeley Framenet project, in: Proceed-
ings of the 17th International Conference on Computational linguistics–Volume 1,
Association for Computational Linguistics, 1998, pp. 86–90.
625 [40] Y. Xu, B. C. Malt, M. Srinivasan, Evolution of word meanings through metaphori-
cal mapping: Systematicity over the past millennium, Cognitive Psychology 96
(2017) 41–53.
33
[42] W. Voorspoels, W. Vanpaemel, G. Storms, Exemplars and prototypes in natural
language concepts: A typicality-based evaluation, Psychonomic Bulletin & Review
15 (3) (2008) 630–637.
[45] A. Perfors, K. Ransom, D. Navarro, People ignore token frequency when deciding
640 how widely to generalize, in: Proceedings of the Annual Meeting of the Cognitive
Science Society, Vol. 36, 2014.
[47] J. Bybee, Language, usage and cognition, Cambridge University Press, Cambridge,
645 2010.
[48] J. Barðdal, Productivity: Evidence from case and argument structure in Icelandic,
John Benjamins, Amsterdam, 2008.
[51] A. Albright, B. Hayes, Rules vs. analogy in english past tenses: A computa-
tional/experimental study, Cognition 90 (2) (2003) 119–161.
[52] J. L. Bybee, Morphology: A study of the relation between meaning and form, John
655 Benjamins, Amsterdam/Philadelphia, 1985.
34
[53] J. Bybee, D. Eddington, A usage-based approach to Spanish verbs of ‘becoming’,
Language 82 323–355.
[55] J. Barðdal, The semantic and lexical range of the ditransitive construction in the
history of (North) Germanic, Functions of Language 14 (1) (2007) 9–30.
[60] T. Mikolov, W.-T. Yih, G. Zweig, Linguistic regularities in continuous space word
representations, in: NAACL-HLT, 2013.
35