0% found this document useful (0 votes)
8 views11 pages

Sciadv Aaq1360

Uploaded by

connectcount3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views11 pages

Sciadv Aaq1360

Uploaded by

connectcount3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

SCIENCE ADVANCES | RESEARCH ARTICLE

NETWORK SCIENCE Copyright © 2018


The Authors, some

A network approach to topic models rights reserved;


exclusive licensee
American Association
Martin Gerlach1,2*, Tiago P. Peixoto3,4, Eduardo G. Altmann2,5 for the Advancement
of Science. No claim to
One of the main computational and scientific challenges in the modern age is to extract useful information from un- original U.S. Government
structured texts. Topic models are one popular machine-learning approach that infers the latent topical structure of a Works. Distributed
collection of documents. Despite their success—particularly of the most widely used variant called latent Dirichlet under a Creative
allocation (LDA)—and numerous applications in sociology, history, and linguistics, topic models are known to Commons Attribution
suffer from severe conceptual and practical problems, for example, a lack of justification for the Bayesian priors, NonCommercial
discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. We License 4.0 (CC BY-NC).
obtain a fresh view of the problem of identifying topical structures by relating it to the problem of finding com-
munities in complex networks. We achieve this by representing text corpora as bipartite networks of documents
and words. By adapting existing community-detection methods (using a stochastic block model (SBM) with non-
parametric priors), we obtain a more versatile and principled framework for topic modeling (for example, it auto-
matically detects the number of topics and hierarchically clusters both the words and documents). The analysis of
artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of
statistical model selection. Our work shows how to formally relate methods from community detection and topic
modeling, opening the possibility of cross-fertilization between these two fields.

Downloaded from https://www.science.org on July 19, 2024


INTRODUCTION structures in artificial corpora can be recovered with LDA (16). A sub-
The accelerating rate of digitization of information increases the im- stantial part of the research in topic models focuses on creating more
portance and number of problems that require automatic organization sophisticated and realistic versions of LDA that account for, for example,
and classification of written text. Topic models (1) are a flexible and syntax (17), correlations between topics (18), meta-information (such as
widely used tool that identifies semantically related documents through authors) (19), or burstiness (20). Other approaches consist of post-
the topics they address. These methods originated in machine learning inference fitting of the number of topics (21) or the hyperparameters (22)
and were largely based on heuristic approaches such as singular value or the formulation of nonparametric hierarchical extensions (23–25). In
decomposition in latent semantic indexing (LSI) (2) in which one op- particular, models based on the Pitman-Yor (26–28) or the negative bi-
timizes an arbitrarily chosen quality function. Only a more statistically nomial process have tried to address the issue of Zipf’s law (29), yielding
principled approach, based on the formulation of probabilistic gener- useful generalizations of the simplistic Dirichlet prior (30). While all
ative models (3), allowed for a deeper theoretical foundation within these approaches lead to demonstrable improvements, they do not pro-
the framework of Bayesian statistical inference. This, in turn, leads vide satisfying solutions to the aforementioned issues because they share
to a series of key developments, in particular, probabilistic LSI (pLSI) the limitations due to the choice of Dirichlet priors, introduce idiosyncratic
(4) and latent Dirichlet allocation (LDA) (5, 6). The latter established structures to the model, or rely on heuristic approaches in the optimiza-
itself as the state-of-the-art method in topic modeling and has been tion of the free parameters.
widely used not only for recommendation and classification (7) but A similar evolution from heuristic approaches to probabilistic
also for bibliometrical (8), psychological (9), and political (10) analysis. models is occurring in the field of complex networks, particularly in the
Beyond the scope of natural language, LDA has also been applied in problem of community detection (31). Topic models and community-
biology (11) [developed independently in this context (12)] and image detection methods have been developed largely independently from
processing (13). each other, with only a few papers pointing to their conceptual simila-
However, despite its success and overwhelming popularity, LDA is rities (16, 32, 33). The idea of community detection is to find large-scale
known to suffer from fundamental flaws in the way it represents text. In structure, that is, the identification of groups of nodes with similar
particular, it lacks an intrinsic methodology to choose the number of connectivity patterns (31). This is motivated by the fact that these groups
topics and contains a large number of free parameters that can cause describe the heterogeneous nonrandom structure of the network and
overfitting. Furthermore, there is no justification for the use of the may correspond to functional units, giving potential insights into the gen-
Dirichlet prior in the model formulation besides mathematical con- erative mechanisms behind the network formation. While there is a
venience. This choice restricts the types of topic mixtures and is not de- variety of different approaches to community detection, most methods
signed to be compatible with well-known properties of real text (14), such are heuristic and optimize a quality function, the most popular being
as Zipf’s law (15) for the frequency of words. More recently, consistency modularity (34). Modularity suffers from severe conceptual deficiencies,
problems have also been identified with respect to how planted such as its inability to assess statistical significance, leading to detection of
groups in completely random networks (35), or its incapacity in finding
1
groups below a given size (36). Methods such as modularity maximiza-
Department of Chemical and Biological Engineering, Northwestern University,
Evanston, IL 60208, USA. 2Max Planck Institute for the Physics of Complex
tion are analogous to the pre-pLSI heuristic approaches to topic models,
Systems, D-01187 Dresden, Germany. 3Department of Mathematical Sciences sharing many conceptual and practical deficiencies with them. In an
and Centre for Networks and Collective Behaviour, University of Bath, Claverton effort to quench these problems, many researchers moved to probabilistic
Down, Bath BA2 7AY, UK. 4Institute for Scientific Interchange Foundation, Via Alas- inference approaches, most notably those based on stochastic block
sio 11/c, 10126 Torino, Italy. 5School of Mathematics and Statistics, University of
Sydney, 2006 New South Wales, Australia. models (SBMs) (32, 37, 38), mirroring the same trend that occurred
*Corresponding author. Email: martin.gerlach@northwestern.edu in topic modeling.

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 1 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

Here, we propose and apply a unified framework to the fields of can be transported between these two classes of problems. The benefit of
topic modeling and community detection. As illustrated in Fig. 1, by this unified approach is illustrated by the derivation of an alternative to
representing the word-document matrix as a bipartite network, the Dirichlet-based topic models, which is more principled in its theoretical
problem of inferring topics becomes a problem of inferring commu- foundation (making fewer assumption about the data) and superior in
nities. Topic models and community-detection methods have been practice according to model selection criteria.
previously discussed as being part of mixed-membership models (39).
However, this has remained a conceptual connection (16), and in prac-
tice, the two approaches are used to address different problems (32): the RESULTS
occurrence of words within and the links/citations between documents, Community detection for topic modeling
respectively. In contrast, here, we develop a formal correspondence that Here, we expose the connection between topic modeling and commu-
builds on the mathematical equivalence between pLSI of texts and nity detection, as illustrated in Fig. 2. We first revisit how a Bayesian
SBMs of networks (33) and that we use to adapt community-detection formulation of pLSI assuming Dirichlet priors leads to LDA and how
methods to perform topic modeling. In particular, we derive a non- we can reinterpret the former as a mixed membership SBM. We then
parametric Bayesian parametrization of pLSI—adapted from a hierar- use the latter to derive a more principled approach to topic modeling
chical SBM (hSBM) (40–42)—that makes fewer assumptions about the using nonparametric and hierarchical priors.
underlying structure of the data. As a consequence, it better matches the Topic models: pLSI and LDA
statistical properties of real texts and solves many of the intrinsic limita- pLSI is a model that generates a corpus composed of D documents,
tions of LDA. For example, we demonstrate the limitations induced by where each document d has kd words (4). Words are placed in the
the Dirichlet priors by showing that LDA fails to infer topical structures documents based on the topic mixtures assigned to both document
that deviate from the Dirichlet assumption. We show that our model and words, from a total of K topics. More specifically, one iterates
correctly infers these structures and thus leads to a better topic model through all D documents; for each document d, one samples kd ~

Downloaded from https://www.science.org on July 19, 2024


than Dirichlet-based methods (such as LDA) in terms of model selec- Poi(hd), and for each word token l D {1, kd}, first, a topic r is chosen
tion not only in various real corpora but also in artificial corpora generated with probability qdr, and then, a word w is chosen from that topic
from LDA itself. In addition, our nonparametric approach uncovers to- with probability frw. If nrdw is the number of occurrences of word
pical structures on many scales of resolution and automatically w of topic r in document d (summarized as n), then the probability
determines the number of topics, together with the word classification, of a corpus is
and its symmetric formulation allows the documents themselves to be
clustered into hierarchical categories. r
ðfrw qdr Þndw
The goal of our study is to introduce a unified approach to topic Pðnjh; q; fÞ ¼ ∏hkdd ehd ∏ ð1Þ
modeling and community detection, showing how ideas and methods d wr nrdw !

We denote matrices by boldface symbols, for example, q = {qdr} with


d = 1,…, D and r = 1,…, K, where qdr is an individual entry; thus, the
notation qd refers to the vector {qdr} with fixed d and r = 1,…, K.

Fig. 1. Two approaches to extract information from collections of texts. Topic


models represent the texts as a document-word matrix (how often each word appears
in each document), which is then written as a product of two matrices of smaller
dimensions with the help of the latent variable topic. The approach we propose here
represents texts as a network and infers communities in this network. The nodes con- Fig. 2. Parallelism between topic models and community detection methods.
sists of documents and words, and the strength of the edge between them is given by The pLSI and SBMs are mathematically equivalent, and therefore, methods from com-
the number of occurrences of the word in the document, yielding a bipartite multi- munity detection (for example, the hSBM we propose in this study) can be used as
graph that is equivalent to the word-document matrix used in topic models. alternatives to traditional topic models (for example, LDA).

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 2 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

For an unknown text, we could simply maximize Eq. 1 to obtain the documents. Although choosing appropriate values of brw can address
best parameters h, q, and f, which describe the topical structure of the this disagreement, such an approach, as already mentioned, runs con-
corpus. However, we cannot directly use this approach to model tex- trary to nonparametric inference and is subject to overfitting. In the
tual data without a significant danger of overfitting. The model has a following, we will show how one can recast the same original pLSI
large number of parameters that grows as the number of documents, model as a network model that completely removes the limitations de-
words, and topics is increased, and hence, a maximum likelihood esti- scribed above and is capable of uncovering heterogeneity in the data at
mate will invariably incorporate a considerable amount of noise. multiple scales.
One solution to this problem is to use a Bayesian formulation by pro- Topic models and community detection: Equivalence between
posing prior distributions to the parameters and integrating over them. pLSI and SBM
This is precisely what is performed in LDA (5, 6), where one chooses We show that pLSI is equivalent to a specific form of a mixed-membership
Dirichlet priors Dd(qd|ad) and Dr(fr|br) with hyperparameters a and b SBM, as proposed by Ball et al. (33). The SBM is a model that generates a
for the probabilities q and f above and one uses instead the marginal network composed of i = 1,…, N nodes with adjacency matrix Aij, which
likelihood. we will assume without loss of generality to correspond to a multigraph,
that is, Aij D ℕ. The nodes are placed in a partition composed of B
Pðnjh; b; aÞ ¼ ∫Pðnjh; q; fÞ∏Dd ðqd jad Þ∏Dr ðfr jbr Þdqdf;
overlapping groups, and the edges between nodes i and j are sampled
d r from a Poisson distribution with average
¼∏ ∏ 1
hkdd ehd 
d

wr n

dw !
r
 
∑rs kir wrs kjs ð3Þ

∏d GkG ∑þr a∑dra  ∏r G


Gðadr Þ

∑w nrdw þ adr
where wrs is the expected number of edges between group r and group s,

Downloaded from https://www.science.org on July 19, 2024


d r dr
  and kir is the probability that node i is sampled from group r. We can
Gð∑w brw Þ G ∑d nrdw þ brw
∏r G∑ nr þ ∑ b  ∏w Gðbrw Þ
ð2Þ write the likelihood of observing A ¼ fArs ij g, that is, a particular
dw dw w rw decomposition of Aij into labeled half-edges (that is, edge end points)
such that Aij ¼ ∑rs Ars
ij , as
If one makes a noninformative choice, that is, adr = 1 and brw = 1,
then inference using Eq. 2 is nonparametric and less susceptible to rs
ekir wrs kis ðkir wrs kjs ÞAij
overfitting. In particular, one can obtain the labeling of word tokens PðAjk; wÞ ¼ ∏ ∏ 
into topics, nrdw , conditioned only on the observed total frequencies i<j rs Ars
ij !
of words in documents, ∑r nrdw , in addition to the number of topics K kir wrs kis =2
rs
ðkis wrs kis =2ÞAii =2
itself, simply by maximizing or sampling from the posterior ∏i ∏ e
Ars
ð4Þ
distribution. The weakness of this approach lies in the fact that the Di-
rs ii =2!

richlet prior is a simplistic assumption about the data-generating process:


In its noninformative form, every mixture in the model—both of topics by exploiting the fact that the sum of Poisson variables is also distributed
in each document as well as words into topics—is assumed to be equally according to a Poisson.
likely, precluding the existence of any form of higher-order structure. We can now make the connection to pLSI by rewriting the token
This limitation has prompted the widespread practice of inferring using probabilities in Eq. 1 in a symmetric fashion as
LDA in a parametric manner by maximizing the likelihood with respect to
the hyperparameters a and b, which can improve the quality of fit in
frw qdr ¼ hw qdr fwr
′ ð5Þ
many cases. But not only does this undermine to a large extent the initial
purpose of a Bayesian approach—as the number of hyperparameters
still increases with the number of documents, words, and topics, and where fwr ′ ≡ frw =∑s fsw is the probability that the word w belongs to
hence maximizing over them reintroduces the danger of overfitting— topic r, and hw ≡ ∑sfsw is the overall propensity with which the word
but it also does not sufficiently address the original limitation of the w is chosen across all topics. In this manner, we can rewrite the likeli-
Dirichlet prior. Namely, regardless of the hyperparameter choice, the hood of Eq. 1 as
Dirichlet distribution is unimodal, meaning that it generates mixtures
that are either concentrated around the mean value or spread away r
eldw ðlrdw Þndw
r

uniformly from it toward pure components. This means that for any Pðnjh;f′;qÞ ¼ ∏ ð6Þ
choice of a and b, the whole corpus is characterized by a single typical dwr nrdw !
mixture of topics into documents and a single typical mixture of words
into topics. This is an extreme level of assumed homogeneity, which with lrdw ¼ hd hw qdr fwr
′ . If we choose to view the counts ndw as the en-
stands in contradiction to a clustering approach initially designed to tries of the adjacency matrix of a bipartite multigraph with documents
capture heterogeneity. and words as nodes, the likelihood of Eq. 6 is equivalent to the likelihood
In addition to the above, the use of nonparametric Dirichlet priors of Eq. 4 of the SBM if we assume that each document belongs to its own
is inconsistent with well-known universal statistical properties of real specific group, kir = dir, with i = 1,…, D for document nodes, and by
texts, most notably, the highly skewed distribution of word frequencies, rewriting lrdw ¼ wdr krw. Therefore, the SBM of Eq. 4 is a generalization
which typically follows Zipf’s law (15). In contrast, the noninformative of pLSI that allows the words and the documents to be clustered into
choice of the Dirichlet distribution with hyperparameters brw = 1 groups and includes it as a special case when the documents are not
amounts to an expected uniform frequency of words in topics and clustered.

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 3 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

In the symmetric setting of the SBM, we make no explicit distinc- probability of a labeled graph A where the labeled degrees k and edge
tion between words and documents, both of which become nodes in counts between groups e are constrained to specific values (and not their
different partitions of a bipartite network. We base our Bayesian for- expectation values), P(k|e) is the uniform prior distribution of the labeled
mulation that follows on this symmetric parametrization. degrees constrained by the edge counts e, and Pðej wÞ is the prior dis-
Community detection and the hSBM tribution of edge counts, given by a mixture of independent geometric dis-
Taking advantage of the above connection between pLSI and SBM, we tributions with average w.
show how we can extend the idea of hSBMs developed in (40–42) such The main advantage of this alternative model formulation is that it
that we can effectively use them for the inference of topical structure in allows us to remove the homogeneous assumptions by replacing the
texts. Like pLSI, the SBM likelihood of Eq. 4 contains a large number of uniform priors P(k|e) and Pðej wÞ by a hierarchy of priors and hyper-
parameters that grow with the number of groups and therefore cannot priors that incorporate the possibility of higher-order structures. We
be used effectively without knowing the most appropriate dimension of could achieve this in a tractable manner without the need of solving
the model beforehand. Analogously to what is carried out in LDA, we complicated integrals that would be required if introducing deeper
can address this by assuming noninformative priors for the parameters Bayesian hierarchies in Eq. 7 directly.
k and w and computing the marginal likelihood (for an explicit expres- In a first step, we follow the approach of (41) and condition the
sion, see section S1.1) labeled degrees k on an overlapping partition b = {bir}, given by


wÞ ¼ ∫PðAjk; wÞPðkÞPðwj
PðAj wÞdkdw ð7Þ bir ¼
1 if kri > 0
ð12Þ
0 otherwise
where w  is a global parameter determining the overall density of the
network. We can use this to infer the labeled adjacency matrix fArs ij g, such that they are sampled by a distribution

Downloaded from https://www.science.org on July 19, 2024


as performed in LDA, with the difference that not only the words but
also the documents would be clustered into mixed categories.
PðkjeÞ ¼ Pðkje; bÞPðbÞ ð13Þ
However, at this stage, the model still shares some disadvantages
with LDA. In particular, the noninformative priors make unrealistic as-
sumptions about the data, where the mixture between groups and the The labeled degree sequence is sampled conditioned on the frequency
distribution of nodes into groups is expected to be unstructured. Among of degrees nbk inside each mixture b, which itself is sampled from its own
other problems, this leads to a practical obstacle, as this approach has a
pffiffiffiffi noninformative prior
“resolution limit” where, at most, Oð N Þ groups can be inferred on a
sparse network with N nodes (42, 43). In the following, we propose 
a qualitatively different approach to the choice of priors by replacing the Pðkje; bÞ ¼ ∏b Pðkb jnbk ÞPðnbk jeb ; bÞ Pðeb je; bÞ ð14Þ
noninformative approach with deeper Bayesian hierarchy of priors and
hyperpriors, which are agnostic about the higher-order properties of the
data while maintaining the nonparametric nature of the approach. We Where e b is the number of incident edges in each mixture (for detailed
begin by reformulating the above model as an equivalent microcanonical expressions, see section S1.3).
model (for a proof, see section S1.2) (42) such that we can write the Because of the fact that the frequencies of the mixtures and
marginal likelihood as the joint likelihood of the data and its discrete those of the labeled degrees are treated as latent variables, this
parameters model admits that group mixtures are far more heterogeneous
than the Dirichlet prior used in LDA. In particular, as was shown
in (42), the expected degrees generated in this manner follow a
wÞ ¼ PðA; k; ej
PðAj wÞ ¼ PðAjk; eÞPðkjeÞPðej
wÞ ð8Þ
Bose-Einstein distribution, which is much broader than the expo-
nential distribution obtained with the prior of Eq. 10. The asymptotic
with form of the degree likelihood will approach the true distribution
as the prior washes out (42), making it more suitable for skewed

PðAjk; eÞ ¼
∏r<s ers !∏r err !!∏ir kri ! ð9Þ
empirical frequencies, such as Zipf’s law or mixtures thereof (44),
without requiring specific parameters—such as exponents—to be
∏rs ∏i<j Arsij !∏i Arsii !!∏r er ! determined a priori.
In a second step, we follow (40, 42) and model the prior for the
edge counts e between groups by interpreting it as an adjacency matrix
 1 itself, that is, a multigraph where the B groups are the nodes. We then
PðkjeÞ ¼ ∏
er
ð10Þ proceed by generating it from another SBM, which, in turn, has its
r N
own partition into groups and matrix of edge counts. Continuing in
the same manner yields a hierarchy of nested SBMs, where each level
 ers E l = 1,…, L clusters the groups of the levels below. This yields a prob-
wÞ ¼ ∏
w w
Pðej ¼ ð11Þ ability [see (42)] given by
w þ 1Þers þ1 ð
r≤s ð w þ 1ÞEþBðBþ1Þ=2
L
where ers ¼ ∑ij Ars
ij is the total number of edges between groups r and s PðejEÞ ¼ ∏Pðel jelþ1 ; bl ÞPðbl Þ ð15Þ
(we used the shorthand er = ∑sers and kri ¼ ∑js Arsij ), PðAjk; eÞ is the
l¼1

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 4 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

with same must be carried out at the upper levels of the hierarchy by repla-
cing Eq. 17 with
 1  1
Pðel jelþ1 ; bl Þ ¼ ∏ ∏r
nlr nls nlr ðnlr þ 1Þ=2 Pðbl Þ ¼ Pw ðbwl ÞPd ðbdl Þ ð21Þ
ð16Þ
r<s elþ1
rs rr =2
elþ1
In this manner, by construction, words and documents will never be
 1 placed together in the same group.
∏r nlr ! Bl1 1 1
Pðbl Þ ¼ ð17Þ
Bl1 ! Bl  1 Bl1 Comparing LDA and hSBM in real and artificial data
Here, we show that the theoretical considerations discussed in the pre-
where the index l refers to the variable of the SBM at a particular vious section are relevant in practice. We show that hSBM constitutes a
level; for example, nlr is the number of nodes in group r at level l. better model than LDA in three classes of problems. First, we construct
The use of this hierarchical prior is a strong departure from the non- simple examples that show that LDA fails in cases of non-Dirichlet topic
informative assumption considered previously while containing it as a mixtures, while hSBM is able to infer both Dirichlet and non-Dirichlet
special case when the depth of the hierarchy is L = 1. It means that we mixtures. Second, we show that hSBM outperforms LDA even in arti-
expect some form of heterogeneity in the data at multiple scales, where ficial corpora drawn from the generative process of LDA. Third, we con-
groups of nodes are themselves grouped in larger groups, forming a hi- sider five different real corpora. We perform statistical model selection
erarchy. Crucially, this removes the “unimodality” inherent in the LDA based on the principle of minimum description length (45) and com-
assumption, as the group mixtures are now modeled by another gen- puting the description length ∑ (the smaller the better) of each model
erative level, which admits as much heterogeneity as the original one. (for details, see “Minimum description length” section in Materials and
Furthermore, it can be shown to significantly alleviate the resolution Methods).

Downloaded from https://www.science.org on July 19, 2024


limit of the noninformative approach, since it enables the detection of Failure of LDA in the case of non-Dirichlet mixtures
at most O(N/logN) groups in a sparse network with N nodes (40, 42). The choice of the Dirichlet distribution as a prior for the topic mixtures
Given the above model, we can find the best overlapping partitions qd implies that the ensemble of topic mixtures P(qd) is assumed to be
of the nodes by maximizing the posterior distribution either unimodal or concentrated at the edges of the simplex. This is an
undesired feature of this prior because there is no reason why data
PðA; fbl gÞ should show these characteristics. To explore how this affects the infer-
Pðfbl gjAÞ ¼ ð18Þ ence of LDA, we construct a set of simple examples with K = 3 topics,
PðAÞ which allow for easy visualization. Besides real data, we consider syn-
thetic data constructed from the generative process of LDA [in which
with case P(qd) follows a Dirichlet distribution] and from cases in which
the Dirichlet assumption is violated [for example, by superimpos-
PðA; fbl gÞ ¼ PðAjk; e1;b0ÞPðkje1 ; b0ÞPðb0 Þ  ∏Pðel jelþ1 ; bl ÞPðbl Þ ing two Dirichlet mixtures, resulting in a bimodal instead of a uni-
l modal P(qd)].
ð19Þ The results summarized in Fig. 3 show that SBM leads to better
results than LDA. In Dirichlet-generated data (Fig. 3A), LDA self-
which can be efficiently inferred using Markov Chain Monte Carlo, as consistently identifies the distribution of mixtures correctly. The SBM
described in (41, 42). The nonparametric nature of the model makes it is also able to correctly identify the Dirichlet mixture, although we
possible to infer (i) the depth of the hierarchy (containing the “flat” did not explicitly specify Dirichlet priors. In the non-Dirichlet syn-
model in case the data do not support a hierarchical structure) and thetic data (Fig. 3B), the SBM results again closely match the true
(ii) the number of groups for both documents and words directly from topic mixtures, but LDA completely fails. Although the inferred re-
the posterior distribution, without the need for extrinsic methods or sult by LDA no longer resembles the Dirichlet distribution after being
supervised approaches to prevent overfitting. We can see the latter inter- influenced by data, it is significantly distorted by the unsuitable prior
preting Eq. 19 as a description length (see discussion after Eq. 22). assumptions. Turning to real data (Fig. 3C), the LDA and SBM yield
The model above generates arbitrary multigraphs, whereas text is very different results. While the “true” underlying topic mixture of
represented as a bipartite network of words and documents. Since the each document is unknown in this case, we can identify the negative
latter is a special case of the former, where words and documents belong consequence of the Dirichlet priors from the fact that the results
to distinct groups, we can use the model as it is, as it will “learn” the from LDA are again similar to the ones expected from a Dirichlet
bipartite structure during inference. However, a more consistent ap- distribution (thus, likely an artifact), while the SBM results suggest
proach for text is to include this information in the prior, since we a much richer pattern.
should not have to infer what we already know. We can perform this Together, the results of this simple example visually show that LDA
via a simple modification of the model, where one replaces the prior for not only struggles to infer non-Dirichlet mixtures but also shows strong
the overlapping partition appearing in Eq. 13 by biases in the inference toward Dirichlet-type mixtures. On the other
hand, SBM is able to capture a much richer spectrum of topic mixtures
due to its nonparametric formulation. This is a direct consequence of
PðbÞ ¼ Pw ðbw ÞPd ðbd Þ ð20Þ the choice of priors: While LDA assumes a priori that the ensemble of
topic mixtures, P(qd), follows a Dirichlet distribution, SBM is more
where Pw(bw) and Pd(bd) now correspond to a disjoint overlapping agnostic with respect to the type of mixtures while retaining its non-
partition of the words and documents, respectively. Likewise, the parametric formulation.

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 5 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

A B C

Fig. 3. LDA is unable to infer non-Dirichlet topic mixtures. Visualization of the distribution of topic mixtures logP(qd) for different synthetic and real data sets in the
two-simplex using K = 3 topics. We show the true distribution in the case of the synthetic data (top) and the distributions inferred by LDA (middle) and SBM (bottom).
(A) Synthetic data sets with Dirichlet mixtures from the generative process of LDA with document hyperparameters ad = 0.01 × (1/3, 1/3, 1/3) (left) and ad = 100 × (1/3,
1/3, 1/3) (right) leading to different true mixture distributions logP(qd). We fix the word hyperparameter brw = 0.01, D = 1000 documents, V = 100 different words, and
text length kd = 1000. (B) Synthetic data sets with non-Dirichlet mixtures from a combination of two Dirichlet mixtures, respectively: ad D {100 × (1/3, 1/3, 1/3), 100 ×

Downloaded from https://www.science.org on July 19, 2024


(0.1, 0.8, 0.1)} (left) and ad D {100 × (0.1, 0.2, 0.7), 100 × (0.1, 0.7, 0.2)} (right). (C) Real data sets with unknown topic mixtures: Reuters (left) and Web of Science (right) each
containing D = 1000 documents. For LDA, we use hyperparameter optimization. For SBM, we use an overlapping, non-nested parametrization in which each document
belongs to its own group such that B = D + K, allowing for an unambiguous interpretation of the group membership as topic mixtures in the framework of topic models.

Artificial corpora sampled from LDA (in terms of the number of documents), as shown in Fig. 4B, where we
We consider artificial corpora constructed from the generative process observe that the normalized description length of each model converges
of LDA, incorporating some aspects of real texts (for details, see to a fixed value when increasing the size of the corpus. We confirm that
“Artificial corpora” section in Materials and Methods and section S2.1). these results hold across a wide range of parameter settings varying the
Although LDA is not a good model for real corpora (as the Dirichlet number of topics, as well as the values and base measures of the hyper-
assumption is not realistic), it serves to illustrate that even in a situation parameters (section S3 and figs. S1 to S3).
that favors LDA, the hSBM frequently provides a better description The LDA description length SLDA does not depend strongly on the
of the data. considered prior (true or noninformative) as the size of the corpora in-
From the generative process, we know the true latent variable of each creases (Fig. 4B). This is consistent with the typical expectation that in
word token. Therefore, we are able to obtain the inferred topical struc- the limit of large data, the prior washes out. However, note that for
ture from each method by simply assigning the true labels without using smaller corpora, the S of the noninformative prior is significantly worse
approximate numerical optimization methods for the inference. This than the S of the true prior.
allows us to separate intrinsic properties of the model itself from ex- In contrast, the hSBM provides much shorter description lengths
ternal properties related to the numerical implementation. than LDA for the same data when allowing documents to be clustered
To allow for a fair comparison between hSBM and LDA, we con- as well. The only exception is for very small texts (m < 10 tokens), where
sider two different choices in the inference of each method, respectively. we have not converged to the asymptotic limit in the per-word descrip-
LDA requires the specification of a set of hyperparameters a and b used tion length. In the limit D → ∞, we expect hSBM to provide a similarly
in the inference. While, in this particular case, we know the true hyper- good or better model than LDA for all text lengths. The improvement of
parameters that generated the corpus, in general, these are unknown. the hSBM over LDA in a LDA-generated corpus is counterintuitive be-
Therefore, in addition to the true values, we also consider a nonin- cause, for sufficient data, we expect the true model to provide a better
formative choice, that is, adr = 1 and brd = 1. For the inference with description for it. However, for a model such as LDA, the limit of suf-
hSBM, we only use the special case where the hierarchy has a single level ficient data involves the simultaneous scaling of the number of
such that the prior is noninformative. We consider two different param- documents, words, and topics to very high values. In particular, the gen-
etrizations of the SBM: (i) Each document is assigned to its own group, erative process of LDA requires a large number of documents to resolve
that is, they are not clustered, and (ii) different documents can belong the underlying Dirichlet distribution of the topic-document distribution
to the same group, that is, they are clustered. While the former is and a large number of topics to resolve the underlying word-topic
motivated by the original correspondence between pLSI and SBM, distribution. While the former is realized growing the corpus by adding
the latter shows the additional advantage offered by the possibility documents, the latter aspect is nontrivial because the observed size of
of clustering documents due to its symmetric treatment of words and the vocabulary V is not a free parameter but is determined by the word-
documents in a bipartite network (for details, see section S2.2). frequency distribution and the size of the corpus through the so-called
In Fig. 4A, we show that hSBM is consistently better than LDA for Heaps’ law (14). This means that, as we grow the corpus by adding more
synthetic corpora of almost any text length kd = m ranging over four and more documents, initially, the vocabulary increases linearly and only
orders of magnitude. These results hold for asymptotically large corpora at very large corpora does it settle into an asymptotic sublinear growth

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 6 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

A (section S4 and fig. S4). This, in turn, requires an ever larger number of
topics to resolve the underlying word-topic distribution. This large
number of topics is not feasible in practice because it renders the whole
goal and concept of topic models obsolete, compressing the information
by obtaining an effective, coarse-grained description of the corpus at a
manageable number of topics.
In summary, the limits in which LDA provides a better description,
that is, either extremely small texts or very large number of topics, are
irrelevant in practice. The observed limitations of LDA are due to the
following reasons: (i) The finite number of topics used to generate the
data always leads to an undersampling of the Dirichlet distributions,
and (ii) LDA is redundant in the way it describes the data in this sparse
regime. In contrast, the assumptions of the hSBM are better suited for
this sparse regime and hence lead to a more compact description of the
data, despite the fact that the corpora were generated by LDA.
B Real corpora
We compare LDA and SBM for a variety of different data sets, as shown
in Table 1 (for details, see “Data sets for real corpora” or “Numerical
implementations” section in Materials and Methods). When using
LDA, we consider both noninformative priors and fitted hyperparam-
eters for a wide range of numbers of topics. We obtain systematically

Downloaded from https://www.science.org on July 19, 2024


smaller values for the description length using the hSBM. For real
corpora, the difference is exacerbated by the fact that the hSBM is ca-
pable of clustering documents, capitalizing on a source of structure in
the data that are completely unavailable to LDA.
As our examples also show, LDA cannot be used in a direct manner
to choose the number of topics, as the noninformative choice system-
atically underfits (SLDA increases monotonically with the number of
topics) and the parametric approach systematically overfits (SLDA
Fig. 4. Comparison between LDA and SBM for artificial corpora drawn from LDA. decreases monotonically with the number of topics). In practice,
Description length S of LDA and hSBM for an artificial corpus drawn from the gener- users are required to resort to heuristics (46, 47) or more complicated
ative process of LDA with K = 10 topics. (A) Difference in S, DS = Si − SLDA−trueprior, inference approaches based on the computation of the model evi-
compared to the LDA with true priors—the model that generated the data—as a dence, which not only are numerically expensive but can only be per-
function of the text length kd = m and D = 106 documents. (B) Normalized S (per word) formed under onerous approximations (6, 22). In contrast, the hSBM is
as a function of the number of documents D for fixed text length kd = m = 128. The four capable of extracting the appropriate number of topics directly from its
curves correspond to different choices in the parametrization of the topic models: posterior distribution while simultaneously avoiding both under- and
(i) LDA with noninformative (noninf) priors (light blue, ×), (ii) LDA with true priors, that overfitting (40, 42).
is, the hyperparameters used to generate the artificial corpus (dark blue, •), (iii) hSBM
In addition to these formal aspects, we argue that the hierarchical
with without clustering of documents (light orange, ▲), and (iv) hSBM with clustering
of documents (dark orange, ▼).
nature of the hSBM and the fact that it clusters words and documents

Table 1. hSBM outperforms LDA in real corpora. Each row corresponds to a different data set (for details, see “Data sets for real corpora” section in Materials
and Methods). We provide basic statistics of each data set in column “Corpus.” The models are compared on the basis of their description length S (see Eq. 22).
We highlight the smallest S for each corpus in boldface to indicate the best model. Results for LDA with noninformative and fitted hyperparameters are shown
in columns “SLDA” and “SLDA (hyperfit)” for different number of topics K D {10, 50, 100, 500}. Results for the hSBM are shown in column “ShSBM” and the inferred
number of groups (documents and words) in “hSBM groups.”
hSBM
Corpus SLDA SLDA (hyperfit) ShSBM
groups
Word
Doc. Words 10 50 100 500 10 50 100 500 Doc. Words
tokens

Twitter 10,000 12,258 196,625 1,231,104 1,648,195 1,960,947 2,558,940 1,040,987 1,041,106 1,037,678 1,057,956 963,260 365 359

Reuters 1000 8692 117,661 498,194 593,893 669,723 922,984 463,660 477,645 481,098 496,645 341,199 54 55

Web of Science 1000 11,198 126,313 530,519 666,447 760,114 1,056,554 531,893 555,727 560,455 571,291 426,529 16 18

New York Times 1000 32,415 335,749 1,658,815 1,673,333 2,178,439 2,977,931 1,658,815 1,673,333 1,686,495 1,725,057 1,448,631 124 125

PLOS ONE 1000 68,188 5,172,908 10,637,464 10,964,312 11,145,531 13,180,803 10,358,157 10,140,244 10,033,886 9,348,149 8,475,866 897 972

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 7 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

make it more useful in interpreting text. We illustrate this with a case For words, the second level in the hierarchy splits nodes into three
study in the next section. separate groups. We find that two groups represent words belonging to
physics (for example, beam, formula, or energy) and biology (assembly,
Case study: Application of hSBM to Wikipedia articles folding, or protein), while the third group represents function words
We illustrate the results of the inference with the hSBM for articles taken (the, of, or a). We find that the latter group’s words show close-to-
from the English Wikipedia in Fig. 5, showing the hierarchical random distribution across documents by calculating the dissemination
clustering of documents and words. To make the visualization clearer, coefficient (right side of Fig. 5, see caption for definition). Furthermore,
we focus on a small network created from only three scientific dis- the median dissemination of the other groups is substantially less random
ciplines: chemical physics (21 articles), experimental physics (24 articles), with the exception of one subgroup (containing and, for, or which). This
and computational biology (18 articles). For clarity, we only consider suggests a more data-driven approach to dealing with function words in
words that appear more than once so that we end up with a network topic models. The standard practice is to remove words from a manually
of 63 document nodes, 3140 word nodes, and 39,704 edges. curated list of stopwords; however, recent results question the efficacy of
The hSBM splits the network into groups on different levels, these methods (48). In contrast, the hSBM is able to automatically identify
organized as a hierarchical tree. Note that the number of groups and groups of stopwords, potentially rendering these heuristic interventions
the number of levels were not specified beforehand but automatically unnecessary.
detected in the inference. On the highest level, hSBM reflects the bi-
partite structure into word and document nodes, as is imposed in our
model. DISCUSSION
In contrast to traditional topic models such as LDA, hSBM auto- The underlying equivalence between pLSI and the overlapping version
matically clusters documents into groups. While we considered articles of the SBM means that the “bag-of-words” formulation of topical
from three different categories (one category from biology and two corpora is mathematically equivalent to bipartite networks of words

Downloaded from https://www.science.org on July 19, 2024


categories from physics), the second level in the hierarchy separates and documents with modular structures. From this, we were able to for-
documents into only two groups corresponding to articles about biology mulate a topic model based on hSBM in a fully Bayesian framework,
(for example, bioinformatics or K-mer) and articles on physics (for ex- alleviating some of the most serious conceptual deficiencies in current
ample, rotating wave approximation or molecular beam). For lower approaches to topic modeling such as LDA. In particular, the model
levels, articles become separated into a larger number of groups; for ex- formulation is nonparametric, and model complexity aspects, such as
ample, one group contains two articles on Euler’s and Newton’s law of the number of topics, can be inferred directly from the model’s posterior
motion, respectively. distribution. Furthermore, the model is based on a hierarchical

Fig. 5. Inference of hSBM to articles from the Wikipedia. Articles from three categories (chemical physics, experimental physics, and computational biology). The first hier-
archical level reflects bipartite nature of the network with document nodes (left) and word nodes (right). The grouping on the second hierarchical level is indicated by solid lines.
We show examples for nodes that belong to each group on the third hierarchical level (indicated by dotted lines): For word nodes, we show the five most frequent words; for
document nodes, we show three (or fewer) randomly selected articles. For each word, we calculate the dissemination coefficient UD, which quantifies how unevenly words are
distributed among documents (60): UD = 1 indicates the expected dissemination from a random null model; the smaller UD (0 < UD < 1), the more unevenly a word is distributed.
We show the 5th, 25th, 50th, 75th, and 95th percentile for each group of word nodes on the third level of the hierarchy. Intl. Soc. for Comp. Biol., International Society for
Computational Biology; RRKM theory, Rice-Ramsperger-Kassel-Marcus theory.

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 8 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

clustering of both words and documents, in contrast to LDA, which is block-model approach we introduce here is also promising beyond text
based on a nonhierarchical clustering of the words alone. This enables analysis.
the identification of structural patterns in text that is unavailable to LDA
while, at the same time, allowing for the identification of patterns in
multiple scales of resolution. MATERIALS AND METHODS
We have shown that hSBM constitutes a better topic model com- Minimum description length
pared to LDA not only for a diverse set of real corpora but also for We compared both models based on the description length S,
artificial corpora generated from LDA itself. It is capable of providing where smaller values indicate a better model (45). We obtained S
better compression—as a measure of the quality of fit—and a richer for LDA from Eq. 2 and S for hSBM from Eq. 19 as
interpretation of the data. However, the hSBM offers an alternative to
Dirichlet priors used in virtually any variation of current approaches to
topic modeling. While motivated by their computational convenience,
SLDA ¼ lnPðnjh; b; aÞPðhÞ ð22Þ
Dirichlet priors do not reflect prior knowledge compatible with the
actual usage of language. Our analysis suggests that Dirichlet priors in- ShSBM ¼ PðA; fbl gÞ ð23Þ
troduce severe biases into the inference result, which, in turn, markedly
hinder its performance in the event of even slight deviations from the We noted that SLDA is conditioned on the hyperparameters b and a
Dirichlet assumption. In contrast, our work shows how to formulate and and, therefore, it is exact for noninformative priors (adr = 1 and brd = 1)
incorporate different (and as we have shown, more suitable) priors in a only. Otherwise, Eq. 22 is only a lower bound for SLDA because it lacks
fully Bayesian framework, which are completely agnostic to the type of the terms involving hyperpriors for b and a. For simplicity, we ignored
inferred mixtures. Furthermore, it also serves as a working example that this correction in our analysis, and therefore, we favored LDA. The mo-
efficient numerical implementations of non-Dirichlet topic models are tivation for this approach was twofold.

Downloaded from https://www.science.org on July 19, 2024


feasible and can be applied in practice to large collections of documents. On the one hand, it offers a well-founded approach to unsupervised
More generally, our results show how we can apply the same math- model selection within the framework of information theory, as it
ematical ideas to two extremely popular and mostly disconnected pro- corresponds to the amount of information necessary to simultaneously
blems: the inference of topics in corpora and of communities in networks. describe (i) the data when the model parameters are known and (ii) the
We used this connection to obtain improved topic models, but there are parameters themselves. As the complexity of the model increases, the
many additional theoretical results in community detection that should former will typically decrease, as it fits more closely to the data, while
be explored in the topic model context, for example, fundamental limits at the same time, it is compensated by an increase of the latter term,
to inference such as the undetectable-detectable phase transition (49) or which serves as a penalty that prevents overfitting. In addition, given
the analogy to Potts-like spin systems in statistical physics (50). Further- data and two models M1 and M2 with description length SM1 and
more, this connection allows the many extensions of the SBM, such as SM 2 , we could relate the difference DS ≡SM 1  SM2 to the Bayes factor
multilayer (51) and annotated (52, 53) versions to be readily used for (56). The latter quantifies how much more likely one model is compared
topic modeling of richer text including hyperlinks, citations between to the other given the data
documents, etc. Conversely, the field of topic modeling has long
adopted a Bayesian perspective to inference, which, until now, has PðM 1 jdataÞ PðdatajM 1 ÞPðM 1 Þ
not seen a widespread use in community detection. Thus, insights from BF ≡ ¼ ¼ eDS ð24Þ
topic modeling about either the formulation of suitable priors or the PðM 2 jdataÞ PðdatajM 2 ÞPðM 2 Þ
approximation of posterior distributions might catalyze the development
of improved statistical methods to detect communities in networks. Fur- where we assumed that each model is a priori equally likely, that
thermore, the traditional application of topic models in the analysis of is, P(M1) = P(M2).
texts leads to classes of networks usually not considered by community On the other hand, the description length allows for a straightforward
detection algorithms. The word-document network is bipartite (words- model comparison without the introduction of confounding factors.
documents), the topics/communities can be overlapping, and the number Commonly used supervised model selection approaches, such as per-
of links (word tokens) and nodes (word types) are connected to each plexity, require additional approximation techniques (22), which are
other through Heaps’ law. In particular, the latter aspect results in dense not readily applicable to the microcanonical formulation of the SBM.
networks, which have been largely overlooked by the networks com- It is thus not clear whether any difference in predictive power would
munity (54). Thus, topic models might provide additional insights result from the model and its inference or the approximation used in
into how to approach these networks as it remains unclear how these the calculation of perplexity. Furthermore, we noted that it was shown
properties affect the inference of communities in word-document net- recently that supervised approaches based on the held-out likelihood
works. More generally, Heaps’ law constitutes only one of numerous of missing edges tend to overfit in key cases, failing to select the most
statistical laws in language (14), such as the well-known Zipf’s law parsimonious model, unlike unsupervised approaches that are more
(15). While these regularities are studied well empirically, few attempts robust (57).
have been made to incorporate them explicitly as prior knowledge,
for example, formulating generative processes that lead to Zipf’s law Artificial corpora
(27, 28). Our results show that the SBM provides a flexible approach For the construction of the artificial corpora, we fixed the param-
to deal with Zipf’s law that constitutes a challenge to state-of-the-art eters in the generative process of LDA, that is, the number of topics
topic models such as LDA. Zipf’s law also appears in genetic codes K, the hyperparameters a and b, and the length of individual
(55) and images (26), two prominent fields in which LDA-type articles m. The a(b) hyperparameters determine the distribution
models have been extensively applied (12, 29), suggesting that the of topics (words) in each document (topic).

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 9 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

The generative process of LDA can be described in the following Section S4. Word-document networks are not sparse
way. For each topic r D {1,…, K}, we sampled a distribution over words Section S5. Empirical word-frequency distribution
Fig. S1. Varying the hyperparameters a and b in the comparison between LDA and SBM for
fr from a V-dimensional Dirichlet distribution with parameters brw for artificial corpora drawn from LDA.
w D {1,…, V}. For each document d D {1,…, D}, we sampled a topic mixture Fig. S2. Varying the number of topics K in the comparison between LDA and SBM for artificial
qd from a K-dimensional Dirichlet distribution with parameters adr for r D corpora drawn from LDA.
{1,…, K}. For each word position ld D {1,…, kd} (kd is the length of docu- Fig. S3. Varying the base measure of the hyperparameters a and b in the comparison between
LDA and SBM for artificial corpora drawn from LDA.
ment d), we first sampled a topic r * ¼ r ld from a multinomial with param- Fig. S4. Word-document networks are not sparse.
eters qd and then sampled a word w from a multinomial with parameters fr* . Fig. S5. Empirical rank-frequency distribution.
We assumed a parametrization in which (i) each document has Reference (61)
the same topic-document hyperparameter, that is, adr = ar for d D
{1,…, D} and (ii) each topic has the same word-topic hyperparam-
eter, that is, brw = bw for r D {1,…, K}. We fixed the average prob- REFERENCES AND NOTES
1. D. M. Blei, Probabilistic topic models. Commun. ACM 55, 77–84 (2012).
ability of occurrence of a topic, pr (word, pw), by introducing scalar 2. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman, Indexing by latent
hyperparameters a(b), that is, adr = aK(pr) for r D {1,…, K} [brw = semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990).
bV(pw) for w = 1,…, V). In our case, we chose (i) equiprobable topics, 3. Z. Ghahramani, Probabilistic machine learning and artificial intelligence. Nature 521,
that is, pr = 1/K, and (ii) empirically measured word frequencies from 452–459 (2015).
the Wikipedia corpus, that is, pw ¼ pemp w with w = 1,…, 95,129, 4. T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval
yielding a Zipfian distribution (section S5 and fig. S5), shown to be (SIGIR’99), Berkeley, CA, 15 to 19 August 1999, pp. 50–57.
universally described by a double power law (44). 5. D. M. Blei, A. Y. Ng, M. I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3,
993–1022 (2003).
Data sets for real corpora 6. T. L. Griffiths, M. Steyvers, Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101,

Downloaded from https://www.science.org on July 19, 2024


5228–5235 (2004).
For the comparison of hSBM and LDA, we considered different data 7. C. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval (Cambridge
sets of written texts varying in genre, time of origin, average text length, Univ. Press, 2008).
number of documents, and language, as well as data sets used in previ- 8. K. W. Boyack, D. Newman, R. J. Duhon, R. Klavans, M. Patek, J. R. Biberstine,
ous works on topic models, for example, (5, 16, 58, 59): B. Schijvenaars, A. Skupin, N. Ma, K. Börner, Clustering more than two million biomedical
(1) “Twitter,” a sample of Twitter messages obtained from www. publications: Comparing the accuracies of nine text-based similarity approaches.
PLOS ONE 6, e18029 (2011).
nltk.org/nltk_data/; 9. D. S. McNamara, Computational methods to extract meaning from text and advance
(2) “Reuters,” a collection of documents from the Reuters financial theories of human cognition. Top. Cogn. Sci. 3, 3–17 (2011).
newswire service denoted as “Reuters-21578, Distribution 1.0” obtained 10. J. Grimmer, B. M. Stewart, Text as data: The promise and pitfalls of automatic content
from www.nltk.org/nltk_data/; analysis methods for political texts. Polit. Anal. 21, 267–297 (2013).
11. B. Liu, L. Liu, A. Tsykin, G. J. Goodall, J. E. Green, M. Zhu, C. H. Kim, J. Li, Identifying
(3) “Web of Science,” abstracts from physics papers published in the functional miRNA–mRNA regulatory modules with correspondence latent Dirichlet
year 2000;. allocation. Bioinformatics 26, 3105–3111 (2010).
(4) “New York Times,” a collection of newspaper articles obtained 12. J. K. Pritchard, M. J. Stephens, P. J. Donnelly, Inference of population structure using
from http://archive.ics.uci.edu/ml; multilocus genotype data. Genetics 155, 945–959 (2000).
(5) “PLOS ONE,” full text of all scientific articles published in 2011 13. L. Fei-Fei, P. Perona, A Bayesian hierarchical model for learning natural scene categories,
in IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2005
in the journal PLOS ONE obtained via the PLOS API (http://api.plos.org/) (CVPR’05), San Diego, CA, 20 to 25 June 2005, vol. 2, pp. 524–531.
In all cases, we considered a random subset of the documents, as 14. E. G. Altmann, M. Gerlach, Statistical laws in linguistics, in Creativity and Universality in
detailed in Table 1. For the New York Times data, we did not use any Language, M. Degli Esposti, E. G. Altmann, F. Pachet, Eds. (Springer, 2016), pp. 7–26.
additional filtering since the data were already provided in the form of 15. G. K. Zipf, The Psycho-Biology of Language (Routledge, 1936).
16. A. Lancichinetti, M. I. Sirer, J. X. Wang, D. Acuna, K. Körding, L. A. N. Amaral,
prefiltered word counts. For the other data sets, we used the following A high-reproducibility and high-accuracy method for automated topic classification.
filtering: (i) We decapitalized all words, (ii) we replaced punctuation Phys. Rev. X 5, 011007 (2015).
and special characters (for example, “.”, “,”, or “/”) by blank spaces so 17. T. L. Griffiths, M. Steyvers, D. M. Blei, J. B. Tenenbaum, Integrating topics and syntax, in
that we could define a word as any substring between two blank spaces, Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, L. Bottou, Eds.
(MIT Press, 2005), pp. 537–544.
and (iii) we kept only those words that consisted of the letters a to z.
18. W. Li, A. McCallum, Pachinko allocation: DAG-structured mixture models of topic
correlations, in Proceedings of the 23rd International Conference on Machine Learning
Numerical implementations (ICML’06), Pittsburgh, PA, 25 to 29 June 2006, pp. 577–584.
For inference with LDA, we used package mallet (http://mallet.cs. 19. M. Rosen-Zvi, T. L. Griffiths, M. Steyvers, P. Smyth, The author-topic model for authors and
umass.edu/). The algorithm for inference with the hSBM shown in this documents, in Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence
(UAI’04), Banff, Canada, 7 to 11 July 2004, pp. 487–494.
work was implemented in C++ as part of the graph-tool Python library 20. G. Doyle, C. Elkan, Accounting for burstiness in topic models, in Proceedings of the 26th
(https://graph-tool.skewed.de). We provided code on how to use hSBM Annual International Conference on Machine Learning (ICML’09), Montreal, Canada, 14 to
for topic modeling in a GitHub repository (https://github.com/ 18 June 2009, pp. 281–288.
martingerlach/hSBM_Topicmodel). 21. W. Zhao, J. J. Chen, R. Perkins, Z. Liu, W. Ge, Y. Ding, W. Zou, A heuristic approach to
determine an appropriate number of topics in topic modeling. BMC Bioinformatics 16, S8
(2015).
22. H. M. Wallach, I. Murray, R. Salakhutdinov, D. Mimno, Evaluation methods for topic
SUPPLEMENTARY MATERIALS models, in Proceedings of the 26th Annual International Conference on Machine Learning
Supplementary material for this article is available at http://advances.sciencemag.org/cgi/ (ICML’09), Montreal, Canada, 14 to 18 June 2009, pp. 1105–1112.
content/full/4/7/eaaq1360/DC1 23. Y. W. Teh, M. I. Jordan, M. J. Beal, D. M. Blei, Hierarchical Dirichlet processes.
Section S1. Marginal likelihood of the SBM J. Am. Stat. Assoc. 101, 1566–1581 (2006).
Section S2. Artificial corpora drawn from LDA 24. D. M. Blei, T. L. Griffiths, M. I. Jordan, The nested Chinese restaurant process and Bayesian
Section S3. Varying the hyperparameters and number of topics nonparametric inference of topic hierarchies. J. ACM 57, 7 (2010).

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 10 of 11


SCIENCE ADVANCES | RESEARCH ARTICLE

25. J. Paisley, C. Wang, D. M. Blei, M. I. Jordan, Nested hierarchical Dirichlet processes. 48. A. Schoffield, M. Måns, D. Mimno, Pulling out the stops: Rethinking stopword removal for
IEEE Trans. Pattern Anal. Mach. Intell. 37, 256–270 (2015). topic models, in Proceedings of the 15th Conference of the European Chapter of the
26. E. B. Sudderth, M. I. Jordan, Shared segmentation of natural scenes using dependent Association for Computational Linguistics, Valencia, Spain, 3 to 7 April 2017, vol. 2,
Pitman-Yor processes, in Advances in Neural Information Processing Systems 21 (NIPS 2008), pp. 432–436.
D. Koller, D. Schuurmans, Y. Bengio, L. Bottou, Eds. (Curran Associates Inc., 2009), 49. A. Decelle, F. Krzakala, C. Moore, L. Zdeborová, Inference and phase transitions in the
pp. 1585–1592. detection of modules in sparse networks. Phys. Rev. Lett. 107, 065701 (2011).
27. I. Sato, H. Nakagawa, Topic models with power-law using Pitman-Yor process, in 50. D. Hu, P. Ronhovde, Z. Nussinov, Phase transitions in random Potts systems and the
Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and community detection problem: Spin-glass type and dynamic perspectives.
Data Mining (KDD’10), Washington, DC, 25 to 28 July 2010, pp. 673–682. Philos. Mag. 92, 406–445 (2012).
28. W. L. Buntine, S. Mishra, Experiments with non-parametric topic models, in Proceedings of 51. T. P. Peixoto, Inferring the mesoscale structure of layered, edge-valued, and time-varying
the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining networks. Phys. Rev. E 92, 042807 (2015).
(KDD’14), New York, NY, 24 to 27 August 2014, pp. 881–890. 52. M. E. J. Newman, A. Clauset, Structure and inference in annotated networks.
29. T. Broderick, L. Mackey, J. Paisley, M. I. Jordan, Combinatorial clustering and the beta Nat. Commun. 7, 11863 (2016).
negative binomial process. IEEE Trans. Pattern Anal. Mach. Intell. 37, 290–306 (2015). 53. D. Hric, T. P. Peixoto, S. Fortunato, Network structure, metadata, and the prediction of
30. M. Zhou, L. Carin, Negative binomial process count and mixture modeling. IEEE Trans. missing nodes and annotations. Phys. Rev. X 6, 031038 (2016).
Pattern Anal. Mach. Intell. 37, 307–320 (2015). 54. O. T. Courtney, G. Bianconi, Dense power-law networks and simplicial complexes.
31. S. Fortunato, Community detection in graphs. Phys. Rep. 486, 75–174 (2010). Phys. Rev. E 97, 052303 (2018).
32. E. M. Airoldi, D. M. Blei, S. E. Fienberg, E. P. Xing, Mixed membership stochastic 55. R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons,
blockmodels. J. Mach. Learn. Res. 9, 1981–2014 (2008). H. E. Stanley, Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 73,
33. B. Ball, B. Karrer, M. E. J. Newman, Efficient and principled method for detecting 3169–3172 (1994).
communities in networks. Phys. Rev. E 84, 036103 (2011). 56. R. E. Kass, A. E. Raftery, Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
34. M. E. J. Newman, M. Girvan, Finding and evaluating community structure in networks. 57. T. Vallès-Català, T. P. Peixoto, R. Guimerà, M. Sales-Pardo, Consistencies and
Phys. Rev. E 69, 026113 (2004). inconsistencies between model selection and link prediction in networks. Phys. Rev. E 97,
35. R. Guimerà, M. Sales-Pardo, L. A. N. Amaral, Modularity from fluctuations in random 026316 (2018).
graphs and complex networks. Phys. Rev. E 70, 025101 (2004). 58. H. M. Wallach, D. M. Mimno, A. McCallum, Rethinking LDA: Why priors matter, in Advances

Downloaded from https://www.science.org on July 19, 2024


36. A. Lancichinetti, S. Fortunato, Limits of modularity maximization in community detection. in Neural Information Processing Systems 22 (NIPS 2009), Y. Bengio, D. Schuurmans,
Phys. Rev. E 84, 066122 (2011). J. D. Lafferty, C. K. I. Williams, A. Culotta, Eds. (Curran Associates Inc., 2009), pp. 1973–1981.
37. P. W. Holland, K. B. Laskey, S. Leinhardt, Stochastic blockmodels: First steps. Soc. Networks 59. A. Asuncion, M. Welling, P. Smyth, Y. W. Teh, On smoothing and inference for topic
5, 109–137 (1983). models, in Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence
38. B. Karrer, M. E. J. Newman, Stochastic blockmodels and community structure in networks. (UAI’09), Montreal, Canada, 18 to 21 June 2009, pp. 27–34.
Phys. Rev. E 83, 016107 (2011). 60. E. G. Altmann, J. B. Pierrehumbert, A. E. Motter, Niche as a determinant of word fate in
39. E. M. Airoldi, D. M. Blei, E. A. Erosheva, S. E. Fienberg, Eds., Handbook of Mixed Membership online groups. PLOS ONE 6, e19009 (2011).
Models and Their Applications (CRC Press, 2014). 61. M. Gerlach, thesis, Technical University Dresden, Dresden, Germany (2016).
40. T. P. Peixoto, Hierarchical block structures and high-resolution model selection in large
networks. Phys. Rev. X 4, 011047 (2014). Acknowledgments: We thank M. Palzenberger for the help with the Web of Science data.
41. T. P. Peixoto, Model selection and hypothesis testing for large-scale network models with E.G.A. thanks L. Azizi and W. L. Buntine for the helpful discussions. Author contributions:
overlapping groups. Phys. Rev. X 5, 011033 (2015). M.G., T.P.P., and E.G.A designed the research. M.G., T.P.P., and E.G.A performed the
42. T. P. Peixoto, Nonparametric Bayesian inference of the microcanonical stochastic block research. M.G. and T.P.P. analyzed the data. M.G., T.P.P., and E.G.A wrote the manuscript.
model. Phys. Rev. E 95, 012317 (2017). Competing interests: The authors declare that they have no competing interests. Data
43. T. P. Peixoto, Parsimonious module inference in large networks. Phys. Rev. Lett. 110, and materials availability: All data needed to evaluate the conclusions in the paper are
148701 (2013). present in the paper and/or the Supplementary Materials. Additional data related to this
44. M. Gerlach, E. G. Altmann, Stochastic model for the vocabulary growth in natural paper may be requested from the authors.
languages. Phys. Rev. X 3, 021006 (2013).
45. J. Rissanen, Modeling by shortest data description. Automatica 14, 465–471 (1978). Submitted 5 October 2017
46. R. Arun, V. Suresh, C. E. V. Madhavan, M. N. N. Murthy, On finding the natural number of Accepted 5 June 2018
topics with latent Dirichlet allocation: Some observations, in Advances in Knowledge Published 18 July 2018
Discovery and Data Mining, M. J. Zaki, J. X. Yu, B. Ravindran, V. Pudi, Eds. (Springer, 2010), 10.1126/sciadv.aaq1360
pp. 391–402.
47. J. Cao, T. Xia, J. Li, Y. Zhang, S. Tang, A density-based method for adaptive LDA model Citation: M. Gerlach, T. P. Peixoto, E. G. Altmann, A network approach to topic models. Sci. Adv.
selection. Neurocomputing 72, 1775–1781 (2009). 4, eaaq1360 (2018).

Gerlach et al., Sci. Adv. 2018; 4 : eaaq1360 18 July 2018 11 of 11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy