Sciadv Aaq1360
Sciadv Aaq1360
Here, we propose and apply a unified framework to the fields of can be transported between these two classes of problems. The benefit of
topic modeling and community detection. As illustrated in Fig. 1, by this unified approach is illustrated by the derivation of an alternative to
representing the word-document matrix as a bipartite network, the Dirichlet-based topic models, which is more principled in its theoretical
problem of inferring topics becomes a problem of inferring commu- foundation (making fewer assumption about the data) and superior in
nities. Topic models and community-detection methods have been practice according to model selection criteria.
previously discussed as being part of mixed-membership models (39).
However, this has remained a conceptual connection (16), and in prac-
tice, the two approaches are used to address different problems (32): the RESULTS
occurrence of words within and the links/citations between documents, Community detection for topic modeling
respectively. In contrast, here, we develop a formal correspondence that Here, we expose the connection between topic modeling and commu-
builds on the mathematical equivalence between pLSI of texts and nity detection, as illustrated in Fig. 2. We first revisit how a Bayesian
SBMs of networks (33) and that we use to adapt community-detection formulation of pLSI assuming Dirichlet priors leads to LDA and how
methods to perform topic modeling. In particular, we derive a non- we can reinterpret the former as a mixed membership SBM. We then
parametric Bayesian parametrization of pLSI—adapted from a hierar- use the latter to derive a more principled approach to topic modeling
chical SBM (hSBM) (40–42)—that makes fewer assumptions about the using nonparametric and hierarchical priors.
underlying structure of the data. As a consequence, it better matches the Topic models: pLSI and LDA
statistical properties of real texts and solves many of the intrinsic limita- pLSI is a model that generates a corpus composed of D documents,
tions of LDA. For example, we demonstrate the limitations induced by where each document d has kd words (4). Words are placed in the
the Dirichlet priors by showing that LDA fails to infer topical structures documents based on the topic mixtures assigned to both document
that deviate from the Dirichlet assumption. We show that our model and words, from a total of K topics. More specifically, one iterates
correctly infers these structures and thus leads to a better topic model through all D documents; for each document d, one samples kd ~
For an unknown text, we could simply maximize Eq. 1 to obtain the documents. Although choosing appropriate values of brw can address
best parameters h, q, and f, which describe the topical structure of the this disagreement, such an approach, as already mentioned, runs con-
corpus. However, we cannot directly use this approach to model tex- trary to nonparametric inference and is subject to overfitting. In the
tual data without a significant danger of overfitting. The model has a following, we will show how one can recast the same original pLSI
large number of parameters that grows as the number of documents, model as a network model that completely removes the limitations de-
words, and topics is increased, and hence, a maximum likelihood esti- scribed above and is capable of uncovering heterogeneity in the data at
mate will invariably incorporate a considerable amount of noise. multiple scales.
One solution to this problem is to use a Bayesian formulation by pro- Topic models and community detection: Equivalence between
posing prior distributions to the parameters and integrating over them. pLSI and SBM
This is precisely what is performed in LDA (5, 6), where one chooses We show that pLSI is equivalent to a specific form of a mixed-membership
Dirichlet priors Dd(qd|ad) and Dr(fr|br) with hyperparameters a and b SBM, as proposed by Ball et al. (33). The SBM is a model that generates a
for the probabilities q and f above and one uses instead the marginal network composed of i = 1,…, N nodes with adjacency matrix Aij, which
likelihood. we will assume without loss of generality to correspond to a multigraph,
that is, Aij D ℕ. The nodes are placed in a partition composed of B
Pðnjh; b; aÞ ¼ ∫Pðnjh; q; fÞ∏Dd ðqd jad Þ∏Dr ðfr jbr Þdqdf;
overlapping groups, and the edges between nodes i and j are sampled
d r from a Poisson distribution with average
¼∏ ∏ 1
hkdd ehd
d
wr n
dw !
r
∑rs kir wrs kjs ð3Þ
uniformly from it toward pure components. This means that for any Pðnjh;f′;qÞ ¼ ∏ ð6Þ
choice of a and b, the whole corpus is characterized by a single typical dwr nrdw !
mixture of topics into documents and a single typical mixture of words
into topics. This is an extreme level of assumed homogeneity, which with lrdw ¼ hd hw qdr fwr
′ . If we choose to view the counts ndw as the en-
stands in contradiction to a clustering approach initially designed to tries of the adjacency matrix of a bipartite multigraph with documents
capture heterogeneity. and words as nodes, the likelihood of Eq. 6 is equivalent to the likelihood
In addition to the above, the use of nonparametric Dirichlet priors of Eq. 4 of the SBM if we assume that each document belongs to its own
is inconsistent with well-known universal statistical properties of real specific group, kir = dir, with i = 1,…, D for document nodes, and by
texts, most notably, the highly skewed distribution of word frequencies, rewriting lrdw ¼ wdr krw. Therefore, the SBM of Eq. 4 is a generalization
which typically follows Zipf’s law (15). In contrast, the noninformative of pLSI that allows the words and the documents to be clustered into
choice of the Dirichlet distribution with hyperparameters brw = 1 groups and includes it as a special case when the documents are not
amounts to an expected uniform frequency of words in topics and clustered.
In the symmetric setting of the SBM, we make no explicit distinc- probability of a labeled graph A where the labeled degrees k and edge
tion between words and documents, both of which become nodes in counts between groups e are constrained to specific values (and not their
different partitions of a bipartite network. We base our Bayesian for- expectation values), P(k|e) is the uniform prior distribution of the labeled
mulation that follows on this symmetric parametrization. degrees constrained by the edge counts e, and Pðej wÞ is the prior dis-
Community detection and the hSBM tribution of edge counts, given by a mixture of independent geometric dis-
Taking advantage of the above connection between pLSI and SBM, we tributions with average w.
show how we can extend the idea of hSBMs developed in (40–42) such The main advantage of this alternative model formulation is that it
that we can effectively use them for the inference of topical structure in allows us to remove the homogeneous assumptions by replacing the
texts. Like pLSI, the SBM likelihood of Eq. 4 contains a large number of uniform priors P(k|e) and Pðej wÞ by a hierarchy of priors and hyper-
parameters that grow with the number of groups and therefore cannot priors that incorporate the possibility of higher-order structures. We
be used effectively without knowing the most appropriate dimension of could achieve this in a tractable manner without the need of solving
the model beforehand. Analogously to what is carried out in LDA, we complicated integrals that would be required if introducing deeper
can address this by assuming noninformative priors for the parameters Bayesian hierarchies in Eq. 7 directly.
k and w and computing the marginal likelihood (for an explicit expres- In a first step, we follow the approach of (41) and condition the
sion, see section S1.1) labeled degrees k on an overlapping partition b = {bir}, given by
wÞ ¼ ∫PðAjk; wÞPðkÞPðwj
PðAj wÞdkdw ð7Þ bir ¼
1 if kri > 0
ð12Þ
0 otherwise
where w is a global parameter determining the overall density of the
network. We can use this to infer the labeled adjacency matrix fArs ij g, such that they are sampled by a distribution
PðAjk; eÞ ¼
∏r<s ers !∏r err !!∏ir kri ! ð9Þ
empirical frequencies, such as Zipf’s law or mixtures thereof (44),
without requiring specific parameters—such as exponents—to be
∏rs ∏i<j Arsij !∏i Arsii !!∏r er ! determined a priori.
In a second step, we follow (40, 42) and model the prior for the
edge counts e between groups by interpreting it as an adjacency matrix
1 itself, that is, a multigraph where the B groups are the nodes. We then
PðkjeÞ ¼ ∏
er
ð10Þ proceed by generating it from another SBM, which, in turn, has its
r N
own partition into groups and matrix of edge counts. Continuing in
the same manner yields a hierarchy of nested SBMs, where each level
ers E l = 1,…, L clusters the groups of the levels below. This yields a prob-
wÞ ¼ ∏
w w
Pðej ¼ ð11Þ ability [see (42)] given by
w þ 1Þers þ1 ð
r≤s ð w þ 1ÞEþBðBþ1Þ=2
L
where ers ¼ ∑ij Ars
ij is the total number of edges between groups r and s PðejEÞ ¼ ∏Pðel jelþ1 ; bl ÞPðbl Þ ð15Þ
(we used the shorthand er = ∑sers and kri ¼ ∑js Arsij ), PðAjk; eÞ is the
l¼1
with same must be carried out at the upper levels of the hierarchy by repla-
cing Eq. 17 with
1 1
Pðel jelþ1 ; bl Þ ¼ ∏ ∏r
nlr nls nlr ðnlr þ 1Þ=2 Pðbl Þ ¼ Pw ðbwl ÞPd ðbdl Þ ð21Þ
ð16Þ
r<s elþ1
rs rr =2
elþ1
In this manner, by construction, words and documents will never be
1 placed together in the same group.
∏r nlr ! Bl1 1 1
Pðbl Þ ¼ ð17Þ
Bl1 ! Bl 1 Bl1 Comparing LDA and hSBM in real and artificial data
Here, we show that the theoretical considerations discussed in the pre-
where the index l refers to the variable of the SBM at a particular vious section are relevant in practice. We show that hSBM constitutes a
level; for example, nlr is the number of nodes in group r at level l. better model than LDA in three classes of problems. First, we construct
The use of this hierarchical prior is a strong departure from the non- simple examples that show that LDA fails in cases of non-Dirichlet topic
informative assumption considered previously while containing it as a mixtures, while hSBM is able to infer both Dirichlet and non-Dirichlet
special case when the depth of the hierarchy is L = 1. It means that we mixtures. Second, we show that hSBM outperforms LDA even in arti-
expect some form of heterogeneity in the data at multiple scales, where ficial corpora drawn from the generative process of LDA. Third, we con-
groups of nodes are themselves grouped in larger groups, forming a hi- sider five different real corpora. We perform statistical model selection
erarchy. Crucially, this removes the “unimodality” inherent in the LDA based on the principle of minimum description length (45) and com-
assumption, as the group mixtures are now modeled by another gen- puting the description length ∑ (the smaller the better) of each model
erative level, which admits as much heterogeneity as the original one. (for details, see “Minimum description length” section in Materials and
Furthermore, it can be shown to significantly alleviate the resolution Methods).
A B C
Fig. 3. LDA is unable to infer non-Dirichlet topic mixtures. Visualization of the distribution of topic mixtures logP(qd) for different synthetic and real data sets in the
two-simplex using K = 3 topics. We show the true distribution in the case of the synthetic data (top) and the distributions inferred by LDA (middle) and SBM (bottom).
(A) Synthetic data sets with Dirichlet mixtures from the generative process of LDA with document hyperparameters ad = 0.01 × (1/3, 1/3, 1/3) (left) and ad = 100 × (1/3,
1/3, 1/3) (right) leading to different true mixture distributions logP(qd). We fix the word hyperparameter brw = 0.01, D = 1000 documents, V = 100 different words, and
text length kd = 1000. (B) Synthetic data sets with non-Dirichlet mixtures from a combination of two Dirichlet mixtures, respectively: ad D {100 × (1/3, 1/3, 1/3), 100 ×
Artificial corpora sampled from LDA (in terms of the number of documents), as shown in Fig. 4B, where we
We consider artificial corpora constructed from the generative process observe that the normalized description length of each model converges
of LDA, incorporating some aspects of real texts (for details, see to a fixed value when increasing the size of the corpus. We confirm that
“Artificial corpora” section in Materials and Methods and section S2.1). these results hold across a wide range of parameter settings varying the
Although LDA is not a good model for real corpora (as the Dirichlet number of topics, as well as the values and base measures of the hyper-
assumption is not realistic), it serves to illustrate that even in a situation parameters (section S3 and figs. S1 to S3).
that favors LDA, the hSBM frequently provides a better description The LDA description length SLDA does not depend strongly on the
of the data. considered prior (true or noninformative) as the size of the corpora in-
From the generative process, we know the true latent variable of each creases (Fig. 4B). This is consistent with the typical expectation that in
word token. Therefore, we are able to obtain the inferred topical struc- the limit of large data, the prior washes out. However, note that for
ture from each method by simply assigning the true labels without using smaller corpora, the S of the noninformative prior is significantly worse
approximate numerical optimization methods for the inference. This than the S of the true prior.
allows us to separate intrinsic properties of the model itself from ex- In contrast, the hSBM provides much shorter description lengths
ternal properties related to the numerical implementation. than LDA for the same data when allowing documents to be clustered
To allow for a fair comparison between hSBM and LDA, we con- as well. The only exception is for very small texts (m < 10 tokens), where
sider two different choices in the inference of each method, respectively. we have not converged to the asymptotic limit in the per-word descrip-
LDA requires the specification of a set of hyperparameters a and b used tion length. In the limit D → ∞, we expect hSBM to provide a similarly
in the inference. While, in this particular case, we know the true hyper- good or better model than LDA for all text lengths. The improvement of
parameters that generated the corpus, in general, these are unknown. the hSBM over LDA in a LDA-generated corpus is counterintuitive be-
Therefore, in addition to the true values, we also consider a nonin- cause, for sufficient data, we expect the true model to provide a better
formative choice, that is, adr = 1 and brd = 1. For the inference with description for it. However, for a model such as LDA, the limit of suf-
hSBM, we only use the special case where the hierarchy has a single level ficient data involves the simultaneous scaling of the number of
such that the prior is noninformative. We consider two different param- documents, words, and topics to very high values. In particular, the gen-
etrizations of the SBM: (i) Each document is assigned to its own group, erative process of LDA requires a large number of documents to resolve
that is, they are not clustered, and (ii) different documents can belong the underlying Dirichlet distribution of the topic-document distribution
to the same group, that is, they are clustered. While the former is and a large number of topics to resolve the underlying word-topic
motivated by the original correspondence between pLSI and SBM, distribution. While the former is realized growing the corpus by adding
the latter shows the additional advantage offered by the possibility documents, the latter aspect is nontrivial because the observed size of
of clustering documents due to its symmetric treatment of words and the vocabulary V is not a free parameter but is determined by the word-
documents in a bipartite network (for details, see section S2.2). frequency distribution and the size of the corpus through the so-called
In Fig. 4A, we show that hSBM is consistently better than LDA for Heaps’ law (14). This means that, as we grow the corpus by adding more
synthetic corpora of almost any text length kd = m ranging over four and more documents, initially, the vocabulary increases linearly and only
orders of magnitude. These results hold for asymptotically large corpora at very large corpora does it settle into an asymptotic sublinear growth
A (section S4 and fig. S4). This, in turn, requires an ever larger number of
topics to resolve the underlying word-topic distribution. This large
number of topics is not feasible in practice because it renders the whole
goal and concept of topic models obsolete, compressing the information
by obtaining an effective, coarse-grained description of the corpus at a
manageable number of topics.
In summary, the limits in which LDA provides a better description,
that is, either extremely small texts or very large number of topics, are
irrelevant in practice. The observed limitations of LDA are due to the
following reasons: (i) The finite number of topics used to generate the
data always leads to an undersampling of the Dirichlet distributions,
and (ii) LDA is redundant in the way it describes the data in this sparse
regime. In contrast, the assumptions of the hSBM are better suited for
this sparse regime and hence lead to a more compact description of the
data, despite the fact that the corpora were generated by LDA.
B Real corpora
We compare LDA and SBM for a variety of different data sets, as shown
in Table 1 (for details, see “Data sets for real corpora” or “Numerical
implementations” section in Materials and Methods). When using
LDA, we consider both noninformative priors and fitted hyperparam-
eters for a wide range of numbers of topics. We obtain systematically
Table 1. hSBM outperforms LDA in real corpora. Each row corresponds to a different data set (for details, see “Data sets for real corpora” section in Materials
and Methods). We provide basic statistics of each data set in column “Corpus.” The models are compared on the basis of their description length S (see Eq. 22).
We highlight the smallest S for each corpus in boldface to indicate the best model. Results for LDA with noninformative and fitted hyperparameters are shown
in columns “SLDA” and “SLDA (hyperfit)” for different number of topics K D {10, 50, 100, 500}. Results for the hSBM are shown in column “ShSBM” and the inferred
number of groups (documents and words) in “hSBM groups.”
hSBM
Corpus SLDA SLDA (hyperfit) ShSBM
groups
Word
Doc. Words 10 50 100 500 10 50 100 500 Doc. Words
tokens
Twitter 10,000 12,258 196,625 1,231,104 1,648,195 1,960,947 2,558,940 1,040,987 1,041,106 1,037,678 1,057,956 963,260 365 359
Reuters 1000 8692 117,661 498,194 593,893 669,723 922,984 463,660 477,645 481,098 496,645 341,199 54 55
Web of Science 1000 11,198 126,313 530,519 666,447 760,114 1,056,554 531,893 555,727 560,455 571,291 426,529 16 18
New York Times 1000 32,415 335,749 1,658,815 1,673,333 2,178,439 2,977,931 1,658,815 1,673,333 1,686,495 1,725,057 1,448,631 124 125
PLOS ONE 1000 68,188 5,172,908 10,637,464 10,964,312 11,145,531 13,180,803 10,358,157 10,140,244 10,033,886 9,348,149 8,475,866 897 972
make it more useful in interpreting text. We illustrate this with a case For words, the second level in the hierarchy splits nodes into three
study in the next section. separate groups. We find that two groups represent words belonging to
physics (for example, beam, formula, or energy) and biology (assembly,
Case study: Application of hSBM to Wikipedia articles folding, or protein), while the third group represents function words
We illustrate the results of the inference with the hSBM for articles taken (the, of, or a). We find that the latter group’s words show close-to-
from the English Wikipedia in Fig. 5, showing the hierarchical random distribution across documents by calculating the dissemination
clustering of documents and words. To make the visualization clearer, coefficient (right side of Fig. 5, see caption for definition). Furthermore,
we focus on a small network created from only three scientific dis- the median dissemination of the other groups is substantially less random
ciplines: chemical physics (21 articles), experimental physics (24 articles), with the exception of one subgroup (containing and, for, or which). This
and computational biology (18 articles). For clarity, we only consider suggests a more data-driven approach to dealing with function words in
words that appear more than once so that we end up with a network topic models. The standard practice is to remove words from a manually
of 63 document nodes, 3140 word nodes, and 39,704 edges. curated list of stopwords; however, recent results question the efficacy of
The hSBM splits the network into groups on different levels, these methods (48). In contrast, the hSBM is able to automatically identify
organized as a hierarchical tree. Note that the number of groups and groups of stopwords, potentially rendering these heuristic interventions
the number of levels were not specified beforehand but automatically unnecessary.
detected in the inference. On the highest level, hSBM reflects the bi-
partite structure into word and document nodes, as is imposed in our
model. DISCUSSION
In contrast to traditional topic models such as LDA, hSBM auto- The underlying equivalence between pLSI and the overlapping version
matically clusters documents into groups. While we considered articles of the SBM means that the “bag-of-words” formulation of topical
from three different categories (one category from biology and two corpora is mathematically equivalent to bipartite networks of words
Fig. 5. Inference of hSBM to articles from the Wikipedia. Articles from three categories (chemical physics, experimental physics, and computational biology). The first hier-
archical level reflects bipartite nature of the network with document nodes (left) and word nodes (right). The grouping on the second hierarchical level is indicated by solid lines.
We show examples for nodes that belong to each group on the third hierarchical level (indicated by dotted lines): For word nodes, we show the five most frequent words; for
document nodes, we show three (or fewer) randomly selected articles. For each word, we calculate the dissemination coefficient UD, which quantifies how unevenly words are
distributed among documents (60): UD = 1 indicates the expected dissemination from a random null model; the smaller UD (0 < UD < 1), the more unevenly a word is distributed.
We show the 5th, 25th, 50th, 75th, and 95th percentile for each group of word nodes on the third level of the hierarchy. Intl. Soc. for Comp. Biol., International Society for
Computational Biology; RRKM theory, Rice-Ramsperger-Kassel-Marcus theory.
clustering of both words and documents, in contrast to LDA, which is block-model approach we introduce here is also promising beyond text
based on a nonhierarchical clustering of the words alone. This enables analysis.
the identification of structural patterns in text that is unavailable to LDA
while, at the same time, allowing for the identification of patterns in
multiple scales of resolution. MATERIALS AND METHODS
We have shown that hSBM constitutes a better topic model com- Minimum description length
pared to LDA not only for a diverse set of real corpora but also for We compared both models based on the description length S,
artificial corpora generated from LDA itself. It is capable of providing where smaller values indicate a better model (45). We obtained S
better compression—as a measure of the quality of fit—and a richer for LDA from Eq. 2 and S for hSBM from Eq. 19 as
interpretation of the data. However, the hSBM offers an alternative to
Dirichlet priors used in virtually any variation of current approaches to
topic modeling. While motivated by their computational convenience,
SLDA ¼ lnPðnjh; b; aÞPðhÞ ð22Þ
Dirichlet priors do not reflect prior knowledge compatible with the
actual usage of language. Our analysis suggests that Dirichlet priors in- ShSBM ¼ PðA; fbl gÞ ð23Þ
troduce severe biases into the inference result, which, in turn, markedly
hinder its performance in the event of even slight deviations from the We noted that SLDA is conditioned on the hyperparameters b and a
Dirichlet assumption. In contrast, our work shows how to formulate and and, therefore, it is exact for noninformative priors (adr = 1 and brd = 1)
incorporate different (and as we have shown, more suitable) priors in a only. Otherwise, Eq. 22 is only a lower bound for SLDA because it lacks
fully Bayesian framework, which are completely agnostic to the type of the terms involving hyperpriors for b and a. For simplicity, we ignored
inferred mixtures. Furthermore, it also serves as a working example that this correction in our analysis, and therefore, we favored LDA. The mo-
efficient numerical implementations of non-Dirichlet topic models are tivation for this approach was twofold.
The generative process of LDA can be described in the following Section S4. Word-document networks are not sparse
way. For each topic r D {1,…, K}, we sampled a distribution over words Section S5. Empirical word-frequency distribution
Fig. S1. Varying the hyperparameters a and b in the comparison between LDA and SBM for
fr from a V-dimensional Dirichlet distribution with parameters brw for artificial corpora drawn from LDA.
w D {1,…, V}. For each document d D {1,…, D}, we sampled a topic mixture Fig. S2. Varying the number of topics K in the comparison between LDA and SBM for artificial
qd from a K-dimensional Dirichlet distribution with parameters adr for r D corpora drawn from LDA.
{1,…, K}. For each word position ld D {1,…, kd} (kd is the length of docu- Fig. S3. Varying the base measure of the hyperparameters a and b in the comparison between
LDA and SBM for artificial corpora drawn from LDA.
ment d), we first sampled a topic r * ¼ r ld from a multinomial with param- Fig. S4. Word-document networks are not sparse.
eters qd and then sampled a word w from a multinomial with parameters fr* . Fig. S5. Empirical rank-frequency distribution.
We assumed a parametrization in which (i) each document has Reference (61)
the same topic-document hyperparameter, that is, adr = ar for d D
{1,…, D} and (ii) each topic has the same word-topic hyperparam-
eter, that is, brw = bw for r D {1,…, K}. We fixed the average prob- REFERENCES AND NOTES
1. D. M. Blei, Probabilistic topic models. Commun. ACM 55, 77–84 (2012).
ability of occurrence of a topic, pr (word, pw), by introducing scalar 2. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman, Indexing by latent
hyperparameters a(b), that is, adr = aK(pr) for r D {1,…, K} [brw = semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990).
bV(pw) for w = 1,…, V). In our case, we chose (i) equiprobable topics, 3. Z. Ghahramani, Probabilistic machine learning and artificial intelligence. Nature 521,
that is, pr = 1/K, and (ii) empirically measured word frequencies from 452–459 (2015).
the Wikipedia corpus, that is, pw ¼ pemp w with w = 1,…, 95,129, 4. T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval
yielding a Zipfian distribution (section S5 and fig. S5), shown to be (SIGIR’99), Berkeley, CA, 15 to 19 August 1999, pp. 50–57.
universally described by a double power law (44). 5. D. M. Blei, A. Y. Ng, M. I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3,
993–1022 (2003).
Data sets for real corpora 6. T. L. Griffiths, M. Steyvers, Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101,
25. J. Paisley, C. Wang, D. M. Blei, M. I. Jordan, Nested hierarchical Dirichlet processes. 48. A. Schoffield, M. Måns, D. Mimno, Pulling out the stops: Rethinking stopword removal for
IEEE Trans. Pattern Anal. Mach. Intell. 37, 256–270 (2015). topic models, in Proceedings of the 15th Conference of the European Chapter of the
26. E. B. Sudderth, M. I. Jordan, Shared segmentation of natural scenes using dependent Association for Computational Linguistics, Valencia, Spain, 3 to 7 April 2017, vol. 2,
Pitman-Yor processes, in Advances in Neural Information Processing Systems 21 (NIPS 2008), pp. 432–436.
D. Koller, D. Schuurmans, Y. Bengio, L. Bottou, Eds. (Curran Associates Inc., 2009), 49. A. Decelle, F. Krzakala, C. Moore, L. Zdeborová, Inference and phase transitions in the
pp. 1585–1592. detection of modules in sparse networks. Phys. Rev. Lett. 107, 065701 (2011).
27. I. Sato, H. Nakagawa, Topic models with power-law using Pitman-Yor process, in 50. D. Hu, P. Ronhovde, Z. Nussinov, Phase transitions in random Potts systems and the
Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and community detection problem: Spin-glass type and dynamic perspectives.
Data Mining (KDD’10), Washington, DC, 25 to 28 July 2010, pp. 673–682. Philos. Mag. 92, 406–445 (2012).
28. W. L. Buntine, S. Mishra, Experiments with non-parametric topic models, in Proceedings of 51. T. P. Peixoto, Inferring the mesoscale structure of layered, edge-valued, and time-varying
the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining networks. Phys. Rev. E 92, 042807 (2015).
(KDD’14), New York, NY, 24 to 27 August 2014, pp. 881–890. 52. M. E. J. Newman, A. Clauset, Structure and inference in annotated networks.
29. T. Broderick, L. Mackey, J. Paisley, M. I. Jordan, Combinatorial clustering and the beta Nat. Commun. 7, 11863 (2016).
negative binomial process. IEEE Trans. Pattern Anal. Mach. Intell. 37, 290–306 (2015). 53. D. Hric, T. P. Peixoto, S. Fortunato, Network structure, metadata, and the prediction of
30. M. Zhou, L. Carin, Negative binomial process count and mixture modeling. IEEE Trans. missing nodes and annotations. Phys. Rev. X 6, 031038 (2016).
Pattern Anal. Mach. Intell. 37, 307–320 (2015). 54. O. T. Courtney, G. Bianconi, Dense power-law networks and simplicial complexes.
31. S. Fortunato, Community detection in graphs. Phys. Rep. 486, 75–174 (2010). Phys. Rev. E 97, 052303 (2018).
32. E. M. Airoldi, D. M. Blei, S. E. Fienberg, E. P. Xing, Mixed membership stochastic 55. R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons,
blockmodels. J. Mach. Learn. Res. 9, 1981–2014 (2008). H. E. Stanley, Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 73,
33. B. Ball, B. Karrer, M. E. J. Newman, Efficient and principled method for detecting 3169–3172 (1994).
communities in networks. Phys. Rev. E 84, 036103 (2011). 56. R. E. Kass, A. E. Raftery, Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
34. M. E. J. Newman, M. Girvan, Finding and evaluating community structure in networks. 57. T. Vallès-Català, T. P. Peixoto, R. Guimerà, M. Sales-Pardo, Consistencies and
Phys. Rev. E 69, 026113 (2004). inconsistencies between model selection and link prediction in networks. Phys. Rev. E 97,
35. R. Guimerà, M. Sales-Pardo, L. A. N. Amaral, Modularity from fluctuations in random 026316 (2018).
graphs and complex networks. Phys. Rev. E 70, 025101 (2004). 58. H. M. Wallach, D. M. Mimno, A. McCallum, Rethinking LDA: Why priors matter, in Advances