A Biterm Topic Model For Short Texts Slide
A Biterm Topic Model For Short Texts Slide
Short Texts
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng
Institute of Computing Technology,
Chinese Academy of Sciences
1
Short Texts Are Prevalent on Today's Web
2
Background
Understanding the topics of short texts is
important in many tasks
content characterizing
content recomendation
user interest profiling
emerging topic detecting
semantic analysis
...
3
Topic Models
From Blei
5
Previous Approaches
LDA with document aggregation
e.g. aggregating the tweets published by the same user
heuristic, not general
Mixture of unigrams
each document has only one topic
too strict assumption, result in peaked posteriors P(z|d)
6
Key Ideas
Topics are basically groups of correlated words and
the correlation is revealed by word co-occurrence
patterns in documents
why not directly model the word co-occurrences for topic
learning?
7
Biterm Topic Model (BTM)
BTM models the generation of word co-occurrences in a
corpus
A biterm is an unordered word pair co-occurring in the same
short context (document)
Training data includes all the biterms in the corpus
Generative description
8
Biterm Topic Model (BTM)
where
10
Parameters Inference
Gibbs Sampling 700
LDA
600
BTM
sample topic for each biterm
12000 LDA
BTM
10000
Memory(M)
8000
6000
4000
2000
0
50 100 150 200 250
Topic Number K
11
Experiments: Datasets
#users 2,039,877 - -
#categories - 35 20
avg doc length 5.21 3.94 97.20
12
Experiments: Tweets2011 Collection
Topic quality
evaluation metric: average coherence score
(Mimno'11) on the top T words
A larger coherence score means the topics are more coherent
14
Experiments: Tweets2011 Collection
Quality of topical representation of documents (i.e.
P(z|d))
select 50 most frequent and meanful hashtag as class labels
organize documents with the same hashtag into a cluster
measure: H score
smaller value indicates better agreement with human labeled classes
15
Experiments: Question Collection
Evaluated by document classification (linear SVM)
16
Experiments: 20Newsgroup Collection
(Normal Texts)
Biterm extraction
two words co-occur within a context window with range no larger
than a threshold r
clustering result
17
Summary
A practical but not well-studied problem
topic modeling on short texts
conventional topic models suffer from the severe data sparsity
Furture works
better way to infer topic proportation for short text documents
explore BTM in real-world applications
18
Thank You
19