0% found this document useful (0 votes)

162 views19 pages

A Biterm Topic Model For Short Texts Slide

This document proposes a Biterm Topic Model (BTM) to address challenges in modeling topics for short texts. BTM directly models word co-occurrence patterns through "biterms" (word pairs that co-occur in documents) rather than word counts. This exploits rich global word co-occurrence information to overcome data sparsity issues in short texts. Experiments on tweet, question, and newsgroup collections show BTM produces more coherent topics and better represents document content than conventional topic models. The authors suggest future work to improve inferring topic proportions for short documents and applying BTM in real-world applications.

Uploaded by

son070719969

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

162 views19 pages

A Biterm Topic Model For Short Texts Slide

Uploaded by

son070719969

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

A Biterm Topic Model for

Short Texts
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng
Institute of Computing Technology,
Chinese Academy of Sciences

1
Short Texts Are Prevalent on Today's Web

2
Background
Understanding the topics of short texts is
important in many tasks
content characterizing
content recomendation
user interest profiling
emerging topic detecting
semantic analysis
...

3
Topic Models

From Blei

Model the generation of documents with latent topic structure

a topic ~ a probability distribution over words
a document ~ a mixture of topics
a word ~ a sample drawn from one topic

Previous studies mainly focus on normal texts

4
Problem on Short Texts: Data Sparsity
Word counts are not discriminative
normal doc: topical words occur frequently
short doc: most words only occur once

Not enougth contexts to identify the senses of

ambiguous words
normal doc: rich context, many topic-related words
short doc: scarce context, few topic-related words

5
Previous Approaches
LDA with document aggregation
e.g. aggregating the tweets published by the same user
heuristic, not general

Mixture of unigrams
each document has only one topic
too strict assumption, result in peaked posteriors P(z|d)

Sparse topic models

each dcoument maintains a sparse distribution over topics, e.g.
Focused Topic Models
too complex, easy to overfitting

6
Key Ideas
Topics are basically groups of correlated words and
the correlation is revealed by word co-occurrence
patterns in documents
why not directly model the word co-occurrences for topic
learning?

Topic models on short texts suffer from the problem of

severe sparse patterns in short documents
why not use the rich global word co-occurrence patterns
for better revealing topics?

7
Biterm Topic Model (BTM)
BTM models the generation of word co-occurrences in a
corpus
A biterm is an unordered word pair co-occurring in the same
short context (document)
Training data includes all the biterms in the corpus

Generative description

8
Biterm Topic Model (BTM)

Model the generation of biterms with latent topic structure

a topic ~ a probability distribution over words
a corpus ~ a mixture of topics
a biterm ~ two i.i.d sample drawn from one topic 9
Inferring Topics in a Document
Assumption
the topic proportions of a document equals to the expectation
of the topic proportions of biterms in it

where

10
Parameters Inference
Gibbs Sampling 700
LDA
600
BTM
sample topic for each biterm

Time cost (s/iteration)

500
400
300
200
100
0
50 100 150 200 250
Topic number K

parameters estimate 14000

12000 LDA
BTM
10000

Memory(M)
8000

6000

4000

2000

0
50 100 150 200 250
Topic Number K

11
Experiments: Datasets

Tweets2011 Question 20Newsgroup

(short text) (short text) (normal text)
#documents 4,230,578 189,080 18,828

#words 98,857 26,565 42,697

#users 2,039,877 - -

#categories - 35 20
avg doc length 5.21 3.94 97.20

12
Experiments: Tweets2011 Collection
Topic quality
evaluation metric: average coherence score
(Mimno'11) on the top T words
A larger coherence score means the topics are more coherent

D. Mimno, H. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic oherence 13

in topic models. EMNLP 2011
Experiments: Tweets2011 Collection

The colored words are irrelevant judged by human

14
Experiments: Tweets2011 Collection
Quality of topical representation of documents (i.e.
P(z|d))
select 50 most frequent and meanful hashtag as class labels
organize documents with the same hashtag into a cluster
measure: H score
smaller value indicates better agreement with human labeled classes

15
Experiments: Question Collection
Evaluated by document classification (linear SVM)

16
Experiments: 20Newsgroup Collection
(Normal Texts)
Biterm extraction
two words co-occur within a context window with range no larger
than a threshold r

clustering result

17
Summary
A practical but not well-studied problem
topic modeling on short texts
conventional topic models suffer from the severe data sparsity

A novel way: Biterm Topic Model

model word co-occurrence to uncover topics
fully exploit the rich global word co-occurrens
effective on short texts (and normal texts)

Furture works
better way to infer topic proportation for short text documents
explore BTM in real-world applications
18
Thank You

BTM WWW13 Slides
No ratings yet
BTM WWW13 Slides
18 pages
A Biterm Topic Model For Short Texts
No ratings yet
A Biterm Topic Model For Short Texts
11 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Topic Modeling of Short Texts A Pseudo-Document View With Word Embedding Enhancement
No ratings yet
Topic Modeling of Short Texts A Pseudo-Document View With Word Embedding Enhancement
14 pages
An Integrated Clustering and BERT Framework For Improved Topic Modeling
No ratings yet
An Integrated Clustering and BERT Framework For Improved Topic Modeling
9 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
ITD253 L8 TopicModelling
No ratings yet
ITD253 L8 TopicModelling
31 pages
1 s2.0 S1877050922010158 Main
No ratings yet
1 s2.0 S1877050922010158 Main
10 pages
Jipeng Qiang 2019
No ratings yet
Jipeng Qiang 2019
17 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Combining Lexical and Semantic Features For Short Text Classification
No ratings yet
Combining Lexical and Semantic Features For Short Text Classification
9 pages
A Gentle Introduction To Topic Modeling Using Pyth
No ratings yet
A Gentle Introduction To Topic Modeling Using Pyth
10 pages
Topic Models Dsi Talk March 2017
No ratings yet
Topic Models Dsi Talk March 2017
24 pages
A Survey On Neural Topic Models
No ratings yet
A Survey On Neural Topic Models
24 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
Improving Topic Models With Latent Feature Word Representations
No ratings yet
Improving Topic Models With Latent Feature Word Representations
16 pages
A Survey On Neural Topic Models: Methods, Applications, and Challenges
No ratings yet
A Survey On Neural Topic Models: Methods, Applications, and Challenges
30 pages
UTOPIC 2023.eacl-Main.132
No ratings yet
UTOPIC 2023.eacl-Main.132
16 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Ke Et Al. - 2024 - Recent Advances in Text Analysis
No ratings yet
Ke Et Al. - 2024 - Recent Advances in Text Analysis
60 pages
A Correlated Topic Model of Science1
No ratings yet
A Correlated Topic Model of Science1
19 pages
Using Topic Modeling Methods For Short-Text Data: A Comparative Analysis
No ratings yet
Using Topic Modeling Methods For Short-Text Data: A Comparative Analysis
14 pages
Incorporating Topic Transition in Topic Detection and Tracking Algorithmsincorporating Topic Transition in Topic Detection and Tracking Algorithms
No ratings yet
Incorporating Topic Transition in Topic Detection and Tracking Algorithmsincorporating Topic Transition in Topic Detection and Tracking Algorithms
6 pages
Wete 2203.01570v2
No ratings yet
Wete 2203.01570v2
17 pages
A Survey of Topic Modeling in Text Mining
No ratings yet
A Survey of Topic Modeling in Text Mining
7 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
No ratings yet
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
12 pages
Unit-Iv NLP
No ratings yet
Unit-Iv NLP
11 pages
Hashtag-Based Tweet Expansion For Improved Topic Modeling
No ratings yet
Hashtag-Based Tweet Expansion For Improved Topic Modeling
19 pages
Machine Learning For Data Science Unit-5
No ratings yet
Machine Learning For Data Science Unit-5
10 pages
Exploration of Thesis
No ratings yet
Exploration of Thesis
93 pages
Topic Modeling For Social Media Content A Practical Approach
No ratings yet
Topic Modeling For Social Media Content A Practical Approach
7 pages
Topic Models From Twitter Hashtags: 1 Problem Definition
No ratings yet
Topic Models From Twitter Hashtags: 1 Problem Definition
2 pages
26595-Article Text-30658-1-2-20230626
No ratings yet
26595-Article Text-30658-1-2-20230626
9 pages
Dynamic Topic Modeling
No ratings yet
Dynamic Topic Modeling
13 pages
Eai 13-7-2018 159623
No ratings yet
Eai 13-7-2018 159623
16 pages
Dynamic Topic Modelling Tutorial
No ratings yet
Dynamic Topic Modelling Tutorial
13 pages
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
No ratings yet
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
22 pages
GloCOM NAACL2025
No ratings yet
GloCOM NAACL2025
16 pages
Topic Modelling Meets Deep Neural Networks - A Survey
No ratings yet
Topic Modelling Meets Deep Neural Networks - A Survey
8 pages
Experiments With Non Parametric Topic Models
No ratings yet
Experiments With Non Parametric Topic Models
10 pages
3 Topic Models
No ratings yet
3 Topic Models
15 pages
Session 2
No ratings yet
Session 2
58 pages
Correlated Topic Models: David M. Blei John D. Lafferty
No ratings yet
Correlated Topic Models: David M. Blei John D. Lafferty
8 pages
Pre-Training Is A Hot Topic: Contextualized Document Embeddings Improve Topic Coherence
No ratings yet
Pre-Training Is A Hot Topic: Contextualized Document Embeddings Improve Topic Coherence
8 pages
Sbalchiero Topicmodelinglongtextsand
No ratings yet
Sbalchiero Topicmodelinglongtextsand
14 pages
A Survey of Topic Pattern Mining in Text Mining PDF
No ratings yet
A Survey of Topic Pattern Mining in Text Mining PDF
7 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
Adison Wongkar, Christoph Wertz, What Are People Saying About Net Neutrality
No ratings yet
Adison Wongkar, Christoph Wertz, What Are People Saying About Net Neutrality
5 pages
Clustering Thesis
No ratings yet
Clustering Thesis
55 pages
Ex6 SMA
No ratings yet
Ex6 SMA
11 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
No ratings yet
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
6 pages
Thesis Paper Patrick Jaehnichen
No ratings yet
Thesis Paper Patrick Jaehnichen
88 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
From Everand
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
Manish Soni
No ratings yet
(Corus) SHS Jointing - Flowdrill and Hollo-Bolt
No ratings yet
(Corus) SHS Jointing - Flowdrill and Hollo-Bolt
13 pages
GROUP6
No ratings yet
GROUP6
13 pages
Ieee 484-02
No ratings yet
Ieee 484-02
23 pages
Nursing Informatics Week 1
No ratings yet
Nursing Informatics Week 1
37 pages
Configuring A JOB in T24
No ratings yet
Configuring A JOB in T24
2 pages
SQC L9
No ratings yet
SQC L9
33 pages
Performance Review of Thermal Power Stations 2011-12: Sl. No Name of Station Unit No Organisation Capacity
No ratings yet
Performance Review of Thermal Power Stations 2011-12: Sl. No Name of Station Unit No Organisation Capacity
4 pages
Newseam 1 Module 2 Matanacio
No ratings yet
Newseam 1 Module 2 Matanacio
32 pages
Chapter Five
No ratings yet
Chapter Five
10 pages
NFC Pre-Authorization API v1.01
No ratings yet
NFC Pre-Authorization API v1.01
5 pages
1,6 Hexanediamine
No ratings yet
1,6 Hexanediamine
7 pages
Chief Financial Officer CFO in Los Angeles CA Resume David Yodkovik
No ratings yet
Chief Financial Officer CFO in Los Angeles CA Resume David Yodkovik
2 pages
Toshiba Satellite L30 SpecificationBrochure 110706
No ratings yet
Toshiba Satellite L30 SpecificationBrochure 110706
2 pages
Tao Et Al - 2017 - Reconfigurable Conversions of Reflection, Transmission, and Polarization States
No ratings yet
Tao Et Al - 2017 - Reconfigurable Conversions of Reflection, Transmission, and Polarization States
6 pages
Luwax and Poligen - Application Guide BAFS
100% (1)
Luwax and Poligen - Application Guide BAFS
9 pages
Resume Sonali Sahu Tenth Revolution Group
No ratings yet
Resume Sonali Sahu Tenth Revolution Group
2 pages
EGR System Diagnostic Procedures
No ratings yet
EGR System Diagnostic Procedures
7 pages
Expectancy Theory Overview
100% (3)
Expectancy Theory Overview
27 pages
HFACS 8.0 How To
No ratings yet
HFACS 8.0 How To
18 pages
Assignment MCA 103
No ratings yet
Assignment MCA 103
4 pages
Actility Enova Presentation REV01
No ratings yet
Actility Enova Presentation REV01
19 pages
FSSC 22000 Ing 2022
No ratings yet
FSSC 22000 Ing 2022
2 pages
Spec Hyundai HX210
No ratings yet
Spec Hyundai HX210
10 pages
Stock Option Agreement
No ratings yet
Stock Option Agreement
11 pages
Smoke Control Hotels PDF
No ratings yet
Smoke Control Hotels PDF
9 pages
Technology Newsletter
No ratings yet
Technology Newsletter
5 pages
IFA New
No ratings yet
IFA New
18 pages
What Is Figurative Language?
No ratings yet
What Is Figurative Language?
8 pages
A DD Merged
No ratings yet
A DD Merged
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Biterm Topic Model For Short Texts Slide

Uploaded by

A Biterm Topic Model For Short Texts Slide

Uploaded by

A Biterm Topic Model for

Model the generation of documents with latent topic structure

Previous studies mainly focus on normal texts

Not enougth contexts to identify the senses of

Sparse topic models

Topic models on short texts suffer from the problem of

Model the generation of biterms with latent topic structure

Time cost (s/iteration)

parameters estimate 14000

Tweets2011 Question 20Newsgroup

#words 98,857 26,565 42,697

D. Mimno, H. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic oherence 13

The colored words are irrelevant judged by human

A novel way: Biterm Topic Model

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.