ME314 Day11
ME314 Day11
Models
Jack Blumenau
ME314
Today’s lecture
Topic Models
Latent Dirichlet Allocation (LDA)
Extensions
Structural Topic Model (STM)
Validating Topic Models
Conclusion
Topic Models
Topic Models
Topic models allow us to cluster similar documents in a corpus together.
Wait. Don’t we already have tools for that?
Yes! Dictionaries and supervised learning.
So what do topic models add?
1 # ```{mermaid}
2 # %%| fig-width: 10
3 # %%| fig-height: 5
4 #
5 # flowchart TD
6 # A[Do you know the categories in which you want to place documents?] --> B(Yes)
7 # A[Do you know the categories in which you want to place documents?] --> G(No)
8 # B --> C[Do you know the rule for placing documents in categories?]
9 # C --> D(Yes)
10 # C --> E(No)
11 # D --> Fa[Dictionaries]
12 # E --> Fb[Supervised Learning]
13 # G --> H[Topic Models]
Topic Models
Pause for motivating material!
Topic Models
Topic models offer an automated procedure for discovering the main “themes” in an
unstructured corpus
They require no prior information, training set, or labelling of texts before estimation
They allow us to automatically organise, understand, and summarise large archives of
text data.
Latent Dirichlet Allocation (LDA) is the most common approach (Blei et al., 2003), and
one that underpins more complex models
Topic models are an example of mixture models:
Documents can contain multiple topics
Words can belong to multiple topics
Topic Models as Language Models
In the last lecture, we introduced the idea of a probabilistic language model
These models describe a story about how documents are generated using
probability
A language model is represented by a probability distribution over words in a
vocabulary
The Naive Bayes text classification model is one example of a generative language
model where
We estimate separate probability distributions for each category of interest
Each document is assigned to a single category
Topic models are also language models
We estimate separate probability distributions for each topic
Each document is described as belonging to multiple topics
What is a “topic”?
A “topic” is a probability distribution over a fixed word vocabulary.
These quantities can then be used to organise documents by topic, assess how topics vary
across documents, etc.
Latent Dirichlet
Allocation (LDA)
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA)
LDA is a probabilistic language model.
Each document d in the corpus is generated as follows:
1 ## Estimate LDA
2 ldaOut <- LDA(pmq_tm_dfm, k = 40, method = "Gibbs")
3
4 save(ldaOut, file = "../data/scripts/ldaOut_40.Rdata")
LDA example
We will make use of the following score to visualise the posterior topics:
⎛ ⎞
̂ ⎜ βk,v̂ ⎟
term-scorek,v = βk,v log⎜ ⎟
⎝ (∏j=1 βj,v̂ ) K ⎠
K 1
The first term, βk,v̂ , is the probability of term v in topic k and is akin to the term
frequency
The second term down-weights terms that have high probability under all topics
Disadvantages
“Our results strongly suggest that the imprint of social class will be found in even the
fuzziest of application materials.”
Break
Extensions
Extending LDA
LDA can be embedded in more complicated models, embodying further intuitions
about the structure of the texts.
E.g., it can be used in models that account for syntax, authorship, word sense,
dynamics, correlation, hierarchies, and other structure.
The data generating distribution can be changed. We can apply mixed-membership
assumptions to many kinds of data.
E.g., we can build models of images, social networks, music, purchase histories,
computer code, genetic data, and other types.
The posterior can be used in creative ways.
E.g., we can use inferences in information retrieval, recommendation, similarity,
visualization, summarization, and other applications.
LDA Extensions
1. Correlated Topic Model (CTM)
LDA assumes that topics are uncorrelated across the corpus
The correlated topic model allows topics to be correlated
Closer approximation to true document structure, but estimation is slower
Disadvantages:
We can use the STM to analyse how topic prevalence varies by party
The γk coefficients give the estimated difference in topic proportions for Labour and
Conservative legislators for each topic
Structural Topic Model Application
1 library(stm)
2
3 ## Estimate STM
4 stmOut <- stm(
5 documents = pmq_dfm,
6 prevalence = ~party.reduced,
7 K = 30,
8 seed = 123
9 )
10
11 save(stmOut, file = "stmOut.Rdata")
Structural Topic Model Application
1 labelTopics(stmOut)
Topic 3:
I suspect that many Members from all parties in this House will agree that mental health services have for
too long been treated as a poor cousin a Cinderella service in the NHS and have been systematically underfunded
for a long time. That is why I am delighted to say that the coalition Government have announced that we will be
introducing new access and waiting time standards for mental health conditions such as have been in existence
for physical health conditions for a long time. Over time, as reflected in the new NHS mandate, we must ensure
that mental health is treated with equality of resources and esteem compared with any other part of the NHS.
I am sure that the Prime Minister will join me in congratulating Cheltenham and Tewkesbury primary care
trust on never having had a financial deficit and on living within its means. Can he therefore explain to the
professionals, patients and people of Cheltenham why we are being rewarded with the closure of our 10-year-old
purpose-built maternity ward, the closure of our rehabilitation hospital, cuts in health promotion, cuts in
community nursing, cuts in health visiting, cuts in access to acute care and the non-implementation of new NICE-
prescribed drugs such as Herceptin?
I am sure that the Prime Minister will join me in congratulating Cheltenham and Tewkesbury primary care
trust on never having had a financial deficit and on living within its means. Can he therefore explain to the
professionals, patients and people of Cheltenham why we are being rewarded with the closure of our 10-year-old
purpose-built maternity ward, the closure of our rehabilitation hospital, cuts in health promotion, cuts in
community nursing, cuts in health visiting, cuts in access to acute care and the non-implementation of new NICE-
prescribed drugs such as Herceptin?
Structural Topic Model Application
1 dim(stmOut$theta)
[1] 27885 30
Structural Topic Model Application
Do MPs from different parties speak about healthcare at different rates?
1 stm_effects <- estimateEffect(formula = c(3) ~ party.reduced,
2 stmobj = stmOut,
3 metadata = docvars(pmq_dfm))
4
5 plot.estimateEffect(stm_effects,
6 covariate = "party.reduced",
7 method = "pointestimate",
8 xlim = c(0.025, 0.045))
Structural Topic Model Application
Structural Topic Model Application
On which topics do Conservative and Labour MPs differ the most?
1 stm_effects <- estimateEffect(formula = c(1:30) ~ party.reduced,
2 stmobj = stmOut,
3 metadata = docvars(pmq_dfm))
Structural Topic Model Application –
Content
1 library(stm)
2
3 ## Estimate STM
4 stmOut2 <- stm(
5 documents = pmq_dfm,
6 content = ~party.reduced,
7 K = 30,
8 seed = 123
9 )
10
11 save(stmOut2, file = "../data/scripts/stmOut2.Rdata")
Structural Topic Model Application –
Content
1 plot(stmOut2,
2 topics = c(3),
3 type = "perspectives",
4 plabels = c("Conservative", "Labour"),
5 main = topic_labels[3])
STM Application
Do liberal and conservative newspapers report on the
economy in different ways?
Problems:
Reforms to the banking system are an essential part of dealing with the crisis, and
delivering lasting and sustainable growth to the economy. Without these changes, we
will be weaker, we will be less well respected abroad, and we will be poorer.
Topic w1 w2 w3 w4 w5 w6
1 bank financ regul england fiscal market
2 plan econom growth longterm deliv sector
3 school educ children teacher pupil class
Assumption: When humans find it easy to locate the “intruding” topic, the mappings are
more sensible.
Semantic validity (Chang et al. 2009)
Conclusion:
“Topic models which perform better on held-out likelihood may infer less
semantically meaningful topics.” (Chang et al. 2009.)
Validating Topic Models – Substantive
approaches
Semantic validity
Does a topic contain coherent groups of words?
Does a topic identify a coherent groups of texts that are internally homogenous but
distinctive from other topics?
Predictive validity
How well does variation in topic usage correspond to known events?
Construct validity
How well does our measure correlate with other measures?
Implication: All these approaches require careful human reading of texts and topics, and
comparison with sensible metadata.
Conclusion
Summing Up
Topic models offer an approach to automatically inferring the substantive themes that
exist in a corpus of texts
A topic is described as a probability distribution over words in the vocabulary
Documents are described as a mixture of corpus wide topics
Topic models require very little up-front effort, but require extensive interpretation and
validation