0% found this document useful (0 votes)
8 views40 pages

Sessionppt Topicmoelling

Uploaded by

ekansh9119
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views40 pages

Sessionppt Topicmoelling

Uploaded by

ekansh9119
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

TOPIC MODELING USING

LDA
SESSION – 22-23
AGENDA
• Basics of how text data is seen in Natural Language Processing
• What are topics?
• What is topic modeling?
• What are the applications of topic modeling,
• Topic Modeling Tools and Types of Models
• Discriminative Models
• Generative Models
Sample Problem

• Let’s say you have a client who has a publishing house. Your client
comes to you with two tasks: one he wants to categorize all the books
or the research papers he receives weekly on a common theme or a
topic and the other task is to encapsulate large documents into
smaller bite-sized texts. Is there any technique and tool available that
can do both of these two tasks?
What are Topics?

• Topics or themes are a group of statistically significant “tokens” or


words in a “corpus”.
• Refreshing terminologies,
• A corpus is the group of all the text documents whereas a document
is a collection of paragraphs.
• A paragraph is a collection of sentences and a sentence is a sequence
of words (or tokens) in a grammatical construct.
Topics

• So basically, a book or research paper, which collectively has pages


full of sentences, can be broken down into words. In the world of
Natural Language Processing (NLP), these words are known as tokens
that are a single and the smallest unit of text. The vocabulary is the
set of unique tokenized words.
NLP for Topics
• First step to work through any text data is to split the text into tokens.
The process of splitting a text into smaller units or words is known
as tokenization.
• As a human, we can easily read through a text or review or book and
based on this context tell what a topic the book or text is referring to,
right? Yes! However, how would a machine tell us what is the topic of
the book? How can you tell if a machine can rightly classify a book or
text into the correct category? The only way to interpret what a
machine builds for us in the language of Statistics.
NLP for Topics

• The next question that arises for us is to unravel what do we mean by


statistical significance in the context of the text data?
• The statistically significant words imply that this collection of words
are similar to each other and we see that in the following way within
a text data:
Words of common topics

In the above table, we have three different topics. Topic 1 on food,


Topic 2 talks about games, and Topic 3 have words related to
neuroscience. In each case, the words that are similar to each other
come together as a topic.
• Topic modeling is the process of automatically finding the hidden
topics in textual data.
• It is also referred to

the text or information mining technique that has the


aim to find the recurring patterns in the words present in the corpus.
• It is an unsupervised learning method as we do not need to supply
the labels to the topic modeling algorithm for the identification of the
themes or the topics. Topics are automatically identified and classified
by the model.
Topic Modelling
• Essentially, topic modeling can be seen as a clustering methodology,
wherein the small groups (or clusters) that are formed based on the
similarity of words are known as topics.
• Additionally, topic modeling returns another set of clusters which are
the group of documents collated together on the similarity of the
topics.
• It is an optimization technique.
Topic Modelling
Illustration
• We have a corpus with the following five documents:
• Document 1: I want to watch a movie this weekend.
• Document 2: I went shopping yesterday. New Zealand won the World Test
Championship by beating India by eight wickets at Southampton.
• Document 3: I don’t watch cricket. Netflix and Amazon Prime have very good
movies to watch.
• Document 4: Movies are a nice way to chill however, this time I would like to
paint and read some good books. It’s been so long!
• Document 5: This blueberry milkshake is so good! Try reading Dr. Joe
Dispenza’s books. His work is such a game-changer! His books helped to learn
so much about how our thoughts impact our biology and how we can all
rewire our brains.
Illustration – Output Goal
• Here, P implies that the respective topic is present in the current
document and 0 indicates the absence of the topic in the document.
• And, if the topic is present in the document then the values (which are
random as of now) assigned to it convey how much weightage does that
topic has in the particular document.
Illustration
• a document may be a combination of many topics. Our intention here
with topic modeling is to find the main dominant topic or the theme.
• We will be working with the same set of documents in the following
parts of the article as well.
Uses of Topic Modelling
• Document Categorization: The goal is to categorize or classify a large
set of documents into different categories based on the common
underlying theme.
• Document Summarization: It is a very handy tool for generating
summaries of large documents; say in our case we want to summarize
the large stack of research papers.
• Intent Analysis: Intent analysis means what each sentence (or tweet
or post or complaint) refers to. It tells what is the text trying to
explain in a particular document.
Topic Modeling Tools and Types of Models

• There are many methods for topic modeling such as:


Latent Dirichlet Allocation (LDA)
Latent Semantic Allocation (LSA)
Non-negative Matrix-Factorization (NNMF)
• Of the above techniques, we will dive into LDA as it is a very popular
method for extracting topics from textual data.
Types of Models
• There are two types of model available:
• Discriminative models: The discriminative models are a type of
logistical model and are mostly used for supervised learning
problems. This type of model uses conditional probabilities to predict.
The model learns to predict by calculating the conditional probability
distribution P(Y|X), which is the probability of Y given X. It implies
what are the chances of occurrence of event Y given event X. It is
applied for business cases related to regression and classification.
• Discriminative models are more analogous and differentiate the
classes with the observed data as defect or no-defect, having the
disease or no disease.
Discriminative Models

• These models are applied in all spheres of artificial intelligence:


• Logistic Regression
• Decision Tree
• Random Forest
• Support Vector Machine (SVM)
• Traditional Neural Network
Generative Models
• On the other hand, generative models use statistics to generate or
create new data. These models estimate the probabilities using the
joint probability distribution P(X, Y). These not only estimate the
probabilities but also models the data points and differentiates the
classes based on these computed probabilities of the class labels.
• As compared to the discriminative models, the generative models
have the capacity of handling more complicated tasks and are
empowered with the ability to create more data to build the model
on. These are unsupervised learning techniques that are used to
discover the hidden patterns within the data.
Generative Models
• Examples of other generative models are:
• Gaussian Mixture Model (GMM)
• Hidden Markov Model (HMM)
• Linear Discriminant Analysis
• Generative Adversarial Networks (GANs)
• Autoencoders
• Boltzmann Machines
LDA

• The topic modeling technique, Latent Dirichlet Allocation (LDA) is also


a breed of generative probabilistic model. It generates probabilities to
help extract topics from the words and collate documents using
similar topics.
Agenda – Part 2
• A Little Background about LDA
• Latent Dirichlet Allocation (LDA) and its Process
• How does LDA work and how will it derive the particular distributions?
• Vector Space of LDA
• How will LDA optimize the distributions?
• LDA is an Iterative Process and thus obtained through optimization
A Little Background about LDA
• Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to
extract topics from a given corpus.
• The term latent conveys something that exists but is not yet developed.
In other words, latent means hidden or concealed.
• Now, the topics that we want to extract from the data are also “hidden
topics”. It is yet to be discovered. Hence, the term “latent” in LDA.
• The Dirichlet allocation is after the Dirichlet distribution and process.
• Named after the German mathematician, Peter Gustav Lejeune
Dirichlet, Dirichlet processes in probability theory are “a family of
stochastic processes whose realizations are probability distributions.”
LDA
• The Dirichlet model describes the pattern of the words that are
repeating together, occurring frequently, and these words are similar
to each other.
• In the case of topic modeling, the process helps in estimating what
are the chances of the words, which are spread over the document,
will occur again? This enables the model to build data points,
estimate probabilities, that’s why LDA is a breed of generative
probabilistic model.
• LDA generates probabilities for the words using which the topics are
formed and eventually the topics are classified into documents.
LDA and its Process
• A tool and technique for Topic Modeling, Latent Dirichlet Allocation
(LDA) classifies or categorizes the text into a document and the words
per topic, these are modeled based on the Dirichlet distributions and
processes.
• The LDA makes two key assumptions:
Documents are a mixture of topics, and
Topics are a mixture of tokens (or words)
• In statistical language, the documents are known as the probability
density (or distribution) of topics and the topics are the probability
density (or distribution) of words.
How does LDA work and how will it derive the particular distributions?

LDA applies the above two important assumptions to the given corpus
We have the corpus with the following five documents:
• Document 1: I want to watch a movie this weekend.
• Document 2: I went shopping yesterday. New Zealand won the World Test
Championship by beating India by eight wickets at Southampton.
• Document 3: I don’t watch cricket. Netflix and Amazon Prime have very good movies to
watch.
• Document 4: Movies are a nice way to chill however, this time I would like to paint and
read some good books. It’s been so long!
• Document 5: This blueberry milkshake is so good! Try reading Dr. Joe Dispenza’s books.
His work is such a game-changer! His books helped to learn so much about how our
thoughts impact our biology and how we can all rewire our brains.
How does LDA work and how will it derive the particular
distributions?

• Any corpus, which is the collection of documents, can be represented as


a document-word (or document term matrix) also known as DTM.
• We know the first step with the text data is to clean, preprocess and
tokenize the text to words. After preprocessing the documents, we get
the following document word matrix where:
• D1, D2, D3, D4, and D5 are the five documents, and
• the words are represented by the Ws, say there are 8 unique words
from W1, to W8.
How does LDA work and how will it derive the particular
distributions?

• Hence, the shape of the matrix is 5 * 8 (five rows and eight columns):
• So, now the corpus is mainly the above-preprocessed document-word
matrix, in which every row is a document and every column is the
tokens or the words.
How does LDA work and how will it derive the particular
distributions?
• LDA converts this document-word matrix into two other matrices:
Document Topic matrix and Topic Word matrix as shown below:
How does LDA work and how will it derive the particular distributions?

• The Document-Topic matrix already contains the possible topics


(represented by K above) that the documents can contain. Here, suppose
we have 5 topics and had 5 documents so the matrix is of dimension 5*6
• The Topic-Word matrix has the words (or terms) that those topics can
contain. We have 5 topics and 8 unique tokens in the vocabulary hence the
matrix had a shape of 6*8.
• The LDA model has two parameters that control the distributions:
Alpha (ɑ) controls per-document topic distribution, and
Beta (ꞵ) controls per topic word distribution
How will LDA optimize the distributions?

• The end goal of LDA is to find the most optimal representation of the
Document-Topic matrix and the Topic-Word matrix to find the most
optimized Document-Topic distribution and Topic-Word distribution.

• As LDA assumes that documents are a mixture of topics and topics are
a mixture of words so LDA backtracks from the document level to
identify which topics would have generated these documents and
which words would have generated those topics.
How will LDA optimize the distributions?

• Now, our corpus that had 5 documents (D1 to D5) and with their
respective number of words:
• D1 = (w1, w2, w3, w4, w5, w6, w7, w8)
• D2 = (w`1, w`2, w`3, w`4, w`5, w`6, w`7, w`8, w`9, w`10)
• D3 = (w“1, w“2, w“3, w“4, w“5, w“6, w“7, w“8, w“9, w“10, w“11,
w“12, w“13, w“14 w“15)
• D4 = (w“`1, w“`2, w“`3, w“`4, w“`5, w“`6, w“`7, w“`8, w“`9, w“`10,
w“`11, w“`12)
• D5 = (w““1, w““2, w““3, w““4, w““5, w““6, w““7, w““8, w““9, w““10,
…, w““32, w““33, w““34)
LDA is an iterative process
• The first iteration of LDA:
• In the first iteration, it randomly assigns the topics to each word in the document.
The topics are represented by the letter k. So, in our corpus, the words in the
documents will be associated with some random topics like below:
• D1 = (w1 (k5), w2 (k3), w3 (k1), w4 (k2), w5 (k5), w6 (k4), w7 (k7), w8(k1))
• D2 = (w`1(k2), w`2 (k4), w`3 (k2), w`4 (k1), w`5 (k2), w`6 (k1), w`7 (k5), w`8(k3), w`9
(k7), w`10(k1))
• D3 = (w“1(k3), w“2 (k1), w“3 (k5), w“4 (k3), w“5 (k4), w“6(k1),…, w“13 (k1),
w“14(k3), w“15 (k2))
• D4 = (w“`1(k4), w“`2 (k5), w“`3 (k3), w“`4 (k6), w“`5 (k5), w“`6 (k3) …, w“`10 (k3),
w“`11 (k7), w“`12 (k1))
• D5 = (w““1 (k1), w““2 (k7), w““3 (k2), w““4 (k8), w““5 (k1), w““6(k8) …, w““32(k3),
w““33(k6), w““34 (k5))
LDA is an iterative process
• This gives the output as Documents with the composition of Topics
and Topics composing of words:
• The documents are the mixture of the topics:
• D1 = k5 + k3 + k1 + k2 + k5 + k4 + k7+ k1
• D2 = k2 + k4 + k2 + k1 + k5 + k2 + k1+ k5 + k3 + k7 + k1
• D3 = k3 + k1 + k5 + k3 + k4 + k1 + ….+K1 + k3 + k2
• D4 = k4 + k5 + k3 + k6 + k5 + k3 + … + k3+ k7 + k1
• D5 = k1 + k7 + k2 + k8 + k1 + k8 + … + k3+ k6 + k5
LDA in First Iteration Process

• The topics are the mixture of the words:


• K1 = w3 + w8 + w`4 + w`6 + w’10 + w“2 + w“6 + … + w“13 + w“`12 +
w““1 + w““5
• K2 = w4 + w`1 + w`3 + w“15 + …. + w““3 + …
• K3 = w2 + w’8 + w“1 + w“4 + w“14 + w“`3 + w“`6 + … + w“`10 + w““32
+…
• Similarly, LDA will give the word combinations for other topics.
Post the first iteration of LDA
• After the first iteration, LDA does provide the initial document- topic
and topic-word matrices. The task at hand is to optimize these
obtained results which LDA does by iterating over all the documents
and all the words.
• LDA makes another assumption that all the topics that have been
assigned are correct except the current word. So, based on those
already-correct topic-word assignments, LDA tries to correct and
adjust the topic assignment of the current word with a new
assignment for which:
• LDA will iterate over: each document ‘D’ and each word ‘w’
Post the first iteration of LDA
• How would it do that? It does so by computing two probabilities: p1
and p2 for every topic (k) where:
• P1: proportion of words in the document (D) that are currently
assigned to the topic (k)
• P2: proportion of those documents in which the word (w) is also
assigned to the topic (k)
• Now, using these probabilities p1 and p2, LDA estimates a new
probability, which is the product of (p1*p2), and through this product
probability, LDA identifies the new topic, which is the most relevant
topic for the current word.
Completion Process of LDA
• Reassignment of word ‘w’ of the document ‘D’ to a new topic ‘k’ via
the product probability of p1 * p2
• Now, the LDA is performed for a large number of iterations for the
step of choosing the new topic ‘k’ until a steady-state is obtained. The
convergence point of LDA is obtained where it gives the most
optimized representation of the document-term matrix and topic-
word matrix.
References
• https://www.analyticsvidhya.com/blog/2021/06/topic-modeling-and-
latent-dirichlet-allocationlda-using-gensim-and-sklearn-part-1/
• https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modelin
g-and-latent-dirichlet-allocation-lda-using-gensim-and-sklearn/
• https://www.analyticsvidhya.com/blog/2021/06/
part-3-topic-modeling-and-latent-dirichlet-allocation-lda-using-gensim
-and-sklearn/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy