Sessionppt Topicmoelling
Sessionppt Topicmoelling
LDA
SESSION – 22-23
AGENDA
• Basics of how text data is seen in Natural Language Processing
• What are topics?
• What is topic modeling?
• What are the applications of topic modeling,
• Topic Modeling Tools and Types of Models
• Discriminative Models
• Generative Models
Sample Problem
• Let’s say you have a client who has a publishing house. Your client
comes to you with two tasks: one he wants to categorize all the books
or the research papers he receives weekly on a common theme or a
topic and the other task is to encapsulate large documents into
smaller bite-sized texts. Is there any technique and tool available that
can do both of these two tasks?
What are Topics?
LDA applies the above two important assumptions to the given corpus
We have the corpus with the following five documents:
• Document 1: I want to watch a movie this weekend.
• Document 2: I went shopping yesterday. New Zealand won the World Test
Championship by beating India by eight wickets at Southampton.
• Document 3: I don’t watch cricket. Netflix and Amazon Prime have very good movies to
watch.
• Document 4: Movies are a nice way to chill however, this time I would like to paint and
read some good books. It’s been so long!
• Document 5: This blueberry milkshake is so good! Try reading Dr. Joe Dispenza’s books.
His work is such a game-changer! His books helped to learn so much about how our
thoughts impact our biology and how we can all rewire our brains.
How does LDA work and how will it derive the particular
distributions?
• Hence, the shape of the matrix is 5 * 8 (five rows and eight columns):
• So, now the corpus is mainly the above-preprocessed document-word
matrix, in which every row is a document and every column is the
tokens or the words.
How does LDA work and how will it derive the particular
distributions?
• LDA converts this document-word matrix into two other matrices:
Document Topic matrix and Topic Word matrix as shown below:
How does LDA work and how will it derive the particular distributions?
• The end goal of LDA is to find the most optimal representation of the
Document-Topic matrix and the Topic-Word matrix to find the most
optimized Document-Topic distribution and Topic-Word distribution.
• As LDA assumes that documents are a mixture of topics and topics are
a mixture of words so LDA backtracks from the document level to
identify which topics would have generated these documents and
which words would have generated those topics.
How will LDA optimize the distributions?
• Now, our corpus that had 5 documents (D1 to D5) and with their
respective number of words:
• D1 = (w1, w2, w3, w4, w5, w6, w7, w8)
• D2 = (w`1, w`2, w`3, w`4, w`5, w`6, w`7, w`8, w`9, w`10)
• D3 = (w“1, w“2, w“3, w“4, w“5, w“6, w“7, w“8, w“9, w“10, w“11,
w“12, w“13, w“14 w“15)
• D4 = (w“`1, w“`2, w“`3, w“`4, w“`5, w“`6, w“`7, w“`8, w“`9, w“`10,
w“`11, w“`12)
• D5 = (w““1, w““2, w““3, w““4, w““5, w““6, w““7, w““8, w““9, w““10,
…, w““32, w““33, w““34)
LDA is an iterative process
• The first iteration of LDA:
• In the first iteration, it randomly assigns the topics to each word in the document.
The topics are represented by the letter k. So, in our corpus, the words in the
documents will be associated with some random topics like below:
• D1 = (w1 (k5), w2 (k3), w3 (k1), w4 (k2), w5 (k5), w6 (k4), w7 (k7), w8(k1))
• D2 = (w`1(k2), w`2 (k4), w`3 (k2), w`4 (k1), w`5 (k2), w`6 (k1), w`7 (k5), w`8(k3), w`9
(k7), w`10(k1))
• D3 = (w“1(k3), w“2 (k1), w“3 (k5), w“4 (k3), w“5 (k4), w“6(k1),…, w“13 (k1),
w“14(k3), w“15 (k2))
• D4 = (w“`1(k4), w“`2 (k5), w“`3 (k3), w“`4 (k6), w“`5 (k5), w“`6 (k3) …, w“`10 (k3),
w“`11 (k7), w“`12 (k1))
• D5 = (w““1 (k1), w““2 (k7), w““3 (k2), w““4 (k8), w““5 (k1), w““6(k8) …, w““32(k3),
w““33(k6), w““34 (k5))
LDA is an iterative process
• This gives the output as Documents with the composition of Topics
and Topics composing of words:
• The documents are the mixture of the topics:
• D1 = k5 + k3 + k1 + k2 + k5 + k4 + k7+ k1
• D2 = k2 + k4 + k2 + k1 + k5 + k2 + k1+ k5 + k3 + k7 + k1
• D3 = k3 + k1 + k5 + k3 + k4 + k1 + ….+K1 + k3 + k2
• D4 = k4 + k5 + k3 + k6 + k5 + k3 + … + k3+ k7 + k1
• D5 = k1 + k7 + k2 + k8 + k1 + k8 + … + k3+ k6 + k5
LDA in First Iteration Process