Unit-06
Unit-06
DADS402
UNSTRUCTURED DATA ANALYSIS
Unit 6
Topic Modelling
Table of Contents
Introduction:
Topic modelling is a technique used in natural language processing (NLP) and machine
learning to identify and extract latent topics or themes within a collection of textual data.
The goal is to discover the underlying structure and meaning of the text, by grouping
together similar words and phrases that are likely to appear in the same context.
The most commonly used algorithm for topic modelling is Latent Dirichlet Allocation (LDA),
which assumes that each document is a mixture of topics, and each topic is a probability
distribution over a set of words. LDA works by iteratively assigning words to topics and
adjusting the probabilities until the model converges.
Topic modelling can be applied in various domains, such as social media analysis, content
recommendation, and customer feedback analysis. It allows researchers and analysts to gain
insights into the main themes and trends within a large volume of unstructured text data,
and to explore the relationships and patterns between different topics.
Learning Objectives:
By the end of unit 6, the learners should be able to:
Topic modelling is a rapid and straightforward technique to begin data analysis because it
doesn't require training. However, there is no assurance that the answers you get will be
accurate, that is the reason many companies choose to spend time training a topic
categorization model.
Data is manually attached with these themes so that a topic classifier can learn from it and
subsequently generate predictions on its own.
Say, for instance, that you work for a software company, and you want to examine client
feedback on a new data analysis tool you recently published. The first step would be to
compile a list of categories (themes) that apply to the new functionality, for example. You
would next require using data samples to line up your topic classifier on exactly how to tag
each text applying these predetermined subject tags like Data Analysis, Features & User
Experience.
Although topic classification necessitates more work, it generates more accurate results than
unsupervised methods, which means you'll gain access to more useful insights that would
aid in the development of data-driven decisions.You might say that unsupervised
approaches are a temporary repair, but supervised techniques are more of a long-term
remedy that will aid in the expansion of your company.
To assist you better understand the distinctions between topic categorization and automatic
topic modelling, let's look at a few examples below
By identifying patterns and recurrent words, topic modelling can be used to define the topics
of a collection of customer evaluations. Let's look at how the Eventbrite review below might
be grouped using an “unsupervised" method, for instance:
When you aren't charging for the event, Eventbrite is nice because it is free to use. If you
intend to charge for the event, there would be a fee of 7.5% + $0.98 each transaction.
Topic modelling can associate this review with other reviews that discuss related topics by
detecting phrases and terms like "free to use," "charge," "charging," and "7.5% plus 98 cents
transaction fee" (these may or may not be about pricing).
A topic classification model might also be used to find out what topics’ customers are
discussing via open-ended survey responses, social media posts, and customer reviews, to
mention a few. These supervised procedures, however, take a different tack. Classification
models can automatically tag a review with specified topic tags rather than attempting to
infer to which similarity cluster it belongs. Consider this assessment of SurveyMonkey:
We use our gold level plan extensively and adore its features. It offers the greatest value for
the money.
This review would be categorized under the topics Features and Price by a topic
classification machine that has been trained to comprehend the expressions "gold level
plan," "love the features," and "best bang for the buck."
In summary, topic modelling algorithms produce collections of expressions and words that
they believe to be related, leaving you to decipher the meaning of these relationships,
whereas topic classification algorithms produce topics that are neatly packaged, with labels
like Price and Features that take the guesswork out of the equation.
1. In topic modelling, the goal is to identify and extract ___________ within a collection of
textual data.
2. The most used algorithm for topic modelling is ___________.
3. Topic modelling can be applied in various domains, such as ___________.
3. TOPIC MODELLING VS TOPIC CLASSIFICATION
One thing combines topic modelling and topic classification. They are the methods for topic
analysis that are most frequently applied. Apart from that, they are both highly different, and
your decision between them will likely to be influenced by several variables.
You will obtain collections of documents that the algorithm has grouped together as well as
groups of words and expressions that it used to infer these relations at the conclusion of your
topic modelling investigation.
On the other hand, supervised machine learning algorithms provide beautifully packaged
findings with topic labels like Price and UX. They do require more setup time because you
must train them by labelling datasets with a specified list of subjects. However, if you
precisely label your texts and refine your criteria, you will be rewarded with a model that
can correctly categorize unknown books according to their topics as well as useful findings.
You'll probably be satisfied with a topic modelling method if you don't have a lot of time to
analyze texts or if you don't need a fine-grained analysis and merely want to know what
topics several texts are discussing.
We'll go into more depth about how each of these machine learning algorithms functions
now that we've clarified the distinctions between topic modelling and topic categorization.
Be prepared for things to get a little more technical.
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation are two topic
modelling techniques (LDA).
4. LATENT SEMANTIC ANALYSIS (LSA)
One of the topic modelling techniques that analysts employ most frequently is latent
semantic analysis (LSA). It is founded on the so-called distributional hypothesis, which
claims that by examining the circumstances in which words are used, it is possible to
understand their semantics.
In other words, according to this theory, two words will have similar semantics if they
frequently appear in comparable circumstances.
Having stated that, LSA calculates the word frequencies throughout the documents and the
entire corpus and assumes that documents with a similar subject matter would generally
have a similar distribution of word frequencies. Each document is then handled as a
collection of words, with any syntactic information (such as word order) and semantic
information (such as the variety of interpretations of a specific word) being disregarded.
Tf-idf is the commonly used method for calculating word frequencies. This method calculates
frequencies by considering both the frequency of words in a specific document as well as the
frequency of words across the entire corpus of documents. Regardless of how many times
they appear in a single document, terms with a greater frequency in the entire corpus will be
better candidates for document representations than less common words. Tf-IDF
representations are therefore substantially superior to those that merely consider word
frequencies at the document level.
We may develop a Document-term matrix that displays the tf-idf value for each term in every
single document once tf-idf frequencies have been calculated. Every document in the corpus
will have a row in this matrix, and each term under consideration will have a column.
Ref: https://monkeylearn.com/blog/introduction-to-topic-modeling/
Utilizing singular value decomposition, this Document-term matrix may be broken down
into the product of three matrices (USV) (SVD). The term "Term-topic matrix" refers to the
U matrix, whereas "Document-topic matrix" refers to the V matrix.
Fig 2: Sub-division of the row and column into the various documents.
Ref: https://monkeylearn.com/blog/introduction-to-topic-modeling/
Each singular value, or each of the numbers in the main diagonal of matrix S, will be
considered by LSA as a potential topic found in the documents because linear algebra
ensures that the S matrix will be diagonal.
The more common themes found in our original Document-term matrix can now be obtained
if we maintain the greatest t singular values along with the first t columns of U and the first
t rows of V. We refer to this as a truncated SVD because it does not retain all the singular
values from the original matrix, and we must set the value of t as a hyperparameter in order
to use it for LSA.
By examining the vectors that make up the U and V matrices, respectively, it is possible to
evaluate the quality of the topic assignment for each document and the quality of the words
allocated to each subject using various methodologies.
The distributional hypothesis (i.e., similar topics use similar words) and the statistical
mixture hypothesis (i.e., documents discuss a variety of topics for which a statistical
distribution can be determined) serve as the foundation for both Latent Dirichlet Allocation
(LDA) and Latent Space Analysis (LSA). Each document in our corpus is mapped using LDA
to a collection of topics that encompass most of its words.
LDA assigns subjects to word combinations, such as "best player" for a topic relating to
sports, to map the documents to a list of topics.
This is based on the presumption that word choices and word placements influence the
topics of papers. Again, LDA treats documents as a collection of words and ignores syntactic
information, exactly like LSA. Additionally, it presupposes that each word in the document
can be given a likelihood of being related to a particular topic. Having said that, LDA's
objective is to identify the variety of subjects that a text contains.
LDA thus expects that subjects and documents have the following structure:
Ref: https://monkeylearn.com/blog/introduction-to-topic-modeling/
Ref: https://monkeylearn.com/blog/introduction-to-topic-modeling/
LDA assumes that the distribution of themes inside a document and the distribution of words
within subjects are Dirichlet distributions, which is the major distinction between LDA and
LSA. Since LSA makes no assumptions about distribution, the vector representations of
topics and documents become opaquer.
The document and topic similarity are controlled by two hyperparameters, alpha and beta,
respectively. Each document will have fewer subjects assigned to it if the alpha value is low,
whereas the inverse is true if the alpha value is high. When modelling a topic, a low beta
value will utilize fewer words than a high number, making the topics more comparable
between them.
The number of topics the algorithm will detect must be set as a third hyperparameter when
LDA is used because the algorithm cannot choose the number of topics on its own.
A vector containing the coverage of each topic for the document being modelled is the
algorithm's output. It will resemble this [0.2, 0.5, etc.], where the first value represents the
amount of time spent on the first topic, and so forth. These vectors can help you understand
the topical properties of your corpus when they are properly contrasted.
You can consult the original LDA paper for further details on how those probabilities are
calculated, the statistical distributions that the algorithm takes for granted, or how to use
LDA.
You may also want to read about the cosine similarity or other similarity measures to learn
more about how to compare vector representations to gain insights into document similarity
or the distribution of themes over a document corpus. The output vectors of both LSA and
LDA can be compared using each of these comparisons to determine how similar they are.
Summary
Terminal Questions
Answers
Terminal Answers
1. All about topic modelling: Topic modelling is a statistical technique used in natural
language processing to uncover hidden topics or themes within a collection of text
documents. ( Refer to section 1 for more details).
9. Working of LSA: LSA uses singular value decomposition (SVD) to convert a large
matrix of term-document frequencies into a lower-dimensional space, where the
rows represent the documents and the columns represent the terms. This allows LSA
to identify the underlying semantic relationships between terms and documents,
even when they do not share exact word matches. (Refer to section 4 for more
details).
10. Applications of LSA: LSA has a wide range of applications in information retrieval,
document classification, and text mining. It can be used for tasks such as document
clustering, topic modeling, and recommendation systems. It has also been applied to
fields such as biology, chemistry, and finance. (Refer to section 4 for more details).
References:
1. Topic Modeling: An Introduction (monkeylearn.com)
2. The Complete Practical Guide to Topic Modelling | by Kajal Yadav | Towards Data
Science
3. Topic Modeling: Algorithms, Techniques, and Application - DataScienceCentral.com