Topic Analysis Presentation
Topic Analysis Presentation
ANALYSIS
AHMED GHAZI - 2020380065
SHOAIB MUHAMMAD - 2020380043
AMER GHALEB - 2020380072
NADER AL-FAKEEH - 2021380082
QOTBEDDINE - 2021380096
1. INTRODUCTION
2. OVERVIEW
3. TECHNICAL REPORT
4. WORKING MODEL
5. DESCRIPTION OF DATA
6. HIGHLIGHTS OF ALGORITHIMS
TABLE OF CONTENTS
INTRODUCTION
By assigning "tags" or categories based on the subject or theme of each
individual paragraph, topic analysis—also known as topic
identification, topic modelling, or topic extraction—is a machine
learning approach that organizes and comprehends vast volumes of text
data.
Natural language processing (NLP) is used in topic analysis to decipher
human language, allowing you to identify patterns and reveal semantic
structures in texts to glean insights and support data-driven decision-
making. NLP topic modelling and NLP topic classification are the two
methods for topic analysis using machine learning that are most often
used.
• (Mary Wollstonecraft Shelley, H.P. Lovecroft, and Edgar Allen Poe). Then, using those three implicit
algorithms, we achieved amazing outcomes. Our algorithms are our own creations.
• (Naïve Bayes classifiers, logistic regression, and xgboost classifiers) in a clever and effective manner that uses
little GPU power and memory while producing excellent results. Word cloud library is what we utilized for
data visualization.
OVERVIEW
When analyzing large volumes of text data quickly and affordably, topic tagging comes in handy. This includes text data
from internal papers, consumer correspondence, and the internet.
There are many scope levels at which topic analysis can be used:
At the document level, the topic model extracts the various subjects from a whole text. For instance, the subjects of a
news story or an email.
Sentence-level: a single sentence's subject is obtained via the topic model. Take the subject line of a news item, for
instance.
Sub-sentence level: a sentence's sub-expressions' subject is retrieved via the topic model. For instance, varying subjects
in a single product review phrase
CONCERNING THIS TOPIC
Requirements:
- High cuda cores processor
- Dedicated graphics is a plus
- NLTK
- Tensorflow
- Matplotlib
- Keras
- xgboost library
- word cloud
- Pandas and
- NumPy
- IDE used was Anaconda(3)
TECHNICAL
REPORT
TECHNICAL REPORT
The magnitude of each word in a WordCloud, a data visualization tool, shows its relevance or frequency when depicting
text data. A word cloud can be used to emphasize important textual data points. Data from social network websites is
frequently analyzed using word clouds.
Matplotlib, Pandas, and Wordcloud are the modules required to generate a word cloud in Python. The following
commands have to be executed in order to install these packages:
• pip install matplotlib
• pip install pandas
• pip install wordcloud
The word cloud generator uses a dataset that was gathered from the Kaggle data repository.
Note:
Kaggle is a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners.
Python | Lemmatization with NLTK
Lemmatization is the act of organizing a word's many inflected forms into a single group so that they may be examined as
a unit. Lemmatization and stemming are comparable, but lemmatization gives words context. Thus, it unites terms with
comparable meanings into a single word.
Examples of lemmatization:
-> rocks : rock
-> corpora : corpus
-> better : good
What is Vectorization?
Python code may be accelerated without the need for loops by employing vectorization. Making effective use of such a
function can assist reduce the amount of time that code runs.
Numerous operations are carried out on vectors, including element-wise multiplication, which produces elements with the
same indexes and keeps the matrix's dimension unchanged, outer products, which yield a square matrix with a dimension
equal to the length X length of the vectors, and the dot product of vectors, also known as the scalar product because it
yields a single output.
Graph of a logistic regression curve showing probability of passing an exam versus hours studying.
WORKING MODEL
The most popular library for topic modelling and analysis is NLTK (we obtained the packages stopwords, wordcount, and
punkt). After that, we imported every library we need.
DESCRIPTION OF
DATA
Text from scary works of fiction by well-known writers, like Mary Shelley, Edgar Allan Poe, and HP Lovecraft, may be
found in the competition dataset. We may occasionally detect a non-sentence since the data was generated by utilizing
CoreNLP's MaxEnt sentence tokenizer to fragment longer texts into sentences. Finding the true author of each sentence
in the test set is our goal.
File descriptions:
Data fields:
2. Logistic regression: Estimating the probability of occurrences, including establishing a connection between attributes and the
likelihood of certain outcomes, is the aim of logistic regression.
Advantages:
1. Logistic regression is easier to implement, interpret, and very efficient to train.
2. It makes no assumptions about distributions of classes in feature space.
3. It can easily extend to multiple classes (multinomial regression) and a natural probabilistic view of class predictions.
4. It not only provides a measure of how appropriate a predictor (coefficient size) is, but also its direction of association (positive or
negative).
5. It is very fast at classifying unknown records.
6. Good accuracy for many simple data sets and it performs well when the dataset is linearly separable.
7. It can interpret model coefficients as indicators of feature importance.
3. XGBoost : Recently, the algorithm XGBoost has been winning Kaggle contests for structured or tabular data and
applied machine learning. A gradient boosted decision tree solution created for speed and efficiency is called XGBoost.
One term for it is a "ALL in One" algorithm. It is the perfect combination of hardware and software optimisation methods
to get common results with the least amount of processing power in the shortest period of time.
Advantages:
It is Highly Flexible
It uses the power of parallel processing
It is faster than Gradient Boosting
It supports regularization
It is designed to handle missing data with its in-build features.
The user can run a cross-validation after each iteration.
It Works well in small to medium dataset
REFRENCES’
Data and XGBoost:
https://www.kaggle.com/
• Standardised Criteria
Because natural language processing (NLP), a blend of computer science, statistics, and computational linguistics, is the foundation of
automated subject analysis, you can count on exceptional accuracy and high-quality findings.
CONCLUSIONS