0% found this document useful (0 votes)
82 views23 pages

Topic Analysis Presentation

The document discusses topic analysis and summarizes key machine learning algorithms used for topic analysis including Naive Bayes, logistic regression, and XGBoost. It also outlines the dataset and requirements for topic modeling.

Uploaded by

Nader AlFakeeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views23 pages

Topic Analysis Presentation

The document discusses topic analysis and summarizes key machine learning algorithms used for topic analysis including Naive Bayes, logistic regression, and XGBoost. It also outlines the dataset and requirements for topic modeling.

Uploaded by

Nader AlFakeeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

TOPIC

ANALYSIS
AHMED GHAZI - 2020380065
SHOAIB MUHAMMAD - 2020380043
AMER GHALEB - 2020380072
NADER AL-FAKEEH - 2021380082
QOTBEDDINE - 2021380096
1. INTRODUCTION
2. OVERVIEW
3. TECHNICAL REPORT
4. WORKING MODEL
5. DESCRIPTION OF DATA
6. HIGHLIGHTS OF ALGORITHIMS

TABLE OF CONTENTS
INTRODUCTION
By assigning "tags" or categories based on the subject or theme of each
individual paragraph, topic analysis—also known as topic
identification, topic modelling, or topic extraction—is a machine
learning approach that organizes and comprehends vast volumes of text
data.
Natural language processing (NLP) is used in topic analysis to decipher
human language, allowing you to identify patterns and reveal semantic
structures in texts to glean insights and support data-driven decision-
making. NLP topic modelling and NLP topic classification are the two
methods for topic analysis using machine learning that are most often
used.

 What makes us select topic analysis?

In addition to keeping an eye on our rivals and identifying the most


recent trends in our sector, real-time subject analysis enables us to
monitor our brand image and take appropriate action in the event of a
crisis or improvement based on client input.
• We employed three important algorithms that are frequently used for subject analysis when working with
datasets for our research. We made use of a dataset that included three quotes from well-known authors.

• (Mary Wollstonecraft Shelley, H.P. Lovecroft, and Edgar Allen Poe). Then, using those three implicit
algorithms, we achieved amazing outcomes. Our algorithms are our own creations.

• (Naïve Bayes classifiers, logistic regression, and xgboost classifiers) in a clever and effective manner that uses
little GPU power and memory while producing excellent results. Word cloud library is what we utilized for
data visualization.
OVERVIEW
When analyzing large volumes of text data quickly and affordably, topic tagging comes in handy. This includes text data
from internal papers, consumer correspondence, and the internet.

There are many scope levels at which topic analysis can be used:

At the document level, the topic model extracts the various subjects from a whole text. For instance, the subjects of a
news story or an email.

Sentence-level: a single sentence's subject is obtained via the topic model. Take the subject line of a news item, for
instance.

Sub-sentence level: a sentence's sub-expressions' subject is retrieved via the topic model. For instance, varying subjects
in a single product review phrase
CONCERNING THIS TOPIC

Requirements:
- High cuda cores processor
- Dedicated graphics is a plus
- NLTK
- Tensorflow
- Matplotlib
- Keras
- xgboost library
- word cloud
- Pandas and
- NumPy
- IDE used was Anaconda(3)
TECHNICAL
REPORT
TECHNICAL REPORT
The magnitude of each word in a WordCloud, a data visualization tool, shows its relevance or frequency when depicting
text data. A word cloud can be used to emphasize important textual data points. Data from social network websites is
frequently analyzed using word clouds.

Matplotlib, Pandas, and Wordcloud are the modules required to generate a word cloud in Python. The following
commands have to be executed in order to install these packages:
• pip install matplotlib
• pip install pandas
• pip install wordcloud

The word cloud generator uses a dataset that was gathered from the Kaggle data repository.

Note:
Kaggle is a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners.
Python | Lemmatization with NLTK

Lemmatization is the act of organizing a word's many inflected forms into a single group so that they may be examined as
a unit. Lemmatization and stemming are comparable, but lemmatization gives words context. Thus, it unites terms with
comparable meanings into a single word.
Examples of lemmatization:
-> rocks : rock
-> corpora : corpus
-> better : good

What is Vectorization?

Python code may be accelerated without the need for loops by employing vectorization. Making effective use of such a
function can assist reduce the amount of time that code runs.
Numerous operations are carried out on vectors, including element-wise multiplication, which produces elements with the
same indexes and keeps the matrix's dimension unchanged, outer products, which yield a square matrix with a dimension
equal to the length X length of the vectors, and the dot product of vectors, also known as the scalar product because it
yields a single output.
Graph of a logistic regression curve showing probability of passing an exam versus hours studying.
WORKING MODEL
The most popular library for topic modelling and analysis is NLTK (we obtained the packages stopwords, wordcount, and
punkt). After that, we imported every library we need.
DESCRIPTION OF
DATA
Text from scary works of fiction by well-known writers, like Mary Shelley, Edgar Allan Poe, and HP Lovecraft, may be
found in the competition dataset. We may occasionally detect a non-sentence since the data was generated by utilizing
CoreNLP's MaxEnt sentence tokenizer to fragment longer texts into sentences. Finding the true author of each sentence
in the test set is our goal.

File descriptions:

train.csv - the training set


test.csv - the test set
sample_submission.csv - a sample submission file in the correct format

Data fields:

- id - a unique identifier for each sentence


- text - some text written by one of the authors
- author - the author of the sentence (EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley)
HIGHLIGHTS OF
ALGORITHIMS
1. Naive Bayes: It is a classification method with the assumption of predictor independence, based on Bayes' Theorem. To put it
simply, the Naive Bayes classifier makes the assumption that a feature's existence in a class is independent of the existence of any
other feature.
Advantages:
 This algorithm works quickly and can save a lot of time.
 Naive Bayes is suitable for solving multi-class prediction problems.
 If its assumption of the independence of features holds true, it can perform better than other models and requires much fewer
training data.

2. Logistic regression: Estimating the probability of occurrences, including establishing a connection between attributes and the
likelihood of certain outcomes, is the aim of logistic regression.
Advantages:
1. Logistic regression is easier to implement, interpret, and very efficient to train.
2. It makes no assumptions about distributions of classes in feature space.
3. It can easily extend to multiple classes (multinomial regression) and a natural probabilistic view of class predictions.
4. It not only provides a measure of how appropriate a predictor (coefficient size) is, but also its direction of association (positive or
negative).
5. It is very fast at classifying unknown records.
6. Good accuracy for many simple data sets and it performs well when the dataset is linearly separable.
7. It can interpret model coefficients as indicators of feature importance.
3. XGBoost : Recently, the algorithm XGBoost has been winning Kaggle contests for structured or tabular data and
applied machine learning. A gradient boosted decision tree solution created for speed and efficiency is called XGBoost.
One term for it is a "ALL in One" algorithm. It is the perfect combination of hardware and software optimisation methods
to get common results with the least amount of processing power in the shortest period of time.
Advantages:

 It is Highly Flexible
 It uses the power of parallel processing
 It is faster than Gradient Boosting
 It supports regularization
 It is designed to handle missing data with its in-build features.
 The user can run a cross-validation after each iteration.
 It Works well in small to medium dataset
REFRENCES’
Data and XGBoost:
https://www.kaggle.com/

Algorithms and Code solutions:


https://www.geeksforgeeks.org/
https://www.hackerrank.com/
https://stackoverflow.com/
https://leetcode.com/
https://coderbyte.com/
Why Topic Analysis Matters and What Are Our Future Objectives?
Every day, businesses produce and gather enormous volumes of data. Businesses may get a variety of benefits to increase their productivity
and efficiency, including improved decision-making, streamlined internal operations, trend identification, and more, by employing
automated topic analysis techniques to analyse and handle this data. Machine learning models are essential for sifting through all of this
data. We can quickly go through enormous papers and determine what our clients are discussing thanks to topic detection.

Benefits of topic modeling include:

• Analyzing data at scale


In addition to being extremely time-consuming, manually combing through a large database to find themes would be prohibitively costly.
You may scan as much data as you want with automated subject analysis using machine learning, opening you new possibilities for gaining
insightful information.

• Analyses in real time


You may get a real-time picture of what your customers are saying about your product by integrating sentiment analysis and other natural
language processing techniques with subject tagging. Most significantly, you can utilise that data to make data-driven decisions in real time,
around the clock.

• Standardised Criteria
Because natural language processing (NLP), a blend of computer science, statistics, and computational linguistics, is the foundation of
automated subject analysis, you can count on exceptional accuracy and high-quality findings.
CONCLUSIONS

Working on this project was both tremendously tough and really


enjoyable. We have both examined the surface and gained a great deal of
knowledge from one another in the field of natural language processing.
As a group of enthusiasts for machine learning, we hope that this project
will be the start of our adventure. We are so grateful to our beloved
Professor Liang Yun Ji for taking the time and making the effort to
thoroughly explain everything to us while also always supporting us.
THANK YOU!
AHMED GHAZI - 2020380065
SHOAIB MUHAMMAD -
2020380043
AMER GHALEB - 2020380072
NADER AL-FAKEEH -
2021380082
QOTBEDDINE - 2021380096

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy