0% found this document useful (0 votes)
6 views6 pages

5 LS

This paper presents a novel framework for real-time text classification and summarization that enhances traditional TFIDF methods by incorporating advanced preprocessing techniques and a self-similarity matrix for better pattern recognition. The proposed model aims to improve the accuracy and relevance of summaries while ensuring precise topic identification through logistic regression. The methodology includes phases such as data collection, preprocessing, feature extraction, and text classification, ultimately addressing the challenges posed by the rapid growth of textual data.

Uploaded by

Sowraba J n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

5 LS

This paper presents a novel framework for real-time text classification and summarization that enhances traditional TFIDF methods by incorporating advanced preprocessing techniques and a self-similarity matrix for better pattern recognition. The proposed model aims to improve the accuracy and relevance of summaries while ensuring precise topic identification through logistic regression. The methodology includes phases such as data collection, preprocessing, feature extraction, and text classification, ultimately addressing the challenges posed by the rapid growth of textual data.

Uploaded by

Sowraba J n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Real-Time Text Classification and Summarization

Model by Enhancing TF-IDF


2024 International Conference on Intelligent Computing and Sustainable Innovations in Technology (IC-SIT) | 979-8-3503-6917-5/24/$31.00 ©2024 IEEE | DOI: 10.1109/IC-SIT63503.2024.10861972

Extractive Summarization Method


Shreenita Saha Chandan Kumar Behera
School of Computing Science Engineering and AI School of Computing Science and Engineering
VIT Bhopal University VIT Bhopal University
Sehore, India Sehore, India
shreenitasaha2022@vitbhopal.ac.in chandank.behera@vitbhopal.ac.in

Abstract— The rapid growth of textual data across various relationships between the words and fails to adapt to the
platforms necessitates efficient real-time text classification dynamic nature of real-time data.
and summarization methods. This paper presents a novel
approach for developing a real-time text classification and The objective of this work is to provide an efficient real-
summarization model that addresses the limitations of time text classification and summarization framework to
traditional methods, such as TFIDF, in extractive extract accurate information in brief for the user.
summarization. The proposed algorithm implemented Therefore, this paper introduced a novel framework which
advanced preprocessing techniques and a similarity matrix deals with the previous limitations of TFIDF by
for enhanced pattern recognition, which leads to more employing an advanced approach for text classification
accurate and contextually relevant summaries. Additionally, and summarization. The proposed model has implemented
the model has integrated logistic regression that ensures the advanced preprocessing techniques, including
precise topic identification. tokenization, stop-word removal, and stemming, to
prepare the text for analysis. It then utilized a self-
Keywords— Real-Time Text Classification, Extractive similarity matrix to enhance pattern recognition, resulting
Summarization, Similarity Matrix, Information Retrieval, Text in more accurate and contextually relevant summaries.
Summarization Model
Unlike conventional methods, this framework also
I. INTRODUCTION integrated a machine learning-based classification system
The exponential growth of textual data has created that ensured precise topic identification, thereby
remarkable challenges for accurate processing of the improving the overall quality of the extracted summaries.
content. From social media posts to academic The paper is organized into five sections. Section I
publications, the need for efficient methods to classify and represents the introduction part. The literature review of
summarize text in real time has become more critical than related research works is represented in Section II through
ever. Traditional approaches, such as the Term Table I. Section III describes the methodologies
Frequency-Inverse Document Frequency (TFIDF), have implemented in this research work. Results are discussed
long been the standard for extractive summarization. and compared with other methodologies in section IV.
However, this method is not able to capture the significant Section V concludes the paper and is followed by
references.
II. LITERATURE REVIEW
TABLE I. SUMMARY OF PREVIOUS WORK
Sl Paper Title Authors and Problem Approach used Result
no. Year Statement
1 A Ranking based Pooja Gupta, It focuses on The authors Pros: The proposed n-gram-based language
Language Model Swati Nigam, extractive text introduce a new model for sentence ranking showed a better
for Automatic Rajiv Singh summarization language model improvement in accuracy for text
Extractive Text (2022) using sentence based on n-grams summarization, achieving 44% for BBC
Summarization ranking. (unigrams, bigrams, News and 36% for CNN datasets.
trigrams) for Cons: The evaluation was conducted on
sentence ranking to only two datasets which may not fully
generate summaries. represent the method’s effectiveness across
diverse text sources. The proposed model
can be computationally intensive, requiring
significant processing power and time.

979-8-3503-6917-5/24/$31.00 ©2024 IEEE


Authorized licensed use limited to: VTU Consortium. Downloaded on March 17,2025 at 05:22:49 UTC from IEEE Xplore. Restrictions apply.
2 An Explorative Kasarapu It focuses on The authors Pros: k-means outperformed Text Rank
Study on Ramani, K. extractive text evaluated the and Latent Semantic Analysis, achieving
Extractive Text Bhavana, summarization techniques using high precision (94.52%), recall (90.98%),
Summarization A.Akshaya, K. techniques applied ROUGE metrics. and F1-score (91.25%)
through k-means, Sai Harshita, to medical Pre-processing steps
LSA, and C. R. Thoran documents, include tokenization, Cons: The study did not implement or
TextRank Kumar, M. comparing k- removing stop evaluate hybrid approaches, which could
Srikanth means, TextRank, words, and potentially offer better summarization
(2023) and Latent lemmatization. Post- results.
Semantic Analysis. processing involves
clustering and
ranking sentences.
3 An overview Shohreh Rad It addresses the The authors review Pros: The paper provides a thorough
extractive text Rahimi, challenge of text various review of text summarization approaches,
summarization Mohamad summarization, summarization including statistical, fuzzy logic, and
Abdolahi,Ali which involves techniques including clustering methods.
Toofanzadeh automatically fuzzy text
Mozhdehi creating a summarization, Cons: The study did not implement or
(2017) condensed version statistical text evaluate any approaches. It just gave a
of a given summarization, and comparative study.
document while text clustering.
preserving its
information
content.
4 Analyzing Nitish Pandey, It addresses the The authors use Pros: The Random Forest algorithm
Extractive Text Sanjog challenge of Random Forest for achieved an accuracy of 96% for text
Summarization Kumar, processing large classification and classification.
Techniques and Vishwa volumes of news three extractive
Classification Ranjan,Masoo articles by summarization Cons: Dataset Dependency might be an
Algorithms: A d developing a algorithms: important issue for the classification
Comparative Ahamed,Abha system for text TextRank, performance.
Study ya Kumar summarization and frequency-based,
Sahoo (2024) classification. and TF-IDF
clustering
5 Comparative Tanzirul It addresses the The author uses Pros: The proposed methodology achieved
Analysis of Islam. challenge of techniques like a high ROUGE-1 score of 0.45, indicating
Different Text Mofazzal creating concise tokenization, stop effective summarization performance.
Summarization Hossain, and fluent word removal, POS
Techniques Fahim Arefin summaries for tagging, and models Cons: The method’s accuracy varied
Using Enhanced (2021) Bangla text like cosine significantly across different datasets, with
Tokenization documents. similarity, and text a drop to a ROUGE-1 score of 0.30 on less
rank for structured data.
summarization.
6 Comparative Mansoora The paper The author compares Pros: The proposed methodology achieved
Study on Majeed, Kala addresses the two extractive text a high ROUGE-1 recall score of 0.86
Extractive M T (2023) challenge of summarization outperforming the sentence ranking
Summarization creating accurate methods: Sentence method.
Using Sentence and succinct Ranking and Text
Ranking summaries of Rank Algorithm. Cons: It involves higher computational
Algorithm and lengthy documents complexity.
Text Ranking using automatic
Algorithm text summarization
techniques.
7 Extractive Swaranjali It addresses the The authors utilized Pros: The proposed method is efficient in
Automatic Text Jugran, Ashish challenge of SpaCy for NLP processing large datasets.
Summarization Kumar, extracting tasks, comparing it
using SpaCyn in Bhupendra meaningful with CoreNLP and Cons: It is complex to implement and
Python and NLP Singh Tyagi, summaries from NLTK in terms of needs higher computational resources.
Vivek Anand large text statistical and
(2021) documents using grammatical
Natural Language performance.
Processing
techniques.

Authorized licensed use limited to: VTU Consortium. Downloaded on March 17,2025 at 05:22:49 UTC from IEEE Xplore. Restrictions apply.
8 Extractive Text Asha Rani It addresses the The authors employ Pros: Achieved an accuracy of 85% in
Summarization Mishra, V.K challenge of text Machine Learning, summarizing text.
An effective Panchal, summarization LDA, Text Rank,
approach to Pawan Kumar using extractive and Topic Modelling Cons: struggles with highly diverse
extract (2019) methods. to rank and extract datasets, leading to lower recall rates.
information from key sentences.
Text

9 Implementation D.Ganesh, It addresses the The author employs Pros: The algorithm achieved a precision of
of Novel Test Mungara challenge of text machine learning, 0.9557 for ROUGE-1 and ROUGE-3,
Rank Algorithm Kiran Kumar, summarization LDA (Latent indicating accurate summarization.
for Effective Jasti Varsha, using extractive Dirichlet
Text Kimavath methods. Allocation), Text Cons: The recall score for ROUGE-2 was
Summarization Jayanth Naik, Rank, and topic 0.1812, indicating some difficulty in
K.Pranusha modeling. It involves retrieving relevant information for this
(2023) analysis, extraction, metric.
and generation
phases to create
summaries.
10 Query-oriented Mahsa It addresses the The author uses Pros: The proposed method achieved a
Text Afsharizadeh, challenge of methods like TF- higher ROUGE-2 average recall of
Summarization Hossein extracting the most IDF, fuzzy logic, 0.07579 compared to the previous method’s
using Sentence Ebrahimpour- important graph-based 0.06887.
Extraction Komleh information from methods, and Latent
Technique (2018) large volumes of Semantic Analysis Cons: The method still faces challenges in
text efficiently. (LSA) for feature accurately identifying the most informative
extraction and sentences due to the complexity of feature
sentence scoring. extraction.

11 Towards Johannes It addresses the The authors Pros: The MKR framework allows for
Extractive Text Zenkert, challenge integrated named efficient filtering and selection of relevant
Summarization Andr´ e of extractive text entity recognition information, improving the accuracy and
using Klahold and summarization usin (NER), sentiment relevance of summaries
Multidimensiona Madjid g multidimensional analysis (SA), and
l Knowledge Fathi(2018) knowledge topic detection (TD) Cons: The approach requires extensive
Representation representation to create a structured preprocessing and integration of multiple
(MKR) to handle knowledge base for text mining methods, which can be
the complexity of summarization. computationally intensive.
natural language
and the vast amount
of information
available online.
12 Query-oriented Mahsa It addresses the Key techniques Pros: The method was evaluated using
Text Afsharizadeh, challenge of include text pre- the DUC 2007 corpus and showed
Summarization Hossein summarizing large processing improved performance in ROUGE metrics
using Sentence Ebrahimpour- volumes of text (tokenization, stop compared to previous methods.
Extraction Komleh,Ayou quickly and words removal,
Technique b Bagheri effectively, stemming, POS Cons: It requires significant computational
(2018) focusing on query- tagging), feature resources and is complex to implement.
oriented text extraction, and
summarization. sentence scoring.

context of word frequency distribution, self-similarity


III. METHODOLOGY means that the distribution of word frequencies has similar
The methodology of this research work is centred on the patterns at different frequency ranges. For example,
development and evaluation of a real-time text consider a word frequency distribution with the following
classification and summarization model, with properties:
enhancements to the traditional TFIDF extractive The most common words (e.g., "the", "and", "a") have
summarization method. This approach has been divided very high frequencies (e.g., 10, 8, 6). The next most
into several key phases: data collection, preprocessing, common words (e.g., "of", "to", "in") have lower
feature extraction, similarity matrix construction, text frequencies (e.g., 5, 4, 3). The least common words (e.g.,
classification and summarization. "rare", "uncommon", "obscure") have very low
A. Self-Similarity frequencies (e.g., 1, 2, 3).
Self-similarity refers to the property of a distribution
where it exhibits similar patterns at different scales. In the

Authorized licensed use limited to: VTU Consortium. Downloaded on March 17,2025 at 05:22:49 UTC from IEEE Xplore. Restrictions apply.
In this distribution, we can observe self-similarity because To calculate the fractal dimension, we can use the box-
the pattern of decreasing frequencies is repeated at counting dimension method. For example: Suppose we
different scales: have a word frequency distribution with the following
values:
• The top 10 words have a similar pattern of decreasing TABLE II: CALCULATION OF THE FRACTAL DIMENSION
frequencies as the top 100 words. Word Frequency
• The top 100 words have a similar pattern of decreasing the 10
frequencies as the top 1000 words. and 8
B. Fractional Dimension a 6
of 5
The fractional dimension is a measure of the self-
to 4
similarity of a distribution. In the context of word
... ...
frequency distribution, the fractional dimension (D) is a
value between 0 and 1 that quantifies the degree of self-
similarity. A high fractional dimension (D → 1) indicates To calculate the fractal dimension, we can follow these
steps:
a high degree of self-similarity, meaning that the
Step 1: Divide the frequency range into boxes of size ε
distribution has a strong repeating pattern at different
(e.g., ε = 1, 2, 4, 8, ...).
scales. This is often observed in natural language texts, Step 2: Count the number of boxes that contain at least
where the frequency distribution of words follows a one word frequency (N(ε)).
power-law distribution. A low fractional dimension (D → Step 3: Plot log(N(ε)) against log(1/ε) and fit a straight
0) indicates a low degree of self-similarity, meaning that line. The slope of the line is the fractional dimension (D).
the distribution has a more random or uniform pattern.
TABLE III: EXAMPLE OF THE PLOT
C. Procedure for Capturing Self-Similarity from
Fractional Dimension ε N(ε) log(N(ε)) log(1/ε)

The fractional dimension captures self-similarity in the 1 10 1.0 0.0


word frequency distribution by analyzing the scaling 2 6 0.778 0.301
behavior of the distribution. Specifically, it measures how
4 4 0.602 0.602
the number of words (N) changes as a function of the
frequency range (ε). In the box-counting dimension 8 2 0.301 0.903
method, we divide the frequency range into boxes of size ... ... ... ...
ε and count the number of boxes that contain at least one- The slope of the line is approximately -1.2, which is the
word frequency (N(ε)). The fractional dimension is then fractional dimension (D) of the word frequency
calculated as the slope of the line that best fits. distribution. Suppose we want to calculate the weighted
TF-IDF score for the word "the" in a document. The
• Captures the repeating pattern of decreasing
frequencies at different scales. traditional TF-IDF score is:
TF-IDF = 10 * log(10/2)
• Quantifies the degree of self-similarity, which can help
identify important words and sentences. = 10 * 0.301
= 3.01
By weighting the TF-IDF scores with the fractional Using the fractional dimension (D = -1.2), we can weigh
dimension, we can incorporate the self-similarity of the the TF-IDF score as follows:
word frequency distribution into the summarization Weighted TF-IDF = 10* log(10/2) * (-1.2)
process. This can help in the identification of important = 3.01 * (-1.2)
words and sentences that are not captured by traditional = -3.61
TF-IDF methods. The negative sign indicates that the word "the" is less
important than expected, given its high frequency. This is
D. Problem formulation because the fractional dimension takes the self-similarity
of the word frequency distribution, which can help to
The traditional TF-IDF formula is:
identify important words and sentences.
TF-IDF = TF * IDF………. (1)
Note that this is a simplified example, and in practice, you
Where, TF (Term Frequency) is the frequency of a word
would need to calculate the fractional dimension for each
in a document, and IDF (Inverse Document Frequency) is
word frequency distribution in the document collection.
a measure of how rare a word is across all documents.
E. Work Flow
Instead of using the traditional TF-IDF formula, we can
weigh the TF-IDF scores using the fractional dimension The workflow begins with the manual implementation of
of the word frequency distribution. This approach takes the Term Frequency (TF) and Inverse Document
into account the self-similarity of the word frequency Frequency (IDF) calculations. Following this, TF-IDF
distribution, which can help to identify important words scores for each term in the document are computed, which
and sentences. are then used to rank sentences for summarization. To
The weighted TF-IDF formula is: further refine the model, the box-counting fractional
Weighted TF-IDF = TF * IDF * D……….. (2) dimension method is incorporated, which estimates the
Where, D is the fractal dimension of the word frequency complexity of the word frequency distribution. This
distribution. fractional dimension is used to adjust the TF-IDF scores,

Authorized licensed use limited to: VTU Consortium. Downloaded on March 17,2025 at 05:22:49 UTC from IEEE Xplore. Restrictions apply.
making them more representative of the text's intrinsic
structure. Finally, the cosine similarity is employed to
measure the effectiveness of the enhanced summarization
against the original text.

Figure 3. Original and innovative summary of the content

Figure 4. Comparative result of different Similarity scores


of original summary and innovative summary
V. CONCLUSION
In this paper, a novel framework has been introduced for
real-time text classification and summarization, with a
particular focus on enhancing the traditional TFIDF-based
extractive summarization method. The proposed approach
utilized a similarity matrix to address the limitations
inherent in the traditional TFIDF method, resulting in
Figure 1. Framework for the proposed model more accurate and contextually relevant summaries. By
integrating voice-to-text input and sentiment analysis, the
IV. RESULTS model can offer a versatile solution capability of handling
diverse text processing needs across various platforms.

The experimental results show that the proposed enhanced


algorithm outperforms the traditional TFIDF approaches
in terms of both accuracy and relevance. Also, it proves
the effectiveness of real-time applications. It shows that
the original summary by the existing TFIDF model is 0.17
similarities, whereas, the proposed enhanced TFIDF
model shows 0.38 similarities. The addition of sentiment
analysis has further enriched the summarization process
by enabling better interpretations of textual data.

Figure 2. The screenshot of the input of the content

Authorized licensed use limited to: VTU Consortium. Downloaded on March 17,2025 at 05:22:49 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
[1] P. Gupta, S. Nigam and R. Singh, "A Ranking based Language [8] D. Ganesh, M. K. Kumar, J. Varsha, K. J. Naik, K. Pranusha and J.
Model for Automatic Extractive Text Summarization," 2022 First Mallika, "Implementation of Novel Test Rank Algorithm for
International Conference on Artificial Intelligence Trends and Effective Text Summarization," 2023 International Conference on
Pattern Recognition (ICAITPR), Hyderabad, India, 2022, pp. 1-5, Advances in Computing, Communication and Applied Informatics
doi: 10.1109/ICAITPR51569.2022.9844187. (ACCAI), Chennai, India, 2023, pp. 1-6, doi:
[2] K. Ramani, K. Bhavana, A. Akshaya, K. S. Harshita, C. R. Thoran 10.1109/ACCAI58221.2023.10201008.
Kumar and M. Srikanth, "An Explorative Study on Extractive Text [9] M. Afsharizadeh, H. Ebrahimpour-Komleh and A. Bagheri,
Summarization through k-means, LSA, and TextRank," 2023 "Query-oriented text summarization using sentence extraction
International Conference on Wireless Communications Signal technique," 2018 4th International Conference on Web Research
Processing and Networking (WiSPNET), Chennai, India, 2023, pp. (ICWR), Tehran, Iran, 2018, pp. 128-132, doi:
1-6, doi: 10.1109/WiSPNET57748.2023.10134303. 10.1109/ICWR.2018.8387248.
[3] Jain, D., Borah, M. D., & Biswas, A. (2021). Summarization of [10] J. Zenkert, A. Klahold and M. Fathi, "Towards Extractive Text
legal documents: Where are we now and the way forward. Summarization Using Multidimensional Knowledge
Computer Science Review, 40, 100388. Representation," 2018 IEEE International Conference on
https://doi.org/10.1016/j.cosrev.2021.100388 Electro/Information Technology (EIT), Rochester, MI, USA, 2018,
[4] N. Pandey, S. Kumar, V. Ranjan, M. Ahamed and A. K. Sahoo, pp. 0826-0831, doi: 10.1109/EIT.2018.8500186.
"Analyzing Extractive Text Summarization Techniques and [11] Quillo-Espino, J.,Romero-González, R.M. and Herrera-Navarro,
Classification Algorithms: A Comparative Study," 2024 A.-M. (2021) A Deep Look into Extractive Text Summarization.
International Conference on Advancements in Smart, Secure and Journal of Computer and Communications , 9, pp.24-37,
Intelligent Computing (ASSIC), Bhubaneswar, India, 2024, pp. 1-5, doi: 10.4236/jcc.2021.96002.
doi: 10.1109/ASSIC60049.2024.10508020. [12] Geetha C Megharaj, Varsha Jituri, 2022, TFIDF Model based Text
[5] T. Islam, M. Hossain and M. F. Arefin, "Comparative Analysis of Summerization, International Journal Of Engineering Research &
Different Text Summarization Techniques Using Enhanced Technology (Ijert) Rtcsit – 2022 (Volume 10 – Issue 12)
Tokenization," 2021 3rd International Conference on Sustainable [13] R. Haruna, A. Obiniyi, M. Abdulkarim and A. A. Afolorunsho,
Technologies for Industry 4.0 (STI), Dhaka, Bangladesh, 2021, pp. "Automatic Summarization of Scientific Documents Using
1-6, doi: 10.1109/STI53101.2021.9732589. Transformer Architectures: A Review," 2022 5th Information
[6] M. Majeed and K. M. T, "Comparative Study on Extractive Technology for Education and Development (ITED), Abuja,
Summarization Using Sentence Ranking Algorithm and Text Nigeria, 2022, pp. 1-6, doi: 10.1109/ITED56637.2022.10051602.
Ranking Algorithm," 2023 International Conference on Power, [14] Watanangura, P., Vanichrudee, S., Minteer, O. et al. A
Instrumentation, Control and Computing (PICC), Thrissur, India, Comparative Survey of Text Summarization Techniques. SN
2023, pp. 1-5, doi: 10.1109/PICC57976.2023.10142314. COMPUT. SCI. 5, 47 (2024). https://doi.org/10.1007/s42979-023-
[7] S. Jugran, A. Kumar, B. S. Tyagi And V. Anand, "Extractive 02343-6
Automatic Text Summarization using SpaCy in Python &
NLP," 2021 International Conference on Advance Computing and
Innovative Technologies in Engineering (ICACITE), Greater
Noida, India, 2021, pp. 582-585, doi:
10.1109/ICACITE51222.2021.9404712.

Authorized licensed use limited to: VTU Consortium. Downloaded on March 17,2025 at 05:22:49 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy