Applied Text Mining
Applied Text Mining
Applied
Text Mining
Applied Text Mining
Usman Qamar • Muhammad Summair Raza
Applied
Text Mining
Usman Qamar Muhammad Summair Raza
National University of Sciences and Department of Software Engineering
Technology (NUST) University of Sargodha
Islamabad, Pakistan Sargodha, Pakistan
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the
whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or informa-
tion storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give
a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that
may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
UQ
Foreword
It is a pleasure of mine to provide a foreword for this book, which is written at the right
time due to the need of such a textbook. Professor Qamar’s high devotion to research and
education quality is shown throughout the book.
The book can be considered as a textbook for young researchers and students who wish
to study and learn more on this important active discipline that is considered an important
milestone in artificial intelligence. The book focuses in a unique style on looking at gen-
erative AI to generate, understand, and interpret text.
The book consists of three parts and starts with an introduction to basic concepts in text
mining in Part I. Next, Part II provides analytics on text classifications and provides source
code examples that help the reader to better understand the approach. Also, this part high-
lights deep learning for text processing and parsing using machine learning techniques. In
Part III, the book discusses text mining based on deep learning concepts, exploited for
processing textual data in clustering and classification. In this part, there are examples on
how deep learning can provide text mining context together with related examples using
Python-based source code that allows the reader to study these new concepts in practice.
These three parts together provide a comprehensive coverage of modern techniques in
natural language understanding and text mining, enhanced by numerous examples helping
the reader to better comprehend the essential topics.
Hamido Fujita
Iwate Prefectural University, Japan
Editor-in-Chief of “Applied Intelligence”
vii
Preface
This textbook covers the concepts, theories, and implementations of text mining and natu-
ral language processing (NLP). There are a few factors that provided motivation for writ-
ing this book. The biggest one is generative AI. By using text mining and NLP, generative
AI can understand, generate, and interpret humanlike text. It is able to grasp the meaning
behind words and consider their context just like a human being. This integration of text
mining into AI represents a paradigm shift.
We have divided the book into three parts. In Part I, we have provided details about basic
concepts and applications of text mining. We hope this part will set a strong foundation for
the reader. In Part II, we will cover text analytics. All the core concepts like text classifica-
tion, text clustering, text summarization, topic mapping, text visualization, etc. have been
included along with the complete source code. Finally, in Part III, deep learning-based text
mining is covered. Nowadays, deep learning is a dominating method applied to practically
all text mining tasks. Here we have covered deep learning approaches to text mining
including deep learning models for processing text, lexical analysis, and parsing using
deep learning and deep learning-based machine translation.
“Part I: Text Mining Basics” consists of three chapters. Chapter 1 details the textual
data, text mining operations, structure of the text information systems, and other basic
concepts. Chapter 2 will cover details about what types of tasks are performed during the
preprocessing of text. Chapter 3 will discuss the two common applications of text mining,
i.e., sentiment analysis and opinion mining. Both applications are discussed with exam-
ples, the implementation in Python, and an explanation of the complete code.
“Part II: Text Analytics” consists of Chaps. 4 to 9. In Chap. 4, the process of feature
engineering will be discussed in detail along with examples, Python implementation, and
a complete explanation of the source code. Chapter 5 is on text classification. Text classi-
fication, also known as categorization, is one of the important text mining tasks. The entire
process of text classification is based on supervised learning, where the text is categorized
ix
x Preface
on the basis of training data. In this chapter, the task of text classification will be discussed
in detail. Each and every step will be explained with the help of examples. Python code
and a complete description will also be provided. Chapter 6 on the other hand is on text
clustering. Similar to text classification, text clustering is another important task that is
performed in the context of textual analysis. In the clustering process, the text is organized
in the form of relevant groups and subgroups for further processing. The chapter will
explain the clustering process in detail along with examples and implementation of each
step in Python. Chapter 7 covers text summarization and topic modeling. Text summariza-
tion and topic modeling are other tasks that have become critically important, especially
in the era of social media. Chapter 8 deals with taxonomy generation and dynamic docu-
ment organization. Taxonomy generation refers to automatic category predefinition of the
text. It is the process of generating topics or concepts and their relations from a given
corpus. This chapter will explain each process’s details and will provide examples and a
complete description of the accompanying Python source code. Finally, Chap. 9 covers
visualization approaches. In the context of human-centric text mining, the interaction of
the user with text mining systems has critical importance.
Finally, “Part III: Deep Learning in Text Mining,” consists of three chapters, i.e.,
Chaps. 10–12. Deep learning has obtained great importance for processing textual data,
especially in text clustering and classification. Chapter 10 will explain how deep learning
can be used in the context of text mining with examples and a complete description of the
accompanying Python source code. Chapter 11 will explain the concepts related to deep
learning in lexical analysis and parsing with practical examples and accompanying code.
Chapter 12 will introduce the concepts of machine translation (MT) using deep learning
models and techniques with examples and a complete description of the accompanying
Python source code.
The textbook was specifically written to enable the teaching of both the basic and advanced
concepts from a single book. The textbook is an all-in-one source for anyone working in
the domain of text mining. It covers both the theory and the practical implementation,
where each and every concept is explained with simple and easy-to-understand examples.
Finally, the book also covers the latest deep learning-based text mining concepts. The
implementation of each text mining task is also part of the book. The implementation is
done using Python as the programming language and Spacy and NLTK as natural lan-
guage processing libraries. The book is suitable for both undergraduate and postgraduate
students, as well as those carrying out research in text mining. It can be used as a textbook
for both undergraduate and postgraduate students in computer science and engineering. It
is also accessible to students from other areas with adequate backgrounds.
The courses that could be offered with various chapters are as follows:
Preface xi
• Introductory course: If you are teaching an introductory course to text mining and natu-
ral language processing (NLP), Chaps. 1–7 will comprehensively cover the core con-
cepts of text mining.
• Advanced course: For more advanced courses, where the students already have the
basic knowledge of text mining and NLP, Chaps. 5–12 can be covered.
Individual chapters in the textbook can also be used for tutorials. Also, each chapter
ends with a set of exercises.
The largest volume of data that exists so far is definitely in text format. And with genera-
tive AI, the integration of text mining into AI represents a paradigm shift. This textbook
aims to introduce the concepts of text mining both at the fundamental level and application
level. The textbook has been written using a simple and easy-to-understand approach so
that the readers do not face any difficulty with any concept. The textbook covers both the
theory and the practical implementation. Each and every concept is explained with simple
and easy-to-understand examples. The implementation of each text mining task is also part
of the book. The implementation is done using Python as the programming language and
Spacy and NLTK as natural language processing libraries. No prior knowledge of Python,
Spacy, and NLTK is required.
Although this book primarily serves as a textbook, it will also appeal to industrial practi-
tioners and researchers due to its focus on applications. The textbook covers a wide range
of topics in text mining and NLP and can be an excellent handbook. Each chapter is self-
contained, so the reader can choose to read those chapters that are of particular interest.
xiii
xiv Contents
Usman Qamar is Professor of Data Science and Head of Department, Computer and
Software Engineering, National University of Sciences and Technology (NUST), Pakistan.
He has over 15 years of experience in data sciences both in academia and industry. He has
a Masters in Computer Systems Design from the University of Manchester Institute of
Science and Technology (UMIST), UK. His MPhil and PhD are from the University of
Manchester, UK. He has published extensively in the domain of AI and data science,
including 27 book chapters, 50+ impact factor journal publications, and over 100 confer-
ence publications. He has also written four books which have all been published by
Springer including two textbooks on data science. The textbook titled Data Science
Concepts and Techniques with Applications with Springer has been written as an accessi-
ble textbook with the aim of presenting an introduction as well as advanced concepts of
the emerging and interdisciplinary field of data science. Recently, the textbook was
awarded the top publications at Springer for the United Nations Sustainable Development
Goal, SDG4: Quality Education. He has also written the second edition of the textbook
which includes an additional 5 chapters as well as over 300 exercise questions. He is
Editor-in-Chief (EiC) of the journal Informatics in Medicine Unlocked and Associate
Editor of Information Sciences, Applied Soft Computing, Engineering Applications in AI,
Applied Intelligence, AI and Ethics, and Computers in Biology and Medicine. He has suc-
cessfully supervised 6 PhD students and over 100 master’s students. He has received mul-
tiple research awards, including Best Book Award 2017/18 by Higher Education
Commission (HEC), Pakistan, and Best Researcher of Pakistan 2015/16 by HEC, Pakistan.
Muhammad Summair Raza is Associate Professor and Chairman, Department of
Software Engineering, University of Sargodha, Pakistan. He has a PhD specialization in
Computer Software Engineering from the National University of Sciences and Technology,
Pakistan. He completed his MS in Software Engineering from the International Islamic
University, Islamabad, Pakistan, in 2009. He has published various papers in international-
level journals and conferences with a focus on rough set theory (RST). He is also the
author of four internationally published books. His research interests include feature
selection, rough set theory, trend analysis, text mining, software architecture, and non-
functional requirements.
xxiii
Part I
Text Mining Basics
Introduction to Text Mining
1
The chapter will provide details of the textual data, text mining operations, and structure
of text information systems, along with some other basic concepts. It will lay the founda-
tions for text mining.
In simple words, a text is a group of words, sentences, and paragraphs written in some
language. The language may be a natural language, for example, English, French, Japanese,
etc., or it may be some artificial language designed for a specific purpose, e.g., a program-
ming language for writing a computer program or a formal specification language used to
write the pseudocode of a program before actually translating it into a formal language.
The largest volume of data that exists so far is definitely in text format. So, once we
have the brief understanding of the text and how to deal with it, this textual data may be a
substantial source of information. The format and the method of stringing this textual data
may be different, e.g., in some cases, the textual data may be stored simply in plain docu-
ments, e.g., in MS Word or in a database, whereas in some other cases, the same may have
some complex representation, for example, XML format.
In this section, we will discuss different components of the textual data and some com-
mon formats that are used these days to store the text.
A textual data may comprise of different number of paragraphs. Each paragraph may have
different number of sentences, whereas a sentence may contain different number of words.
So, from the abovementioned structure, paragraphs, sentences, and words can be con-
sidered as components of the textual data. The following is an example of textual data:
People in Europe like sports. A lot of games are played there. From football to tennis,
each game has a lot of fan base. Different leagues and championship matches are arranged
every year. Competitive leagues like the English Premier League and Spain’s La Liga are
very famous. Similarly, people anxiously wait for the Wimbledon tennis championships,
Tour de France cycling race, and the prestigious Formula One Grand Race. This continent
is a hub for both traditional sports and athletic excellence.
As far as weather in Europe is concerned, it varies from country to country and region
to region. In Northern areas of Europe, the weather is cold whereas the southern parts
enjoy warmer climates. Western Europe has moderate weather with rain throughout the
year. Central Europe can have all four seasons distinctly, with cold winters and hot sum-
mers. This means that you can find everything from snowy landscapes to sunny beaches
in Europe.
The following components are present in the above textual data:
Paragraphs: 2
Sentences: 12
Words: 153
Both of the above sentences provide the same meaning; however, the position of the
words is different according to the rules of the grammar. Note that the first sentence shows
1.1 Textual Data and Its Components 5
the active voice, whereas the second sentence shows the passive voice. It should be noted
that in a plain text, a paragraph ends with a carriage return normally, whereas a sentence
ends with a punctuation mark in plain English.
There are a number of formats for representing and storing the textual data with plain text
as the most common format. In this section, we will discuss some common formats used
these days to store and present the textual data.
As mentioned earlier, the plain text format is the simplest and most commonly used
format for textual data. In this format, the text is stored simply in the form of sentences and
paragraphs, e.g., an MS Word file with “.docx” extension or a text file with “.txt” exten-
sion. Although the plain text is the simplest text format, processing data in this format is
the most difficult task mainly due to its unstructured nature. The sample textual data shown
in Sect. 1.1.1 is an example of plain text.
Another commonly used format is the Extensible Markup Language (XML) data for-
mat. This is a semiformal data format used to store the textual data. It is designed to be
self-descriptive, as the data semantics are also embedded for storing the data. This makes
the processing of data much easy as compared to the plain text format. For example, con-
sider the following data stored in XML format:
<Email>
<To>John</To>
<From>Smith</From>
<Subject>Meeting Cancellation</Subject>
<Content>Dear John! The already communicated meeting has been cancelled. New
schedule will be announced later.</Contents>
</Email>
Above is a sample script that stores the information of an email. As you can see, differ-
ent tags have been introduced that can be used to mark the semantics of the contents of the
tag. This makes processing of the data much easy. One of the important features of the
XML is that it provides a common standard for exchange of information independent of
the particular software and hardware. All you need to know is the structure of the document.
Another commonly used format for storing and presenting data is Portable Document
Format (PDF). It can be used to store both text and image data. The format stores the data
in a form independent of the underlying hardware, software, and operating system. The
6 1 Introduction to Text Mining
format can also be used to store the structural information by the use of different annota-
tions. Lots of software support are available these days for processing textual data in
PDF. We can convert data from different formats to PDF and vice versa. Normally, a PDF
file is a combination of vector graphics, text, and bitmap graphics.
Textual data is the core of text mining, so it is important to know the sources of text data.
These days, there are a number of sources that generate huge volumes of textual data on a
daily basis.
Here we will discuss a few important sources:
• Online libraries: Online digital libraries are one of the important sources of textual data.
Now, the soft copy of a book has become an essential part along with the hard copy.
Online libraries contain a major portion of text data.
• Social media: Social media is the core source of textual data these days. This data pro-
vides various opportunities for performing different text mining tasks, e.g., sentiment
analysis, product ratings, digital marketing, etc. A majority of these platforms also
provide sources to access and analyze this data using APIs. Some of these APIs include
Twitter APIs, YouTube APIs, Facebook APIs, etc.
• The Web: The Web is also one of the major contributors of textual data. We can conve-
niently conclude that almost all Web pages contain some sort of text that can be used as
source for performing different text mining tasks. A sufficient amount of textual data on
the Web comes from blogs. These days, hundreds of thousands of blogs exist on almost
every topic, which can be mined for domain-related text mining tasks. Another impor-
tant source of textual data is the Wikipedia, which has also been used as a source of
textual data in many artificial intelligence and text mining-related tasks.
• Language datasets: Language datasets (corpora) have been another source of informa-
tion especially after the emergence of natural language processing. A number of lan-
guage corpora exist containing information like POS (Part-of-Speech) tags, entity
recognitions, stop words, etc. Furthermore, language translation has become an impor-
tant part of natural language processing these days. Such tasks are based on parallel
language corpora comprising of sentences from a language and its translation in other
language. Table 1.1 shows a parallel language corpus containing sentences from English
and the corresponding translation in French:
1.3 Text Mining 7
Table 1.1 Parallel corpus containing English and the corresponding French translation
English French
Hello, how are you? Bonjour comment allez-vous?
Good bye! Au revoir!
This is a car C’est une voiture
Please let me know when you arrive at S’il vous plaît, prévenez-moi de votre arrivée à
London, so that I may come to receive you Londres, afin que je puisse venir vous recevoir.
We will be happy if you join us at the Nous serons heureux si vous nous rejoignez à la
birthday party fête d’anniversaire
I am going for a week’s trip to the United Je pars en voyage d’une semaine aux Etats-Unis
States
This will definitely not impact the Cela n’aura certainement pas d’impact sur les
performance performances
Text mining deals with the use of different tools to extract information from the textual
data stored in documents. Just like data mining, text mining opts to find the patterns of
interest from the textual data. A major difference and perhaps the biggest issue of text min-
ing is that it extracts these patterns from unstructured textual data rather than structural
datasets. Just like data mining, text mining performs text preprocessing on the textual data,
and once the data is ready for further processing, different information extraction algo-
rithms are applied to extract the information, and finally, the extracted information is pre-
sented by using some visualization tools. In this sense, we can consider text mining as
comprising the following steps:
• Data collection: This process consists of collecting the appropriate data according to
the requirements; for example, to process the sports-related data, sports blogs may be
one of the appropriate data sources. Similarly, to summarize some current political situ-
ations, newspapers may be considered as data sources.
• Data preprocessing: Note that the data collected may not be in the appropriate form to
start processing. We will have to apply various preprocessing steps to convert the data
into an appropriate format so that further analysis steps could be performed.
• Data processing and analysis: This is the core step that deals with the processing of the
formatted data and extraction of the relevant information, for example, the patterns of
interest. Various text mining tasks, e.g., classification, clustering, and sentiment analy-
sis, can be performed at this stage.
• Data visualization: Once the information is extracted, it can be presented using differ-
ent data visualization techniques appropriate to the nature of the information. Several
text visualization techniques are available. In the upcoming chapters, we will discuss
text visualization techniques in more detail.
8 1 Introduction to Text Mining
Note that the abovementioned process just provides an abstract-level view of the over-
all text-mining process. There are several other details that will be discussed in depth in
forthcoming chapters.
As you can see, apparently, the text mining process seems similar to data mining; how-
ever, the major difference is the format of the data that both mining processes deal with.
This inherently changes the nature of the different processing tasks; for example, in data
mining, as the majority of the data is already structured, the preprocessing steps emphasize
the normalization of the data by, for instance, applying the joins. In text mining, the pre-
processing steps, on the other hand, deal with steps like converting the data into some
features to appropriately process the data.
Once the data is preprocessed and is ready to be analyzed for the further knowledge dis-
covery process, normally, three types of operations are performed.
1.4.1 Distribution
For a document labelled with single concept c ∈ C, all the documents labelled with c
will be written as D/c. So, D/Politics will contain all the documents that correspond to
D/Local-Politics or D/International-Politics.
1.4 Core Text Mining Operations 9
Concept Proportion:
If
D = Set of documents
C = Set of concepts
Then
F(D/C) will be the ratio of the documents labelled with C to the total number of docu-
ments, i.e.:
D/C
F D / C
D
If
D = Set of documents
C1 = Set of concepts
C2 = Another set of concepts
Then
F(D/C1|C2) = F(D/C2|C1) = Set of documents labelled with both concepts C1and C2.
A frequent concept set comprises of the set of concepts, the documents of which appear
together for an alpha number of times. Here alpha is the threshold value that may be pro-
vided by the users. The frequent concept set selection helps in various text mining tasks,
for example, in association rule mining. In conventional data mining, the market basket
analysis is the best example of association rule mining. The Apriori algorithm is the most
famous algorithm for association rule mining.
1.4.3 Associations
The frequent concept selection helps in finding the association rules. A rule may take the
following form:
A B
The above rule states that the transactions that contain the “A” will also contain the “B”.
10 1 Introduction to Text Mining
In the context of text mining, it can be stated as the documents that are labelled with the
concept “A” can also be labelled with the concept “B”.
There are various challenges faced by text mining algorithms that may not be present in
the case of traditional data mining algorithms. The major challenge is the unstructured
nature of the data forms and the input of the text mining algorithms. In conventional data
mining algorithms, the input may contain the missing, ambiguous, or duplicate values;
however, all the data has normally a well-defined structure. For example, consider the
Student dataset given in Table 1.2 containing the records of the students of a school class.
Note that the Elizbeth’s physics grades are not known and may create problems for
algorithms using this dataset; however, the dataset has a well-defined structure. The algo-
rithm may know in advance the semantics of the data, e.g., the first column contains the
student’s name, the second column contains the grades in math, and so on. Furthermore,
the algorithm may also have the exact information about the format of the data.
However, this may not be the case for textual data. For example, consider the following
sentence:
To login to the system, the user must provide a valid username and password.
Now, the following are the different ways the same sentence can be written:
Format-1: The user much provide a valid username and password to login to the system
Format-2: The user will have to provide the valid username and password to login to
the system
Format-3: To login to the system, a valid username and password will be provided by
the user
Format-4: The user cannot login to the system until he/she provides a valid username and
password
Format-5: A valid username and password is mandatory to login to the system
Note that this is just one example of writing an English sentence in different formats,
all of which convey the same meaning. We have hundreds of languages, and each has its
own formats and grammar.
The unstructured nature of textual data means that the same dataset cannot (at least
without preprocessing) be used for two different algorithms. To overcome this problem,
it requires lot of preprocessing, which means additional tasks to be performed by
the system.
Apart from the syntactical mistakes, e.g., incorrect grammar or spelling mistakes, the
ambiguities are the other common issues that a text mining algorithm has to face. For
example, consider the following sentence:
John saw a boy with a telescope.
Does the above sentence mean that when John saw the boy, he (the other boy) was hav-
ing a telescope? Or does it mean that John was having a telescope, with the help of which
he saw the boy?
Similarly, consider the following sentence:
All the beautiful men and women went to the seminar.
Does the adjective “beautiful” apply to both men and women or does it simply apply to
“men” as it is concatenated with “men” only?
Following are some of the examples of the ambiguities that may exist in a sentence:
Example-1: The user will provide a username and password not less than eight characters.
Ambiguity: The condition of being eight characters applies to the username or password
or both?
Example-4: Please take your time but note that we will have to meet the deadline.
Ambiguity: The sentence contains two parts and both parts show different scenarios.
As mentioned earlier, it is not possible for text mining systems to process the raw text due
to its unstructured nature. The only possible way to feed the text to a text mining algorithm
is to convert into a proper format or at least in the form of single words so that the
12 1 Introduction to Text Mining
algorithm could understand these words. The text indexing process is the method of con-
verting textual sentences into words or tokens valid for an algorithm to process. Figure 1.1
shows the text indexing process. Now we will discuss each of these processes one by one.
1.6.1 Tokenization
Tokenization is the process of converting a sentence in the form of single words. Each
word is called a token. This is necessary because each sentence is formed by a set of words
where each word plays a specific role to give proper semantics to the sentence. So before
feeding a sentence to the mining algorithm, it is necessary to convert the sentence
into tokens.
Consider the following example:
Before the boarding starts, you should ensure that you have purchased all the neces-
sary amenities.
Before we feed this sentence to the mining algorithm, we will have to separate each
word into a token. This is the step where tokenization helps. In the next chapter, we will
provide implementation of this step as well.
The tokens of the abovementioned example will be:
‘Before’, ‘the’, ‘boarding’, ‘starts’, ‘,’, ‘you’, ‘should’, ‘ensure’, ‘that’, ‘you’, ‘have’,
‘purchased’, ‘all’, ‘the’, ‘necessary’, ‘amenities’, ‘.’
Similarly, consider the following sentence:
A cat is chasing the rat. It was in the house.
The above example will be tokenized as follows:
‘A’, ‘cat’, ‘is’, ‘chasing’, ‘the’, ‘rat’, ‘.’ ‘It’, ‘was’, ‘in’, ‘the’, ‘house’, ‘.’
It should be noted that in all the above sentences, the tokenization is performed on the
basis of the “Space” character. It means that the compiler will identify two words as sepa-
rate tokens if there is a space between these words. However, there are many languages,
for example, Chinese and Japanese, where the “Space” is not used. In all these languages,
the tokenizer is especially trained to tokenize the words.
1.6.2 Stemming
Once the tokenization is performed, we get the single words the entire sentence comprises
of. However, in the majority of cases, the words need to be converted to their roots in order
1.6 Text Indexing Process 13
to get the exact semantics. It is common in the English language that the words are not
used in their root form. For example, consider the following sentence:
I am eating bananas.
Now, the tokens will be “I”, “am”, “eating”, and “bananas”.
However, we need to convert the words “eating” and “bananas” to their root form in
order to get their actual semantics.
So the words “eating” will be converted to “eat” and the word “bananas” will be con-
verted to “banana”.
The figure below shows the stemming process (Fig. 1.2).
Normally, the following grammatical components are converted to their root words:
• Nouns
• Verbs
• Adjectives
In the case of nouns, we need to convert the plural nous into the singular noun, e.g.,
consider the following sentence:
Students will register for the seminar.
In the above example, the plural noun “Students” will be converted to “Student”.
Similarly, the verbs need to be converted to their root. For example, consider the fol-
lowing sentence:
John eats the bananas.
In the above sentence, the word “eats” needs to be converted to “eat”, so the “s” will be
removed.
The following are examples of some words converted to their root form:
Applies → Apply
Systems → System
Surprisingly → Surprise
Boarding → Board
Generalization → Generalize
Simplest → Simple
Specifies → Specify
Goes → Go
Reading → Read
14 1 Introduction to Text Mining
Sometimes, in a sentence, there are some words that contribute nothing to the semantics of
the sentence, e.g., “a”, “the”, and “,”. So, before processing a sentence, all such words are
removed. This is called stop-word removal. For example, consider the following sentence:
John will take a train to New York after he gets the clearance from the office.
Now, in the above sentence, the words “a”, “to”, and “the” are words that do not con-
tribute much to the semantics of the sentence, so we can safely remove them. It should be
noted that there is no proper mechanism to decide which word should be a stop word. In
order to remove the stop words, a model is properly trained by feeding different words as
stop words. This is normally done by preparing a list of stop words in the form of a cor-
pora. It should be noted that the stop word removal and the stemming can be swapped with
each other, i.e., either step can be performed before the other step.
Each term or token has a specific importance, e.g., we can determine the category of the
document on the basis of the tokens used in a document. Term weighting is the process of
assigning relevant weights to the tokens on the basis of their importance in the sentence.
Different measures can be used in this regard.
One of the simplest methods of weighting the terms is the measure called term fre-
quency (TF). Term frequency assigns weights on the basis of the occurrence of a term in a
document. However, the problem with this method may be the overestimation or underes-
timation of a term on the basis of the size of the document. Another measure may be the
relative term frequency, which refers to the ratio of a term in a document to the max-
imal one.
Another important term that is used to measure how common or rare a word is in a
corpus or set of documents is Inverse Document Frequency (IDF).
This is calculated by dividing the total number of documents by the number of docu-
ments that contain the term and finally taking the log.
Mathematically:
IDF t log N / n
Here “N” is the total number of documents in a corpus, and “n” represents the docu-
ments that contain the term “t”.
From the above formula, it can be noted that the terms that appear in a greater number
of documents will have lesser value of IDF and vice versa.
Multiplying both TF and IDF gives us the measure called Term Frequency Inverse
Document Frequency (TF-IDF). The higher value of TF-IDF indicates that the term is
more relevant.
1.7 Text Information System and Its Functions 15
Any text information system can provide various functions including information access,
information processing and analysis, and result visualization. Here, we will discuss some
basic functions of the text information system.
One of the basic functions of the text information system is to provide access to informa-
tion to the user for his/her needs. For this, an information system should have access to the
sources of information. This process starts when the user provides some input query
describing the nature of the information needed. The system then tries to find the relevant
information from the available repositories and provides this information to the user as
feedback. A simple example of such a system may be the search engines that provide rel-
evant information to the user in the form of Web pages.
This means that the information system should be able to analyze the extracted informa-
tion and provide the required knowledge after analysis. Most of the search engines these
days lack this ability. It should be noted that knowledge acquisition does not only mean the
extraction of the relevant information from repositories, but it also means that the system
should be able to perform different analysis tasks and provide the required knowledge to
the user.
Once the information is extracted and the results are analyzed, the presentation of the
results is another important task of text information systems. This task requires the infor-
mation to be structured in a manner that could be easy for the user to navigate and could
provide the insight the user requires. A simple example may be to structure the informa-
tion in the form of the three-dimensional cubes or trees that the user could expand to get
the required information.
16 1 Introduction to Text Mining
Each text information system takes some input from the user in the form of a query, ana-
lyzes that query, and extracts information from different sources relevant to the query.
Once the information is extracted from different sources, the information is analyzed and
the results are returned to the user. Figure 1.3 shows the conceptual framework of a text
information system. Now, we will discuss a few of the components of the system:
• User Interface Module: This is the module that the user uses to interact with the system.
The module provides two types of interfaces: the first one for the input, where the user
can input the query to the system, and the second one that displays the extracted infor-
mation from the system. Note that we have already discussed that the information is
presented to the user in a structured format. This is the interface that is responsible for
providing this structured information.
• Information Extraction Module: This module receives the input query and retrieves the
relevant information from different textual data repositories. Once the relevant infor-
mation is retrieved, the information is sent to the output interface, which is then dis-
played to the user.
• Information Processing and Analysis Module: If the information extracted from differ-
ent data repositories need to be analyzed as per request of the user, this module per-
forms the task. Different minding-related tasks, e.g., classification, clustering, rule
mining, similarity checking, etc., are performed by this module. Once the results are
prepared, the results are sent to the output interface for visualization.
Information
Extraction
Data Repositories
Module
Interface
Information
Processing and
Analysis
A pattern specifies a certain sequence. In textual data, there may exist certain sequences
that can be of particular interest to a user for certain requirements. For example, consider
the following pattern: noun-verb-noun.
This pattern specifies a significant information such as which action was performed by
whom and on what. Patterns can be identified in certain ways, for example, natural lan-
guage processing libraries provide a method to find patterns in the form of regular expres-
sions. However, here, we will specify a manual method and specify the logic using the
Spacy NLP library. The source code below takes user requirements as input and identifies
the class diagram components from the requirements. However, before proceeding, note
that the patterns are specified in the following format:
The first “NNS:C” specifies the first “Class” that participates in a relation, and
“VB:ASOC” specifies the association that exists between the classes. The “JJ:M-R” speci-
fies the multiplicity. “NNS:C” specifies the second class that participates in the relation.
So now consider the sentence:
Customer purchases many products.
The sentence specifies that there is one-to-many association between the customer and
product.
The following source code shows the implementation of this pattern identification:
import pparser
import spacy
import pyodbc
import tkinter
from tkinter import scrolledtext
nlp = spacy.load("en_core_web_sm")
def createposlog(sentence):
tmp = ""
for t in sentence:
tmp = tmp + t.text + ":" + t.tag_ + ", "
tmp = tmp + "\n"
a = sentence.text + "\n" + tmp
poslog.insert(tkinter.INSERT, a)
poslog.insert(tkinter.INSERT, '\n')
18 1 Introduction to Text Mining
def createclassdiagram(sentence):
cursor.execute('select * from Pat order by ID')
plength = 0
foundedpatterns = 0
rows = cursor.fetchall()
def Convert():
poslog.delete('1.0', tkinter.END)
cd.delete('1.0', tkinter.END)
txt = nlp(req.get("1.0", tkinter.END))
for sentence in txt.sents:
createposlog(sentence)
createclassdiagram(sentence)
master = tkinter.Tk()
master.title("Class Components Finder")
master.geometry("1000x750")
if pcounter == len(pat):
return True, r # pattern found
else:
return False, None # Pattern not found
def getpatant(pp):
p = pp.split(":")
return p[0], p[1]
def parsesecondary(pcd,sp,nlp):
r=[]
sptemp = sp.split(",")
pcdcounter=0
for i in range(0,len(sptemp)):
temptxt1 = sptemp[i]
if temptxt1[0] != "[":
sptemp1 = sptemp[i].split(":")
x=pcd[pcdcounter].split("->")
if nlp(x[0])[0].tag_ == sptemp1[0]:
r.append(nlp(x[0])[0].lemma_ + "->" + x[1])
pcdcounter = pcdcounter+1
20 1 Introduction to Text Mining
else:
totext = sptemp[i]
totext = totext[1:len(totext)-1]
sptemp1 = totext.split(":")
r.append(sptemp1[0] + "->" + sptemp1[1])
return r
Note that patterns are saved in a database, which is accessed using Microsoft Access
Driver. The function “createposlog(…)” takes a single sentence as input and shows its
tokens along with the part of speech tag of the token. On interface, this is displayed in a
list control named “poslog”. The function “createclassdiagram(…)” takes as input a sen-
tence and parses the relevant patterns from the database. Note that there are two parsing
functions, namely, “parseprimary(…)” and “parsesecondary(…)”. Sometimes, a primary
pattern may not contain a certain detail, so the secondary pattern has also been parsed. For
example, consider the following pattern:
NNP:C, VBD, NNP
This pattern specifies that there are many subjects and objects, for example, consider
the following sentence:
Customers purchase products.
This specifies that there are multiple customers and multiple products. Now, this pat-
tern needs to be specified as:
NN:C, JJ:M-L, VBD: ASOC, JJ:M-R, NN:C
In a text processing system, documents and corpus are the important components. A cor-
pus is a collection of documents, whereas a document contains textual data. For example,
the corpus named “FIFA World Cup” may contain a number of documents, each document
containing the details of each World Cup tournament.
A text information system may process the documents in a number of ways. For example,
it can load the data saved in the documents and can perform certain analysis tasks as per
requirements of the user. For this purpose, there are a certain set of tasks that a system may
have to perform. For example, for the system that has been mentioned in Sect. 1.9, the fol-
lowing tasks are performed on the data:
In many cases, a corpus may be used as a baseline on the basis of which different tasks are
performed by the algorithm. Such corpus acts the same as the training data in conventional
machine learning algorithms. The information that may be obtained from such corpus may
be the POS tags, named entities, patterns, class labels, etc. One of the important uses of a
corpus as a baseline is when it is used for machine translation, in which case it contains the
sentences from one language and the corresponding sentences from the other language.
Natural language libraries such as Spacy and NLTK provide a number of corpora for
performing basic operations. In fact, all the basic tasks, e.g., POS tagging, entity recog-
nitions, etc., are performed using these corpora. In Sect. 1.9, we have used pattern cor-
pus for storing the pattern and the corresponding class diagram components the pattern
identifies.
In Sect. 1.9, we have discussed pattern matching on the basis of the patterns stored in a
corpus. Regular expressions are another way that are more efficient than manual pattern
matching. We specify the pattern, and the sentence is chunked on the basis of the speci-
fied pattern. In NLTK, there is a separate module named “re” that provides the function-
ality for extracting the text on the basis of the regular expressions. For example, consider
the following code:
import re
sentence = "now is the time for all good men to help others"
tokens = word_tokenize(sentence)
for t in tokens:
if(re.search("e$",t)):
print(t)
the
time
22 1 Introduction to Text Mining
Here we have used the module “re” and specified the regular expression <”e$”>, which
specifies all the words that end with the letter “e”. In this way, we can form different com-
plex regular expressions on the basis of which different types of tokens can be chunked
from the text.
1.12 Summary
In this chapter, we have provided the basic concepts of textual data and the different stor-
age formats. We discussed different sources of data. We have provided details of what text
mining is all about and the types of challenges involved. We have discussed in detail about
different text operations that a basic text information system has to perform. We also dis-
cussed what is meant by a corpus and what are the different ways to use it. Finally, we
discussed about the regular expressions and the way to extract information from the text
using these regular expressions.
1.13 Exercises
Q1: Consider the following sentence and identify the ambiguity in it.
There should be sufficient balance in the account to withdraw the money.
Q2: Write down five sentences and try to find the stop words in each of these sentences.
Q4: Consider the following text and perform the tokenization process.
Tokenization is the process of converting a sentence in the form of single words. Each
word is called a token. This is necessary because each sentence is formed by a set of
words where each word plays a specific role to give proper semantics to the sentence.
Q5: Search at least five commonly used text corpus from the Internet and mention
their names.
Q6: Prepare a corpus of your own specifying at least 100 words of your native language
and their semantic in the form of a sentence.
Q7: Prepare a corpus specifying at least ten sentences of your native language and the
corresponding translation in English.
1.13 Exercises 23
Note: If your native language is English, you can translate it in some other language. You
can use Google Translate service for this purpose.
Q9: Write down the names of different components of textual data and diagrammatically
show their hierarchy.
Q10: Consider the following text and identify the number of paragraphs, sentences, and
words from this.
I am Smit. I live in UK. I am 18 years old.
Recommended Reading
• Mining Text Data by Charu C. Aggarwal and ChengXiang Zhai
Publisher: Springer
Publication Year: 2014
The book is one of the best resources for researchers and students at the same time. It
provides basic concepts along with examples. The book covers a wide range of topics
starting from basic to advanced concepts like natural language processing and informa-
tion retrieval.
It also discusses advanced topics like text summarization, topic mapping, text segmen-
tation, etc.
This chapter will provide technical details about what types of tasks are performed during
the processing of a text. As the natural language is one of the major sources of textual data,
things will be discussed from the natural language point of view.
A natural language is a natural way of communication. Just like the formal communication
methods that exist these days, natural languages are also a way to communicate between
humans. In this section, we will discuss different aspects of the natural language and how
the data generated from the natural language can be processed.
The communication methods started evolving with the passage of time right from the
commencement of the modern human being. Starting from simple gestures to modern
natural languages, the natural language has always been a convenient medium of commu-
nication for humans.
These days, we can safely claim that natural languages are one of the sources that gen-
erate enormous amount of data. Each country, region, and culture has its own language,
which adds to these enormous data volumes generated on a daily basis.
English, French, Japanese, and Chinese are among the hundreds of the natural lan-
guages that exist these days.
The important aspect of the natural language comes from the word “natural”, which
means that a natural language is something that evolves itself with the passage of time
rather than some other language (e.g., a computer programming language) that is artifi-
cially created by humans.
The overall communication process may take the following shape (Fig. 2.1):
As shown in the figure, the thoughts of the person are translated into symbols. This
translation process happens in the context of the semantics the source person has in his/her
mind. After the thoughts are translated into symbols, these symbols are then communi-
cated to the audience. The reverse process is performed on the side of the audience. Once
the symbols are received, the audience transforms these symbols into the attached seman-
tics. Once the semantics are clear, the target audience understands what the source wants
to communicate.
The important aspect of this communication is the understanding of the symbols and
the common semantics. It is quite a common phenomenon that a person from one region
2.1 Natural Language 27
Transmission of
Transformation of the symbols
Thought of a person the thought to
through some
symbols medium
may not understand the language of the other region. For example, a person who is listen-
ing or reading Japanese for the first time may not understand the meaning. The actual
reason is that he does not understand the symbols and the attached semantics. These days,
natural languages have become one of the biggest sources of data that need to be processed
on a daily basis.
The above-discussed phenomenon refers to the process of language acquisition, that is,
how language can be used to communicate. Now, we will discuss some of the usages of
language, i.e., we will discuss different types of semantics that can be communicated
using language.
One of the usages of language may be to comminate some information. It should be
noted that the information may be true or false. Similarly, the information may already be
known to the receiver. For example, consider the following sentence:
The earth rounds the sun in twenty-four hours.
The sentence communicates the information to the receiver. It is not mandatory for the
receiver to already know the information. Furthermore, as mentioned above, the informa-
tion may be true or false.
Secondly, language can be used to communicate orders where the sender directs the
receiver to perform some action. For example, consider the following sentence:
Contact me at 4 PM today.
The above sentence shows the directions given to the receiver. It should be noted that
not only the orders but requests can also be communicated in this way. For example, con-
sider the following sentence:
Please send me some money.
In the above sentence, the sender is requesting the receiver to send him some amount.
Thirdly, the sender may communicate future actions, including the promises, pledges,
oaths, etc. For example, consider the following sentence:
I promise that I will send you one thousand dollars tomorrow.
In the above sentence, the sender bounds himself that he will send that amount of
money to the receiver.
28 2 Text Processing
Fourthly, the sender may communicate personal expressions to the receiver on the basis
of the current actions that have happened. For example, if the son has passed an exam, the
father may congratulate him as follows:
Congratulations dear, you have done a wonderful job.
In the above sentence, the happiness of the father is communicated to the son.
Finally, some strong declarations may be communicated where a person may mention
his/her final decisions to some actions, e.g.:
You are not honest, so I can’t trust you.
In the above sentence, the sender is giving his/her final decision on what he thinks
about the receiver.
2.2 Linguistics
So far, we have studied about what a natural language is and how it can be used. However,
in academia, languages are formally studied and researched. This refers to the domain of
linguistics. Linguistics may be formally defined as the study of the structure and the
semantics of a language. We discuss the structure of sentences, the attached semantics, and
the context in which these sentences can be used.
Up to some extent, we can consider the term philosophy as similar to linguistics, which
means that in linguistics, mainly, we study the philosophy of the language.
Table 2.1 shows different aspects studied in the domain of linguistics.
Although the table below provides some common topics that are studied in linguistics,
linguistics is a much bigger domain and covers many other aspects. In this chapter, we will
keep our discussion focused on the syntax and semantics of the natural language. We will
study these aspects for the English language only.
Table 2.1 The common list of topics studied in the domain of linguistics
Phonetics Acoustic properties of the sounds are studied
Phonology Sound patterns are studied
Syntax The structure of the sentences is studied
Semantics The meanings of the words and sentences are studied
Morphology Morphemes are studied, where a morpheme is the simplest unit that has a meaning
Lexicon The properties of the words and phrases are studied
Pragmatics The impact of the context on the meaning of the sentence is studied
Stylistics Different communication aspects like the tone, accent, voice types, etc. are studied
2.2 Linguistics 29
Each language has a proper syntax and structure that specify the way the sentences in a
language are structured. This syntax has to be followed for correct communication. This
syntax should be followed for any natural language processing system to correctly process
the text of the language.
For example, consider the following English language sentence.
It is better if we have a meeting this weekend to discuss all the issues.
The abovementioned sentence has a proper structure that is mandatory to follow for
someone to correctly communicate the semantics. If, however, we do not follow the
sequence, the sentence will not only be incorrect but also make no sense. For example,
consider the following sentence where the sequence is totally shuffled.
Better if have we this weekend a it issues is all the to meeting this all discuss.
It is apparent that the above sentence does not make any sense. The same is the case
with the syntax and structure in other languages.
In the English language, sentences comprise of clauses that consequently comprise of
phrases and the phrases comprise of words.
Now, we will discuss all of these one by one.
2.2.2 Words
Words are the smallest units that may have a distinct meaning. In the following sentence:
This text mining book explains some basic concepts.
Each individual unit separated by a space is a word. Normally, in a sentence, the words
can correspond to the following categories:
Nouns: a noun represents some real word entity or object. For example, the word
“Smith” that represents a real-world person is a noun. Similarly, the word “student” may
represent a real-world person studying in some institute. Consider the sentence given below:
Smith will register a seminar.
In the above sentence, the words “Smith” and “seminar” are the nouns. Nouns can be
further categorized into two categories called common nouns and proper nouns.
Verbs: Verbs represent the actions in a sentence, for example, consider the following
sentence:
I am reading a book.
In the above sentence, the word “reading” is a verb.
Other examples may be write, eat, play, etc.
Adjectives: Adjectives qualify other words. They typically describe the qualities of
other words, for example, nouns. Consider the following sentence:
He is a tall boy.
In the above sentence, the word “tall” is the adjective that describes the quality of the
noun boy.
30 2 Text Processing
2.2.3 Phrases
Although words can be considered as the single unit that give the unique meaning, they
can be combined to form a phrase. A phrase is a combination of words that provide some
meaning. A phrase normally contains two or more words; however, a phrase may also
contain one word only. Following are some phrase categories:
• Noun phrase
• Verb phrase
• Adjective phrase
• Adverb phrase
• Prepositional phrase
2.2.4 Clauses
• Declarative: These clauses have neutral tone and provide some simple information, for
example, the “Sky is blue”. This clause is one of the most common clauses used in
daily life.
2.2 Linguistics 31
• Imperative: This clause contains some request, order, some advice, etc., for example,
“Please leave the room.”
• Relative: As the name implies, these clauses refer to some other part of the sentence for
providing the meaning, for example, consider the following sentence:
Smith says, he will go to London.
In the second part of the sentence, i.e., “he will go to London”, the word “he” refers
to Smith.
• Interrogative: These clauses mention some question, for example, consider the follow-
ing sentence:
Will you go to London tomorrow?
• Exclamative: These clauses provide the expression, e.g., happiness, sorrow, shock, etc.
For example, consider the following sentence:
What a movie!
2.2.5 Grammar
Grammar provides a set of rules that are applied to sentences in a specific language. For
the correct communication, these rules should be followed. Each language has its own
grammar. For example, in the English language, whenever a sentence is written in present
tense, the verb is added with the qualifier “s” or “es”. Consider the following sentence:
“John eats banana”.
Note that the verb “eat” is qualified with the word “s”. Similarly, if the sentence men-
tions some event that is in progress right now or in other words the sentence refers to the
present tense, we add “ing” after the verb. Consider the following sentence:
“I am reading a book”.
As the action to read the book is currently happening, “ing” has been added after the
verb “read”. Similarly, there may be some other rules, e.g., the subject must occur at the
start of the sentence, and the object should occur at the end or the last part of the sentence.
For example, consider the following sentence:
Smith reads the book.
In the above sentence, the subject, i.e., “Smith”, appears at the start, and the object, i.e.,
“book”, appears at the end.
Topology means how we can specify the languages on the basis of the syntax and structure
of the sentence. The word-order typology refers to the classification scheme where lan-
guages are classified on the basis of the order of words. In the English language, the order
“Subject-Verb-Object” is followed. However, this may not be the case in all other lan-
guages, e.g., the Japanese language follows the “Subject-Object-Verb” order. The follow-
ing is an example:
32 2 Text Processing
Semantics refer to the study of the meanings. These meanings are obtained by the relation-
ships between words and phrases. However, it is not only the sentence that provides mean-
ings, but the facial expression and the body knowledge are also part of semantics.
Lexical semantics deal with lexical units, which are the smallest units of the language that
provide the correct meaning. Lexical analysis involves various other concepts, for example:
1. Lemma and word forms: A lemma is the base word from which other words are derived.
The base word is called the lemma, and the derived words are called word forms. For
example, the word “Live” is the base word, i.e., Lemma, and the other different forms,
e.g., “Lived” and “Living”, are the word forms.
2. Homonyms: Homonyms are words that have the same spellings or pronunciation but
have different meanings. For example, the word “Bank” can refer to a riverbank or any
commercial bank. The meaning of the word is then derived from the context of the
sentence.
3. Homographs: Homographs are words that have the same spellings (may or may not
have different pronunciation) but different meanings. For example, the word “Steel”
and “Steal” both have the same pronunciation but different meanings. Again, the
semantic of the word is derived from the context in which the word is used.
4. Homophones: Have the same pronunciation but different meanings. For example, con-
sider the word “Pair” and “Pear”.
5. Synonyms: Words that have different spellings and pronunciations but the same mean-
ings, for example, “Usually” and “Often”. Both of these words refer to the situation
that occurs frequently.
6. Antonyms: Words that are opposite to each other. For example, the antonym of the
word “Usually” and “Often” is “Rarely”.
2.3 Language Semantics 33
In semantic networks, we relate lexical units and form a connection between them to rep-
resent a concept. This helps us represent a concept in the form of a graph or a hierarchical
structure. The nodes in the graph may represent the concepts, and the vertices may repre-
sent the relationships between them.
Typical examples of semantic relationships may include:
Figure 2.2 shows a sample graph that represents the semantic model of a vehicle.
So far, we have discussed about how to derive the semantics from lexical units. However,
deriving the exact semantics is not an issue in our daily life as things are very much under-
stood. But the situation is not so simple when it comes to formal communication. We need
to ensure that the sentence provides the exact semantics that is intended. For this purpose,
we need to provide the formal representation of the semantics. There are different ways we
can formally provide the semantics. Here, we will discuss the propositional logic and first-
order logic.
Travelling
Vehicle
Car
Cargo
Truck
34 2 Text Processing
Beside these, there are many other operators, e.g., If-Then operator and Iff (if and only
if) operator.
Now, we will see examples of these operators, but before that, let’s see the symbols
used for these operators (Table 2.2).
First, let’s check the example of an AND operator. Consider the following sentence:
Tomorrow is Monday and there will be a football match.
The above sentence has two parts as follows:
As it can be seen, there is an AND operator between both parts. So both parts should be
true to hold this proposition. We can write it as follows:
P = Tomorrow is Monday
Q= there will a football match
And logically, for both P and Q to be true, P should be true and Q should be true.
Table 2.3 shows the truth table for an AND operator:
Now, let’s consider the OR operator. Consider the following sentence:
Smith will go to London or New York.
Again, this sentence has two parts:
There is an OR operator between these two parts. The overall proposition can be repre-
sented as follows:
P∨Q
Now, for this proposition to be true, either P should be true or Q should be true.
However, if Smith neither goes to London nor New York, the proposition will remain
false. Table 2.4 shows the truth table of the OR operator.
The NOT operator negates the proposition. For example, consider the following
sentence:
Smith will go to London.
Now, if we want to convey that Smith will not go to London, we can do the following:
¬P
36 2 Text Processing
Next is the implication operator. As shown, it represents the concept of “Implies”, i.e.,
if one thing (independent variable) is true, then the second thing (dependent variable) will
also be true. For example, consider the following sentence:
If it rains, Smith will take the soup.
Now, this sentence has a condition, i.e., “to take the soup” depends on whether it will
rain or not. If it will rain, the soup will be taken. This can be shown in propositional logic
as follows:
P = It rains
Q = Smith will take the soup
P→Q
The IFF operator restricts the dependent variable on the independent variable, i.e., it
puts a strong bound. For example, consider the following sentence:
Smith will take soup only if it rains.
Now, the restriction has been put for taking the soup. The soup will be taken only if it
rains and not in any other case. This can be represented as follows:
P = It rains
Q = Smith will take the soup
P↔Q
Although propositional logic seems to be easy and convenient, it has certain issues, the
major one being that it is not easy to represent the facts in propositional logic. The issue
can be resolved by first-order logic. Following are some common components of first-
order logic:
• Objects
• Properties
• Functions
• Relations
• Quantifiers
• Connectives
2.4 Text Corpora 37
s S , m M : register s,m
In the above representation, “s” is member of “S”, where “S” represents the set of stu-
dents. The “∀” symbol represents “For all”, so whatever the action will be mentioned later,
it will be applied to all the members of the set “S”. “M” is the set of seminars and “m” is
just one member. “∃” represents the minimum of one occurrence. “register(s,m)” bounds
each student to register for the seminar.
As it can be seen, the first-order logic is more representational in nature along with
handling the properties and quantifications. We can explicitly mention the conditions
as well.
Corpora is the plural of the word corpus. A textual corpus defines a collection of text that
can be used as a base or reference for performing various text processing tasks. A text
corpus can serve the purpose of training a model, and on the basis of the information pro-
vided by the training corpus, the algorithm may perform the related tasks on the unknown
text in the future.
A text corpus can contain text from a single language or multiple languages. The
benefit of using text from multiple languages is that language translators can be devel-
oped that use these corpora to translate sentences from one language to other languages.
Such corpora are also called parallel corpora. A parallel corpus can take the follow-
ing form:
Now, wherever the sentence like “I am in London” will appear, it will be translated to
je suis à Londres.
One of the major benefits of the corpora is that it can provide a lot of meta data informa-
tion. The information is provided in the form of annotations. These annotations act as a
guideline for the algorithm using these corpora. Following is a list of annotations that can
be provided in a corpus:
• Translation
• POS
• Stop words
• Named entity recognition
• Grammar
• Semantic types
These days, a number of corpora are available that can be used anytime for any type of
tasks. These corpora can be specialized, e.g., containing the metadata of a specific domain.
Following is a list of the corpora that are freely available with NLTK:
• Gutenberg Corpus
• Web and Chat Text
• Brown Corpus
• Reuters Corpus
• Inaugural Address Corpus
We can access text corpora in our Python code and use it as per requirement. There are a
number of APIs that are already available in different libraries that can be used for
40 2 Text Processing
accessing these corpora. For example, consider the following Python code written in
NLTK to find the fields of the brown corpora that is part of the NLTK library:
The above code displays the fields of the brown corpus as follows:
['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13',
'ca14', 'ca15', 'ca16', 'ca17', 'ca18', 'ca19', 'ca20', 'ca21', 'ca22', 'ca23', 'ca24', 'ca25',
'ca26', 'ca27', 'ca28', 'ca29', 'ca30', 'ca31', 'ca32', 'ca33', 'ca34', 'ca35', 'ca36', 'ca37',
'ca38', 'ca39', 'ca40', 'ca41', 'ca42', 'ca43', 'ca44', 'cb01', 'cb02', 'cb03', 'cb04', 'cb05',
'cb06', 'cb07', 'cb08', 'cb09', 'cb10', 'cb11', 'cb12', 'cb13', 'cb14', 'cb15', 'cb16', 'cb17',
'cb18', 'cb19', …, 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09']
print(brown.words())
As mentioned earlier, one of the major challenges of textual data is its unstructured form,
which is challenging for text processing applications. So just like other data mining algo-
rithms, the text needs to be preprocessed in order to bring it in the form appropriate for
processing. This may involve steps like sentence tokenization, word tokenization, POS
tagging, etc.
Here, we will discuss the details of each with examples.
In textual documents, we need to separate each sentence so that we could get to move
down the hierarchy for the minimum textual unit that can provide us the distinct meaning.
For this, we need to perform sentence tokenization as follows:
text = “Europe is a continent with a rich and diverse cultural heritage. This continent,
comprising 44 countries, spans from the icy fjords of Scandinavia to the sunny shores of
the Mediterranean, offering a vast range of landscapes, languages, and traditions. One of
Europe’s unique qualities is its commitment to sustainability and environmental conserva-
tion. Many European countries have made significant strides in green energy, recycling,
and eco-friendly transportation. Europe’s stunning natural landscapes, from the Swiss
Alps to the Scottish Highlands, are preserved and protected for future generations to enjoy.
The European Union, a political and economic union of 27 member states, plays a vital
role in promoting cooperation and peace across the continent.”
sentences = sent_tokenize(text)
for s in sentences:
print(s)
Now, we can process each sentence and extract information from it. The following code
reads each sentence and shows words as the tokens.
for s in sentences:
print(word_tokenize(s))
42 2 Text Processing
['Europe', 'is', 'a', 'continent', 'with', 'a', 'rich', 'and', 'diverse', 'cultural', 'heritage', '.']
['This', 'continent', ',', 'comprising', '44', 'countries', ',', 'spans', 'from', 'the', 'icy', 'fjords', 'of',
'Scandinavia', 'to', 'the', 'sunny', 'shores', 'of', 'the', 'Mediterranean', ',', 'offering', 'a',
'vast', 'range', 'of', 'landscapes', ',', 'languages', ',', 'and', 'traditions', '.']
['One', 'of', 'Europe', "'s", 'unique', 'qualities', 'is', 'its', 'commitment', 'to', 'sustainability',
'and', 'environmental', 'conservation', '.']
['Many', 'European', 'countries', 'have', 'made', 'significant', 'strides', 'in', 'green', 'energy', ',',
'recycling', ',', 'and', 'eco-friendly', 'transportation', '.']
['Europe', "'s", 'stunning', 'natural', 'landscapes', ',', 'from', 'the', 'Swiss', 'Alps', 'to', 'the',
'Scottish', 'Highlands', ',', 'are', 'preserved', 'and', 'protected', 'for', 'future', 'generations',
'to', 'enjoy', '.']
['The', 'European', 'Union', ',', 'a', 'political', 'and', 'economic', 'union', 'of', '27', 'member',
'states', ',', 'plays', 'a', 'vital', 'role', 'in', 'promoting', 'cooperation', 'and', 'peace', 'across',
'the', 'continent', '.']
In the above code, we have used word_tokenizer. This tokenizer takes each sentence as
input in each iteration of the loop and then tokenizes each word in that sentence. We can
access each individual word as follows:
for w in word_tokenize(sentences[0]):
print(w)
Europe
is
a
…
In the above code, we have taken the first sentence of the sentences array. Note that the
sentences array stores each individual sentence. The first sentence is stored at the index
sentences[0]. We took this sentence and printed each individual word in it.
Once each individual word has been identified as a token, the next task is to apply the Part-
of-Speech (POS) tagging. A POS tag specifies the category the words belong to. This helps
in performing different types of analysis, e.g., which action is performed by whom and on
whom. All NLP-based libraries provide trained models for assigning tags.
2.5 Text Pre-processing 43
sentences = sent_tokenize(text)
for sentence in sentences:
print(nltk.pos_tag(nltk.word_tokenize(sentence)))
Note that first we have tokenized the entire paragraph in the form of sentences and then
each sentence is tokenized into words and finally POS tags are found by applying the
NLTK function pos_tag(…).
Once the POS tags are identified, we perform various types of analyses. However,
sometimes, we need to perform further the named entity recognition discussed in the next
section.
44 2 Text Processing
Sometimes in text, we need to identify the real-world objects and entities in order to get
the context of the information. These entities or objects are called named entities. For
example, consider the following sentence:
A bridge collapsed in New York.
The word collapsed provides information of some event that has occurred; however, as
to where this event occurred, we can get this information only if we could recognize “New
York” as a location. This is where the named entity recognition helps. Note that identify-
ing “New York” as a noun does not help much. The following code provides the list of
entities and related information from the given paragraph:
nlp = spacy.load("en_core_web_sm")
text = nlp("Europe is a continent with a rich and diverse cultural heritage. This conti-
nent, comprising 44 countries, spans from the icy fjords of Scandinavia to the sunny shores
of the Mediterranean, offering a vast range of landscapes, languages, and traditions. One
of Europe’s unique qualities is its commitment to sustainability and environmental conser-
vation. Many European countries have made significant strides in green energy, recycling,
and eco-friendly transportation. Europe’s stunning natural landscapes, from the Swiss
Alps to the Scottish Highlands, are preserved and protected for future generations to enjoy.
The European Union, a political and economic union of 27 member states, plays a vital
role in promoting cooperation and peace across the continent.").
Europe/LOC
44/CARDINAL
Scandinavia/LOC
Mediterranean/LOC
One/CARDINAL
Europe/LOC
European/NORP
Europe/LOC
Swiss/NORP
Scottish/NORP
The European Union/ORG
27/CARDINAL
2.6 Sentence Structure 45
We have used the named entity recognizer of Spacy, which is already a trained model.
So far, all the sentences that we have considered contain a number of words that may
be unnecessary for the analysis, e.g., “to”, “the”, “a”, “of”, “with”, etc. All these words
need to be removed before performing further tasks in the analysis process. These words
are called stop words, and removing these words saves a lot of processing.
Consider the following code that removes the stop words from the given paragraphs:
import spacy
import nltk
from nltk import word_tokenize
nlp = spacy.load('en_core_web_sm')
stopwordlist = nlp.Defaults.stop_words
text = "the communication methods started evolving with
passage of time right from the commencement of the
modern human being. Starting from the simple gestures
to modern natural languages, the natural language has
always been the convenient medium of communication
for the humans. These days we can safely claim the
natural languages are one of the sources that
generate enormous amount of data. Each country,
region and culture have its own language which adds
to these enormous data volumes that is generated on
daily basis."
toks = word_tokenize(text)
stopwordsremoved = [tok for tok in toks if not tok in stopwordlist]
print(stopwordsremoved)
In real-world communication, the sentences may be so confusing that the analysis may
lead to a different result as compared to the original semantics that were intended to be
communicated. For example, consider the following sentence:
I saw a boy with my telescope.
46 2 Text Processing
In order to correctly interpret the sentences, we will have to define the grammar. As
discussed earlier, grammar is a set of the rules that governs the correct interpretation of the
sentences.
In NLTK, we can define grammar using context-free grammar (CFG) and specify the
grammatical components in the form of production. For example, consider the following
production:
S - > NN VB NN
This production says that a sentence comprises three different components, i.e., noun,
verb, and noun again.
Now, we will present a simple example of how grammar can be used to correctly inter-
pret the sentences.
Consider the following code:
The two sentence structures are produced on the basis of the semantics. Figure 2.3
shows the structure of these sentences.
Information extraction is the core objective of text mining. All text information systems
tend to perform this task one way or the other. In this topic, we will discuss information
extraction from the information flow point of view.
2.8 Architecture of Information Extraction System 47
Each information extraction system takes input from the user in the form of queries and
provides results on the basis of the available data and the analysis process. This is similar
to the knowledge discovery process in conventional data mining. Once the required infor-
mation is available, it is provided to the user by following some appropriate information
structure.
48 2 Text Processing
• Preprocessing
• Morphological and lexical analysis
• Syntactic analysis
• Domain analysis
Figure 2.4 shows these steps. As you can see, these steps are performed in the order
provided above. Now, we will discuss these steps one by one.
2.8.1 Tokenization
We have discussed the tokenization process in detail. A token is the minimum lexical unit
that can provide a meaning. A document is tokenized into sentences, and each sentence is
tokenized into words. These words are then considered for analysis, for example, to find
out the action that is mentioned in the sentence. Consider the following sentence:
Smith is reading a book.
After analyzing the word “reading”, we can see that a task related to reciting is being
performed.
Morphological
and Lexical
analysis
Syntactic
analysis
Once the tokens have been identified, we can perform the morphological and lexical anal-
ysis further to find out the context of the information. This is very much necessary as in the
sentence, not only we may need the exact information but the meta information as well.
This may correspond to finding out the named entities or proper name identification. For
example, consider the sentence:
A plane crash in London.
After analysis, we can find out whatever is mentioned in the sentence, and the context
is the city of London.
The third step of information processing is syntactic analysis. After performing the basic
morphological and lexical analysis, the next step is to perform syntactic analysis. In syn-
tactic analysis, we relate lexical components with one another to extract information on
the basis of relationships. For example, consider the following sentence:
Smith told Maria that we will go to London with his son.
In the above sentence, apart from the information that someone is traveling, relating the
tokens with each other gives the following further information:
This is the main purpose of syntactic analysis: by relating the lexical tokens with each
other, we can find out the exact information.
Up to this point, the information extraction system just performs generic tasks. However,
the analysis is always dependent on the domain, and the information is extracted on the
basis of the domain rules. This is what domain analysis is all about. We apply domain rules
to extract domain-related information. For example, in the context of software engineer-
ing, consider the following sentence:
A student can register for multiple seminars.
The above sentence represents a requirement from the learning management system of
any university. Now, after performing the domain analysis, in terms of object-oriented
design, we can find out that:
50 2 Text Processing
As you can see, the discussed information is from the domain of software engineering,
which could not have been possible to extract without applying the rules of the software
design specific to object-oriented design. Similarly, consider the following sentence:
If the transaction amount is greater than fifty thousand, tax will be deducted.
In the above sentence, performing the domain analysis in the context of software engi-
neering and extracting information from the use case diagram point of view, the following
information is obtained:
2.9 Summary
In this chapter, we have discussed the essential concepts of natural language processing.
We discussed all the steps that are necessary to perform the task of information extraction
from the natural language text. Natural language is one of the major sources of data these
days. In fact, large volumes of data are generated on a daily basis just from natural lan-
guage. We discussed various sources of natural language data. We also discussed about
how textual data can be stored and how we can use it as a base for various natural language
processing-related tasks. We provided examples of these tasks and the implementation
using Python. Two natural language processing libraries, namely, Spacy and NLTK, were
used for implementation purposes.
2.10 Exercises
Q1: Write down at least one example of each of the following clauses apart from those
given in the chapter:
• Imperative
• Declarative
• Relative
• Interrogative
2.10 Exercises 51
Q2: Consider the following paragraph and perform the sentence segmentation
using Python.
A corpus consists of documents. A document consists of paragraphs. A paragraph
consists of sentences. A sentence consists of words.
Q3: Consider the following paragraph and perform POS tagging using Python.
Once each individual word has been identified as a token, the next task is to apply
the Parts-of-Speech (POS) tagging. A POS tag specifies the category the words
belong to. This helps in performing different types of analysis, e.g., which action is
performed by whom and on whom. All NLP-based libraries provide trained models
for assigning tags.
Q4: Consider the following text and identify the named entities using Python.
Google is one of the leading information technology companies in the world in 2023.
With its headquarters in the USA, the company offers various IT products and services.
Q5: Write down the Python code to print the fields of the NLTK Gutenberg corpora.
Q7: What is difference between POS tagging and named entity recognition?
Q9: Consider the following word forms, and identify the lemma of each.
Going, Leaving, Sitting, ate, Got, felt
Q10: Write down at least five POS tags of your native language.
Recommended Reading
• Natural Language Processing and Information Systems by Elisabeth Métais, Farid
Meziane, Helmut Horacek, Philipp Cimiano
Publisher: Springer Cham
Publication Year: 2020
52 2 Text Processing
The book provides conference proceedings about natural language processing. The
papers presented in the book discuss various concepts including classification, informa-
tion retrieval, and question answer generation. The book is a good source for anyone
interested in the literature related to the applications of natural language processing.
• Natural Language Processing in Action by Hobson Lane, Cole Howard, Hannes Hapke
Publisher: Manning
Publication Year: 2009
The book discusses topics related to natural language processing. It provides tech-
niques used for understanding natural language text. The book uses Python and several
of its libraries as implementation tool.
Text mining has a wide range of applications in data analytics as the majority of data is text
based. This chapter will discuss the two common applications of text mining, i.e., senti-
ment analysis and opinion mining. Both applications will be discussed with proper
examples.
Opinions shape our behavior and influence our decisions. Sentiment analysis is basically
the branch of research that is commonly referred to as opinion mining, which actually
examines how people react by analyzing their opinions, decisions, interpretations, and
feelings against products, personalities, affairs, events, services, and any topics. We cur-
rently have the largest collection of social media data. Without such opinionated data, it
was not possible to do too much research in this field.
Opinions play a vital role in almost all aspects of human activity because opinions are
major determinants of our behavior. We seek out other people’s perspectives whenever we
need to make a decision. Businesses and organizations constantly seek public or customer
feedback on their goods and services in the real world. Before making a decision to pur-
chase a product and before deciding to vote for any political party, most of the time, people
want to hear what other people/consumers think about the product’s current users as well
as the political parties and their candidates. When someone needed advice in the past, they
typically turned to their friends and family. Surveys are conducted by organizations and
businesses when they need opinions of the general public or customers. Content available
on different social media is being utilized by people and organizations to make intelligent
decisions.
Nowadays, we can find a bulk of user reviews and user debates on social media, so it is
not yet limited for a person to take advice from his/her family or friends before purchasing
any product. Hence, gathering information using surveys, focus groups, and opinion polls
is no longer required because a bunch of information is publicly available on the Internet.
There are so many different sites, and finding them, keeping an eye on them, and extract-
ing the information they contain remain challenging tasks. Each Web site normally has a
massive number of customers’ opinions, which are sometimes difficult to understand. The
typical human reader will have trouble locating such Web sites and extracting and sum-
marizing the people/customer thoughts contained therein. Therefore, it is required to auto-
mate the sentiment analysis process, and we need automated systems to analyze data. In
recent years, we have endorsed that people’s opinions about any product, issue, topic, and
service on social media have helped organizations make their services better and improve
their products.
Of course, there are documents containing people’s opinions that are found on the
Internet and are referred to as external data. Organizations maintain their internal data as
well, such as consumer feedback gathered through online surveys and emails or the out-
comes of surveys that the organizations have conducted.
There are three main levels that have been the focus of sentiment analysis research:
• Document level: This analysis is a fancy term for analyzing the overall sentiment of an
entire document, like an article, a review, or a tweet. It’s all about figuring out if the text
is positive, negative, or neutral, instead of looking at individual sentences or phrases.
Document-level sentiment classification is the term commonly used to refer to this task.
• Sentence level: The task in sentence-level sentiment analysis involves breaking down
larger texts, such as articles or reviews, into individual sentences and analyzing the
sentiment expressed in each one. In order to estimate the sentiment, each sentence is
predicted to analyze if the sentence conveys positive or negative sentiment.
• Entity and aspect level: When it comes to determining what exactly people like or dis-
like, document-level and sentence-level analyses don’t quite hit the mark. This is where
aspect-level sentiment analysis comes in handy. It used to be referred to as feature level,
and unlike the other two methods, aspect-level analysis cuts right to the chase and tar-
gets the opinions themselves. It’s constructed on the idea that opinions consist of two
parts: a sentiment, whether it’s positive (+) or negative (-), and a target. The kind of
opinion is useless without recognizing the target of such opinion.
We can better understand the sentiment analysis issue if we are aware of the signifi-
cance of opinion targets. For example, consider the sentence, “Although the service is not
that great, I still love this restaurant.” Sure, it has a positive tone overall, but it’s not com-
pletely positive.
3.1 Sentiment Analysis 55
Mostly, the entities or the aspects that belong to those entities are the opinion targets.
So, aspect-level sentiment analysis aims to identify sentiments on the entities and/or their
aspects. For instance, the sentence “The iPhone’s call quality is good, but its battery time
is less” evaluates two aspects: quality of the call and the battery time of the iPhone. The
review of the customer on the iPhone’s quality of call is positive (+); however, the cus-
tomer’s sentiment on its battery time is negative (-). In this sentiment, the quality of the
call and battery time are known as the opinion targets. By doing this level of analysis, we
can produce a comprehensive summary of opinions regarding entities and their respective
aspects, which belong to those entities. It will help turn unstructured textual information
into structured information, which can then be used for both types of analyses: qualitative
and quantitative. Document-level and sentence-level classifications are difficult. Aspect-
level sentiment analysis poses even greater challenges.
It’s not surprising that sentiment words, or words that express opinions, are the key
indicators of emotions. These kinds of words are frequently used to convey positive and
negative types of sentiments. For instance, good, awesome, and wonderful tend to express
positive emotions, while very bad and poor typically indicate negative emotions of the
consumer. Besides individual words, idiomatic phrases like “cost someone an arm and a
leg” also convey emotions. Sentiment words as well as phrases play a crucial role in senti-
ment analysis, as they provide insights into people’s feelings. Researchers have come up
with different algorithms to compile lists of such kinds of phrases and words, which are
known as sentiment (opinion) lexicons.
Social media provides a platform for people to express their opinions and views freely
and anonymously, without the fear of any repercussions. This has made opinions on social
media highly valuable as they represent the voice of the public. However, this anonymity
also creates a loophole for individuals with malicious intent to manipulate the system by
casting fake votes to promote or oppose any candidate for general elections. These indi-
viduals, commonly referred to as opinion spammers, engage in activities known as opin-
ion spamming. This poses a significant challenge to sentiment analysis as it can lead to the
analysis of inaccurate and misleading data.
Opinions and feelings are a matter of personal perspective, unlike factual information. So,
it’s important to consider a variety of opinions from different people instead of relying on
just one. A single opinion only reflects the views of that person and isn’t enough to make
decisions. Since there are so many opinions available online, it’s necessary to have some
kind of summary or overview to make sense of it all.
To help us get to the heart of the matter, we will be using a review of a Samsung LCD
TV. We’ve assigned an identification number to each sentence in the review so that we can
easily refer back to specific points.
56 3 Text Mining Applications
The review contains both negative and positive opinions about the Samsung LCD TV. The
second sentence represents a positive opinion as a whole. The third sentence represents
a positive opinion about its video quality. The fourth sentence also expresses a positive
opinion regarding its remote-control response time. However, the fifth sentence repre-
sents a negative opinion regarding its audio quality.
attributes can also be further broken down into more specific details, like the length and
breadth of the TV screen.
The second example is about a topic that can also be considered an entity, such as a tax
increase. In this case, the entity can be divided into various parts or categories, such as tax
increase for the poor, middle class, and rich.
Basically, what this definition is saying is that we can break down an entity into smaller
parts and analyze opinions about each of those parts individually. So, whether we’re talk-
ing about a specific model of TV or a topic like tax increase, we can examine opinions and
sentiments about different aspects of that entity to gain a deeper understanding.
In another definition, an opinion is made up of five elements, namely, the entity name
(ei), aspect of the entity (aij), sentiment about the aspect (sijkl), holder of opinion (hk), and
the time opinion written (tl). The sentiment can be positive, negative, or neutral and may
have different levels of intensity. For instance, some review sites use a rating scale of 1–5
stars to express the strength of the sentiment. When the opinion is referring to the whole
entity, the aspect “GENERAL” is used to refer to it. The entity name and aspect together
form the opinion target.
All five components are crucial, and their absence could lead to issues. For instance, if
the time component is missing, it would be challenging to analyze opinions over time,
which is often critical in practice because an opinion expressed years ago may not hold the
same value as a current opinion. Similarly, not having an opinion holder would also cause
problems.
With the above definitions, the most important tasks and objectives can be represented
as follows:
Given a document d having opinions, the objective of the sentiment analysis is to detect
all opinion quintuples which are present in d.
One of the important tasks in analyzing opinions is to identify the entities that are being
talked about in the text. It resembles the process of “named-entity recognition”. However,
this task is not always straightforward as people may refer to the similar entity in multiple
ways. As an example, the word “Motorola” could be written as “Mot”, “Moto”, and
“Motorola” in different contexts. Therefore, a crucial step in entity extraction is to catego-
rize the extracted entities and identify when they refer to the same entity. This is important
for accurately tracking and analyzing opinions about a particular entity over time.
• Entity category and extraction: Entities are unique items that can be referred to using
various expressions in a given text. Entity categorization refers to the mechanism of
grouping these expressions into distinct entity categories. Each category of entities
must have a different name in a given application. Similarly, aspects of entities can also
be expressed in different ways. As an example, “picture”, “image”, and “photo” can all
point to the aspects of a camera. Therefore, aspect expressions must also be extracted
and categorized to accurately analyze opinions on the corresponding entities.
58 3 Text Mining Applications
In conclusion, to summarize the analysis process for sentiments that are constructed on
a given collection of opinion documents, there are basically six main tasks that need to be
performed:
• Task 1 involves identifying and grouping all the different ways an entity is referred to
in a set of opinion documents and categorizing them into unique entity clusters. Every
cluster represents a unique entity.
• Task 2 involves identifying all the different aspects related to each entity and categoriz-
ing them into clusters. Every aspect cluster for a given entity shows a unique aspect.
• Task 3 involves identifying and categorizing the individuals or groups expressing opin-
ions in the text.
• Task 4 involves identifying and standardizing the different formats in which time is
expressed in the text.
• Task 5 involves determining the sentiment expressed in the text or structured data
toward each aspect. The sentiment could be positive (+), negative (−), neutral, or
assigned a numeric rating.
• Task 6 involves formulating all the quintuples (ei, aij, sijkl, hk, tl) represented in the text
or structured data, which is based on the results obtained in the tasks.
In general, task 6 is responsible for producing four opinion quintuples as the final output.
Opinions are a tricky thing because they are subjective by nature. Often, a single opinion
from one person is not enough to make decisions. That’s why analyzing a large number of
opinions is important in many applications. To make sense of all these opinions, it’s help-
ful to have a summary that includes opinions regarding various entities and the related
aspects. This summary should also contain a numerical perspective because the difference
between 20% and 80% positive opinions about a product is significant. There are many
ways to create an opinion summary, but these key components should always be included.
The opinion quintuple outlined earlier is a valuable source and serves as a framework
for producing quantitative and qualitative summaries. There are two fundamental catego-
ries of opinions:
• Regular opinions
• Comparative opinions
As a result, most of the research in the field has concentrated on explicit opinions, while
implicit opinions have received less attention.
The study of sentiment and opinion is closely intertwined with two significant con-
cepts: subjectivity and emotion. Objective sentences state information about the world in
a neutral and factual manner, while subjective sentences convey personal feelings, beliefs,
or opinions. For instance, “The sky is blue” is an objective sentence, whereas “I think the
sky is beautiful today” is subjective. Subjective expressions can take various forms, such
as opinions, beliefs, and desires.
Emotions have been a subject of study in various fields such as philosophy, psychology,
and sociology. Researchers have explored a broad range of emotional responses, including
physiological reactions, facial expressions, gestures, and postures, as well as an individu-
al’s subjective experiences. Different categorizations of emotions have been proposed, but
basic emotions have yet to be established. These primary emotions can be further divided
into many emotions, each with varying degrees of intensity.
The relationship between emotions and sentiments is close. When people have strong
feelings, those feelings can often be linked to the strength of their opinions. For instance,
if someone is really happy or really angry about something, their opinion on the matter is
likely to be stronger than if they feel more neutral. The opinions that are analyzed in senti-
ment analysis are basically evaluations. Research on consumer behavior suggests that
evaluations could be divided into two main categories:
• Rational evaluations
• Emotional evaluations
Rational evaluations are typically based on logical reasoning and practical beliefs. Such
evaluations often focus on the tangible features and benefits of an entity. For instance,
sentences like “The picture quality of this TV is excellent”, “This blender is very effi-
cient”, and “I am satisfied with this hotel room” express rational evaluations. On the other
hand, emotional evaluations are based on subjective and intangible responses to entities,
which reflect people’s deeper emotions and feelings. These evaluations often relate to
people’s personal preferences, desires, and emotional states. Examples of emotional eval-
uations include “I absolutely love this perfume”, “I am so frustrated with their customer
service”, and “This is the most comfortable couch ever”.
To practically utilize the two types of evaluations discussed earlier, we can create a
system of sentiment ratings. These ratings are based on emotions and rational thinking.
Emotional negative is rated −2, while rational negative is rated −1. Neutral is rated 0, and
rational positive is rated +1. Emotional positive is rated +2. However, sometimes, a neutral
rating may mean that no opinion or sentiment has been expressed. It’s important to under-
stand that the ideas of emotion and opinion are different from each other. Rational opin-
ions convey no emotions, such as the sentence “The voice of this phone is clear”, whereas
various emotional statements declare no opinion or sentiment toward anything, such as the
sentence “I am so surprised to see you here”. Additionally, emotions do not necessarily
3.2 Sentiment Classification 61
have a target but rather indicate an individual’s internal feelings, as expressed in the sen-
tence “I am so sad today”.
Sentiment classification is an area of study that has garnered a significant amount of atten-
tion. The primary objective of sentiment classification is to conclude whether an opinion
represented in a document is positive (+) or negative (-). Such a task is often referred to as
“document-level sentiment classification” because it contemplates the entire document as
the fundamental unit of analysis. Most of the research in this field has been focused on
classifying online reviews, though the definition and methods are applicable to other simi-
lar contexts as well. In practical situations, opinions about multiple entities in a document
may differ. For instance, the opinion holder might express positivity about some entities
and negativity about others, making it impractical to assign a single sentiment orientation
to the entire document. Likewise, if many people share their opinions in one piece of writ-
ing, their viewpoints could vary. This is usually the case with product and service reviews
as they usually assess only one product and are written by a single customer. However, in
a forum or blog post, the writer may write opinions on multiple things and use comparative
sentences to compare them, making it challenging to determine the overall sentiment of
the post.
In this context, we talk about the task of categorizing text to predict class labels and the
task of predicting numerical rating scores. Usually, techniques for categorizing text use
supervised learning, although unsupervised methods are also used. For predicting rating
scores, supervised learning is mainly used. In recent times, researchers have expanded this
area of study to include cross-domain and cross-language sentiment classification.
words are used as the primary features for classification. However, in the sentiment clas-
sification process, words that convey opinions or feelings—such as great, excellent, amaz-
ing, horrible, bad, worst, etc.—are given more weight in determining the text’s sentiment
orientation. As sentiment classification is essentially a text-classification problem, we can
apply any supervised learning algorithm to it. Some popular methods are available like
naive Bayes and support vector machines, abbreviated as SVM.
In the following research, numerous features and machine learning methods were
experimented with by many researchers. To successfully classify sentiment, like any other
supervised machine learning application, an efficient set of features must be developed. A
few examples of such features are mentioned below.
The first type of features that have been commonly used in sentiment classification is
based on the frequency of individual words and their n-grams. This approach is similar to
what is used in traditional text classification, where individual words and their combina-
tions are used to identify the topic of a document. In addition to frequency, other weight-
ing schemes like TF-IDF can also be applied to these features to give more importance to
rare and informative words. This type of feature engineering has been proven to be very
effective.
Another type of feature that has been explored is based on the part-of-speech in the text.
Words of various parts of speech can provide different cues about the sentiment expressed
in the text. For example, adjectives are often used to express opinions and evaluations and
thus can be treated as special features. However, it is also possible to use all the part-of-
speech tags and combinations as features. Such an approach can capture more subtle
nuances in the language used to express sentiment.
Sentiment analysis involves identifying and categorizing expressions in text as positive,
negative, or neutral. One approach to do this is using sentiment phrases and words, which
are commonly adjectives and adverbs that convey positive or negative emotions, such as
“amazing” or “terrible”. In addition to individual words, sentiment can also be expressed
through phrases and idioms. Other factors that can affect sentiment analysis include rules
of assessments, emotion modifiers, and grammatical relation. Sentiment shifters are
expressions that alter sentiment orientation, such as negation words like “not”, which can
flip the sentiment from positive to negative. Syntactic dependency refers to the relation-
ships between words in a sentence, which can be analyzed through parsing or dependency
trees to generate features for sentiment analysis.
Given that sentiment words play a dominant role in sentiment classification, researchers
have explored unsupervised methods that utilize sentiment words and phrases. One such
approach involves fixed grammar structures that are possible to be used for expressing
opinions. These patterns are built on part-of-speech tags.
3.3 Aspect-Based Sentiment Analysis 63
In the first step, the algorithm extracts two consecutive words. For instance, pattern 2
requires an adverb followed by an adjective, and must not be a noun. As an example, a
phrase “beautiful sounds” is extracted from the sentence “This piano produces beautiful
sounds” as it fulfills the initial pattern. The algorithm selected these patterns because
adjectives (JJ), adverbs (RB), comparative adverbs (RBR), and superlative adverbs (RBS)
are commonly used to convey opinions. To provide context, we use nouns or verbs since
the sentiment of an adjective, adverb, etc. can vary depending on the context it is used in.
For instance, the word “unpredictable” can have a negative sentiment in a car review when
used to describe “unpredictable steering”, but it can have a positive meaning when review-
ing a movie when used to describe an “unpredictable plot”.
It is a technique in machine learning that is used to identify and assign sentiment to various
aspects, features, and topics present in a body of text. Unlike other sentiment analysis
techniques that provide a general overview of sentiment, aspect-based sentiment analysis
(ABSA) provides more in-depth and detailed insights. Document-based sentiment analy-
sis typically identifies the sentiment of a text by scanning for certain keywords, while
topic-based sentiment analysis provides sentiment for a particular topic. However, aspect-
based sentiment analysis (ABSA) is a more advanced technique that extracts detailed
information from the text, giving a comprehensive overview of a customer’s emotions
toward different aspects. It can be a customer review on a product or a service. ABSA goes
beyond merely identifying the overall sentiment of the text and instead focuses on specific
aspects of the product or service that the customer is discussing. By breaking down the text
into smaller, more manageable components, ABSA can provide a more nuanced under-
standing of the customer’s opinions and experiences.
For instance, consider a sentence such as “The appetizers were okay, the drinks were
flat, and the ambiance was very bad.” When we use document-based sentiment analysis, it
only considers the overall sentiment of the text and may label it as negative. This approach
doesn’t take into account the finer details and complexities of the language used in the text.
A topic-based sentiment analysis tells you that the sentiment for “food” is neutral and
“ambiance” is negative. But ABSA does more by breaking down the text and identifying
specific aspects like food, drink, and atmosphere. By using semantics and contextual pro-
cessing, ABSA can tell you that the customer sentiment was neutral toward the “food”
aspect but leaning toward negative for “drinks”. Additionally, ABSA can also tell you that
the customers had a negative sentiment toward the “atmosphere” aspect of the restaurant.
There are several benefits of using ABSA in analyzing customer feedback. By identify-
ing specific aspects and characteristics of an item or any service that customers feel posi-
tively or negatively about, ABSA provides actionable insights to businesses that can be
used to improve their offerings. Moreover, by understanding the sentiment of customers
64 3 Text Mining Applications
toward different aspects, businesses can tailor their marketing strategies and communica-
tion to better address the customers’ needs and preferences.
This work is to assess the positivity, negativity, or neutrality of customer reviews made on
various components. For instance, in the first sample mentioned earlier, the customer’s
review on the aspect “voice quality” is good, and in the second example, the opinion on
the aspect “GENERAL” is also good.
In aspect-based sentiment analysis, researchers generally concentrate on two main
approaches:
1. Supervised learning
2. Lexicon-based techniques
It is a crucial task in the sentiment analysis, as it helps identify the specific aspects or top-
ics that are being discussed in a given text. Unlike traditional information extraction tasks,
aspect extraction is unique in that it involves identifying not only the opinion or sentiment
expressed in a text but also the target or aspect of that sentiment. This is because every
opinion or sentiment is associated with a specific target, which is often the aspect or topic
that is being discussed.
If you want to get the sentiment analysis right, it’s super important to be able to pick out
all the different opinion expressions and their targets from a sentence. However, it’s worth
noting that some user reviews can play multiple roles in a text. For example, a word like
“expensive” can serve both as a sentiment word and an implicit aspect, indicating that the
price is the target of the sentiment. Identifying such dual-role expressions is essential for
accurate aspect extraction and sentiment analysis.
Abstract extraction can be approached in several ways. Four of the most common
approaches include:
Once the aspects have been extracted, the next step is to group together expressions that
describe the same aspect into synonymous aspect categories. This is crucial in opinion
analysis as people frequently use various words and phrases to refer to the same features.
For instance, in the context of phones, “call quality” and “voice quality” are synonymous
and referring to the same aspect. While tools like WordNet and other thesaurus dictionar-
ies can assist in this task, they are often inadequate due to domain dependency. There are
lots of ways to talk about different things in a sentence, and some of them are more than
one word long, which can make it tough to keep track of everything. Plus, just because two
words both relate to the same thing doesn’t mean they’re the same word or mean exactly
the same thing. For example, “expensive” and “cheap” both have to do with how much
something costs, but they’re not exactly the same as each other or the word “price”. So it
can be tricky to sort it all out and figure out what people really mean when they talk about
various features of a good or service.
Researchers developed a new way to match aspects to a specific domain by comparing
the similarity of the phrases used. They looked at how close the words were in meaning,
66 3 Text Mining Applications
whether they had the same definition or were related to one another. They tested this out
on reviews for digital cameras and DVDs and found it worked pretty well. Then they took
it a step further by using public aspect hierarchies and real reviews to make sure the final
aspect categories made sense. They also made sure to pick the best method to compare the
phrases.
There was another idea to make things easier for users. They suggested a method that
lets users group aspect expressions into their own custom categories. The process goes like
this: users start by labeling a few seeds for each category they want. Then, the system takes
care of the rest by using a special learning method with labelled and unlabeled examples
to assign other aspect expressions to the right categories.
The approach used in this study made use of two types of information to improve the
performance of the expectation-maximization (EM) algorithm: (1) aspect expressions that
contain similar words are likely to be part of the same aspect category, such as “battery
life” and “battery power”, and (2) aspect terms like “movie” and “picture” that are classi-
fied as synonyms in a dictionary are likely to belong to the same aspect category. By
incorporating these two types of knowledge, the EM algorithm was able to produce more
accurate classification results.
It is an important task in semantic analysis, which involves identifying the intended mean-
ing of a word in a given context. This is a challenging problem, as many words have mul-
tiple possible meanings, and the correct meaning can depend on the surrounding words or
the topic of the text.
Information retrieval and text classification are one of the many uses for word sense
disambiguation (WSD) in natural language processing (NLP). For example, in informa-
tion retrieval, WSD can help increase the accuracy of searching results by ensuring that
queries are interpreted correctly. In machine translation, WSD can help to improve the
accuracy of translations by selecting the correct meaning of each word in the source
language.
There are many approaches to word sense disambiguation (WSD), including rule-based
methods, supervised machine learning methods, and unsupervised methods. Rule-based
methods involve manually defining rules to identify the correct meaning of a word based
on its context. Supervised machine learning methods involve training a classifier to predict
the correct meaning of a word based on labelled examples. Unsupervised methods involve
clustering similar contexts together to identify the different senses of a word.
While word sense disambiguation (WSD) is a challenging task, it is an important part
of semantic analysis and has many practical applications in NLP.
3.4 Opinion Summarization 67
When dealing with opinions in sentiment analysis, it’s important to look at multiple opin-
ions from different people, since opinions can be subjective and vary greatly between
individuals. In order to effectively summarize these opinions, many sentiment analysis
frameworks use a variety of techniques that have been developed in the field of NLP. These
techniques are also widely used in industry, with companies such as Microsoft Bing and
Google Product Search relying on them to provide meaningful summaries of user opinions.
These summaries can come in different forms, like structured or unstructured text doc-
uments, and they’re usually made by summarizing multiple texts together. While extensive
research has been done on text summarization in NLP, opinion summarization is a some-
what different task, as it focuses on identifying the different topics or entities being dis-
cussed in a document, along with the sentiments expressed about them. This makes it a
more complex process than traditional single or multi-document summarization, which
simply seeks to identify and collect the most vital information from a body of text. To cre-
ate effective opinion summaries, it’s necessary to have a strong understanding of the topics
and entities being discussed, as well as the various sentiments being expressed about them.
This requires a structured approach to summarization, even when the output is a short text
document.
It is a powerful tool that has two main characteristics that make it particularly useful. First,
it gathers the essence of user reviews by locating the reviews targets, which includes enti-
ties and the aspects that are related to these entities and the opinions of people about these
entities. As a second task, it provides all quantitative data that represents the total number
or the percentage of people who have good or bad sentiments about these targets. This is
particularly important given the detailed nature of user reviews and opinions. By using
opinion quintuples, structured summaries can be produced that provide a clear overview
of the opinions expressed.
Moreover, the opinion quintuples can be used in a variety of ways, including visualiza-
tion and data analysis. For example, by extracting time data, it is easy to track the trend of
user opinions about various aspects. Even without sentiment analysis, it is possible to get
a sense of which aspects people are most concerned about, based on the frequency of men-
tions. With the use of database applications and advanced OLAP tools, the available data
can be sliced and diced to do any kind of analysis (i.e., qualitative and quantitative).
The car industry used aspect-based opinion summarization in one of its applications.
By mining opinion quintuples of various individual cars, comparing opinions on small
automobiles, medium-sized cars, Chinese cars, and Japanese cars, among other things,
was possible. The outcomes were next utilized as unprocessed data for data mining, which
68 3 Text Mining Applications
enabled the user to identify interesting segments of the market. For instance, one segment
of customers focused on the beauty and slickness of the car, while another segment talked
about back seats. Such information was crucial because it allowed the user to tailor their
product to the preferences of various customer segments.
A comparative sentence is one that establishes a relationship between two or more entities
based on similarities or differences. Many different comparisons can be made, which can
be broadly categorized into two main groups: gradable and, the others, non-gradable com-
parisons. The first types of comparisons are those that express a degree of difference or
similarity between the compared entities. For instance, a sentence like “The coffee is hot-
ter than the tea” is a gradable comparison because it implies a difference in temperature
between the two drinks. On the other hand, non-gradable comparisons do not express a
degree of difference or similarity. For example, a sentence like “The sun is bigger than the
earth” is a non-gradable comparison because it only highlights the difference in size
between the two entities without implying any degree of comparison.
Comparing entities is a common way to express opinions, and there are different types
of comparisons that people use. One main category of comparison is gradable. Such kinds
of comparisons indicate a relationship between the entities that are being compared.
Within gradable comparisons, there are three subtypes. The first subtype is non-equal
gradable comparison, which compares two sets of entities and ranks them according to
some shared characteristics. For instance, “Coke tastes better than Pepsi” is an example of
this subtype. The second subtype is equative comparison, which claims that due to some
shared characteristics, two or more entities are equal. “Coke and Pepsi taste the same” is
an example of equative comparison. Finally, the third subtype is superlative comparison,
which places one entity above all others. For instance, “Coke tastes the best among all soft
drinks” is an example of superlative comparison.
Another category of comparison is non-gradable, which expresses a relationship among
two or more entities without making any rank. There are three subtypes of non-gradable
comparison. The first subtype occurs when Entity A and Entity B differ or are similar
depending on some shared characteristics. For example, “Coke tastes differently from
Pepsi” is a non-gradable comparison of this type. When Entity A has aspect a1 and Entity
B has aspect a2 (which are typically interchangeable), this is the second subtype. Here is
an example sentence: “Desktop PCs use external speakers, but laptops use internal speak-
ers” is a non-gradable comparison of this type. Finally, when Entity A has aspect A but
Entity B does not, this is the third subtype. For instance, “Nokia phones come with ear-
phones, but iPhones do not” is a non-gradable comparison of this type.
70 3 Text Mining Applications
In the English language, we often express comparisons using specific words known as
comparatives and superlatives. These terms are added to base adjectives and adverbs to
indicate degrees of comparison. For instance, in the sentence “The battery life of Nokia
phones is longer than Motorola phones”, the comparative form of the adjective “long” is
“longer”. Here, “longer” and “than” denote that this is a comparative sentence. Similarly,
in “The battery life of Nokia phones is the longest”, since “longest” is the superlative form
of “long”, this sentence is one of superlative quality. These comparisons and superlatives
are referred to as superlatives and comparatives of type 1.
However, adding -er or -est does not allow adjectives and adverbs with two or more
syllables that do not end in y to become comparative or superlative. Instead, we prefix such
terms with more, most, less, and least, for example, more beautiful. These comparisons are
referred to as type 2 comparatives and superlatives. In addition to these regular varieties,
English also features irregular comparatives and superlatives that deviate from the norm,
such as better, best, worse, worst, further/farther, and furthest/farthest. However, because
of the way they behave, they are categorized here along with type 1 comparatives.
Comparisons are not just limited to using standard comparatives and superlatives such
as “-er” and “-est”. In addition to “prefer” and “superior”, there are numerous more words
and expressions that can be used to make comparisons. For instance, a sentence like “The
voice quality of the iPhone is superior to that of the BlackBerry” shows that the iPhone has
features like better voice quality and is preferred over the BlackBerry. Researchers have
compiled a list of such words and phrases that behave similarly to the first type of com-
paratives. Comparative keywords refer to these expressions as well as the traditional com-
paratives and superlatives. Additionally, comparative keywords can be divided into two
groups based on whether they convey increased or decreased quantities, which can be
useful in sentiment analysis. Comparatives that convey an increase in quantity include
“more” and “longer”, while comparatives that decrement the quantity include “less” and
“fewer”.
When we want to mine opinions that compare two or more things, we will look for the
stuff that people are comparing. That includes the things they’re talking about, what
they’re saying about them, and who’s doing the talking. It’s like regular opinion mining in
that way. Actually, it’s sometimes easier with comparative sentences because you can usu-
ally find the entities being compared pretty easily. But to figure out which entity people
prefer, that’s a whole other thing. Plus, not all sentences that have comparative words actu-
ally make a comparison, and some of those words can be hard to spot. So, we’re going to
focus on two problems: finding sentences that compare things and figuring out which thing
people like more.
It’s important to note that not all sentences that contain comparative and superlative
keywords are actually comparative sentences. For instance, the sentence “I cannot agree
with you more” doesn’t express any comparison. However, studies have found that most
comparative sentences do indeed include a keyword indicating comparison, such as “bet-
ter” or “superior”. Using a collection of keywords, researchers were able to locate 98% of
3.6 Opinion Search and Retrieval 71
comparable statements in their dataset with a 98% recall rate. While the precision rate was
32%, this still provides a useful starting point for identifying comparative sentences.
There are three types of keywords that indicate comparison in sentences. The first type
includes comparative adjectives and adverbs, such as “more”, “less”, “better”, and words
that end with “-er”. These words are being counted as two keywords. The second type
includes superlative adjectives and adverbs, such as “most”, “least”, “best”, and words that
end with “-est”. These words are also being counted as two keywords. The third category
consists of additional uncommon words and expressions that indicate comparison, such as
“favor”, “beat”, “win”, “exceed”, “outperform”, “prefer”, “ahead”, “than”, “superior”,
“inferior”, “number one”, and “up against”. Each of these is counted as one keyword.
Keywords can be used to filter out sentences that aren’t likely to be comparative because
they can be used to find a lot of comparative sentences on their own. The remaining sen-
tences’ accuracy will then be improved.
According to the literature, there are four different categories into which comparative
sentences can be divided: non-equal gradable, equative, superlative, and non-gradable. To
do this, researchers used keywords and key phrases as features and found that SVM
worked the best. The issue of defining comparison questions and the entities being com-
pared were the attention of other academics. Sequential patterns/rules were used to do this,
and the approach was slightly different. They made decisions about which entities were
being compared at the same time as well as whether a question was a comparison question.
To learn these patterns, they used a weakly supervised learning method grounded on boot-
strapping. They extracted a set of first seed entity pairs by starting with a user-provided
pattern. They then obtained all questions that contained each entity pair for each pair of
entities and treated those questions as comparison questions. They learned about and eval-
uated each potential sequential pattern from these questions and entity pairs.
The learning process involved generalization and specialization, where any words or
phrases that matched the entities in a sentence were considered entities.
Web search has demonstrated its usefulness as a service, and opinion search is expected to
be equally valuable. There are two common types of query:
1. Finding public opinions about a certain person, thing, or topic, such as consumer
reviews of the picture quality of a digital camera or public perceptions of a political
candidate or issue.
2. Discovering opinions held by a particular individual, group, or subject toward a given
thing, aspect, or subject, such as Barack Obama’s viewpoint on abortion. This kind of
search is particularly pertinent to news stories because they directly reveal viewpoints.
72 3 Text Mining Applications
To perform the first type of query, users can provide the entity name or aspect name
along with the entity name. For the second type of query, users can provide the opinion
holder’s name and the entity (or topic) name.
A promising area of development in the context of Web search is opinion search. This
involves finding out what the general public thinks about a certain thing or aspect or what
people’s or organization’s opinions are on a certain thing or subject. In news items, where
opinions are frequently expressed directly, this kind of search is especially helpful. The
two main duties of opinion search are retrieving and ranking the documents and sentences
that are pertinent to the user’s query. Although these tasks resemble conventional Web
searches, there are some important distinctions. In opinion search, in addition to locating
documents or sentences that are pertinent to the query, the retrieval process also entails
identifying whether or not those documents or sentences express opinions on the query
issue and, if so, whether the opinions are positive or negative.
This is where sentiment analysis comes in, a crucial subtask for opinion search that
traditional search does not perform.
Traditional Web search engines assign a page’s ranking based on its authority and rel-
evancy score. However, this paradigm is not suitable for opinion search, especially for the
first type of queries, where the objective is to record the population’s overall positive and
negative sentiments in their natural state. The real ratios of positive and negative opinions
are crucial pieces of information that must be displayed in the search results in the major-
ity of applications.
Making two rankings—one for positive and one for negative feedback—and displaying
the proportion of each would be one possible option. This way, users can get a more com-
prehensive view of public opinion on a specific entity. Providing an aspect-based summary
for each search presents another challenge in opinion search. Associating entities with
their aspects, categorizing aspects, and extracting aspects all present difficulties. Until
there are effective solutions for these problems, such summaries will not be possible. In
conclusion, opinion search is an exciting field that has the potential to provide valuable
insights into public opinion. With advances in sentiment analysis and ranking algorithms,
we can expect to see more sophisticated opinion search tools in the future.
Opinion retrieval research often takes a two-stage approach to the task. In the initial
stage, only topical relevance is used to rank documents. The appropriate candidate docu-
ments are then reranked in the second stage in accordance with the judges’ opinions.
These scores can be obtained by using a sentiment lexicon, sentiment word scores, and
proximity scores between query terms and sentiment words in a machine learning-based
sentiment classifier, such as SVM, or by using a lexicon-based sentiment classifier. In
more sophisticated research, topic relevance and opinion are combined in one step to pro-
duce rankings based on an integrated score. The objective is to create a system capable of
doing opinion search or retrieval, which consists of two elements. For each query, the first
component collects pertinent information, and the second component categorizes the doc-
uments as opinionated or not. The opinionated documents are further divided into those
that are favorable, unfavorable, or mixed.
3.7 Opinion Spam Detection 73
The algorithm takes both keywords and concepts into account while retrieving perti-
nent documents. Concepts can be specific words or phrases from dictionaries and other
sources, such as Wikipedia entries, or they can be named entities, such as the names of
persons or organizations. The system initially recognizes and clarifies the concepts in the
user query before processing them. The search is then expanded using its synonyms. In
order to extend the query, it then uses pseudo-feedback to automatically extract pertinent
terms from the top-ranked texts while also recognizing concepts in the retrieved papers.
Finally, it uses both concepts and keywords to determine how similar (or relevant) each
document is to the extended query.
Both categorizing each document into one of the two categories—opinionated or not—
and categorizing each opinionated document as expressing a positive, negative, or mixed
opinion are tasks carried out by the opinion classification component. The system uses
supervised learning to carry out these tasks. It gathers a lot of opinionated training data
from sites like rateitall.com and epinions.com as well as from many industries like con-
sumer products and services, public policy, and political ideologies. Unbiased training
data are gathered from Web sites that offer impartial information, such Wikipedia. An
SVM classifier is built using these training data.
The next step in the opinion retrieval process is to apply a classifier to each document
that was retrieved. Each statement in the document is broken down into opinionated and
non-opinionated categories. The power of a statement is also highlighted if it is deemed to
be opinionated. A document is considered opinionated if it contains at least one sentence
that fits the definition. The approach, however, requires that sufficient query concepts or
phrases be located nearby to ensure that the sentence’s viewpoint is connected to the query
issue. The ranking of the document is then based on the sum of the opinionated sentences,
their strengths within the document, and how well the document resembles the query.
A second classifier is created to assess whether an opinionated material reflects a favor-
able, unfavorable, or mixed opinion. Reviews from review Web sites like rateitall.com that
also include review scores make up the training data. A low score in these evaluations
denotes an unfavorable opinion, whereas a high score denotes a favorable opinion. A senti-
ment classifier is developed to categorize each text as conveying a favorable, negative, or
mixed viewpoint using positive and negative reviews as training data.
Opinions have become a crucial factor in our daily decision-making processes. The rise of
social media has given us the opportunity to voice our opinions on a wide range of topics,
from politics to product reviews. However, with the increased importance of opinions,
there has also been a rise in fraudulent practices aimed at influencing the opinions of oth-
ers. These practices are known as opinion spamming, and they can have serious conse-
quences, particularly in the social and political spheres.
74 3 Text Mining Applications
Opinion spammers engage in activities such as posting fake reviews, creating fake
social media accounts, and posting biased or misleading content to sway public opinion.
They often do so for personal gain, such as financial rewards or to promote their own
agenda. The rise of opinion spamming has become a significant challenge for social media
platforms, which are struggling to maintain their credibility as a trusted source of public
opinion.
To address this problem, researchers have been exploring ways to detect opinion spam-
ming. However, detecting opinion spamming is not an easy task, as it differs from tradi-
tional forms of spamming. For instance, unlike email spam, opinion spamming does not
involve unsolicited advertisements. Similarly, unlike Web spam, opinion spamming does
not involve adding irrelevant words to a Web page.
Opinion spamming is more subtle and sophisticated, and it requires more advanced
techniques for detection. Researchers have been exploring various methods, such as
machine learning algorithms and NLP, to detect opinion spamming. These methods
involve analyzing the content of reviews and social media posts and looking for patterns
and anomalies that might indicate fraudulent behavior.
Overall, opinion spamming is a serious problem that requires continued attention from
researchers, social media platforms, and users alike. As opinions continue to play an
increasingly important role in our lives, we need to be vigilant in detecting and preventing
fraudulent practices aimed at manipulating public opinion. By cooperating, we can pre-
vent fake news, lies, and deceptions from using social media as a platform instead of a
reliable source of public opinion.
Opinion spam detection presents a unique challenge because it’s incredibly difficult, if
not impossible, to identify false opinions by reading them. For researchers looking for
opinion spam data to build and test their detection systems, it presents a challenge. Unlike
other kinds of spam, such as email or Web spam, it’s not always easy to recognize opinion
spam at first glance. In other circumstances, it might even be illogical to determine opinion
spam solely by reading the text. In an effort to promote a substandard restaurant, someone
might, for instance, write an honest evaluation of a wonderful restaurant and submit it as
a phony review. Without taking into account facts outside of the text itself, it is impossible
to tell this false review from a genuine one. This is so that a review cannot be both genuine
and fraudulent at the same time.
Thus, researchers face a major challenge in detecting opinion spam, as it’s difficult to
obtain reliable data and identify fraudulent reviews. Nonetheless, it’s crucial to develop
robust detection algorithms to ensure that social media platforms remain trusted sources of
information and opinions, rather than being filled with fake reviews and deceptions.
There are three types of spam that you might come across when looking at online reviews.
The first type is fake reviews. These are basically lies written by people who want to
3.7 Opinion Spam Detection 75
promote something or damage someone’s reputation. They might say they love a product
they’ve never even tried, or they might say they hate something just because they don’t like
the company that makes it. The second type is when people write reviews about brands
rather than specific products. This isn’t always spam, but it can be if the person doesn’t
actually talk about the product they’re supposed to be reviewing. For example, if someone
writes “I hate HP” when they’re supposed to be reviewing an HP printer, that’s spam. The
third type of spam isn’t really reviews at all. It includes things like advertisements, ques-
tions, answers, and other texts that don’t actually contain any opinions. Obviously, these
aren’t really reviews, so they’re not really opinion spam.
According to the literature, identifying and detecting types 2 and 3 spam reviews is
relatively simple, and these types of reviews are uncommon. In the rare cases where these
reviews do go undetected, they pose little threat, as they are easily noticeable by human
readers during manual inspection. Because type 1 spam reviews are fake reviews, this
chapter focuses on them.
Fake reviews are regarded as a type of dishonesty, but they differ from traditional forms
of deception in significant ways. Deception is typically associated with lies about a per-
son’s true feelings or facts. Various studies have identified language patterns that liars tend
to use, such as distancing language (using “you”, “she”, “he”, and “they” instead of “I”,
“myself”, and “mine”) and frequent use of words associated with certainty to hide falsity.
Fake reviews, meanwhile, aren’t always lies in the conventional sense. Fake reviewers
often use first-person pronouns (“I”, “myself”, “mine”) to create the impression that their
opinions are genuine. In some cases, fake reviews may not even be lies; for example, an
author might write a review of their own book under a pseudonym, but the review may
reflect their genuine feelings. Moreover, many fake reviewers may not have even used the
product or service they are reviewing and are instead giving positive or negative opinions
about something they have no knowledge of. As such, they are not lying about their own
experiences or feelings (Table 3.1).
Table 3.1 shows the different types of fake reviews and their impact on the quality of a
product. Regions 1 and 3 represent positive reviews with either undisclosed or disclosed
conflicts of interest, respectively. These reviews are not necessarily harmful, but they are
not completely honest either. Regions 2 and 6, on the other hand, represent negative
reviews with undisclosed or no conflicts of interest, respectively. These reviews are very
harmful to the product’s reputation. Regions 4 and 5 represent negative and positive
reviews, respectively, with disclosed conflicts of interest. These reviews may not be com-
pletely honest, but they are not as harmful as regions 2 and 6. It’s important to note that the
impact of these fake reviews on product quality may vary depending on the number of
reviews and the presence of spammers. Therefore, detecting and filtering out fake reviews
in regions 2, 3, 4, and 5 is crucial in maintaining the accuracy and integrity of online
reviews.
Fake reviews can come from a variety of sources, including friends and family of the
product, company employees, competitors, and even businesses that specialize in writing
fake reviews. Some businesses even incentivize their own customers to write positive
reviews for them by offering discounts or refunds. Additionally, both public and private
organizations, as well as political parties, may hire people to post messages intended to
sway social media discussions and disseminate false information.
There are two types of spammers: individual spammers and group spammers. Individual
spammers work alone and write fake reviews using a unique user ID. On the other hand,
group spammers collaborate to promote a specific product or harm the reputation of a
competitor. There may or may not be spammers in the group to be acquainted with one
another. Another form of group spamming is when a person registers many user IDs and
uses them to write fake reviews, giving the impression of a group effort. This is often
referred to as “sock pipetting.”
Because of how many people make up a group, group spam can be very harmful
because it fully misleads potential buyers, especially during the beginning stages of an
item launch. Group spammers have certain distinct traits that make it particularly destruc-
tive, even though they can also be seen as numerous individual spammers. Overall, detect-
ing and preventing both individual and group spamming are crucial to maintaining the
integrity of online reviews.
When it comes to detecting review spam, three basic kinds of data are typically used.
The actual text of each review, or the review content, is presented first. Language-based
traits like word and POS n-grams can be examined to look for possible instances of decep-
tion or dishonesty. However, relying solely on linguistic features may not be enough, as it
would be simple for someone to create a false review that reads exactly like the real thing.
For instance, they could write a glowing review of a “terrible restaurant” based on their
actual experience at a different, much better restaurant.
Metadata pertaining to the review serves as the second kind of data utilized to identify
spam. This information includes the number of stars assigned to each review, the review-
er’s user ID, the date and hour the review was written, how long it took to write, the IP and
MAC addresses of the reviewer’s computer, the reviewer’s location, and the order in which
3.7 Opinion Spam Detection 77
they clicked on the review site. We can search for unusual patterns of behavior among
reviewers and their reviews by analyzing this data. For instance, it’s suspicious if a reviewer
only writes positive ratings for one brand while writing poor evaluations for a rival. Similar
to this, we can be wary if numerous user IDs from the same machine publish numerous
favorable reviews of a product. A red flag is also raised if a hotel’s only favorable evalua-
tions come from the neighborhood.
Finally, we have product information. This contains information regarding the subject
of the review, such as a description of the product and sales volume and rank. If a product
has many positive reviews but isn’t selling well, that’s a cause for suspicion. Overall, by
combining these three types of data, we can more effectively identify review spam and
protect consumers from being misled.
Opinion spam detection is a tricky task that involves classifying reviews into two catego-
ries—fake and non-fake. While supervised learning seems like a natural choice, it’s chal-
lenging to accurately differentiate between fake and genuine reviews, as spammers can
craft fake reviews that are almost identical to real ones. Due to this, there’s a lack of
labelled data to train machine learning algorithms for detecting fake reviews. However,
despite the difficulties, researchers have proposed and tested various detection algorithms.
Three supervised learning techniques will be covered in this part, and several unsuper-
vised techniques will be covered in the next section. Duplicate reviews were used in analy-
sis of 5.8 million reviews and 2.14 million reviewers from amazon.com to address the
issue of limited labelled data. The analysis found a sizable number of duplicate and almost
identical reviews, indicating that review spamming was pervasive. Since writing fresh
reviews can be time-consuming, many spammers reuse old ones or make minor edits to
them for various products.
In Jindal and Liu’s study, duplicates and near-duplicates were categorized into four
groups: those from the same user ID on the same product, those from different user IDs on
the same product, those from the same user ID on different products, and those from dif-
ferent user IDs on different products. The following three categories are more likely to be
fraudulent reviews, even if duplicates from the same user ID on the same product might
arise by unintentional clicking.
In order to distinguish false reviews from non-fake reviews in the training data for
machine learning, the final three categories of duplicates were employed. Three sets of
characteristics were applied to identify false reviews.
Features for detecting opinion spam may be divided into three categories: review-
centric, reviewer-centric, and product-centric features. Review-centric features are charac-
teristics that are exclusive to each review and may include elements like the length of the
review, the frequency of brand references, the percentage of opinion words, and the quan-
tity of constructive criticism received. On the other hand, reviewer-centric characteristics
78 3 Text Mining Applications
are those that are concerned with the reviewer and might include things like the reviewer’s
average rating, the standard deviation of their ratings, and the proportion of their reviews
that were the first to be submitted. Last but not least, elements that are focused on the
product itself may include data on the product’s cost, sales ranking, and the mean and
standard deviations of its review scores. The study utilized logistic regression to construct
a model for identifying fake reviews. The results obtained from the experiments were
intriguing and highlighted several key findings. Firstly, it was observed that negative out-
lier reviews, which are reviews with notably negative ratings compared to the average
product rating, are frequently targeted for spamming. On the other hand, positive outlier
reviews were not significantly affected by spamming. Another notable observation was
that reviews that were the only ones for a particular product were more likely to be fake.
This could be attributed to sellers attempting to promote unpopular products by fabricating
reviews.
Additionally, the study showed that reviewers who scored high were more likely to
write phony reviews. These reviewers wrote a lot of reviews; some of them wrote hundreds
or even tens of thousands, which is a lot more than the typical customer would ever write.
The study also revealed that fake reviews could receive positive feedback, while genuine
reviews might get negative feedback. This highlighted how simple it was for spammers to
create plausible false reviews that receive a lot of positive comments and so trick readers.
Lastly, the study found that products with lower sales ranks were more susceptible to
spamming. This suggested that spammers predominantly targeted low-selling products
that required promotion, as it is difficult to damage the reputation of a well-established
product. Overall, these findings can assist in the development of more effective techniques
for identifying and filtering fake reviews.
Data labeling manually is a problem for training; relying solely on supervised learning to
detect fake reviews is challenging. As a result, this section will cover two unsupervised
approaches that are already implemented on various review hosting sites. These tech-
niques have proven effective in addressing the issue of fake reviews without relying on
labelled training data.
The following section delves into various techniques for detecting spam based on
unusual reviewer behaviors. For instance, a reviewer who writes only negative reviews for
a specific brand, while other customers have positive reviews, and writes only positive
reviews for its competitors raises suspicion of being a fake reviewer. The first method
focuses on finding patterns of spamming behavior and rates each reviewer’s level of spam-
ming behavior using a numeric scale. To get the final spam score, these ratings are added
together, making it feasible to identify spammers and false reviewers. The behavior mod-
els for this method include:
3.7 Opinion Spam Detection 79
(a) Targeting products: Spammers will probably focus their efforts on promoting or harm-
ing a small number of target products. To accomplish this, they closely monitor these
products and write fake reviews to manipulate their ratings at the appropriate time.
(b) Targeting groups: In this scenario, spammers short-term manipulate the ratings of a
group of products that have specific characteristics. For instance, a spammer may pick
out multiple items from a brand over the course of a few hours. To have the greatest
impact, these products are either given very high or very low ratings.
(c) General rating deviation: Genuine reviewers frequently rate the same products simi-
larly to other reviews. However, because they try to promote or denigrate particular
products, spammers’ ratings differ greatly from those of other reviewers.
(d) Early rating deviation: Spammers that post bogus reviews soon after a product debut
are caught by this behavior. Such evaluations are likely to catch the eye of other
reviewers, giving spammers the chance to influence the opinions of later reviewers.
reviews. Reviews from the top-ranked stores, reviewers, and reviews were probably the
targets of review spam. The assessment involved employing human evaluators to compare
the scores of stores with those of the Better Business Bureau (BBB), a reputable US cor-
poration that collates reports on business credibility and cautions consumers about busi-
ness or consumer frauds.
3.8 Summary
Text mining has a wide range of applications. In this chapter, we have discussed the two
common applications of text mining, i.e., sentiment analysis and opinion mining, with
examples. Opinions play a vital role in almost all aspects of human activity because opin-
ions are major determinants of our behavior. A promising area of development in the
context of Web search is opinion search. This involves finding out what the general public
thinks about a certain thing or aspect or what people’s or organization’s opinions are on a
certain thing or subject.
3.9 Exercises
Q4: What is the difference between supervised sentiment classification and unsupervised
sentiment classification?
Q6: Write down different opinion types, and mention at least one example of each.
Q9: What is a sentiment analysis problem? Write down in your own words.
Q10: What is meant by word sense disambiguation? Provide at least three senses of the
word “pen”.
3.9 Exercises 81
Recommended Reading
• Text Mining Classification, Clustering, and Applications by Ashok N. Srivastava,
Mehran Sahami
Publisher: Chapman & Hall
Publication Year: 2009
The book provides details of the various statistical methods used for text mining. It
discusses the important concepts of classification and clustering and provides details of
how these methods can be applied in various applications. Overall, the book is a good
source for anyone interested in text classification and clustering approaches.
One of the major challenges of text mining is that the textual data is unstructured, so it is
essential to bring the data in a structured format. In this chapter, the process of feature
engineering will be discussed in detail along with examples, Python implementation, and
a complete explanation of the source code.
As can be seen, the extraction of relevant features is an important task before any of the
algorithms can be used to perform its functionality.
It should be noted that features may have different types, e.g., nominal, coordinal,
numeric, etc. Features can be qualitative or quantitative. Each of these categories has fur-
ther subtypes as follows:
• Qualitative features
–– Nominal features
–– Ordinal features
–– Binary features
• Quantitative features
–– Discrete features
–– Continuous features
Nominal features are also called categorical features. They represent the names of
something, e.g., name of the cities of France, names of the categories of trees, etc. Ordinal
features are the ones that have a meaningful sequence between their values, for example,
grades of students or sizes of shirts. In the case of size of shirts, we have S, M, L, or
XL. “S” stands for small, “L” for large, and “XL” extra-large. As we know, size “M” is
bigger than size “S”; similarly, size “L” is bigger than size “M” and so on.
Binary features are those that have only two values, e.g., Yes or No, True or False, 0
or 1, etc.
Quantitative features, as the name shows, represent some quantity. Discrete features are
those that have finite values. They can be counted, for example, zip code, number of trees,
etc. Continuous features on the other hand can have an infinite set of values. Normally, these
features appear in the form of real-type values, for example, volume, weight, height, etc.
In this chapter, we will discuss only the quantitative features.
As mentioned earlier, features represent the characteristics of real-world objects and enti-
ties. However, just getting the features is not sufficient. The features need to be converted
to an appropriate form so that the algorithm can use these features.
4.2 Feature Engineering 87
The “?” represents the missing value. It is quite common for algorithms to meet the
missing values while processing the data. There are a number of reasons for such missing
values. For example, the user may not be willing to provide the confidential data, e.g., age
or gender. Similarly, in the scenario where the data is collected automatically, e.g., in the
case of the temperature calculated after every hour by a sensor, there may be cases where
the sensor may not record the data, e.g., due to power failure, etc. There may be different
ways to remove such values. The simplest way is just to delete the records containing the
missing values and continue with the rest of the data. Although easy, this approach has its
own drawback, i.e., the data is lost. The alternate way is to calculate the mean or median.
If we consider the mean value, the missing record will be (17, 4.4). The selection of the
appropriate technique may depend upon the requirements.
Another important task performed during feature engineering is the detection of the
outlier data. An outlier value is a value that shows abnormal behavior as compared to the
other values in the data. It is so abnormal that it seems not generated by the way the other
values were generated.
In literature, there are a number of techniques for detection of outliers. Consider the
following example:
As can be seen, in the last record where the age of the student is 17, the height shows
the value 7 ft, which can easily be seen as an outlier. We can detect the outliers by taking
the average and calculating the distance of each value from that average. In the above case,
the difference is 2.6. On the basis of the requirements, this can be considered as an outlier.
One of the important tasks in feature engineering is the selection of the appropriate
features that could be used in the algorithm for further data processing. In real life, it is
easy to come across the applications that have to process hundreds of thousands of fea-
tures, which is computationally too expensive. In such a case, we use the process of feature
selection. Feature selection can be defined as the process of selecting a set of features from
the entire dataset that still provides the information otherwise provided by the entire data-
set. In text mining, feature selection can help in selecting the relevant documents by pro-
cessing the minimum number of features. For example, consider the following documents
along with their “Type” as feature:
Now, without feature selection, you have to process all the documents if you want to
process the document related to “Politics”. However, if you select the feature “Type” and
then search only those documents that are of the “Sports” type, you can save a lot of time.
Another important task performed in feature engineering is that of feature encoding.
For example, consider the following documents:
Suppose we want to preserve the semantics. So, for this purpose, we need to represent
these documents in the form of some numeric values. This raises the issue of representing
the documents in the form of vectors. One of the methods is to define a template in such a
way that the information about the size and the color is preserved. One such tem-
plate may be:
Document-Name(ObjectName, Size_Small, Size_Medium, Size_Large, Color_Red,
Color_Yellow, Color_Blue).
In order to show the presence or absence of a feature, we use the binary values “1” and
“0”. So, the documents will be encoded as:
D1(1, 1, 0, 0, 1, 0, 0)
D2(2, 0, 0, 1, 0, 1, 0)
D3(3, 0, 1, 0, 0, 0, 1)
D4(4, 0, 0, 0, 0, 1, 0)
D5(5, 0, 1, 0, 1, 0, 0)
This technique is called one-hot encoding. The benefit of this technique is its simplic-
ity, as we do not need any complex mechanism to represent the features of the documents.
However, the major issue with this technique is that it requires a lot of storage to represent
the data. As you can see, with increase in the semantics, the length of the vector to repre-
sent the document also increases.
In label encoding, the size does not change, and we can preserve the document seman-
tics easily. For example, we can use the value “0” for “Small”, “1” for “Medium”, and “2”
for “Large”. Similarly, we can use “1” for “Red”, “2” for “Blue”, and “3” for “Yellow”. In
this way, our documents will become:
D1(1, 0, 1)
D2(2, 2, 3)
D3(3, 1, 2)
D4(4, 0, 3)
D5(5, 1, 2)
Another encoding scheme is the frequency encoding scheme. In this scheme, the values
are assigned on the basis of the occurrence in the data. For example, in our document col-
lection, there are five documents, and the size “medium” appeared twice, so its frequency
will be: 2/5 = 0.4. On this basis, the frequency of the other semantic terms will be:
90 4 Feature Engineering for Text Representations
Similarly, the frequency values of the colors “Red”, “Blue”, and “Yellow” will be:
Here we have encoded the semantics using the frequency encoding scheme. However,
as you can see, the issue with this scheme may be that two different semantics may have
the same frequency, which means that they have the same encoded value.
Another common encoding scheme is target encoding. Suppose we consider the color
as the target, then for the value “Small”, we have the sum of the target values as 4 (“1” for
Red and “3” for yellow). In this way, we will calculate the target value for all the values in
the “Size” features.
In our case, the frequency of the value “Small” is 2 and that of “Medium is “2”, and
finally, the frequency of Yellow is “1”. The target values will be:
Small: 4/2 = 2
Medium: 4/2 = 2
Large: 3/1 = 3
D1(1, 2, 1)
D2(2, 3, 3)
D3(3, 2, 2)
4.3 Traditional Feature Engineering Models 91
D4(4, 2, 3)
D5(5, 2, 2)
In this section, we will discuss some traditional models that are extensively used in the
feature engineering process.
As discussed earlier, one of the major issues of the natural language text is that it is
unstructured and thus it is very difficult for the machine learning algorithms to process
such text. So, we need to convert this text into numbers, specifically the vectors. Once the
text is converted into these vectors, it can be given as input to any algorithm. Bag of words
is one of the models that can be used to convert the raw text into the features and the
vectors.
Bag of words is shortly called BoW. It is used to extract the features from the text.
There are various versions of the bag-of-words model; however, here we will discuss only
the simple one. The core intention of the bag-of-words model is to present the occurrence
of words in a text. The term “Bag” here refers to a collection of words, just like a bag can
store many other things. The model performs in two steps. In the first step, it creates a
vocabulary of the words that are present in the text, and in the second step, it presents the
measure of presence of words.
One of the major features of the bag-of-words model is that it does not require any
information about the structure of the words in the text.
Let’s first discuss the example of the model, and then we will provide its implementa-
tion. Consider the following text:
It is good weather today. Last year it was not so good.
First, we create the vocabulary. A vocabulary is a collection of unique words in the text.
The vocabulary in our case will be “it”, “is”, “good”, “weather”, “today”, “Last”, “Year”,
“was”, “not”, and “so”.
As you can see, all the words are uniquely represented. Now, the next task is to convert
each sentence to the vector form. For convenience, we can use the binary values to mark
the presence and the absence of the words. For example, the first sentence will be con-
verted as follows:
“it” = 1, “is” = 1, “good” = 1, “weather” = 1, “today” = 1, “Last” = 1, “Year” = 1, “was”
= 1, “not” = 1, “so”=1
So our first sentence will be represented through the following vector:
It is good weather today = [1,1,1,1,1,0,0,0,0,0]
The second sentence will be as follows:
Last year it was not so good = [1,0,1,0,0,1,1,1,1,1]
92 4 Feature Engineering for Text Representations
As you can see, we have not followed any proper structure to represent the document
apart from the sequence in which the words appeared in the dictionary. We simply picked
the dictionary words one by one and marked them in the vector with “1” or “0” based on
their presence.
The words that we marked in the sentences were those that were appearing in the dic-
tionary. However, there may be cases where a single word may appear in the sentence, but
may not appear in the dictionary; we can simply ignore all such words. That provides
another simplicity of the BoW model.
One of the problems that the basic BoW model faces is the large size of the vocabulary.
In our example, the vocabulary comprised of just ten words, and thus, the size of the vector
used to represent the sentences was 10. However, in real life, there may be hundreds and
thousands of words in a vocabulary; e.g., consider the vocabulary of a book. In all such
cases, it becomes much more difficult to process such vocabulary and the large size vec-
tors. We have a number of different techniques that can be used in order to reduce the size
of the vocabulary and consequently the vectors. For example, we can ignore the case and
convert all the words into the small case. In this way, there will be no difference between
“The” and “the”, as both words will be represented using a single word. Removing the
stop words may be another helpful strategy. We can also remove the word forms of the
words and use the headword only; e.g., “eating”, “ate”, and “eaten” can all be replaced
with the word “eat”.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer
#Documents
d1 = "Today Weather is good"
d2= "Yesterday it was not so good"
d3= "It will remain the same throughout this month"
d4 = "It was same Last year"
d5= "Last year it was also good"
d6= "hopefully, Next month it will be good as well"
d7= "good weather in next month will be amazing"
CVec = CountVectorizer(ngram_range=(1,1),
stop_words='english')
CData = CVec.fit_transform([d1, d2, d3, d4, d5, d6, d7])
vocab = CVec.get_feature_names()
print(pd.DataFrame(CData.toarray(),columns=vocab))
Here we have first imported the libraries. We have used sklear and its CountVectorizer
module. Then we defined seven documents having common words and some different
words. Then we used the CounVectorizer module to convert the text to the numbers. We have
4.3 Traditional Feature Engineering Models 93
used the 1-gram model where each single word is considered as a token. In the next section
we will see the N-Gram model as well.
Then we used the fit_transform method for the model to learn the vocabulary and return
the document term matrix. Finally, we displayed the final vector form of the document. As
can be seen, it shows each document and its corresponding vector in the form of the binary.
The output of the code is as follows:
The index 0–6 in each row represents the document and the corresponding matrix. So
the first document is represented as follows:
Today Weather is good = [0, 1, 0, 0, 0, 1, 1, 0, 0]
You can use the following statement to display the dictionary:
print(vocab)
The output will be as follows:
['amazing', 'good', 'hopefully', 'month', 'remain', 'today', 'weather', 'year', 'yesterday']
In the N-Gram BoW model, the vocabulary will comprise two words each. We will
consider the same documents and implement the BoW again.
#important imports
from sklearn.feature_extraction.text import CountVectorizer
#Documents
d1 = "Today Weather is good"
d2= "Yesterday it was not so good"
d3= "It will remain the same throughout this month"
d4 = "It was same Last year"
d5= "Last year it was also good"
d6= "hopefully, Next month it will be good as well"
d7= "good weather in next month will be amazing"
CVec = CountVectorizer(ngram_range=(2,2),
stop_words='english')
CData = CVec.fit_transform([d1, d2, d3, d4, d5, d6, d7])
vocab = CVec.get_feature_names()
print(pd.DataFrame(CData.toarray(),columns=vocab))
As you can see, while creating the CountVectorizer, we provided the “ngram_range”
parameter as (2,2), which means that the minimum and the maximum token size will be
“2”. The output of the code will be:
[7 rows × 10 columns]
If we want to see the dictionary, we can use the following statement:
print(vocab)
The output will be as follows:
['good weather', 'hopefully month', 'month amazing', 'month good', 'remain month', 'today
weather', 'weather good', 'weather month', 'year good', 'yesterday good']
4.3 Traditional Feature Engineering Models 95
As discussed earlier, TF stands for term frequency, and IDF stands for inverse document
frequency. In the first chapter, we discussed these concepts. Before proceeding further,
let’s check the following example:
Banana is a fruit
Mango is a fruit
Both banana and mango are fruits
• Fruit
• Banana
• Mango
To keep the vocabulary short, we have removed the words “is”, “a”, “are,” etc.
So, our documents will be:
Table 4.4 shows the term frequency of each term in the vocabulary.
The inverse document frequency shown in Table 4.5 is calculated according to the fol-
lowing formula:
Number of Sentences
idf t ,d
Number of Sentences Containing the word
CVec = TfidfVectorizer(ngram_range=(1,1),
stop_words='english')
CData = CVec.fit_transform([d1, d2, d3, d4, d5, d6, d7])
vocab = CVec.get_feature_names()
print(pd.DataFrame(CData.toarray(),columns=vocab))
Note that instead of using the CountVectorizer, this time we have used the TfidfVectorizer.
The output of the code is as follows:
[7 rows × 9 columns]
Again, the rows show the document number, and the vector shows the TF-IDF of each
term against that document. The 1-Gram vocabulary will remain the same.
Now, let’s check the code if we use the 2-Gram model:
#CVec = CountVectorizer(ngram_range=(2,2),
stop_words='english')
TVec = TfidfVectorizer(ngram_range=(2,2),
stop_words='english')
TData = TVec.fit_transform([d1, d2, d3, d4, d5, d6, d7])
vocab = TVec.get_feature_names()
print(pd.DataFrame(TData.toarray(),columns=vocab))
98 4 Feature Engineering for Text Representations
[7 rows × 10 columns]
Whenever we need to extract the features of the new document, i.e., the document that was
not the part of the original documents from which the vocabulary was built, we can per-
form this task by using the transform(…) method of the vectorizer. The best feature of this
is that we don’t need to build the vocabulary again or add the new terms in the library. The
new document will be transformed using the previous vocabulary.
Information extraction systems have to search the documents that are found similar to the
query of the user. This is the step where the concept of document similarity helps. As the
name implies, document similarity refers to the process of finding how similar or dissimi-
lar two documents are. There are a number of ways we can find this, as different metrics
have been proposed for this purpose, e.g., edit distance, TF-IDF, and so on.
Here is a simple naive of finding that either two documents are similar or not. Consider
the following two documents:
Now, on the basis of number of key terms present in both documents, we can conclude
the similarity of the documents. Here, two words, i.e., “good” and “weather”, are common
in both, so we can set a threshold. Whenever the number of words exceeds that threshold,
we call the two documents similar to each other. This is somewhat similar to what we call
the Euclidean distance.
4.3 Traditional Feature Engineering Models 99
In case of the documents “D1” and “D2”, the vocabulary will be:
[“It”, “is”, “Good”, “weather”, “today”, “I”, “Like”]
According to this vocabulary, the documents will be translated into vectors as follows:
D1: [1,1,1,1,1,0,0]
D2: [0,0,1,1,0,1,1]
d1 12 12 12 12 12 02 02
d1 = 5
d1 = 2.23
d 2 02 02 12 12 02 12 12
d2 = 4
d2 = 2
100 4 Feature Engineering for Text Representations
y
90
d1.d 2 4
Similarity d1,d 2
d1 d 2 2.23 2
Similarity d1,d 2 0.89 i.e.89%
Here d11 represents the presence of the first vocabulary word in document d1, and d21
represents the presence of the first vocabulary word in document d2. d12 represents the
presence of the second vocabulary word in document d1, and d22 represents the presence
of the second vocabulary word in document d2.
So the Euclidean distance between D1 and D2 will be:
ed d1,d 2 1 0 1 0 1 1 1 1 1 0 0 1 0 1
2 2 2 2 2 2 2
ed d1,d 2 1 1 0 0 1 1 1 2.23
The greater the distance, the less similar the documents are. Now, we will present the
Python code to calculate the document similarity. First, let’s discuss the simplest method:
import spacy
nlp = spacy.load("en_core_web_sm")
d1 = nlp("Today weather is good")
d2= nlp("Yesterday weather was not so good")
d3= nlp("It will remain the same throughout this month")
d4 = nlp("We live in Asia")
4.3 Traditional Feature Engineering Models 101
d1_d2_similarity = d1.similarity(d2)
d1_d3_similarity = d1.similarity(d3)
d1_d4_similarity = d1.similarity(d4)
print(d1_d2_similarity)
print(d1_d3_similarity)
print(d1_d4_similarity)
0.6512580243881606
0.3824955568737893
0.22766188049798478
Here, we have used the similarity function of the space to calculate the similarity. The
similarity between d1 and d2 is highest of all; similarly, the similarity between d1 and d4
is minimum, which is apparent after reading the contents of these documents. We can
also use Sklear Python library, which provides the cosine_similarity function for this
purpose.
CVec = CountVectorizer(ngram_range=(1,1),
stop_words='english')
#TVec = TfidfVectorizer(ngram_range=(2,2),
stop_words='english')
TData = CVec.fit_transform([d1, d2, d3, d4, d5, d6, d7])
vocab = CVec.get_feature_names()
simmat = cosine_similarity(TData)
print(pd.DataFrame(simmat))
As you can see, we have used the same embeddings obtained through the CountVectorizer
for the purpose of calculating the similarity. The transformed data is put to the cosine_sim-
ilarity function for calculating the similarity. The function returns the similarity matrix.
102 4 Feature Engineering for Text Representations
The similarity matrix shows the similarity between the two documents, either in binary or
in the form of the continuous value. The output of the function is as follows:
0 1 2 3 4 5 6
0 1 0.408248 0 0 0.408248 0.333333 0.57735
1 0.408248 1 0 0 0.5 0.408248 0.353553
2 0 0 1 0 0 0.408248 0.353553
3 0 0 0 1 0.707107 0 0
4 0.408248 0.5 0 0.707107 1 0.408248 0.353553
5 0.333333 0.408248 0.408248 0 0.408248 1 0.57735
6 0.57735 0.353553 0.353553 0 0.353553 0.57735 1
The second column in the second row shows the similarity of the document d0 with the
document d1, and the similarity is “0.408”, almost 41%. Note that there is the value “1” at
the diagonal; we can ignore this value as these values show the similarity of a document
with itself, which does not make any sense.
As you can see, the document d3 is 100% not similar to the document d5, as there is
nothing common between these documents.
Below is the document similarity based on TfidfVectorizer:
0 1 2 3 4 5 6
0 1 0.182103 0 0 0.209001 0.154463 0.442001
1 0.182103 1 0 0 0.258828 0.191289 0.162603
2 0 0 1 0 0 0.306488 0.260527
3 0 0 0 1 0.838416 0 0
4 0.209001 0.258828 0 0.838416 1 0.219543 0.186621
5 0.154463 0.191289 0.306488 0 0.219543 1 0.37638
6 0.442001 0.162603 0.260527 0 0.186621 0.37638 1
Document similarity can be used in many real-life applications, e.g., search engines,
recommender systems, image matching, etc.
Topic modeling is the process of processing the text of the document and finding the words
from these documents and finally grouping the documents similar to each other. If you
remember, this is similar to the process of clustering that is normally performed in data
mining. As it does not require the previous labels, topic modeling is an unsupervised
approach.
Note that this is totally different from the classification of the documents where each
document is classified into topics on the basis of the previous training data. The topic
modeling process is very helpful, especially in the case of a corpus consisting of a large
number of documents.
4.3 Traditional Feature Engineering Models 103
For example, consider a corpus consisting of 1000 documents, where each document
comprises of 100 words. Now, processing each document means we will have to process
1000*100 = 100,000 words. However, if the documents are already clustered and grouped
on the basis of the topics, we need to process only the documents that are present in a
certain topic, which reduces a lot of effort.
There are various methods for performing topic modeling:
The following source code shows the topic modeling process using latent Dirichlet
allocation. For this, we have used the LatentDirichletAllocation module from sklearn.
In the above code, first we introduced five documents having some similar words. Then
a Counvectorizer was generated. We have used two topics for this example. Finally, the
words belonging to each topic were displayed.
The following is the output of the code:
Topic #0:
dog lazy brown good friends loved
Topic #1:
fox quick jumped night slept
So far, we have discussed all the basic feature engineering algorithms. Now, onwards, we
will start advanced feature engineering algorithms used in machine learning.
All the machine learning models require training on the basis of the already existing cor-
pus. This means we need to load the corpus in a model and train the model according to
that corpus. To achieve this, various natural language processing libraries provide the
already built corpus that we can use in our code. For this purpose, a number of APIs are
provided by these libraries, which we can use to import and use this corpus. Here is an
example of how we can use the bible corpus:
So far, the methods that we have discussed to represent the words consider each word
independent of the other. For example, consider the following two sentences:
4.4 Advanced Feature Engineering Models 105
The continuous bag-of-words model tries to predict the target word from the given
context words. The context words can be the words immediately preceding and following
the current words, or they can be more than one word on both sides, depending on the size
of the window. Figure 4.3 shows the diagrammatic representation of the CBOW, while
Fig. 4.4 shows the representation of the skip-gram model.
In the skip-gram model, we try to predict the context words out of the given word, i.e.,
it performs opposite to the CBOW: it predicts the context words from the current word.
W(t+1)
W(t)
W(t+2)
The following is the source code that shows the implementation of the CBOW model:
import numpy as np
from keras.layers import Dense, Embedding, Lambda
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
# Define hyperparameters
vocab_size = 5000
embedding_size = 50
window_size = 2
batch_size = 128
epochs = 10
The code first uses some sample data and inputs it to the model. The model is generated
using the Keras library. The model uses the embedded layer, the context layer, and the
dense layer. The Softmax function is applied at the dense layer to predict the center layer.
We have also visualized the model architecture using the plot_model function from
Keras. After ten epochs, the model generated the following output:
108 4 Feature Engineering for Text Representations
In the previous section, we have generated the Word2Vec model using Keras by following
CBOW architecture. Now, let’s generate a Word2Vec model using the Gensim. The code
is much simpler as compared to the previous one.
The code defines a list of words, after which the Word2Vec model is initialized using
different parameters including the window size. We have used the build_vocab method for
building the vocabulary. After this, the model is trained using the model_train method.
Finally, we calculated the embeddings of the word “first” and then tried to find the words
similar to the word “first”. The output of the code for the word “first” was:
There are a number of applications where the Word2Vec can be used in machine learning
applications. One of the biggest applications is the searching of documents where docu-
ments related to a specific word can be found. It overcomes the drawbacks of the conven-
tional BoW model where there is no semantic information. We can use the Word2Vec
models in the machine learning tasks, which take the following steps:
Word embeddings are much better than the BoW or other feature encoding schemes; how-
ever, one of the drawbacks of Word2Vec is that it considers the context information that is
local to the current word. For example, let’s look at the following sentence:
Smith eats Banana
Here the context words for “eats” are {Smith, Banana}, which are local in the sentence. In
the GloVe (Global Vectors for Word Representation) model, we consider the global informa-
tion. The global information is obtained by a co-occurrence matrix that is built using the entire
corpus. Table 4.7 shows an example of the co-occurrence matrix for the sentence:
110 4 Feature Engineering for Text Representations
# Example usage
print(model.similarity('king', 'queen'))
print(model.most_similar('france'))
0.65109555
[('spain', 0.7568718194961548), ('italy', 0.7442922596931458), ('germany',
0.7363376617431641), ('england', 0.7192206382751465), ('europe',
0.7071917653083801), ('belgium', 0.7005804777145386), ('netherlands',
0.6887465715408325), ('sweden', 0.676919102191925), ('austria',
0.6685905451774597), ('portugal', 0.654475510597229)]
The applications of the GloVe model in machine learning are the same as that of the
Word2Vec. However, as discussed earlier, the GloVe model provides broader semantic
context in order to find the semantic similarity of the word. This is because the GloVe
4.4 Advanced Feature Engineering Models 111
model is built not on the basis of the local information but the Global information that is
collected in the form of the co-occurrence matrix built using the entire corpus.
The FastText model is a further extension of the Word2Vec model. In the BoW model, the
semantics of the words are not considered. However, this problem was resolved by the
Word2Vec and GloVe models where the words were transformed into vectors called
embeddings. The embeddings were formed in such a way that the words close to each
other had the embeddings that were also close to each other. However, in the Word2Vec
and GloVe models, the relationship between the words was not considered. Due to this
issue, it is inappropriate to apply both of these models to the languages that are morpho-
logically rich. The FastText model overcomes this problem.
Both Word2Vec and GloVe use word grams, i.e., a complete word or set of words is
used, whereas in the FastText model, the character gram is used. For example, consider the
word “Often”; the character grams will be:
“of”, “oft”, “fte”, “ten”, “en”, and “often”.
Here is the Python code for the FastText model to generate the embeddings:
import numpy as np
from collections import defaultdict
w1 = np.random.randn(num_words, embedding_size) /
np.sqrt(embedding_size)
w2 = np.zeros((embedding_size, num_words))
In this code, we have defined a simple corpus. After this, parameters were defined, and
the weights were initialized. The, model was then trained using five epochs. Window size
is kept as two (2). Finally, we got the embeddings for the word “Python”. The size of the
embeddings is 10 as we specified in the parameter above. Your answer can be different on
the basis of the initialization of parameters and the number of epochs.
The FastText model can be applied anywhere where we use the previously discussed mod-
els, i.e., BoW, Word2Vec, and GloVe. It provides embeddings on the basis of the character
gram instead of the word grams so that it could be applied to morphologically rich lan-
guages. The embeddings can be used to get document similarity, which consequently can
be applied to search engines. Furthermore, the model can be used for text classification as
well. It can be used in analyzing survey responses, analyzing product reviews to get the
sentiments of the customers toward a particular product, recommending movies and the
songs on the basis of the lyrics of the songs that a user already likes, etc.
4.5 Summary
In this chapter, we provided the details of the feature engineering methods for textual data.
We started with the basic definition of the features and discussed different examples. After
this, we provided details of feature engineering and how different methods can be used to
manipulate and manage the features before feeding them to any algorithm that needs to
process this data.
The methods were divided into two categories, i.e., traditional methods and the advance
methods. In the case of traditional methods, methods related to N-Grams including TF-IDF
models, document similarity, and topic models were discussed.
In the case of the advanced models, embeddings-based models were discussed, includ-
ing Word2Vec, Glove, and FastText models. The source code implementation was also
provided in order to make the concept clear.
114 4 Feature Engineering for Text Representations
4.6 Exercises
Q2: Consider the sentences given in Q1. By using the BoW model, translate them into
feature vectors.
Q3: Provide at least one example of the dataset, and generate a new feature from the
existing features in the dataset.
Write down the Python code to find the cosine similarity between the documents d1 and
d2 and d1 and d3.
Q7: Write down the Python code to implement the Word2Vec model.
Q8: Write down the Python code to implement the GloVe model.
Q10: What is the difference between the GloVe and FastText models?
Recommended Reading
• Feature Engineering for Machine Learning: Principles and Techniques for Data
Scientists by Alice Zheng and Amanda Casari
Publisher: O’Reilly Media
Publication Year: 2018
Feature engineering is a crucial component of the machine learning process, although
it frequently receives less attention in debates. This informative publication highlights
the significance of feature engineering, offering vital knowledge on the methods used
to extract and convert features, which are numerical representations of unprocessed
data, into formats that are appropriate for machine learning models. Every chapter in
this book serves as a comprehensive exploration of a distinct data difficulty, such as the
complex undertaking of encoding textual or visual data. The compilation of these case
studies serves to elucidate the core principles underlying feature engineering.
• Bad Data Handbook: Cleaning Up the Data So You Can Get Back to Work by Q. Ethan
McCallum
Publisher: O’Reilly Media
Publication Year: 2012
Defining bad data goes beyond technical glitches like missing values or jumbled
records; it encompasses a broader spectrum. In this insightful handbook, data expert
Q. Ethan McCallum has curated wisdom from 19 experts across diverse domains of the
data landscape. Together, they unveil their strategies for overcoming challenging
data issues.
Bad data takes various forms, from glitchy storage systems to inadequate representa-
tion and ill-conceived policies. At its core, bad data obstructs progress. This book navi-
gates through these obstacles, offering practical and effective methods to circumvent
the hurdles posed by problematic data. Through the collective experiences shared
within these pages, readers gain valuable insights into tackling the complexities of real-
world data challenges.
• Data Wrangling with Python: Tips and Tools to Make Your Life Easier by Jacqueline
Kazil and Katharine Jarmul
Publisher: O’Reilly Media
Publication Year: 2016
The book helps you enhance your data analysis abilities beyond the limitations of
Excel. This comprehensive guide is specifically designed for individuals without prior
programming experience, providing a structured approach to acquiring proficiency in
Python programming. The guide aims to equip readers with the necessary skills to
effectively handle complex data-related assignments. Rest assured, this guide is specifi-
cally tailored for individuals who are new to Python, even if they do not possess any
prior familiarity with the programming language.
116 4 Feature Engineering for Text Representations
Text classification, also known as categorization, is one of the important text mining tasks.
The entire process of text classification is based on supervised learning, where the text is
categorized based on training data. In this chapter, the task of text classification will be
discussed in detail. Each step will be explained with the help of examples. Python code
and a complete description will also be provided.
One of the first and most important stages in gathering useful information from textual
data is training your system to process and analyze language. Although familiarity with
linguistic syntax, structure, and semantics is crucial, it is insufficient to be able to make the
most of huge amounts of text data by extracting meaningful patterns and insights.
Incorporating language processing expertise with artificial intelligence (AI), machine
learning (ML), and deep learning (DL) principles allows for the development of smart
systems that can use text data to address real-world issues faced by companies and
organizations.
Machine learning encompasses a wide range of subfields, the most recent of which is
“deep learning”, which combines characteristics of supervised and unsupervised learning
as well as reinforcement and recurrent neural networks. The power of predictive and pre-
scriptive analytics lies in the ability to immediately apply a trained model to fresh and
unknown data and get the results.
Text classification, also known as document classification, is one of the most important
and difficult tasks in the field of natural language processing. It involves sorting text docu-
ments into numerous (predefined) groups according to their common characteristics. This
may be used in a variety of industries and academics, from detecting spam emails to
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 117
U. Qamar, M. S. Raza, Applied Text Mining,
https://doi.org/10.1007/978-3-031-51917-8_5
118 5 Text Classification
organizing news stories. The theory may seem simple. If the number of documents is few,
it may be possible to examine each one and deduce its subject. Once there are hundreds of
thousands of text documents that need to be categorized, the task becomes increasingly
difficult.
Classifying texts may also be referred to as categorizing texts. However, there are two
reasons why we use the term “classification” specifically. The first is that it visually repre-
sents our primary goal of classifying documents into categories, much as text classifica-
tion does. Second, it shows how the data is put into different categories based on their
characteristics. There are a variety of methods for classifying texts. Classification is uti-
lized often in fields outside of text analysis, including science, medicine, meteorology, and
technology.
Feature extraction and supervised/unsupervised machine learning are two methods that
may be used in this context. The task of document categorization is general; it can be used
in any field, especially in the media industry.
Keep in mind that, at their core, documents are just strings of text. It’s a corpus, basi-
cally. Our job is to sort out documents into categories that most closely fit each other.
There are multiple stages of this task, all of which will be discussed here. To start with, we
require some labelled data to train a text classification model for a supervised classifica-
tion task. This information is based on the already verified labels. Using these labels, we
can categorize the given documents based on their properties.
The process is based on a supervised machine learning technique. Naturally, the data
must be cleaned and standardized before the model can be trained on the basis of this data.
New document’s classes or categories may then be predicted by using the already trained
model as mentioned before. In contrast, in an unsupervised classification task, the text
labels are not available, so as an alternative to get the labels, we can use clustering or docu-
ment similarity to get the labels.
In this chapter, we will mention the text document categorization as a supervised
machine learning problem and explore its implications. We will also discuss different cat-
egorization systems and the information they provide. We will also show a graphical rep-
resentation of the core activities involved in a text categorization process and how these
activities are performed in the entire process. Before defining the actual classification
procedure, we need to know the characteristics of the textual data and what it means by the
term “Classification”. Textual data may be anything from a single word or sentence to a
whole document containing paragraphs of text, and it can come from any number of
sources, including corpora, blogs, the Web, and even a corporate data warehouse. The
word “document” is used to include a wide variety of textual information; hence, the
phrase “text classification” is also often used.
As we have discussed basic ideas about the text classification, we can explicitly describe
the text categorization process and its scope. If we assume that we already have a specified
set of classes, we may say that text categorization is the process of placing texts into those
classes. Inherent features of documents allow a text categorization system to accurately
place them in the appropriate classification(s). It may be mathematically described as
5.1 What Is Text Classification? 119
Cd = c1, c2,..., cn, where c1, c2,..., cn is a collection of predefined classes or categories and
“d” is a description and characteristics for a document D.
Since document D itself may have several essential features, it may be seen as an entity
existing in a multidimensional space. It is expected that a text classification system T can
correctly label a document D to the appropriate class Cx on the basis of the features defined
by d. The mathematical expression for this is T : D → Cx. In the next chapters, we will go
into further depth about the textual organization scheme. A primary conceptual representa-
tion of the text categorization procedure is shown in Fig. 5.1.
Many documents (as shown in Fig. 5.1) may be classified as belonging to different
types, i.e., politics, sports, and movies. These files are initially available as collections,
much as the documents that make up a text corpus. Each document has been assigned to
one class or category after going through a text categorization system. Note that these
classes/categories have already been defined before starting the process. Also, note that
documents in real data are only represented by their names, but they may also contain
detailed information about themselves (such as their specifications, components, and so
on), which can be used to assign the labels to that document.
Classifying texts may be done in several ways. We distinguish between two broad cat-
egories defined by the information included in the documents. These are as follows:
• Request-based classification
• Content-based classification
In content-based classification, we look at the words and ideas inside a piece of text to fig-
ure out what category it belongs to. We use things like keywords, topics, and the way the text
is written to make this decision. We don’t need any extra information besides the text itself.
Polics
National National
Cricket Democratic
Congress Congress Party
Democratic Sports
Party Baseball
Classification
Algorithm Cricket Baseball
Terminator
Gravity
II Movie
Terminator
Gravity
II
You can think of text classification as a team of people working together to sort each docu-
ment. If there are a limited number of documents, the team may perform its task very well.
However, in case there are hundreds and thousands of documents, they will need help from
the automated text categorization system. Automated text categorization is the process of
separating texts into predetermined categories by using a software. We use several machine
learning methods and ideas to fully automate the categorization of text. There are two
major categories of machine learning methods that may be used for this purpose:
Other families of learning algorithms exist as well, such as reinforcement learning and
semi-supervised learning. From a machine learning and text document classification point
of view, we’ll examine supervised and unsupervised learning techniques in more depth.
The term “unsupervised learning” is used to describe a subset of machine learning meth-
ods and algorithms that may construct a model without the need for labelled training data.
Instead of focusing on predictive analytics, this method emphasizes pattern mining and the
discovery of hidden substructures in the data. Depending on the nature of the problem at
hand, the data points we work with tend to be either textual or numerical in nature. By
performing feature engineering on each data point, we can feed that feature set into our
algorithm to discover hidden patterns within the data, such as clustering similar data points
together or summarizing documents using topic models.
The term “supervised learning” is used to describe a subset of machine learning meth-
ods and algorithms that are used to make predictions based on examples of data that have
already been categorized and labelled. Each data point has its own feature set and accom-
panying class/label that were retrieved via feature engineering. Using the training data, the
algorithm discovers distinct patterns for each class. The result of this task is a trained
model. Once we input the attributes of new test data samples into the model, we can use it
to forecast the class of those samples. As a result, the computer can learn to find the class
122 5 Text Classification
of the unknown dataset based on the training data. The following are descriptions of the
two primary categories of supervised learning algorithms, i.e., classification and regression.
When the class labels are already available, and the result variable is itself a categorical
variable, supervised learning techniques are referred to as classification. For example,
movies and news may be categorized into subcategories. Regression algorithms, on the
other hand, are used in supervised learning techniques when the class variable is a continu-
ous measure. Examples of such predictions include real estate prices and weather reports.
In this chapter, we will deal with categorization by using classification strategies.
We will implement the classification of the textual data in different ways. Supervised
learning algorithms based on various classification models will be used in these imple-
mentations. Let us start with the definition of machine learning-based text classification.
Suppose we have a training dataset that is already labelled. To describe this, we may
write that TS = (d1, c1), (d2, c2),..., (dn, cn), where d1, d2,..., dn is a list of text documents and
c1, c2,..., cn are pairs of labels. Each document may be assigned to one of the classes
labelled c1, c2,..., cn, with the set of all possible classes, indicated by C. Note that cx is the
class label for document x.
Once the training dataset is available, we can develop a supervised machine learning
algorithm F such that F(TS) = γ that is trained on the training dataset TS. So, given an input
set of (document, class) pairs TS, the supervised learning algorithm F produces a trained
classifier γ that serves as our model. The term “training” describes this stage. This model
may then be used to make predictions about the classification of unseen documents, such
that cND ∈ c. The symbolic representation of this operation, which we call the prediction
process, looks like:
: TD cND
As a result, we may deduce that the process for supervised text categorization consists
of two phases:
• Training
• Prediction
In the majority of cases, the automated classification of text starts by assigning labels
to objects manually. This manual annotation of training data is mandatory for supervised
text classification. Once trained, a classifier may be used to anticipate and recognize new
documents with little assistance from humans. We will talk about learning strategies and
techniques next. These approaches may be used with any form of data after performing the
preprocessing tasks, e.g., data cleansing, feature extraction, etc.
We can use any of the supervised machine learning algorithms. We can also adopt the
hybrid approach where more than one algorithm can be used in combination with each
other. The model is refined using the features from the training and testing data. This guar-
antees that the model may be used with fresh and unknown data and does not adapt to the
training data. Multiple hyperparameters may be present in the model depending on the
5.2 Automated Text Classification 123
training dataset. Note that the presence of the labelled data is important. In the situation
where the labelled data is not available, we can use some automated technique as well, for
example, we can use clustering to label the documents that belong to same clusters.
After training the model by using the training data, the model is then optimized for
accuracy and performance. This guarantees that the model does not overfit the training
data. The model is continuously improved by cross-validation. By using a random distri-
bution, this approach splits the training dataset into two sets: the training set and the vali-
dation set. The model’s performance is then assessed using a variety of criteria, including
how well it predicted the validation set.
If we talk about the types of text classification with respect to the number of classes, it
can be categorized into three types, i.e., binary classification, multi-class classification,
and multi-label classification. Figure 5.2 shows these types.
In binary classification, documents are placed into one of two groups based on their
content. The objective is to categorize documents into two groups according to their con-
tent or features. Spam detection, sentiment analysis (wherein movie reviews are either
positively or negatively categorized), and fraud detection (wherein financial transactions
are either fraudulent or not fraudulent) are all examples of common binary classification
problems. A labelled dataset including text documents and their associated binary class
labels is needed for binary classification.
Logistic regression, SVM, decision trees, random forests, and neural networks are
some of the machine learning methods that can be used for binary classification. Accuracy,
precision, recall, and F1 Score are a few of the measures used to evaluate binary classifica-
tion models. Accuracy evaluates how well the model consistently makes accurate
predictions.
When more than two classes or categories are needed to be assigned to a set of text
documents, the type of classification is called multi-class classification. The objective is to
categorize each document into the best possible category. A few examples of this type of
document classification include diving news articles into categories like politics, sports,
and entertainment. The multi-class classification relies on a labelled dataset consisting of
text documents and their associated class labels, much as binary classification does.
Logistic regression, SVMs, trees, random forests, and neural networks are only a few of
the popular machine learning techniques used for multi-class categorization.
Classification
Types
In multi-label categorization, many class labels may be applied to a single text docu-
ment at the same time. Multiple class labels may be appropriate in many scenarios for a
given document; therefore, it’s not always possible to apply a single label to each docu-
ment. Applications that make use of this kind of categorization include document tagging
(wherein numerous tags are assigned to a single document), topic tagging, and content
classification. For multi-label classification, we have a labelled dataset that includes text
articles and many binary indications representing whether each class label is present.
Some examples of multi-label classification methods developed using machine learning
include binary relevance, classifier chains, and label power set. These algorithms can pre-
dict meaningful class labels.
In terms of their comparison, binary text classification offers simplicity and ease of
implementation. It’s straightforward to train models for binary classification, and it’s com-
putationally efficient. However, it may not fulfil the requirements in case of more complex
situations, where multiple categories or labels are needed.
Multi-class text classification allows for the categorization of text into more than two
classes. It’s more versatile than binary classification as it accommodates a broader range
of possibilities. However, it can be more challenging to train and requires larger datasets,
and class imbalances may create challenges.
Multi-label text classification is even more flexible, allowing a single text document to
belong to multiple categories or labels. It’s suitable for complex scenarios but can be com-
putationally intensive, and labelling large datasets with multiple labels for each text can be
time-consuming. Additionally, it may require more complex evaluation metrics.
In summary, binary classification is simple but may lack complexity; multi-class is
more versatile but requires more data and may suffer from imbalances, while multi-label
offers the most flexibility but demands more computational resources and extensive label-
ling efforts. The choice depends on the specific needs and complexities of the classifica-
tion task at hand.
After understanding the basic concept of automatic text categorization, we can now dis-
cuss the steps necessary to create such a system from scratch. There is a certain sequence
of tasks that must be performed throughout the learning and evaluation stages. The first
step in developing a text categorization system is to locate and obtain the necessary train-
ing data. Suppose we have our dataset downloaded. The following stages explain a typical
process for a text categorization system:
When constructing a text classifier, these are the basic steps. The training and predic-
tion phases of a text categorization system are shown in detail in Fig. 5.3.
The two basic steps in developing a text classifier are represented by the two boxes
labelled training and prediction in Fig. 5.2. When we have a dataset, we often separate it
into three parts: training, validation (optional), and testing. In Fig. 5.2, you can see that the
“Text Normalization” and “Feature Extraction” modules are the common ones (in case of
training and testing). This means that we need to apply the same set of changes during
training and prediction to each document we wish to categorize or forecast. First, docu-
ments are preprocessed and normalized before extracting their features. We guarantee the
reliability of our classification model’s predictions by maintaining consistency of features
throughout the “training” and “prediction” phases.
This is because traditional machine learning algorithms cannot process raw unstruc-
tured input such as text. Next, we provide the features to our model for training purpose.
The documents’ feature vectors and their labels are fed into the model during training so
that the algorithm can learn the distinct patterns that are associated with each class or cat-
egory. To ensure the classification method generalizes effectively with the data during
training, it is common practice to employ an additional validation dataset for evaluation.
The final outcome of the “training” process is a classification model that has already learnt
Training Training
documents Classes
Text
Normalization
the patterns from the training data. To enhance the performance and accuracy, we may
tune the hyperparameters of the model.
The term “prediction” refers to the act of either attempting to forecast classes for testing
documents. Normalization, feature extraction, and engineering are all applied to the tex-
tual data contained in these test documents. Next, the feature vectors from the test docu-
ments are fed into the “classification model”, which uses its knowledge of prior examples
to make predictions about the documents’ likely categorization.
Next, we need to evaluate our model. Accuracy, precision, recall, and F1 Score are just
a few of the measures that may be used to assess a model’s predictive abilities when com-
pared against manually labelled documents with known true class labels. This would show
you how accurate the model is when making predictions about unknown documents.
The last stage, deployment, includes preserving the model and any required dependen-
cies and releasing the model to the world as a service, API, or active application. If used
as a Web service, it can forecast categories for new documents in a batch process or in any
other form specified by the user. Determining how you’d want to access your machine
learning models once they’ve been deployed is a key consideration.
Data extraction from a database or other data storage system is known as data retrieval. It
is an essential process in many applications, including database queries, file searches, and
online data retrieval. Sequential retrieval and random retrieval are the two basic methods
for retrieving data. The process of obtaining data in a consecutive sequence is known as
sequential retrieval. Although this is the most straightforward method of data retrieval, it
may not be effective if the data is not kept in a sorted form.
The technique of obtaining data using its unique identification is called “random
retrieval.” Although it involves sorting the data before storing it, this method of data
retrieval is more effective. The performance of data retrieval may be enhanced through a
variety of ways. These methods consist of the following:
The application determines the best data-retrieving method. Sequential retrieval, for
instance, can be a wise decision for applications where the data is changed infrequently,
but random retrieval might be a wise one for applications where the data is constantly
updated. The following are a few advantages of data retrieval:
5.4 Data Retrieval 127
• Efficiency: By decreasing the time required to obtain data, data retrieval may aid in
increasing the efficiency of applications.
• Correctness: By ensuring that the right data is being retrieved, data retrieval may assist
in increasing the correctness of applications.
• Reliability: By ensuring that the data is accessible when required, data retrieval may
assist in increasing the dependability of programs.
The following are some difficulties with data retrieval.
• Complexity: It might be difficult to retrieve large volumes of data.
• Cost: It may be costly to retrieve large volumes of data.
• Security: Data retrieval may expose the data to unauthorized users, which creates a
security concern. In many applications, data retrieval is an essential activity.
Applications’ effectiveness, precision, and dependability may be enhanced by it. Before
incorporating data retrieval into a program, it’s crucial to consider its difficulties.
The following Python code checks the data retrieval status and returns a Boolean value
indicating success or failure.
def check_data_retrieval(data):
"""
Checks the data retrieval status and returns a Boolean
value indicating success or failure.
Args:
data: The retrieved data
Returns:
A boolean value indicating the success of data
retrieval
"""
if data is not None:
# Check if the data is valid or empty
if len(data) > 0:
# Data retrieval is successful
return True
else:
print("Data retrieved is empty.")
else:
print("Data retrieval failed."
# Data retrieval failed
return False
The success of the data retrieval is determined in this function by inspecting the data
parameter. If the data object is not “None”, then we verify that it includes at least some
data by ensuring that its length is larger than zero. If the information was successfully
retrieved, we will return True. If the retrieval fails, a suitable error message is shown, and
False is returned. To see whether your data has been successfully retrieved, utilize this
method in your code. Here’s one way to invoke this method.
128 5 Text Classification
The retrieved data is assumed to be sent as input in the implementation of the check_
data_retrieval function. Depending on the details of your application’s data retrieval tech-
nique, you may need to adjust this function.
The following is an example of the results from the check_data_retrieval function.
retrieved_data = []
if check_data_retrieval(retrieved_data):
print("Data retrieval successful!")
# Additional processing logic here
else:
print("Data retrieval failed. Please try again.")
Output:
Data retrieved is empty.
Data retrieval failed. Please try again.
5.4 Data Retrieval 129
Since there is nothing in the retrieved_data list, the method returns false, and the mes-
sage “Data retrieved is empty.” and “Data retrieval failed.” are shown.
There are many standard procedures available for retrieving data, for example, in case
of a database query, the information to be retrieved is specified by the user or application.
A query is a series of instructions that instruct the DBMS what information to return. The
database management system (DBMS) then looks for the requested information. The dis-
covered information is then sent back to the requesting user or program.
If we talk about the types of data retrieval, selective retrieval and complete retrieval are
the two most common approaches to retrieving data. In selective retrieval, just a portion of
the whole database is requested. You may achieve this by restricting your query to just the
information you need. Simple examples may be the retrieval of all consumers from a cer-
tain city or all orders made in a certain month. To do a full retrieval, one must request all
the information stored in the database. This is a common practice for data protection and
restoration, as well as report-making. A sample example may be the following Structured
Query Language (SQL) query that retrieves all the information in a “Students” table.
Select * from Students
The abovementioned query is an example of SQL. SQL is the most used tool for this
purpose. Most Database Management Systems (DBMSs) will allow you to use the SQL
query language. The following are some other data-retrieving methods:
• NoSQL databases: When it comes to storing and retrieving large volumes of unstruc-
tured data, NoSQL databases are the way to go. Although NoSQL databases don’t sup-
port SQL, they do provide many other ways to retrieve data.
• Application programming interfaces (APIs): Data from databases, online services, and
cloud storage systems are just some of the places where APIs may be used.
• Data mining techniques: Patterns and trends in huge datasets may be mined using data
mining methods. After seeing a pattern or trend, you may utilize it to get the informa-
tion you need.
Apparently, the process of data retrieval seems to be much simple; however, it is very
challenging. When information is kept in many different places, it is said to be “data frag-
mentation.” Because of this, it may be challenging to get all the relevant information in one
attempt. The second is the security of sensitive information. When retrieving information
from a database, data security is very important. Your information must be secure from
unauthorized access. Scalability of data is the third challenge where your data retrieval
strategies must be able to scale with the size of your database. The fourth challenge of data
retrieval is a precise representation of the data. The accuracy of the information retrieved
is of the utmost importance. This is of great significance for any data that will be utilized
in making decisions or producing reports.
It is possible that the methods and languages used to get and query data from different
sources are different. Defining search criteria and specifying the information to be retrieved
are done using a query language. Query languages range from the general-purpose
130 5 Text Classification
Structured Query Language to the more specialized query languages for systems, such as
MongoDB’s query language. The parameters or filters that are used during the data
retrieval process are known as “search criteria.” Keywords, data attribute criteria (such as
“age > 30”), date/time ranges, and geographical limits are all examples of this. The syntax
of the query language specifies the criteria to be used in the search.
To enhance the performance of the data retrieval system, different techniques are used.
One of these techniques is indexing. It is an important method for improving data retrieval
efficiency. To facilitate quicker searching and retrieval based on certain traits or keys, data
structures (indexes) are created. By eliminating the need to scan the full dataset, indexing
may dramatically increase the speed at which queries are processed.
Now, let us discuss about the models available for data retrieval. Various retrieval mod-
els are available for usage, each of which can be used in a certain scenario. Vector space
models (which represent documents and queries as vectors in a high-dimensional space)
and probabilistic retrieval models (which are based on probability theory and ranking
algorithms) are common examples of retrieval models. Data retrieval sometimes entails
sorting the results obtained in order of how relevant they are to the user’s query.
Here are two commonly used methods for information search:
• Keyword search: The most common and basic method is to search using keywords. To
search the information, the user will have to enter the keywords. The data will be
searched and retrieved using these keywords. While effective, keyword search may fail
to fully capture the intended semantic meaning of the query.
• Full-text search: Whole-text search or full-text search goes beyond keyword search by
considering the document’s whole content, not just the keywords inside it. This includes
the document’s text, metadata, and other properties. It allows you to look up terms, find
synonyms, use fuzzy matching, and sort results by relevance ratings. Search engines
and content-based apps all benefit greatly from full-text search
• Content-based search: Accurate and efficient information retrieval is achieved by sys-
tems using state-of-the-art approaches including Web crawling, indexing, relevance rat-
ing, and user profiling. The goal of content-based retrieval methods, as opposed to
using explicit metadata or keywords, is to locate related material based on the con-
tent itself.
To improve the efficiency of the data retrieval systems, we may use query optimization
methods. The following are some of the techniques that can be used to reduce the time and
resources required to access the data:
• Query rewriting
• Caching
• Parallel processing
• Database indexes
5.5 Data Preprocessing and Normalization 131
Data partitioning, replication, and distributed indexing are only a few of the methods
used in distributed systems with data stored on numerous nodes or servers for fast distrib-
uted data retrieval. The capacity to scale, tolerate failure, and balance retrieval requests is
built into the distributed data retrieval system. The ability to easily access and extract
required information from data is essential to any data-centric project. Methods, query
languages, and models are used to obtain data from diverse sources according to predeter-
mined criteria set by the user. Data-driven decision-making, information discovery, and
knowledge extraction are all made possible by efficient data retrieval.
So, we can say that the ability to retrieve data is crucial in a large number of applica-
tions. It has applications in many contexts, including reporting, decision-making, and data
analysis. Numerous techniques exist, each with its own set of benefits and drawbacks, for
retrieving data. The optimal data retrieval strategy should be selected according to the
requirements of the application.
The data analysis process begins with the preparation of raw data. Data preparation com-
prises of preparing raw data for analysis by cleaning and converting it. Data preparation is
a multi-stage procedure that involves operations including cleaning, integrating, reducing,
and transforming raw data. Data normalization is a pre-processing method for bringing
numerical data into a consistent form and scale. All characteristics are given an equal
chance to contribute to the analysis in this way. Min-max scaling, z-score normalization,
and log transformation are only a few examples of available normalizing methods.
Preprocessing data means cleaning, manipulating, and structuring data for analysis. Because
it may boost the effectiveness of machine learning models, it is a crucial part of any data
science endeavor. So, what exactly is normalization? Scaling the values of features in a
dataset to a common range is an example of normalization, a sort of data preparation.
Machine learning algorithms that are particularly vulnerable to changes in data size
may benefit from this. There are a variety of reasons why preprocessing and normalizing
data are crucial.
• To fine-tune the precision of ML models: Models for machine learning are often trained
using massive amounts of data. However, this information is not always dependable
and may result in flawed models. Noise and inconsistencies in the data may be reduced
by data pretreatment and normalization, hence enhancing the precision of machine
learning models.
• To boost the efficiency of ML models: It’s possible for machine learning models to be
sensitive to data size. For example, if you train your model on a large variety of values,
it may not be as effective as one trained on a more limited set of values. Scaling the
values of the data to a common range via data preparation and normalization may aid
in improving the performance of machine learning models. For better data
132 5 Text Classification
comparability and consistency, data normalization and preprocessing may also aid in
achieving this goal. Data analysis and visualization are two areas where this might
be useful.
• To prepare the data differently: Different ways of preparing data are available, includ-
ing the removal of missing values to complete the dataset. Finding and eliminating data
outliers is what this process is all about. Noise and irregularities may be eliminated
using the data cleaning process. Transforming data is reworking it so that it can be
analyzed effectively. In data normalization, values are scaled to a standard range. In
normalization, several distinct normalizing procedures are available, including data
that are min-max normalized when their values are scaled to fall between 0 and 1. The
formula to calculate the Z-score for a data point, X, in a dataset with a mean (μ) and
standard deviation (σ) is as follows:
X
Z
• Taking the logarithm of each number is what’s meant by “log normalization.” In con-
clusion, the first stage of every data science project should be to preprocess and normal-
ize the data. It is possible to enhance the effectiveness of machine learning models by
proper data preparation. The most effective method of data preparation and normaliza-
tion depends on the nature of the data at hand and the goals of the project.
Data preparation is an essential procedure for every data analysis or machine learning
project. Preparing data for analysis and modeling comprises changing raw data into a for-
mat that is acceptable and intelligible. The purpose of data preparation is to clean up the
data, make it more amenable to the selected algorithms, and increase its overall quality.
The following are examples of common preparation procedures for data:
• Scrubbing the data: Missing data must be filled in, outliers must be removed, and mis-
takes must be fixed at this stage. Depending on the situation at hand and the quantity of
missing data, it may be necessary to eliminate missing values. There are several meth-
ods for identifying and dealing with outliers or numbers that dramatically depart from
the norm of the dataset.
• To mark the distributions: To make the data distribution normal to the selected model-
ing methodologies, data transformation techniques are used. Among the most often
used transformations are the log, power, and Box-Cox transformations. These transfor-
mations might be useful for dealing with unbalanced data distribution and bringing it
closer to a normal distribution. To prevent one feature from being overly weighted in
the modeling process, feature scaling standardizes the weights of all features.
Standardization (mean normalizing) and normalization (min-max scaling) are two of
the most frequent types of scaling.
5.5 Data Preprocessing and Normalization 133
Distance-based algorithms, which are very sensitive to the size of the input character-
istics, highlight the importance of scaling. The process of normalization scales features
values such that they all fall between 0 and 1, as in a specified range. In other words, it’s a
kind of min-max scaling.
The range of the feature’s values is narrowed down to its lowest and maximum values.
The following formula is then used to the feature values to effect the transformation:
Here, X represents the baseline feature value, X_min represents the minimum feature
value, and X_max represents the maximum feature value. As a result, all the feature values
will be scaled to the range from 0 to 1 after normalization. A feature’s value is converted
to 0 if it is less than or equal to the minimum value and to 1 if it is greater than or equal to
the maximum value. In proportion to their original values, the remaining values will fall
between 0 and 1. There are several situations where normalization might be useful.
Note that normalization should be performed independently on both the training data
and the test data. To ensure uniformity between the training and test data, the scaling
parameters (minimum and maximum values) should be calculated using the former. In
general, it is impossible to do effective analysis or modeling without first preprocessing
and normalizing the data. These procedures aid in enhancing the quality and dependability
of the data, which in turn produces more precise and insightful results from a variety of
machine learning tasks, by fixing problems like missing values, outliers, and scaling
difficulties.
The following Python function shows an example of preprocessing the data:
import numpy as np
def check_data_preprocessing_normalization(data):
"""
Checks the data preprocessing and normalization status
and returns a boolean indicating success or failure.
Args:
data: The preprocessed and normalized data
Returns:
A boolean value indicating the success of data
preprocessing and normalization
"""
# Check if the data is not None
if data is not None:
# Check if the data has more than 0 rows and columns
if data.shape[0] > 0 and data.shape[1] > 0:
# Check if the data is within the range [0, 1]
if np.min(data) >= 0 and np.max(data) <= 1:
# Data preprocessing and normalization
successful
134 5 Text Classification
return True
else:
print("Data is not within the range [0, 1].")
else:
print("Data has empty dimensions.")
else:
print("Data preprocessing and normalization failed.")
# Data preprocessing and normalization failed
return False
Here, it validates the data parameter to see whether our data pretreatment and normal-
ization efforts were fruitful. We begin by making sure the row and column counts are
greater than zero and the data is not “None”. The next step is to ensure that the data has
been correctly normalized by checking that the lowest value is higher than or equal to 0
and the highest value is less than or equal to 1. This code snippet may be used to verify if
your data has undergone the necessary pretreatment and normalization steps. Here’s one
way to invoke this method:
Keep in mind that the preprocessed and normalized data is assumed in the construction
of the check_data_preprocessing_normalization function. Depending on the data pretreat-
ment and normalization methods used in your application, you may need to adjust this
function.
Following is an example of the output that we may expect when using the check_data_
preprocessing_normalization function:
Data preprocessing and normalization successful!
The data in this case has been preprocessed and normalized, and it can be found in the
preprocessed_normalized_data array. Since this data is not “None”, has dimensions larger
than zero, and contains values that are all within the range [0, 1], the check_data_prepro-
cessing_normalization function returns “True” when given this information. That’s why
you see “Data preprocessing and normalization successful!” The result would vary if the
preprocessed_normalized_data array was null, included values outside the range [0, 1], or
was None.
5.5 Data Preprocessing and Normalization 135
preprocessed_normalized_data = np.array([])
if check_data_preprocessing_normalization(
preprocessed_normalized_data):
print("Data preprocessing and normalization successful!")
# Additional processing logic here
else:
print("Data preprocessing and normalization failed.
Please check your preprocessing steps.")
# Handle the failure case here
Output
Data has empty dimensions.
Data preprocessing and normalization failed. Please check your preprocessing steps.
The function returns False, and the message “Data has empty dimensions” and “/Data
preprocessing and normalization failed” are shown since the preprocessed_normalized_
data array is empty. Please double-check your initial preparations. Changing the inputs to
the function allows you to simulate various conditions and evaluate their effects on the
result. The level of preprocessing the text in general must be considered. This requires text
components like sentences, phrases, and words to be cleaned, preprocessed, and normal-
ized to a consistent format.
This allows for uniformity in our document corpus, which is crucial for the develop-
ment of relevant features and the elimination of the noise brought by any source like
invalid input, incorrect values, etc.
When assessing sentiment, our text contains irrelevant elements like HTML tags. As a
result, we need to make sure they are eliminated before any characteristics are extracted.
The BeautifulSoup library does a great job at meeting this need. You can develop your
customized function to perform this task.
Next, we perform the removal of accented characters. This deals with English-language
evaluations in our dataset; thus, non-ASCII characters, such as those used for accents,
must be transformed and standardized into ASCII. As an example, é should be changed to
e. Again, a custom-developed function can be of great help.
We also need to deal with contraction and abbreviations. A contraction is a shortened
form of a word in the English language. These abbreviations are derived from longer
words or phrases by eliminating individual letters or phonemes. Vowels are often dropped
when words are shortened. The expressions “do not” and “I would” are two good exam-
ples. The process of text normalization is complicated by contractions since we must deal
with punctuation marks like apostrophes and stretch each contraction to its full form.
When cleaning and normalizing text, it is also vital to get rid of the excess noise that
some letters and symbols might cause. To do this, you need to utilize a few basic regexes.
136 5 Text Classification
To that end, you can develop your custom function that does just that. You can store the
information related to noise in a corpus, and at runtime, the function may check the input
text against that corpus to remove the noise.
In most cases, a word’s stem is the starting point from which a variety of affixes, such
as prefixes and suffixes, may be added to generate new words. Stemming is the process of
reducing a word back to its root form. As a basic illustration, consider the words “watches”,
“watching”, and “watched”, all of which contain “watch” as their root stem. As far as the
tool support is concerned, in addition to the PorterStemmer and LancasterStemmer, the
NLTK package provides a plethora of other stemmers. Lemmatization is similar to the
stemming, in which affixes are eliminated to reveal the word’s underlying meaning.
However, only the root word describes the underlying shape here. The difference is that
the root stem may be incorrect, whereas the root word is always a correct dictionary term.
To preserve lexically accurate terms, we employ lemmatization exclusively in our normal-
ization process.
Finally, we remove stop words. These are the words that are often ignored when gener-
ating significant features from text because of their lack of relevance. Words that appear
most often in a document corpus fall into this category. Stop words include the letters “a”,
“an”, “the”, and so on. While there is no standard list of stop words, we do make use of the
NLTK’s standard list of English language stop words. In cases when it is necessary, you
may add your own domain-specific stop words. In the case of NLTK, we can use the
default provided corpus “Stopwords” that contain a list of common stop words.
Following is an example of a function in Python that can check the context in
each corpus of text. The function converts the words to lowercase as part of the
preprocessing task.
This function may be used by providing a text collection, a desired keyword, and an
optional window size to determine how many words before and after the keyword should
be considered. The keyword’s surrounding context will be returned by the function.
The following code shows the example usage:
corpus = "The quick brown fox jumps over the lazy dog. The
dog is very lazy. The fox is quick and brown."
keyword = "lazy"
context = get_context(corpus, keyword, window_size=3)
print(context)
Output
the lazy dog. The
is very lazy. The
The method returns the surrounding context around each occurrence of the term “lazy”
inside the supplied window size of three words on each side.
138 5 Text Classification
Following is an example of a function that checks for empty or null documents after
preprocessing.
def check_empty_documents(corpus):
"""
Checks for empty or null documents in a corpus after
preprocessing.
Args:
corpus (list): A list of documents.
Returns:
bool: True if there are empty or null documents,
False otherwise.
"""
# Iterate over the documents in the corpus
for document in corpus:
# Check if the document is empty or null after
preprocessing
if not document or not document.strip():
return True
We may put this function to use by feeding it a list of documents that make up your
corpus. After the documents have been preprocessed, the function will loop over them to
see whether any of them are empty or null.
Here is how we can use it.
corpus = [
"This is a sample document.",
"",
"Another document.",
None,
" ",
"Yet another document."]
has_empty_documents = check_empty_documents(corpus)
print(has_empty_documents)
Output
True
After analyzing the corpus, the function in this example finds any null or empty docu-
ments and returns True. After processing, if the corpus contains no empty or null docu-
ments, the method will return True.
5.6 Training and Test Datasets 139
As the name implies, a training dataset is a collection of data in machine learning that is
used to train a model. The model acquires information from the training dataset and applies
it to the prediction of test datasets. After a machine learning model has been trained, it may
be tested on a separate collection of data known as a test dataset to see how well it per-
formed. To generate predictions on the test dataset, the model must use its learning
obtained from the training dataset during training. To what end are training and test data-
sets useful? To measure the effectiveness of a machine learning model, we need both train-
ing and test datasets.
The model’s accuracy can seem higher than it really is if we solely use the training
dataset to enhance it. This is because the model is more likely to provide correct predic-
tions on test data since it has previously seen the data in the training dataset. However, the
model is of little utility if it cannot reliably predict the outcome of experiments conducted
with additional data. Creating test and training sets may be done in several different ways.
Splitting the original dataset in half so that half may be used for training and the other half
for testing is a popular strategy. The easiest approach to do this is to randomly divide the
dataset in half, with about the same number of observations in each half. It’s worth noting,
however, that this strategy isn’t always the best option. When a dataset is randomly divided
into two, one component may include a disproportionately large number of data points
from one class if the original dataset did not have an equal distribution across classes.
K-fold cross-validation is an alternative method for creating training and test datasets.
In K-fold cross-validation, the original dataset is divided into subsets called “folds”, with
“k” commonly ranging from 5 to 10. After that, k−1 folds of the data are used to train the
model, and the final fold is used to test it. Each fold is used as the test set once, and this
procedure is repeated k times. Then, the overall accuracy of the model is estimated as the
mean accuracy of the k tests.
Now, we will discuss some of the guidelines for creating quality tests and training sets:
• Verify that the datasets used to train and evaluate the model are accurate representations
of the data utilized in the actual world.
• The dataset may be randomly divided into train and test sets, or the cross-validation
method can be used.
• Until the model has been trained and assessed, the training and testing datasets should
be kept apart.
• Assess the model’s effectiveness using an appropriate measurement tools and
techniques.
In conclusion, the machine learning method relies heavily on the creation of training
and test datasets. By adhering to the guidelines, you can make sure that the data you use
to train and test your model is an accurate representation of the actual world. This will
allow for a more precise assessment of the model’s performance and verification that it is
ready for deployment.
140 5 Text Classification
Machine learning and data analysis rely heavily on the creation of training and test
datasets. The process begins with separating the available information into a training data-
set and a test dataset.
The following are some of the qualities of training and test datasets that are necessary
for dependable model construction and assessment:
• Representativeness: The data in both sets should represent all the possible values. All
applicable groups or categories should be represented, and their relative number should
reflect that of the actual world.
• Independence: Separate datasets should be used for training and evaluation. To provide
a fair assessment of the model’s performance, the test dataset should not be used during
training.
• Generalization: The data used in the test should be representative of what the model
would see in the actual world. It must include data that is representative of the data the
model will use to interpret.
• Unseen Examples: None of the examples used to train the model should appear in the
test dataset. This makes sure the model’s performance is judged on how well it applies
to data it has never seen before.
There are a few methods that are often used to generate datasets for training and testing:
• Holdout method: During the holdout procedure, information is arbitrarily split into a
training set and a test set. In most cases, about 70–80% of the data is utilized for train-
ing, while the rest is saved for testing.
• Cross-validation: It is common practice in cross-validation to divide the data into many
groups, or “folds”. The process includes repeated rounds of model training and testing
on new fold configurations. K-fold cross-validation and stratified k-fold cross-valida-
tion are two popular approaches to the cross-validation process.
• Time-based split: It is usual practice when dealing with time-series data to divide it up
at discrete points in time. The training set consists of instances from before the time
point, while the testing set consists of examples from after. Using this method, the
model may be tested using information that represents realistic conditions.
• Stratified sampling: Stratified sampling is used to ensure that each class or category is
represented at about the same frequency in both the training and testing sets, which is
useful when the dataset is unbalanced. It guarantees that every category is fairly repre-
sented in all datasets.
• Randomization: To guarantee that the data used for training and testing is representative
and devoid of any inherent order or bias, randomization approaches, such as shuffling
the data randomly before splitting, may be used.
Keep in mind that the size of the dataset, the kind of data, and the nature of the problem
all play a role in determining the best approach to use. In conclusion, creating a training
and testing dataset is an essential part of the model construction and testing process. It
5.6 Training and Test Datasets 141
facilitates efficient model training and permits objective evaluation of the model’s effec-
tiveness. It is important to take measures to guarantee that the datasets are representative,
independent, and generalizable and include no previously encountered occurrences.
To create a machine learning system, we must first construct our models using training
data and then optimize them in an evaluation using test data. This is why it is necessary to
create a separate “train” and “test” datasets. Normally, 67% of the data is used for training,
while 33% is used for testing.
Following is an example of a function for validating data in training and evaluation sets.
Returns:
str: A summary of the dataset characteristics.
Raises:
ValueError: If the datasets are empty or have
incompatible dimensions.
"""
if not train_dataset or not test_dataset:
raise ValueError("The datasets cannot be empty.")
if len(train_dataset) != len(test_dataset):
raise ValueError("The train and test datasets have
incompatible dimensions.")
num_train_examples = len(train_dataset)
num_test_examples = len(test_dataset)
The train and test datasets may be used with this function as inputs. If the datasets are
missing data or have different dimensions, the program will flag this. A summary of the
properties of the datasets, including the total number of samples in each dataset, will sub-
sequently be returned. Here’s a possible application.
142 5 Text Classification
train_dataset = [1, 2, 3, 4, 5]
test_dataset = [6, 7, 8, 9, 10]
summary = check_train_test_datasets(train_dataset,
test_dataset)
print(summary)
Both the training and testing datasets are examined here, and they each include five
samples. After that, a report detailing the key features of the dataset is generated.
Following is an example of a function that determines the sample size of documents for
model training and evaluation.
Args:
dataset_size (int): The total number of documents in
the dataset.
train_ratio (float): The ratio of documents to be
used for training (between 0 and 1).
Returns:
tuple: A tuple containing the number of training and
testing documents.
Raises:
ValueError: If the train ratio is not within the
valid range.
"""
if not (0 <= train_ratio <= 1):
raise ValueError("Train ratio should be between 0 and
1.")
You may use this function by providing the overall size of your dataset and the training
ratio you want to use (the percentage of documents you want to utilize for training). The
function will determine how many training and testing documents are needed and return
that amount as a tuple. Here’s a possible application:
dataset_size = 1000
train_ratio = 0.8
train_documents, test_documents = get_train_test_split_info(
dataset_size, train_ratio)
print(f"Number of training documents: {train_documents}")
print(f"Number of testing documents: {test_documents}")
Before discussing different feature engineering techniques that can be used to get the rel-
evant features by applying them to the textual data, let’s quickly review what we mean
when we talk about features, why we need them, and how they might be valuable to us.
A dataset will normally have many data points, which will typically be represented by the
rows of the dataset. The columns, on the other hand, will contain a variety of characteristics
or qualities of the dataset, each of which will provide a special characteristic of the data
point. Here we will investigate several feature engineering strategies and the significance
of these techniques in terms of boosting the accuracy and generalization of models. When
we talk about the importance of feature engineering, there are various reasons why feature
engineering is such an important part of the process of developing good machine learning
models. The main reason is performance enhancement. Carefully crafted features have the
potential to recognize significant links and patterns in the data, which may lead to enhanced
performance of the model. Feature engineering works to improve a model’s ability to pre-
dict outcomes by supplying it with information that is important to those outcomes.
The process of feature engineering takes raw data and converts it into a representation
that is more suited for use by machine learning algorithms. Instead of working with raw
data formats like text or photos, it allows models to deal with meaningful numerical or
categorical attributes. Handling complex associations, feature engineering approaches
make it possible to extract complicated associations from data, even if such relationships
are not immediately obvious.
144 5 Text Classification
By doing so, models can learn and generalize more effectively when the complex pat-
terns are being captured. Reducing the dimensionality of the data, feature engineering may
assist in reducing the dimensionality of the data by assisting in the selection or creation of
a subset of important features. This makes the model easier to understand, boosts the effi-
ciency with which it can be computed, and reduces the impact of the course of
dimensionality.
The following are some feature engineering strategies that are extensively used:
• Feature extraction: The process of collecting useful information from pre-existing data
is the focus of this method, known as “feature extraction.” Text tokenization, picture
feature extraction (e.g., via the use of convolutional neural networks), and audio feature
extraction (e.g., using Mel-frequency cepstral coefficients) are all examples of the types
of methods that might fall under this category.
• Encoding features: Feature encoding is a process that converts category variables into
numerical representations that are appropriate for use in machine learning models.
Techniques such as one-hot encoding, ordinal encoding, and target encoding are exam-
ples of common encoding methods.
• Scaling the features: The goal of feature scaling is to standardize the numerical features
to a scale that is comparable to other characteristics. Techniques such as standardiza-
tion, in which the mean value is zero and the standard deviation is 1, and min-max
scaling, in which values are scaled to a defined range, are used to guarantee that fea-
tures have equal magnitudes and to avoid features with greater values from dominating.
• Transformation of features: The process of applying mathematical or statistical modi-
fications to the data is known as feature transformation. The logarithmic transforma-
tion, the square root transformation, and the Box-Cox transformation are some examples
of transformations. These transformations may assist in the linearization of relations or
in bringing the data into conformity with certain assumptions that are used by machine
learning algorithms.
• Creation of features: The process of creating new features by deriving new features
from current features is referred to as “feature creation.” This might contain transfor-
mations that are domain-specific, interaction terms, polynomial characteristics, or
mathematical operations like addition and subtraction. Additionally, it can include
transformations that capture relevant patterns in the data, such as addition and
subtraction.
• Handling missing data: The process of dealing with missing data is a crucial task of
feature engineering. Value replacement involves replacing missing data with esti-
mated values.
• Feature selection: The goal of the many strategies used for feature selection is to deter-
mine which characteristics are the most important for the model.
5.7 Feature Engineering Techniques 145
• Options for specific features: During this step, you will choose the characteristics that
are most significant for a machine learning algorithm. There are a wide variety of meth-
ods for selecting features, such as univariate feature selection, recursive feature removal,
and principal component analysis. These are only some of the options. Take care while
working with the code. Find out more.
The nature of the machine learning model that is being used determines which feature
engineering strategies are the most appropriate to use. Nevertheless, the following are
some broad guidelines to consider while selecting feature engineering techniques: Take
into consideration the different kinds of data you have. There are certain methods of fea-
ture engineering that are more suited for working with kinds of data than others. For
instance, procedures for data transformation are often used for numerical data, but
approaches for feature extraction are frequently utilized for the collection of categori-
cal data.
It has been shown that some feature engineering strategies are more compatible with
certain machine learning algorithms than others. For instance, approaches for feature
selection are often used when developing linear regression models, while techniques for
data transformation are frequently utilized when developing decision trees. When it comes
to feature engineering, there is no common approach that can be used in all scenarios.
Experimenting with a variety of approaches to solve your issue and seeing which one
yields the greatest results are the most effective strategies to find possible solutions.
So we can say that the process of machine learning begins with feature engineering,
which is an essential aspect of the process. The performance of machine learning algo-
rithms may be improved by the practice of feature engineering techniques. This involves
converting the data into a format that is more suited for it or developing new features that
are more predictive. There are many various sorts of feature engineering strategies that
may be used, and the methodology that you choose to use will be determined by the
machine learning problem that you are attempting to address.
• TF-IDF model: TF-IDF stands for term frequency-inverse document frequency. The
metric is basically the combination of two other metrics, i.e., term frequency and
inverse document frequency.
The method of optimizing raw data into characteristics that are more useful and practical
for machine learning models is known as feature engineering. Beyond the fundamental
methods of changing data types and producing derived features are advanced feature engi-
neering approaches. They may consist of methods like:
• Feature selection: This entails choosing a dataset’s most significant attributes. Statistical
techniques, machine learning algorithms, or a mix of both may be used to do this.
• Feature extraction: This entails consideration of already existing features to produce
new ones. Techniques like feature hashing, independent component analysis (ICA), and
principal component analysis (PCA) may be used for this.
• Feature transformation: This entails changing the format of the features. Techniques
like normalization, standardization, and discretization may be used to accomplish this.
• Reduce noise: Data may include noise in a variety of ways including measurement
mistakes and outliers. By eliminating or changing noisy features, advanced feature
engineering approaches may aid in noise reduction.
• Improve feature relevance: The significance of each feature varies depending on the
machine learning model. The most crucial features may be found and prioritized using
advanced feature engineering approaches.
• Enhance feature interaction: A feature’s value may often be increased by considering
how it interacts with other features. Advanced feature engineering techniques may be
helpful in identifying and modeling feature interactions. The effectiveness of machine
learning models may be considerably increased by using sophisticated feature engi-
neering techniques. However, putting ideas into action might be challenging and time-
consuming. Before using complex feature engineering techniques, it is essential to
carefully assess the specific needs of the current problem at hand.
Some difficulties that arise when applying the complex engineering models to features
are as follows:
Advanced feature engineering models include methods like the Glove model, co-occur-
rence matrix, etc. Word embeddings, also known as vector representations of words in a
high-dimensional space, are learned with the help of a popular unsupervised learning
method known as GloVe (Global Vectors for Word Representation). This technique is used
to learn word embeddings. It was first presented to the public in 2014 by academicians
working at Stanford University, and since then, it has become a technique that is often used
for natural language processing problems. In this section, we will revise the fundamental
ideas behind the GloVe model as well as its distinguishing features.
Traditional one-hot encoding representations are superseded by distributed representa-
tions, which make it possible for computers to comprehend and reason about words in a
more meaningful manner. The idea of distributed representations, which claims that words
that have similar meanings or contextual use are likely to have similar vector representa-
tions, is the foundation around which GloVe is built.
GloVe wants to be able to capture the semantic links that exist between words in the
form of learned vectors, and it does this by training a model to learn such representations.
The GloVe model presents numerous fundamental ideas that are necessary to properly
learn word embeddings, including the following:
• Objective function: GloVe has a function that is supposed to be objective, and it uses
the word co-occurrence matrix as its input. The objective is to acquire word vectors that
can effectively forecast the likelihood of co-occurrence between different words.
• Word vector space: GloVe expresses words by placing them in a space with many
dimensions as vectors. Different semantic and syntactic links between words are repre-
sented by the various dimensions of the vector space. For instance, words that have
similar connotations or applications in context are represented by vectors that are phys-
ically adjacent to one another in this space.
• Word insertions: The information that is obtained from the word co-occurrence matrix
is included in both the global and local aspects of GloVe’s objective function. GloVe
develops meaningful word embeddings by repeatedly changing word vectors to mini-
mize the objective function. These embeddings represent the statistical correlations that
exist between words.
• Similarity and analogies: After being trained, the word embeddings that GloVe gener-
ates may be used to determine the degree to which two words are semantically related
to one another. Words that are semantically related will have vectors that are located
extremely near to one another in the embedding space. In addition, doing vector
arithmetic operations on the word embeddings might provide fascinating analogies.
For instance, “king − man + woman” may produce a vector that is somewhat near to
“queen”.
In the field of word embeddings, the GloVe model has various benefits, including the
following:
embeddings. Users are not required to train the GloVe model from the start to make use
of these pre-trained embeddings, which may be easily used for NLP tasks. These
embeddings often come in a variety of sizes and dimensions, giving users the ability to
choose the version that is most suited to the activities they need to do.
In conclusion, GloVe is an effective model for learning word embeddings that can cap-
ture the semantic links that exist between words. The model generates representations for
words that are both efficient and easy to understand by making use of global data on the
co-occurrence of terms. GloVe embeddings have shown their use in a variety of applica-
tions for natural language processing (NLP), making it possible for robots to better com-
prehend human language and reason about it.
To classify information into distinct groups, researchers often use the classification mod-
els, a subset of machine learning models. They may be used to make predictions about the
unlabelled data after being trained on a dataset of labelled data. Each model is, at its core,
a mathematical function with a set of parameters that determines its specific characteris-
tics (such as its complexity, its ability to learn, etc.). They are called hyperparameters, and
they must be provided before the model can be trained and run since they cannot be learned
simply from the data. The goal of model tuning is to improve prediction accuracy by
choosing a set of hyperparameters that maximizes model performance. This procedure
may be carried out in a number of ways, including grid search and randomized search.
Several aspects of model tuning are considered throughout the implementation pro-
vided in this book. However, given that this is not a book dedicated only to machine learn-
ing and since this chapter’s emphasis is on text classification, it will not go into the
particulars of each classification approach.
In this chapter, we will only focus on classifying the texts. While building our catego-
rization models itself is outside the scope of this chapter, we will discuss the already avail-
able models. Methods like logistic regression, SVMs, MNB, RF, and GB machines may
be used for classification. These are examples of some of the most popular text categoriza-
tion algorithms. While these are only a few examples of the most common categorization
algorithms, there are many more.
The last two models on our list are ensemble methods like random forests and gradient
boosting. The use of a number of different models for training and prediction is at the heart
of ensemble methods. Deep learning-based methods have also seen significant growth in
recent years. To build an affective classification model, these methods use many hidden
layers from several neural network models. Let’s take a quick look at the principles that
form the basis of these algorithms before deciding whether to apply them to our categori-
zation problem.
5.8 Classification Models 151
For problems requiring prediction or classification with more than two classes, we may
turn to a variant of the widely used naive Bayes method. Let’s look at the naive Bayes
algorithm’s definition and formulation before moving on to multinomial naive Bayes.
Using the widely known Bayes theorem, the naive Bayes algorithm facilitates supervised
learning.
P y P x1 ,x2 , ,xn y
P y x1 ,x2 , ,xn
P x1 ,x2 , ,xn
Here
P xi y ,x1 ,x2 , ,xi 1 ,xi 1 , ,xn P xi y
and for all i we can represent this as follows with i between 1 and n:
n
P y P xi y
P y x1 ,x2 , ,xn i 1
P x1 ,x2 ,,xn
Now that we know P(x_1, x_2, ..., x_n) is a fixed value, we can write down the model
as follows:
n
P y x1 ,x2 , ,xn P y P xi y
i 1
P y x1 ,x2 , ,xn P y i 1 P xi y
1 n
yˆ argmax P ( Ck ) × ∏ i =1 P ( xi C k )
n
=
k∈{1,2,…,K }
Fyi
pˆ yi
Fy n
where Fyi = ∑x ∈ TD xi is the rate at which feature i appears in our training dataset for class
label y TD and Fy i 1 Fyi is the sum of the frequencies of each attribute associated
TD
with the category y. The use of priors allows for a degree of smoothing.
To eliminate problems associated with zero probabilities, we set α ≥ 0. This takes into
consideration the attributes that are missing. Efforts have been made to identify the
152 5 Text Classification
appropriate values for this variable. Smoothing with α = 1 is called Laplace smoothing,
whereas smoothing with α > 1 is called Lidstone smoothing. When developing our text
classifier, we make use of the Scikit-Learn library’s multinomial naive Bayes implementa-
tion, found in the class MultinomialNB. Setting the value naively too high will mislead the
model owing to excessive smoothness.
In 1958, statistician David Cox created a model called logistic regression. Since the logis-
tic (or sigmoid) mathematical function is used in its estimation of parameter values, this
model goes by several names, including logit and logistic. These are the feature coeffi-
cients for our dataset that minimize the loss while making predictions about the target
variable.
The logistic function is a useful tool for transforming log-odds into probability. This
formula represents the typical sigmoid or logistic function mathematically:
1
1 e x
In the linear regression equation that we can use to find the coefficients of the features,
we may obtain the exponent e value and the usual equation x. As can be seen in Fig. 5.4,
this function often takes the form of an S-shaped curve.
The sigmoid function returns a value between 0 and 1. We can use a threshold that can
help in classifying an item. For example, for threshold greater than 0.5, an item may
belong to one class and alternatively to a second class. The following is a representation of
a common multiple linear regression model:
y 0 1 x1 2 x2 n xn
Here {x1, x2, …, xn} are what we are focusing on when we attempt to calculate the coef-
ficients for {β1, β2, …, βn}. Using the logit of the probability, we can express the prediction
of the categorical classes as the following log-odds p.
logodds logit p 0 1 x1 2 x2 n xn
p
Therefore, the probabilities of making a correct classification prediction are .
1− p
This is essentially the proportion of positive to negative results. Similarly, the logit of p is
equivalent to the log odds. This may be derived mathematically as follows:
p
logit p log 0 1 x1 2 x2 n xn
1 p
The logistic regression model’s central equation is given below. It allows us to get the
class probability values that the model generates:
1
p 0 1 x1 2 x2 n xn
1 e
For the purposes of both classification and regression, support vector machines (SVMs)
are popular machine learning models. They are an effective method for addressing many
different issues, and they are also simple to learn and use. SVMs function by identifying
the hyperplane that optimally divides the data into two groups. Hyperplanes are lines or
planes that cut the universe in half. SVMs seek to identify the hyperplane that yields the
largest gap between the two classes. The gap is measured from the hyperplane to the near-
est examples of each class. The greater the gap between the two groups, the better. As a
non-parametric model, SVMs do not presume anything about the shape of the data distri-
bution. Because of this, they are a useful resource for tackling a broad range of issues. In
addition to being a very effective model, SVMs need little time to train even when applied
to massive datasets. Some advantages of employing SVMs are as follows:
• Accuracy: When trained on big datasets, SVMs may achieve high levels of accuracy.
• Interpretability: When it comes to debugging and comprehending the findings, SVMs
may be straightforward to understand.
154 5 Text Classification
• Scalability: SVMs are applicable in the real world because they can be scaled to enor-
mous datasets.
• Robustness: Because of its resistance to noise and outliers, SVMs are often used in situ-
ations when the data is not ideal.
SVMs are an effective machine learning technique. They’re straightforward and versa-
tile enough to apply to a broad range of issues. Figure 5.5 shows the pictorial representa-
tion of support vector machine.
X1
5.8 Classification Models 155
To extract the rules from the given dataset, we use decision trees, which is a supervised
machine learning approach. The tree is constructed using measures such as knowledge
gain and Gini-index. Since decision trees are non-parametric, the higher the amount of
data available, the deeper the tree. The resulting trees may be quite large and in-depth, but
they may suffer the problem called overfitting. Although the model performs well on the
training dataset, it does not learn something new from the data. As a result, it produces
terrible results on the validation dataset. This issue is attempted to be solved using random
forests.
By taking the average of the results of many decision tree classifiers applied to different
subsamples of the dataset, random forests can enhance predicted accuracy while prevent-
ing over-fitting. Subsample sizes are always proportional to the size of the input sample,
but bootstrap samples are created with replacement. All the trees in a random forest are
trained simultaneously (bagging model/bootstrap aggregation) rather than sequentially.
Additionally, the ensemble trees are constructed using a sample of the training data that is
drawn with replacement (a bootstrap sample). In addition, the best split across all charac-
teristics is no longer selected when splitting a node during tree building. Instead, the opti-
mal split is selected at random from among all the characteristics. Random selection of
features when dividing nodes and the random sampling of data introduce unpredictability
156 5 Text Classification
in a random forest. Since the forest is inherently more unpredictable than a single non-
random decision tree, the bias of the forest tends to rise slightly as a result. When we aver-
age, the variance of the model falls more than the bias increases, so we end up with a better
model overall.
Both the individual decision trees and the whole forest may have their model parame-
ters fine-tuned when creating a random forest. Standard decision tree model parameters,
such as tree depth and leaf count, are often used for the trees along with information gain,
Gini impurity, and the number of features. We can control the forest by adjusting param-
eters such as the total number of trees, the number of features utilized in each tree,
and so on.
In each of the abovementioned tasks, gradient boosting machines (GBMs) can be useful.
GBMs optimize any differentiable loss function and construct an additive model in for-
ward-chaining stages. GBMs may be applied to any set of models (weak learners). Scikit-
Learn employs gradient boosted regression trees (GBRTs). These GBRTs can be applied
to any differentiable loss function. The versatility of this model lies in its ability to solve
both regression and classification issues accurately. Mathematically, GBRT additive mod-
els can be represented as follows:
M
F x m hm x
m 1
n
Fm x Fm 1 (argmin L yi ,Fm 1 xi h x
i 1
Decision trees are often employed as foundational models, with the resulting minimiza-
tion of branches (in the case of regression trees) or negative log likelihood (in the case of
classification trees). Supervised machine learning and classification have been the subject
of several articles and books. The most recent state-of-the-art ensemble models, such as
XGBoost, CatBoost, and LightGBM, are also highly recommended reading.
While it is critical to train, tune, and develop models as part of the analytics lifecycle, it is
also crucial to evaluate and understand how effectively these models are performing. The
effectiveness of a classification model is often measured by its ability to accurately predict
the class labels of unseen data. Typically, this is evaluated in comparison to a test dataset,
which contains information that was not utilized to train the classifier. Several observa-
tions and their respective labels are included in this sample dataset.
5.9 Evaluating Classification Models 157
Features are extracted in the same manner they were extracted during model training.
The features are then used to make predictions for each data point using the trained model.
To evaluate how well the model has predicted, its results are compared to the actual labels.
A model’s prediction performance may be measured in several ways. The following indi-
cators are our primary areas of interest:
• Accuracy
• Precision
• Recall
• F1 Score
Recall
158 5 Text Classification
In most cases, the confusion matrix is segmented into the following four quadrants:
• True positives (TP): These are the cases when it was determined that a positive outcome
was appropriate.
• False positives (FP): These are the cases that were mistakenly labelled as good while
having a negative outcome.
• True negatives (TN): These are the cases that were appropriately categorized as having
a negative impact.
• False negatives (FN): These are the cases that ought to have been evaluated positively
but were instead evaluated negatively.
• Accuracy: It is the ratio of correctly predicted instances (both true positives and true
negatives) to the total number of instances in the dataset.
The following is the formula:
Accuracy = (TP+TN)/(TP+TN+FP+FN)
• Precision: The term “precision” refers to the proportion of cases that correspond to the
“positive” label that is, in fact, positive.
The following is the formula:
Precision = TP/(TP+TN)
• Recall: It is the ratio of true positives to the sum of true positives and false negatives.
The following is the formula:
Recall = TP/(TP+TN)
• F1 Score: The F1 Score is the harmonic mean of precision and recall. It provides a balance
between these two metrics, especially when there is an imbalance between the classes.
The following is the formula:
F1 Score= 2 * (Precision * Recall) / (Precision + Recall)
The use of confusion matrices may provide several issues, including the following:
• When applied to huge datasets, its interpretation may be challenging. They are some-
times sensitive to the imbalance of classes. They give no information on the confidence
of the model’s predictions in any way, shape, or form.
• Confusion matrices are, given the circumstances, an invaluable tool for analyzing the
accuracy of a classification model. They provide a thorough picture of the performance
of the model, and they may be used to detect the many kinds of mistakes that the model
is producing.
• For large number of classes, the confusion matrix may become complex.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
def plot_confusion_matrix(y_true, y_pred, labels):
"""
Plots a confusion matrix based on the true and predicted
labels.
Args:
y_true (array-like): The true labels.
y_pred (array-like): The predicted labels.
labels (list): List of class labels.
Returns:
None
"""
confusion_matrix = pd.crosstab(pd.Series(y_true,
name='Actual'), pd.Series(y_pred,
name='Predicted'))
confusion_matrix = confusion_matrix.reindex(index=labels,
columns=labels, fill_value=0)
sns.heatmap(confusion_matrix, annot=True, fmt="d",
cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
# Example usage
true_labels = [0, 1, 0, 1, 1, 0, 2, 2, 2]
predicted_labels = [0, 0, 1, 1, 1, 2, 2, 2, 0]
class_labels = [0, 1, 2]
plot_confusion_matrix(true_labels, predicted_labels,
class_labels)
160 5 Text Classification
• True positive (TP): This is the total number of occasions when our model accurately
predicted the positive class label, i.e., where the real class label exactly matched the
predicted class label.
• False positive (FP): Our model incorrectly predicted a positive outcome for this many
occurrences from the negative class. Therefore, it is called a “false” positive.
• True negative (FN): It specifies the total number of negative class occurrences for
which our model accurately predicted the class label, i.e., when the real class label
exactly matched the predicted class label.
• False negative (FN): How many times did our model incorrectly anticipated that a
given instance belonged to the negative class? This is why it’s called a “false” negative.
When the classes are evenly distributed in the dataset, the accuracy is enhanced in the
model. Our model’s prediction accuracy may be calculated using the following code:
Accuracy, precision, recall, and F1 Score are only a few of the performance measures
that are computed by the function using the Scikit-Learn module. The program then uses
print() commands to show the results of the calculations. In addition, it generates a report
detailing the categorization with statistics for accuracy, recall, F1, and support. To get the
performance metrics for your unique classification problem, just swap out the true labels
and predicted labels with your own data.
Output
Accuracy: 0.6667
Precision: 0.6190
Recall: 0.6667
F1 Score: 0.6306
Classification Report:
Classifying text requires labeling it with labels or groups based on its subject. Spam filter-
ing, sentiment analysis, and content classification are just a few examples of the many uses
of this core NLP task.
The following are the stages involved in developing a text classifier:
• Data collection: The initial task is to collect a dataset containing the annotated text.
Text that has been manually labelled into two or more categories should be included in
this dataset.
• Feature extraction: The next stage features extraction from the text. When training a
classifier, features are the properties of the text that are considered.
• Model selection: This step involves choosing a classification model. Text categoriza-
tion may be accomplished using a wide variety of models, including naive Bayes, sup-
port vector machines, and decision trees.
• Model training: The training of the classifier is the fourth stage. Putting the dataset
through the model and optimizing its settings to get the desired degree of accuracy are
what this entails.
• Model evaluation: Finally, the classifier must be assessed. The accuracy of the classifier
is determined by putting it to the test on a mock dataset.
Model Model
Results
Evaluation Training
• Accuracy: Accuracy is the most often used measure of a text classifier’s performance.
It measures how well-labelled test data is.
• Precision: Accuracy measures how well a system predicts whether a piece of text is
positive or negative.
• Recall: The recall rate measures how well a piece of positive text is labelled.
• F1 Score: Efficiency and recall are evaluated equally to get the F1 Score.
It should be noted that the higher the quality of the data, the more accurate the classifier.
Make use of many different options. The better the accuracy of the classifier, the more
characteristics you should utilize. Choose a model that fits the data well. It is true that not
all models can handle the same data equally well. Use a huge dataset to train the classifier.
More information is helpful when training a classifier. Check the classifier’s performance
with a unknown dataset. This will allow you to gauge the classifier’s efficacy more pre-
cisely. Some difficulties in developing and accessing text classifiers are as follows:
• Data requirements: It takes a lot of labelled data for text categorization algorithms to
learn. This may be hard to overcome particularly when the data is not sufficient or
majority of the data is unlabelled.
• Complexity: It might be difficult to develop and train an algorithm to classify texts. For
certain customers, this may provide a problem.
• Bias: Because of their inherent subjectivity, text categorization algorithms may not
provide reliable results across the board.
It’s not easy to construct and evaluate text classifiers with high accuracy, but the effort
often results into some level of reliability of the model. If you follow the guideline given
above, you should be able to create a text classifier that performs well.
5.11 Applications
There are many practical uses of text classification. Here are a few examples: classification
of news, anti-spam measures, classification of musical or cinematic works, and emotion
detection in text of a language. There is no limit to what can be done with text data, and
with little work, you can automate various tasks and processes that would normally take a
lot of time by using categorization. Among the numerous available algorithms for text
categorization, the most popular ones are as follows:
• Decision trees: To categorize texts, a tree-like structure called a “decision tree” might
be used. They function by making a series of judgments about the text and then classi-
fying it based on those judgments.
Classifying texts is an important method with many applications. Examples of how text
categorization may be used are as follows:
• Spam filtering: Text classification may be used to identify spam emails. It’s common
practice to either mark spam emails as “spam” or “ham” (not spam).
• Sentiment analysis: The mood of a piece of writing may be analyzed with the help of
text categorization. The positive, negative, or neutral tone of a piece of text may be
identified with the help of sentiment analysis.
• Topic modeling: The subjects of a text may be determined by text categorization. Using
topic modeling, we may classify texts according to the subjects they cover.
• Accuracy: When trained on big datasets, text classification algorithms have the poten-
tial to achieve high levels of accuracy.
• Scalability: Text categorization methods are practical for real-world use because they
can be easily extended to enormous datasets.
• Interpretability: When it comes to troubleshooting and interpreting the findings, text
classification algorithms may be very straightforward to grasp.
• Data requirements: It takes a lot of labelled data for text categorization algorithms to
learn. This may be hard to come by, particularly for brand-new or specialized sectors.
• Complexity: It might be difficult to develop and train an algorithm to classify texts. For
certain customers, this may provide a problem.
• Bias: Text classification methods may not be dependable for all datasets due to their
inherent bias.
5.12 Summary
This chapter has covered the concepts of text classification, also known as text categoriza-
tion. We here have discussed the complete process of text classification as well as various
classification algorithms that can be used. Finally, we have looked at comparison methods
to evaluate the efficiency of the text classification algorithms. Each step will be explained
with the help of examples. Python code and a complete description are also provided.
5.13 Exercises 165
5.13 Exercises
Text Label
Germany is one of the most beautiful countries in the world Country
Football is one of my favorite games Sports
National Congress party is expected to win the current elections Politics
Terminator II was one of my favorite movies Movie
The cricket world cup venue has not been decided yet Sports
I cannot ensure who will win the next National Elections Politics
Now, consider some relevant text from some newspaper, and classify that text accordingly.
1. K-Means
2. AHC algorithm
3. Decision trees
4. DBSCAN
5. Navie base
6. Logistic regression
7. Ensemble models
8. Support vector machines
9. Random forest
10. Gradient boosting machines
Q4: Which activation function is depicted by the following diagram? Also explain some
of the benefits and drawbacks of that activation function.
166 5 Text Classification
Q5: What is the difference between the support vector machine and the logistic regres-
sion classification methods?
Q7: In the following diagram showing the support vector machine, identify the equations
for the diagonal lines.
X2
X1
• Beautiful
• Cat
• Dog
Calculate TF, IDF, and TF-IDF of each term for each sentence.
Q10: Explain how the text classification can be used in the following systems:
• Classification of news
• Anti-spam system
5.13 Exercises 167
Q11: Consider the following confusion matrix, and explain what the empty cells represent.
Predicted labels
n’ p’
(Predicted) (Predicted)
True labels n
(True)
p
(True)
Q12: Computer does not understand the textual data; then, how can we make computers
understand the text for further processing?
Recommended Reading
• Text Mining: Classification, Clustering, and Applications by Ashok Srivastava and
Mehran Sahami
Publisher: Chapman and Hall/CRC
Publication Year: 2009
Offering a comprehensive picture of the discipline, Text Mining: Classification,
Clustering, and Applications presents a comprehensive investigation of statistical
approaches in text mining and analysis. It digs into approaches for automatically
grouping and classifying text documents, illustrating their practical applications in
adaptive information filtering, information distillation, and text search across multi-
ple disciplines.
The book opens by breaking document classification into predetermined categories,
disclosing cutting-edge algorithms and their real-world applications. Moving beyond
this, it presents unique approaches for grouping documents into groups without spec-
ified frameworks. These approaches autonomously discover thematic structures
within document collections, providing useful insights into the underlying themes.
• Inductive Inference for Large Scale Text Classification: Kernel Approaches and
Techniques by Catarina Silva, Bernardete Ribeiro
Publisher: Springer Berlin, Heidelberg
Publication Year: 2009
Text classification has become a crucial task for analysts in many different sectors in
the modern world. Textual materials have proliferated rapidly in the digital age, includ-
ing books, scientific papers, emails, Web pages, and news items. Despite their wide-
spread use, handling digital texts requires special skills. This process is difficult because
of the enormous amount of data required to represent them and the subjective nature of
classification.
168 5 Text Classification
Like text classification, text clustering is another important task that is performed in the
context of textual analysis. In the clustering process, the text is organized in the form of
relevant groups and subgroups before further processing. One of the major challenges in
text clustering is to form meaningful clusters of the text without having any prior knowl-
edge. The chapter will explain the clustering process in detail along with examples and
implementation of each step in Python.
In text clustering, we group the text based on the similar properties of the textual data. The
purpose of text clustering is to enable effective information organization and retrieval by
automatically discovering patterns, themes, or subjects within a huge collection of unstruc-
tured textual data. An excessive amount of textual data has accumulated in recent years
due to the emergence of digital platforms including social media postings, online reviews,
news articles, and consumer feedback. Analyzing this much text manually is a tedious,
wasteful, and error-prone process. The procedure may be automated with the use of text
clustering algorithms, which can provide a structured representation of otherwise unorga-
nized textual data. Most text clustering techniques use an unsupervised approach, which
means they may be trained without the use of arbitrary labels or categories. Instead, they
construct clusters based on an analysis of the text’s innate patterns and structures. The
premise is that texts with comparable meanings and purposes would use similar language.
Several well-known text clustering techniques exist, and they all have their advantages and
disadvantages.
Now we will discuss different applications of clustering in depth.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 169
U. Qamar, M. S. Raza, Applied Text Mining,
https://doi.org/10.1007/978-3-031-51917-8_6
170 6 Text Clustering
Now, we give a brief overview of different clustering methods and techniques. These
are some of the most typical methods.
6.1 Introduction to Text Clustering 171
This technique groups text into a fixed number of clusters by using the feature vectors.
Each document is assigned to the cluster centroid that is closest to it, and the centroids are
updated until convergence is reached. K-means clustering is a popular method for group-
ing similar texts together. It is an iterative method that attempts to divide a collection of
documents into K groups according to their level of similarity. Some characteristics of
K-means clustering are as follows: The technique begins by picking K initial cluster cen-
troids at random. The cluster centers are shown by these centroids. At this stage, we allo-
cate each document to the cluster whose centroid is closest to it using some kind of
similarity metric. Cosine similarity is one of the common measures used for this purpose.
It measures how similar two vectors (representing texts) are by calculating the cosine of
the angles between them. After the initial document allocation, the method performs an
update step in which it calculates the mean vector of all documents in each cluster to revise
the cluster centroids. The cluster’s new center of mass will be located along this average
vector. The process is repeated until a convergence threshold is reached. Convergence
criteria often include a minimum distance between successive centroids or a minimum
percentage change in cluster assignments. After the algorithm reaches a steady state, the
resultant clusters are analyzed to determine how cohesive and distinct they are. “Cohesion”
refers to the degree to which papers within the same cluster are like one another, whereas
“distinct” means the degree to which clusters are unlike each other. The quality of the
clusters may be evaluated by using evaluation measures like the silhouette score or the
sum of squared errors (SSE). There are several factors to think about while deciding on the
number of clusters (K). Knowledge or experience in the relevant field may be used to make
the decision. Alternatively, the outcomes of clustering with varying values of K may be
evaluated using methods like the elbow method or silhouette analysis to determine the best
K value. When applied to textual data, K-means clustering has several restrictions. One
drawback is that it does not consider the natural sequential and structural information
found in the text and instead treats the documents as points in a high-dimensional vector
space. In addition, K-means may provide varied clustering outcomes depending on the
original centroids. K-means clustering is often used to text data after some preliminary
processing has been completed, known as “preprocessing.” Tokenization, stop word
removal, stemming, lemmatization, and vectorization methods like TF-IDF and word
embeddings are all examples. These procedures assist in reformatting the text data so that
it may be used in a similarity calculation between documents. The K-means clustering
algorithm is a straightforward method of grouping similar texts. It can process huge data-
sets quickly and efficiently. To be successful, however, it requires careful attention to pre-
processing, similarity measure selection, and clustering optimums.
172 6 Text Clustering
(a) Compute the similarity between all pairs of texts by using a suitable similarity metric,
such as the cosine similarity metric.
(b) Start the clustering process by assigning every document to its own unique group.
(c) Using the similarity measure, integrate the two clusters that are the most like one
another. To consider the new information, you will need to adjust the distances that
exist between the clusters in the similarity matrix.
(d) Return to steps C and D until the needed degree of similarity has been achieved or the
desired number of clusters has been reached, whichever occurs first.
When using the divisive clustering method, first all the documents are grouped together
into a single cluster. Next, the larger cluster is divided into smaller groups in an iterative
manner until a stopping condition is met. The algorithm operates in the following manner:
(a) Using an acceptable similarity measure, calculate the degree to which each pair of
texts is like one another.
(b) Begin the process of creating a cluster, and move all the documents into it.
(c) Based on a suitable criterion, such as maximizing the dissimilarity between different
documents, pick a cluster, and break it into two smaller clusters. To consider the new
information, you will need to adjust the distances that exist between the clusters in the
similarity matrix.
(d) It is necessary to keep iterating steps (c) and (d) in a recursive manner until a termina-
tion condition is met.
dendrogram and determining the degree to which clusters are cohesive to one another and
distinct from one another and depending on external criteria such as predetermined ground
truth labels. Choosing the number of clusters, hierarchical clustering does not need the
number of clusters to be established before the clustering process. It is possible to compute
the suitable number of clusters by clipping the dendrogram at a certain height, or one
might use one of many alternative approaches, such as the silhouette score or the gap sta-
tistic. Hierarchical clustering may have a high computational cost with the growing size of
the dataset. The so-called chaining effect, in which one group of clustering decisions may
have ramifications for subsequent ones, is an additional possible disadvantage. In addition,
the outputs of the clustering process might be different depending on whether a single link,
complete link, or average link strategy was used. Before applying hierarchical clustering
to text data, preprocessing operations including tokenization, stop word removal, stem-
ming or lemmatization, and vectorization are often conducted. The hierarchical structur-
ing of clusters that is made possible by using this approach allows for the gaining of
insights into both global and local patterns that are present within the data. Visualizing the
text data in a graphical style known as a dendrogram may help them better understand the
underlying patterns in the data.
LDA is a probabilistic generative model. It assumes that there are a finite number of sub-
jects and that each topic is a word-distribution model. Using an inferred topic distribution,
LDA classifies texts into groups.
There are various critical phases in text clustering. In the first stage of processing, stop
words, punctuation, and shortening of words to their roots (through stemming or lemma-
tization) are removed from the text data. After that, the text is converted into numerical
feature vectors using an appropriate representation, such as the bag-of-words model or term
frequency-inverse document frequency (TF-IDF). After the textual data has been converted,
the clustering technique is used to classify articles into collections with shared charac-
teristics. Internal metrics like silhouette score or external metrics like the Rand index or
F-measure may be used to assess the quality of the generated clusters. To enhance the clus-
tering outcomes, it is possible to apply iterative parameter refining or ensemble approaches.
There are several ways to apply the notion of text clustering. It is used by researchers
to find out what the public thinks or what online trends are. It is helpful for client
174 6 Text Clustering
segmentation, the practice of categorizing customers based on things like remarks and
purchase trends. This information may be used to develop various business-related poli-
cies. It can help with text creation, data extraction, and content categorization. Despite its
benefits, text clustering has numerous drawbacks, including the need to cope with random
or sparse data and the inherently arbitrary nature of text analysis. Deep learning, subject-
matter knowledge, and pertinent data are a few of the techniques being looked at as pos-
sible solutions to improve the current grouping algorithms.
When dealing with vast amounts of random text data, text clustering is a vital tool.
Based on how closely related the articles are, the algorithm may automatically classify
them. This makes it easier to find information, find new things, and make decisions across
disciplines.
In data clusters, the outcomes may be either static or dynamic, and this duality serves as
the difference between the two kinds of clustering tasks. Results from clustering data
items are fixed in static clustering but are continuously updated in dynamic clustering.
Static clustering has been assumed for discussing the clustering technique in prior research
on data clustering. We go out with the differences between the two distinct clustering
methods and provide examples of each. Once data points have been grouped, it is pre-
sumed that the grouping results are permanent, which is known as static clustering.
Clustering techniques have been previously described as static clustering. In this method,
we take in many data points and output predetermined groups of related data. However,
new data points may be added, and old ones removed from the results, so the information
must be rearranged properly. In static clustering, however, the actual context is lost. As an
alternative to static clustering, we might think of dynamic clustering. With dynamic clus-
tering, data is continuously restructured when new data points are added or existing data
points are removed. Data may be reorganized in two ways: either hard, in which a cluster-
ing technique is used to rearrange all the data items, or soft, in which existing clusters are
combined or split up to form new ones. Before deciding whether to rearrange, we need to
have a sense of how well things are currently arranged. There is a comparison made
between static clustering and dynamic clustering. Static clustering operates on the premise
that all data items are provided at once, whereas dynamic clustering accounts for the fre-
quent insertion and removal of data. In static clustering, a set of data points is grouped into
subgroups only once, whereas in dynamic clustering, data points are clustered repeatedly.
The grouping of data points is either fixed or dynamic, depending on the method being
used. Dynamic clustering considers the soft organization of actions like splitting and
6.2 Clustering Types 175
merging existing clusters, which are ignored in static clustering. In dynamic clustering,
managers are tasked with choosing between two types of maintenance: soft and hard.
Intra-cluster and inter-cluster similarity are calculated as quality indicators of the clusters.
The goal of data clustering is to enhance similarity within each cluster while minimizing
it between them. The values of the two metrics before and after the addition or deletion of
items are compared to determine which of the three options should be pursued.
Crisp and fuzzy clustering depends upon whether an item belongs to one cluster or more
than one. In a crisp clustering, each item is assigned to exactly one group, whereas in
a fuzzy clustering, at least some of the items share membership with many groups. In
fuzzy clustering, membership values of clusters are given to each item rather than the item
itself. We go out the similarities and distinctions between the two types of clustering here.
Table 6.1 presents the nine elements and five categories as a matrix (each cluster represents
a category). None of the eight things may be placed in more than one of the five categories.
Based on the exclusive clustering, set 1 consists of items 1 and 4, set 2 of item 2 only, set
3 consists of items 3 and 5, set 4 consists of items 6 and 8, and finally set 5 consists of
items 7 and 9. The crisp clustering has no overlapping clusters. Table 6.2 depicts a basic
use of fuzzy clustering. The framework of Table 6.2 is identical to that of Table 6.1. Five
out of the nine objects have several labels. The first item has three labels, item 4 has two
labels, and so on. Fuzzy clustering yields results if at least one item has been assigned to
more than one group. In Table 6.3, we compare the methods of crisp clustering and fuzzy
clustering that we have discussed above. In contrast to the fuzzy clustering method, the
crisp method strictly prohibits any overlap between groups. In contrast to the continuous
values between zero and one used for membership in the fuzzy clustering, the membership
values are represented as binary values in this method. Crisp clustering places each data
point in the most comparable cluster, whereas fuzzy clustering places it in many clusters
when the similarity is more than a specified threshold. Opinion mining typically uses crisp
clustering, in which each data point is assigned to one of three possible positive, neutral,
or negative clusters; topic-based text clustering typically uses fuzzy clustering, in which
each data point is assigned to one of several possible topic clusters.
Matrix of clustered items, i.e., the item-cluster matrix that emerges after applying fuzzy
clustering, can be seen in Table 6.3. Both the item-cluster matrix (shown in Table 6.4) and
the clusters themselves (in which some items belong to more than one cluster) are exam-
ples of the outcomes of fuzzy clustering. Each cluster is represented by a column in the
matrix frame, and each item is represented by a row. The value that is the intersection of a
column and a row represents the item’s membership in the cluster.
In crisp clustering, assigning each data point to precisely one cluster, crisp clustering is
often referred to as hard clustering. Each data point is assigned a membership value of 0
or 1 to indicate whether it is part of a certain cluster. The objective is to form separate
groups whose members do not overlap.
On the other hand, fuzzy clustering permits data points to be partially members of
numerous groups. Instead of classifying each data point into one of two categories, fuzzy
clustering uses a membership degree to describe the probability that a given data item
belongs to a certain cluster. By allowing data points to be linked with different clusters at
the same time, fuzzy clustering accounts for the inherent ambiguity and overlap in data.
6.2 Clustering Types 177
This type of clustering depends on whether a given cluster permits further nested clusters.
In a flat clustering, there is no such thing as a nested cluster, whereas in a hierarchical
clustering, there is. Results from flat clustering are shown as a list of clusters, whereas
those from hierarchical clustering are presented as a tree. Here, we will discuss the differ-
ences and similarities between the two distinct clustering methods. Both flat clustering and
hierarchical clustering are shown in Fig. 6.1. The left half of Fig. 6.1 displays the flat
clustering. In the flat clustering method, the final output is a list of clusters rather than a
hierarchical structure. When you need to divide a set of data into smaller groups based on
common features, it is better to use the flat clustering method. As can be seen on the right
side of Fig. 6.1, hierarchical clustering allows for at least one nested cluster inside a given
cluster. In Fig. 6.1, the three clusters are shown on the left side and one on the right side;
however, the right side has nested clusters. The output is a cluster tree with the root node
representing a whole set of items, the leaf nodes representing individual data points, and
the branch nodes representing subgroups. To facilitate the exploration of the data, a hier-
archical clustering structure is built. Table 6.5 shows the distinctions between the two
grouping strategies. As was previously said, hierarchical clustering permits any level of
nesting, but flat clustering does not. While both forms of clustering make use of a variety
of clustering methods, the k-means technique is employed in flat clustering.
178 6 Text Clustering
In the flat case, the K-means method is often utilized, whereas the AHC algorithm is
more common in the hierarchical case. Data items may be arranged into one or more clus-
ters in the flat clustering method, whereas in the hierarchical method, clusters can be
divided from the top-down or merged from the bottom-up. The outcomes of flat clustering
are shown in the form of a list of clusters, whereas those of hierarchical clustering are
displayed in the form of a tree. Let’s think about how we would go about analyzing the
outcomes of our hierarchical grouping. The outcomes of flat clustering are simply assessed
by determining the intra-cluster similarity and the inter-cluster similarity. The lower intra-
cluster similarities of higher clusters and the larger inter-cluster similarities among nested
ones tend to underestimate the findings from the hierarchical clustering. Before applying
to a hierarchical clustering problem, a clustering method is often examined in a flat clus-
tering task. The clustering index is presented as a tool to assess the clustering outcomes in
this research.
In conclusion, the goal of flat clustering, sometimes called partition clustering, is to
divide a dataset into a specified number of clusters without creating any kind of hierarchi-
cal structure between them. Using similarity measurements or distance metrics, the algo-
rithms employed in flat clustering place each data point into a single cluster.
The goal of hierarchical clustering, on the other hand, is to produce a tree-like diagram
of clusters, known as a dendrogram. The number of clusters is not hardcoded in advance.
Both bottom-up (involving the merging of individual data points into clusters) and top-
down (involving the initial grouping of all data points into a single cluster) approaches
may be used to do hierarchical clustering. The code below shows a simple implementation
of a simple agglomerative clustering algorithm:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
# Textual data
data = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",]
6.2 Clustering Types 179
Cluster 2:
–– This document is the second document.
–– And this is the third one.
Here we have used a simple agglomerative method to perform text clustering. A list of
the documents was defined, and then the agglomerative method was applied by specifying
“2” as the number of clusters to be formed. The output may be different if a different num-
ber of clusters is specified.
The base of the single-viewed clustering and the multiple-viewed clustering is to consider
whether we can handle the multiple clustering outcomes or not. Different settings for these
parameters might cause the same clustering algorithm to yield different results.
Differentiating hierarchical clustering from multiple viewpoint clustering is based on the
180 6 Text Clustering
fact that the former produces a single tree of clusters, while the latter generates forests made
up of many trees. Here, we compare the two distinct clustering tasks that may be per-
formed. Clustering as seen from a single perspective is theoretically shown in Fig. 6.2. A
clustering method takes a set of things and divides them up into smaller groups. It does not
matter whether clusters overlap or not; the outcome is the set of clusters that are formed.
Till now, it was considered that results from grouping data items using only a single method
can be used for almost all the domains of machine learning and data mining. After that,
research has shown that we need to make room for alternative clustering findings that are
produced from other perspectives as well. Multiple views of the clusters provide for a wide
range of clustering outcomes, as illustrated in Fig. 6.3. The fact that information is struc-
tured in a variety of perspectives leads to different views of the same dataset.
This sort of data clustering is worthy of consideration since it may be used even in
manual clustering. Even when using the same technique, the outcomes of clustering data
items might vary based on the settings of its external parameters. Changing the number of
clusters in the K-means method, for instance, may lead to a variety of distinct outputs.
Multiple clustering outcomes per data item are permitted in this sort of clustering. We
need to combine the findings of many cluster analyses of the same dataset. The preceding
section described many possible outcomes, sometimes known as organizations, from
which a user might choose the one. Similar groups are combined into a single entity by the
merging of several clustering outcomes. Integration of clustering findings from several
organizations is shown in Fig. 6.4, depicting hierarchical clustering and multiple-view
clustering. Both forms of clustering include many lists of clusters, making it difficult to
differentiate them at first glance. The results of hierarchical clustering provide a unified
Clustering
Text
Algorithm
C
Text Clustering Algorithm
C
C
structure for the data points. However, in the multiple-view clustering, various groups are
shown separately. In other words, the results of hierarchical clustering are shown as a
single tree of clusters, but the results of multiple-viewed clustering are presented as forests.
In conclusion, when a limited number of features or attributes are being used in the
clustering procedure, single-viewed clustering is used. It assumes that the provided collec-
tion of features adequately describes the data points.
In multiple-view clustering, sometimes called multi-view clustering, data from many
different features or attribute sets is combined to give a more complete picture of the data
to be clustered. Combining numerous views allows for a more complete depiction of the
underlying structure to be captured since each view offers a unique viewpoint on the data.
When forming clusters, multi-view clustering algorithms consider the connections
between data from different perspectives and thus generate different views. Each view
provides a different insight into the data. This helps in making decisions based on differ-
ent ideas.
Various clustering methods provide unique viewpoints and methods for data analysis.
Factors such as data type, desired degree of granularity, presence of various feature sets,
and whether the data is dynamic or static all play a role in determining the best clustering
method to use. To choose the right algorithm and analyze the obtained results, it is crucial
to be familiar with the features and limitations of each clustering method.
As we have discussed, the process of arranging texts that have common themes into larger
units is known as text clustering. It is widely used in natural language processing (NLP)
and machine learning (ML). The purpose of text clustering, a text mining technique, is to
arrange unlabelled texts into groups whose members are more like those in their cluster
than those in other clusters. Reviews from customers, news articles, and even scientific
research might all fit neatly into one of these categories.
Since it lays the base for a wide variety of derived tasks that make use of clustered data,
it is a crucial technique in natural language processing and machine learning. The improve-
ments in understanding, organization, and use of text data are the result of these derived
182 6 Text Clustering
operations. Multiple tasks can be performed once the data from the text documents has
been properly clustered into different categories. Common types of the tasks that can be
performed are as follows:
• Topic extraction/summarization: Topics may be extracted from a vast body of texts and
summarized with the help of text clustering. The most representative documents or
keywords within each cluster may be selected by document clustering to provide an
overview of the principal themes or subjects covered by the dataset. Information
retrieval, document indexing, and content analysis may all benefit from this kind of
automated topic extraction or text summarizing.
• Document classification: By giving an initial grouping of documents, clustering may
aid in categorization. To train supervised classification algorithms, it is necessary to
first cluster documents based on their similarities and then give class labels to each
cluster. Document classification models that make use of the pre-clustered data have a
better chance of using the data’s inherent structure and organization for the advantage
of classification accuracy and efficiency.
• Sentiment analysis: Sentiment analysis is one area where text clustering might be use-
ful. Sentiment patterns may be established by grouping several texts with similar senti-
ments, for example, those that can be seen in customer reviews or social media postings.
This may be useful for tracking overall mood, seeing patterns in certain emotions, and
understanding why people have different viewpoints. Brand management, market
research, and customer satisfaction analysis are all areas that may benefit from senti-
ment analysis that makes use of clustering.
• Recommender systems: Recommender systems may utilize text clustering to provide
product suggestions based on the similarities between user reviews and product descrip-
tions. Recommendations tailored specifically to the user’s tastes are made possible by
the process of clustering comparable products or user preferences. When object charac-
teristics are only stated in the form of text or when explicit user ratings are unavailable,
this method may improve the accuracy of recommendations.
Text clustering may be used for a wide variety of purposes beyond those listed above.
The tasks that are conducted are determined by the dataset at hand and the results that are
required. Text clustering may be used to enhance the efficiency of various text mining
activities in addition to the ones mentioned above that are derived from it. By aggregating
texts with similar contents, text clustering may enhance the precision of text categoriza-
tion, for instance. By aggregating documents that are likely to be relevant to a particular
query, text clustering may also be used to enhance the efficiency of text retrieval.
Text clustering’s adaptability and usefulness are highlighted by the resulting tasks. Text
data may be further examined, summarized, categorized, and used in a variety of other
activities by using the underlying structure and organization achieved by clustering. Text
clustering’s potential uses are broadened by the derived tasks, which may be implemented
in contexts like information retrieval, sentiment analysis, recommendation systems, and
anomaly detection. Text clustering is an effective text mining approach for discovering
6.3 Derived Tasks from Text Clustering 183
meaningful relationships between texts before performing the proceeding tasks, which
ultimately boosts the efficiency of other text mining operations.
Cluster naming is the act of assigning meaningful names to each cluster based on its
contents. We’ll go through some guidelines for giving groups of data meaningful names;
for example, you shouldn’t provide a number or a primary key value to each cluster that
has nothing to do with its actual contents. Cluster names must accurately reflect the infor-
mation contained inside them. Here, we detail the naming conventions we used to make it
easier to navigate text clusters.
Different rules have been proposed in the literature for symbolically assigning names
to clusters. The fundamental rule of cluster naming is that the names should be symbolic
representations of the contents of the clusters. The second rule is that cluster names
shouldn’t be too lengthy; typically, a cluster name will include no more than four words.
The third rule is that there should be no duplicate names for different clusters. The names
of the clusters make it easier to find texts by browsing.
Different procedures have also been defined to name the clusters, and these procedures
help in searching for a particular cluster. The names of the clusters are alphabetized into a
list. Words in an index are given weights, and the words with the greatest weights are con-
sidered for potential cluster names. Whenever there are more than two clusters, we can
name them by using one of the strategies that will be discussed in the next paragraph, all
of which have the same name. This isn’t the only method available; in fact, the naming of
clusters is often treated as a distinct text mining operation from clustering itself. Having
many clusters with the same name is against the idea; thus, we should discuss some meth-
ods for preventing this problem. When giving a name to this cluster, we substitute the
second-most-important term for the first.
As an alternative, we may combine the clusters that share the same name. We then swap
out the term with one that has been given by using a different weighting method. It’s quite
possible that the clusters called with different terms have the same meanings even though
the redundant cluster name was avoided. By integrating text clustering and cluster naming,
taxonomy may be generated. Generating taxonomy from a corpus involves extracting top-
ics, relations, and documents connected with those subjects. By applying a clustering tech-
nique to texts and then giving those groups descriptive names, the taxonomy may be
generated. We can use the process of taxonomy generation for the purpose of creating
ontologies. The process can either be manual or semi-automated. Figure 6.5 shows the
three clusters and their names.
C2
C3
184 6 Text Clustering
Now we will discuss subtext clustering. The overview of the subtext clustering process
is shown in Fig. 6.6. Subtext, in its broadest sense, refers to any portion of the main text
like headings, paragraphs, and sentences. Paragraphs are used to organize the texts in the
corpus, and comparable texts are grouped together. Once the main text is clustered, we
synthesize the subtext into the newly generated text. Here, we will show how subtext clus-
tering works.
A text may be broken down into its constituent parts and shown as a tree structure.
Paragraphs in the corpus texts are delineated by the carriage return. Punctuation marks
serve to divide each paragraph into individual sentences. Paragraphs, phrases, and indi-
vidual words all have their own contexts. The complete text serves as the root node in a
hierarchical representation of the text, with paragraphs and phrases serving as the inter-
mediate nodes and individual words serving as the terminal nodes. Even though a sub-
text is embedded inside the provided complete text, it still qualifies as textual data and
may be the focus of text clustering. The specified set of texts is mined for their subtexts,
which are then represented as numerical vectors. A clustering algorithm makes their
clusters according to the common features. It is possible that a particular text has many
clusters where various subtexts are located. It’s important to keep in mind that numerical
vectors representing subtexts could be sparser than those representing entire texts. In
this part, we will discuss subtext clustering and provide examples of related tasks.
Summaries of texts, rather than the whole texts themselves, are encoded in the sum-
mary-based clustering method. So, in simple words, if you break up a large text into
smaller chunks by grouping its paragraphs, you’ll have what are known as subtexts.
Through text segmentation, an entire text may be broken down into smaller, more man-
ageable chunks termed topic-based subtexts. Subtexts are often represented by sparser
numerical vectors than entire texts. For efficient text clustering, the summary-based
clustering method is a kind of subtext clustering. Paragraphs in an unsorted text are
grouped together. In this entire process, the ability to pick out relevant subtexts from a
larger body of text is always critical.
6.3 Derived Tasks from Text Clustering 185
Now, let us discuss how to automatically generate sample texts for text classification.
The first step in classifying texts is to construct categories in the form of a list or a tree and
assign certain texts to each category to serve as examples. Text clustering and cluster nam-
ing are the two basic activities that may be automated with the help of an automated sys-
tem. Sample-labelled texts are those that fit into the established categories, which are
provided as a list or a tree of cluster names. Here, we explain text clustering, which is often
thought of as the process of automating preparatory activities.
The text classification process becomes a difficult job in the early activities, such as
category pre-definition and sample text allocation. Predetermined topics or groups are cre-
ated in a grid or tree structure. The text from the external sources is gathered and carefully
labelled. The procedure encrypts the sample texts into numerical vectors. The quality of
the sample texts used to train text classifiers, however, is tried to improve. Let us discuss
how these pre-processing procedures for text classification can be automated. In this situ-
ation, unlabelled text from the outside world is collected. The collected chunks are divided
into smaller groups of those with comparable characteristics, and these subgroups are
given symbolic names. Both the cluster names and the contents included inside each clus-
ter are provided in the form of labels.
As a cost of automating the preparatory steps, however, we can get a decline in the
quality of sample texts that are produced in this manner. To automate the initial steps of
text classification, we must tune the system to make use of text clustering. Classification
methods, such as crisp versus fuzzy and flat versus hierarchical, can be selected based
on the requirements. The clustering process must be conducted in accordance with the
kind of categorization we choose. Cluster names serve as a symbolic means of assign-
ing IDs to their respective groups. Optionally, we may also redistribute cluster examples
to serve as training data for the binary classifications that are extracted from the multi-
classification system.
It’s worth noting that the following procedure generates instances at random and that
these examples may be quite unreliable. Therefore, we need to think about the type of
learning that makes use of both labelled and unlabelled instances. Semi-supervised learn-
ing is a type of learning that incorporates elements of both supervised and unsupervised
learning. As a semi-supervised learning approach, LVQ (learning vector quantization) can
be used. The number of clusters must be determined in advance when using the K-means
technique or Kohonen Networks to classify data.
Now, let’s explain another application of text clustering, i.e., the identification of redun-
dant projects. The work specified here involves assembling clusters of research project
ideas that are very similar in terms of their main theme. Text clustering is used to create
project clusters from the research project proposals that are submitted as texts. This pro-
cess was first developed by using a single-pass approach with a similarity criterion close
to 1.0. Here, we will explain how to identify duplicate projects by grouping them together.
Scope, substance, and objectives are the core of each successful research proposal. In the
186 6 Text Clustering
first section, titled “Project Scope,” the researcher defines the boundaries of the proj-
ect they are proposing. The second section, titled “Project Contents,” details the many
ideas, plans, and studies that make up the project itself. The last section, “Project Goals,”
describes the initiative’s ultimate purpose. The proposal is supposed to be presented in the
form of an XML document, with each of the three parts labelled. Let’s talk about how text
clustering may be used to identify repetition in research proposal material. Each study pro-
posal in the offered system comprises three sections. A clustering technique, in this case,
the single-pass algorithm, is employed to categorize the collection into smaller groups.
Each cluster’s research project proposals are evaluated for any duplication or connection
with other proposals. Each cluster should have either a single proposal selected for further
development, or the proposals inside the cluster should be merged into a larger project
as all of them represent redundant projects. To better identify duplicate project ideas, we
need to adjust the text clustering algorithm parameters.
The goal is to produce an exceptionally large number of clusters, each containing a very
tiny number of objects. In this case, we use a similarity value that is very close to one (1)
for the similarity criterion and the single-pass approach. The three parts of a proposal are
given different weights, with the study topic and research aim receiving more weight than
the research scope. Moreover, half of the results clusters are empty, consisting of only one
item. As an alternative to text clustering, text association may be used to identify duplicate
projects. Corpus is used to represent words as collections of text. The Apriori algorithm
uses the text sets to derive the association rules, which can be used here as well. The texts
are marked as potential duplicates for a certain project. In this scenario, both the confi-
dence and support thresholds are close to 1.0.
So far, we have discussed different clustering types. Now, we will discuss specific cluster-
ing algorithms.
Basic text clustering may be accomplished with the help of simple clustering algorithms,
which can serve as the foundation for more complex methods. These algorithms are sim-
ple and can be used for any simple task. Here are a few examples of basic text clustering
algorithms:
• Random partitioning: This approach first creates a partition by randomly placing each
document into one of many clusters. After that, it re-clusters texts based on similarity
metrics until convergence is reached, incrementally improving the division along
the way.
• Agglomerative clustering: Starting with the assumption that each document belongs to
a distinct cluster, agglomerative clustering combines the most comparable clusters in
an iterative fashion until a termination condition is reached. It organizes data into a
hierarchical structure of clusters that may be subdivided into any number of subsets.
Simple clustering algorithms have no need for the user to have any background infor-
mation on the subject. They merely categorize files into sets determined by their similari-
ties. Both hierarchical and density-based clustering are examples of amazingly simple
clustering techniques. They are relatively easy to implement and can be used with a variety
of datasets. However, they can be less accurate than other clustering algorithms, especially
if the data is not well separated.
K-means is a popular algorithm for clustering data, and it is often used for text as well. It
is a partitioning-based clustering method with the aim of splitting the dataset into K dis-
tinct groups. Each group is called a separate cluster. Here are the measures taken by the
K-means algorithm:
• Initialization: Choose K random or heuristic starting points for cluster centers. Assign
each document to the centroid of the cluster to which it is closest using some distance
measure, such as the Euclidean distance or cosine similarity.
• Update: By using the average or median of the documents in each cluster, you will use
that information to recalculate the cluster centers.
• Iteration: Iterate the assignment and updating processes until convergence is achieved
or some other stopping condition is reached.
• Convergence: The K-means method converges when there is little variation in cluster
allocations, leading to consistent cluster centers. Groups of documents that have similar
feature vectors are represented by the final clusters.
So, we can say that is a well-known technique for clustering information into groups
with a fixed size (k). First, the method clusters documents at random. Then, it goes through
each cluster in turn, moving documents to new clusters if they have more in common with
documents there than with those in their present cluster. This procedure is repeated until
the algorithm converges, at which point no documents will be moved across clusters.
188 6 Text Clustering
Following is the Python code for K-means clustering algorithm that clusters the tex-
tual data.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# Textual data
data = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",]
Cluster 2:
–– This document is the second document.
–– And this is the third one.
First, we have defined the text documents in the list named “data”. Then we used
TfidfVectorizer vectorizer. Finally, the K-means algorithm was used to get the actual
6.4 Text Clustering Algorithms 189
clusters that were then displayed using the cluster ID and the documents belonging to that
cluster.
Unsupervised learning algorithms like competitive learning are useful for clustering-
related tasks. It was considered after studying neural networks and SOM. When it comes
to representing groups of data, competitive learning algorithms generate a network of
neurons that compete with one another to best represent each cluster. The weight vectors
of competing neurons are compared to the document to determine which one performs
best. In the context of text clustering, several competitive learning techniques include:
• Self-organizing maps (SOM): The topological connections between neurons are kept
intact in SOM’s grid-based organization. Neighboring neurons are additionally modi-
fied to reflect the similarity between documents when each document is mapped to the
neuron with the nearest weight vector.
• Growing neural gas (GNG): GNG is an enhancement of SOM that dynamically modi-
fies the network architecture during training. It begins with a limited number of neurons
and gradually adds more as needed to detect and record new clusters. GNG works well
with text data that is always changing. By ordering neurons according to their competi-
tiveness and adaptability, competitive learning algorithms provide a competitive mech-
anism for clustering. They are adapted to handling high-dimensional data and
successfully capturing its underlying structure. Each text clustering technique has its
own advantages, and the abovementioned are some of the advantages of GNG. The
properties of the text data, the outputs required, and computing constraints all play a
role in deciding which method to use. Text clustering algorithms continue to be devel-
oped and improved by researchers and professionals to meet the problems given by
unstructured text data.
So, we can say that competitive learning is a clustering technique that places docu-
ments into groups according to the cluster with the most common terms among them.
First, the method clusters documents at random. The cluster centroids are then recalcu-
lated based on the documents in each cluster, and the process is repeated until all clusters
have been processed. This procedure is repeated until the algorithm converges, at which
point the centroids of the clusters no longer shift.
The appropriate clustering method to use depends on the dataset at hand and the
intended conclusion. A basic clustering approach may be enough if the dataset is small and
straightforward. K-means clustering and competitive learning are two examples of more
advanced clustering algorithms that may be required if the data collection is ridiculously
huge or complicated.
190 6 Text Clustering
The text clustering system architecture presented in this chapter can be seen in Fig. 6.7.
The first step is to collect texts to be used as clustering targets. Next, the features collected
from the texts are encoded as numerical vectors. When put into practice, the AHC method
that clusters the numerical vectors used to represent texts into subgroups. In this part, we
provide details of the functional view modules that contribute to the text clustering sys-
tem’s overall implementation. To fulfil its function, the text encoding module converts
input texts into numerical vectors. Obtaining texts for clustering is the first stage in run-
ning the text clustering system. Features are extracted from the collected texts and orga-
nized into a database. To encode the features of interest, a numerical vector is created for
each text.
After texts are encoded as numerical vectors, the text clustering module calculates the
similarities between the vectors. When a cluster pair is sent into the similarity calculation
module, it returns a normalized number between 0 and 1. In this system, we implemented
text clustering by using cosine similarity as the similarity measure between samples. The
text clustering module includes the essential activity of calculating similarities, which is
not included in the system architecture shown in Fig. 6.7. The next iteration of this system
will include more similarity measures. Using a similarity computation module, text clus-
tering is conducted in the text clustering module. This text clustering system calculates the
clusters using the AHC algorithm. It creates clusters from single-item skeletons by com-
bining items into those with the greatest resemblance to one another. Cluster similarities
are calculated by taking the meaning of all potential pairwise similarities. Each individual
item is merged into a cluster until the number of clusters reaches a threshold determined
by an outside parameter. The collected texts are the clustering targets, and they are stored
in text files. The collected texts are organized into smaller groups with related topics. Texts
that have been clustered are shown as a list of filenames with boundary lines between each
cluster.
However, Python provides the built-in libraries for calculating the text clusters. The
below given code takes different text documents and calculates their similarity.
Numerical
Vectors
Arguments:
- texts: A list of text samples to be classified.
- classifier: A trained text classifier model.
Output:
- A set containing the unique classes found in the
text samples.
"""
return classes
# Example usage
texts = ["This is an example text about sports.",
"I love reading books and literature.",
"The stock market is experiencing a downturn.",
"The latest fashion trends for summer.",
"The new smartphone model has been released."]
Let’s discuss another interesting implementation. Here we have used the K-means clus-
tering algorithm for labeling the textual data, and then we have used this labelled data for
classification task further.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# Textual data
data = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]
labels = kmeans.labels_
X_train = X[:3]
y_train = labels[:3]
X_test = X[3:]
y_test_actual = labels[3:]
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
6.5 Implementation of Text Clustering 193
Output:
Accuracy: 1.0
Note that first the K-means is used to label the documents given in the list named
“data”. Then a few of these documents are used for training purposes, and the last docu-
ment is used for testing. This shows that clustering can be used to perform classification
by using clustering as a preprocessing step.
Natural language processing (NLP) and machine learning rely heavily on text cluster-
ing because it facilitates the classification, analysis, and finding of patterns in massive
textual datasets. Data preparation, feature extraction, and the application of a clustering
method are only a few of the essential processes in putting text clustering into practice. An
in-depth explanation of how text clustering works may be found below.
As the first task is to standardize and prepare text data for clustering, preprocessing is
typically necessary. Examples of typical preprocessing procedures include:
• Tokenization: the process of separating the text into separate words or tokens.
• Stop word removal: Reducing the number of filler words (such as “the” and “is”) used.
• Lowercasing: for the sake of case insensitivity, the text is changed to lowercase.
• Stemming/lemmatization: reducing words to their root form (e.g., “running” to “run”).
• Removing punctuation: getting rid of all punctuation in the writing.
• Feature extraction: After being cleaned up, textual material must be converted into a
form that clustering algorithms can work with. Text clustering often makes use of the
term frequency-inverse document frequency (TF-IDF) approach for feature extraction.
Words that are particularly significant or discriminatory are highlighted by TF-IDF
because their frequency in the document is compared to that of the full corpus.
After standardizing the text, the next task is to select the clustering algorithm. The
problem’s specifics and the makeup of the text data will determine which clustering tech-
nique is most suited to solving it. Common text clustering methods include:
• K-means: The goal of this partitional clustering technique is to divide the data into a
fixed number of distinct groups. It does this by minimizing the squared distance (or
“distance squared”) between each data point and the cluster centers.
• Hierarchical clustering: A set of algorithms for creating a structured hierarchy of clus-
ters. A dendrogram is produced by merging or splitting groups repeatedly depending on
their similarity.
• Density-based clustering (DBSCAN): A method that uses density to partition datasets
into dense and sparse sections.
194 6 Text Clustering
Once the clustering algorithm is applied, the next task is to evaluate the results. Due to
the lack of class labels in unsupervised contexts, assessing the quality of text clustering is
difficult. However, the clustering outcomes may be evaluated using a variety of metrics:
After evaluation, the next task is related to interpretation and visualization. It is essen-
tial to understand and visualize the clusters after receiving the clustering results to acquire
insights and draw relevant conclusions. To facilitate visualization, one might make use of
dimensionality reduction methods like principal component analysis or t-SNE. Scatterplots,
word clouds, and topic modeling are just a few examples of visualization methods that
may be used to learn more about the clusters’ structure and features. It should be noted that
the process of text clustering is iterative, requiring possible adjustments to preprocessing
methods, feature extraction strategies, and clustering algorithms. Optimizing clustering
performance requires repeatedly tweaking the process and assessing the outcomes.
Data preparation, feature extraction, application of clustering methods, evaluation of
findings, and interpretation of clusters are all necessary stages in the implementation of
text clustering. The nature of the text data and the desired outcomes of the clustering meth-
ods should be based on the selection of appropriate methods and algorithms. Text cluster-
ing, when properly implemented and analyzed, may help in the efficiency of several
natural language processing tasks, including information retrieval, topic extraction, and
data exploration. In a text classification system, it is useful to be aware of how the clusters
correspond to the established categories. The importance of the class-to-cluster connection
in text clustering is discussed in detail below.
Classes in a text classification system stand for the labels or categories that have already
been decided upon for the text samples. Topics, emotions, genres, or anything else could
be used as the basis for such groups. Classes may be things like “Sports”, “Politics”,
“Entertainment”, etc., in a system for categorizing news articles.
6.5 Implementation of Text Clustering 195
Clusters are produced throughout the clustering process as divisions based on the simi-
larities between the text samples. The goal of document clustering is to collect related
documents into manageable groups based on their common attributes, themes, and pat-
terns. Class-to-cluster mapping is the process of relating the clusters produced by a clus-
tering algorithm to the previously established classes. Each class’s corresponding clusters
are shown, and the accuracy of the clustering procedure in relation to the labels is calcu-
lated. There are several advantages of analyzing class-to-cluster links in the process of text
clustering. Here we have discussed a few things:
• Validation: The efficiency of the clustering method may be verified by comparing the
clustered classes to the ground-truth labels. It gives a numeric evaluation of how well
the clustering outcomes fit the predicted class distribution.
• Error analysis: Identifying any faults or weaknesses in the clustering process may be
achieved by analyzing the discrepancies between classes and clusters. This opens the
door for further research into instances in which samples from the same class are found
in various clusters.
• Discovering new classes: Clusters that don’t fit neatly into a predetermined category
may point to hidden or underappreciated themes in the text data. When these groups are
investigated, new ideas or trends may surface.
• Enhancing interpretability: Having a semantic context based on the known class labels
is provided by the class/cluster mapping, which assists in the understanding of the clus-
ters. It simplifies the process of identifying and categorizing clusters, which in turn
makes it simpler to draw useful conclusions from the data.
However, there are some challenges as well:
• Difficulties in mapping: Due to the unsupervised nature of clustering and the subjective
nature of text interpretation, it might be difficult to establish a flawless one-to-one map-
ping between classes and clusters. Among the difficulties and constraints are:
• Ambiguity: It is challenging to categorize text samples since each one may be related to
many different groups or topics.
• Overlapping classes: Samples from the same class may be dispersed among several
clusters due to classes having substantial overlap in terms of the underlying text data.
• Noise and Outliers: It may be difficult to effectively map text data due to the presence
of noise or outliers that do not fit neatly into any pre-existing class or cluster.
In conclusion, text clustering relies heavily on the class/cluster connection, which links
the initial classes with the final clusters. The interpretability of the clusters is improved,
new classes are discovered, and the clustering findings are validated. While it may be dif-
ficult to create a perfect mapping, doing so will allow for a more insightful review of the
clustering performance and the text data.
196 6 Text Clustering
# Example usage
texts = ["This is an example text about sports.",
"I love reading books and literature.",
"The stock market is experiencing a downturn.",
"The latest fashion trends for summer.",
"The new smartphone model has been released."]
6.5 Implementation of Text Clustering 197
cluster_labels = [0, 1, 1, 2, 2]
class_labels = ["Sports", "Literature", "Finance", "Fashion",
"Technology"]
# Check the class-cluster mapping
mapping = check_class_cluster_mapping(texts, cluster_labels,
class_labels)
# Print the output
print("Class-Cluster Mapping:")
for class_label, clusters in mapping.items():
print("Class:", class_label)
print("Clusters:", clusters)
print("---")
Output
Class-Cluster Mapping:
Class: Sports
Clusters: {0}
---
Class: Literature
Clusters: {1}
---
Class: Finance
Clusters: {1}
---
Class: Fashion
Clusters: {2}
---
Class: Technology
Clusters: {2}
---
Relationships between classes and clusters will be shown in the text clustering system’s
output. A category’s appearance in different sets may also occur depending on the type of
clustering technique you are using. Remember that your text clustering results may pro-
vide somewhat different results, as well as slightly different class and cluster relations.
Now, we will show an example of a function to check the Class: AHCAlgorithm rela-
tionship in text clustering using Agglomerative Hierarchical Clustering (AHC).
"""
Function to check the Class: AHCAlgorithm mapping in
text clustering using Agglomerative Hierarchical
Clustering (AHC).
Arguments:
- texts: A list of text samples.
- class_labels: A list of class labels corresponding
to each text sample.
- num_clusters: The desired number of clusters.
Output:
- A dictionary mapping each class to the clusters it
appears in.
"""
# Step 1: Text preprocessing and feature extraction
# Using TF-IDF for feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
# Step 2: Apply Agglomerative Hierarchical Clustering
(AHC)
clustering_model = AgglomerativeClustering(
n_clusters=num_clusters)
cluster_labels = clustering_model.fit_predict(X)
# Step 3: Create the Class: AHCAlgorithm mapping
class_ahc_mapping = {}
for i, class_label in enumerate(class_labels):
cluster = cluster_labels[i]
if class_label not in class_ahc_mapping:
# Initialize an empty set for each new class
class_ahc_mapping[class_label] = set()
# Add the cluster label to the set of clusters for
the class
class_ahc_mapping[class_label].add(cluster)
return class_ahc_mapping
Here we will show how you can use the function and its output.
# Example usage
texts = ["This is an example text about sports.",
"I love reading books and literature.",
"The stock market is experiencing a downturn.",
"The latest fashion trends for summer.",
"The new smartphone model has been released."]
class_labels = ["Sports", "Literature", "Finance",
"Fashion", "Technology"]
num_clusters = 2
# Check the Class: AHCAlgorithm mapping
mapping = check_class_ahc_mapping(texts, class_labels,
num_clusters)
# Print the output
print("Class: AHCAlgorithm Mapping:")
for class_label, clusters in mapping.items():
print("Class:", class_label)
print("Clusters:", clusters)
print("---")
In this section, we will discuss the various concepts about cluster evaluation.
We first address the guidelines, frameworks, and strategies for assessing clustering out-
comes before getting into the evaluation process. It is difficult to evaluate the clustering
labels when the target labels of samples are not available. There are numerous instances of
clusters matching with the target categories, even when analyzing them using the labelled
samples. Maximizing the similarities between samples inside each cluster and minimizing
with those outside the clusters are ways in which clustering findings are evaluated. Here,
we will discuss the process of comparing the outcomes of clustering and categorization.
Evaluating text clustering performance is more challenging and complex than evaluat-
ing text classification performance. When analyzing text clustering, the outcomes depend
upon methods of calculating similarities among texts. There have been several suggested
methods for assessing the quality of text clustering findings, but none of them have become
the industry standard. When assessing clustering outcomes, normally, we don’t restrict the
number of clusters produced. Although there are no universally accepted standard evalua-
tion metrics, there is a clear direction when it comes to implementing text clustering sys-
tems. This direction involves optimizing the cohesiveness or intra-cluster similarities
among items within each cluster. Similarly, it requires to minimize the dissimilarity or
inter-cluster similarities among different clusters. In simple words, the goal is to prevent
two types of errors:
Several measures for evaluating the effectiveness of a clustering system have been pro-
posed in the literature. Let’s talk about the three views that we can use to measure output
of the clustering algorithms. The external view indicates that the clustering results are
evaluated using labelled examples in the external view. The internal perspective might be
thought of as the one where outcomes are evaluated according to similarities of individual
objects (Fig. 6.8).
The relative perspective is the point of view in which several methods are compared
considering the outcomes of a single method. Let us quickly discuss how we go about
assessing the outcomes of data clustering. It is expected that a set of properly labelled
examples has been produced for use in testing. The samples are organized into as many
subsets as there are categories. We calculate the average of the similarities inside each
cluster and between all the possible clusters.
6.6 Clustering Evaluation 201
Clustering
Evaluation
• Silhouette coefficient: Silhouette coefficient measures how well the clusters are sepa-
rated and how similar the data points are within each cluster.
• Davies-Bouldin index: It gives an indication of cluster integrity based on distribution
within clusters and distance between clusters, and it quantifies the average similarity
between clusters.
• Adjusted Rand index (ARI): ARI measures the quality of the clustering algorithm by
comparing it with the reference (true) clusters. The comparison determines the quality
of the cluster that is being evaluated.
• Fowlkes-Mallows index: The geometric mean of the clusters’ accuracy and recall rela-
tive to the reference clustering are calculated.
For Point A:
a (average distance to its cluster) = (Distance to B) = 1
b (average distance to the nearest other cluster, C2) = (Distance to C) = 7
Silhouette Coefficient (A) = (7 - 1) / max(1, 7) = 0.857
For Point B:
a (average distance to its cluster) = (Distance to A) = 1
b (average distance to the nearest other cluster, C2) = (Distance to C) = 7
Silhouette Coefficient (B) = (7 - 1) / max(1, 7) = 0.857
Following is the Python code to calculate silhouette coefficient:
In this part, we’ll talk about the fundamental metrics used to judge the quality of cluster-
ing algorithms. It is assumed that the data points are represented as numerical vectors.
Clustering evaluations may be conducted using either intra-cluster or inter-cluster similar-
ities. The goal of data clustering is to increase similarity within each cluster while decreas-
ing it across clusters. Here, we provide the method by which we compute similarities.
Preparing a text collection to serve as the clustering target is the first stage in assessing
clustering results. When text labels aren’t present in the test set, we use relative and inter-
nal validation to create our evaluation metrics. After the clustering findings have been
externally validated, if the labels are accessible, they are evaluated. Labels are concealed
throughout the clustering process and revealed during assessment. In the external valida-
tion, the number of clusters is determined by the number of target categories; in the inter-
nal validation, the number of clusters is arbitrary; and in the relative validation, the number
of clusters is determined by the pivot method.
6.6 Clustering Evaluation 203
By calculating the similarity between clusters, both the extremes of having all data
points in one group and having as many empty clusters as there are data points may be
avoided. Maximum intra-cluster similarity is achieved by populating each cluster with
items that are very much similar to each other. Ideally, the similarity across clusters would
be reduced by maximizing the strength of discrimination between items in different clus-
ters. For the best evaluation of clustering output using labelled instances, the texts of each
target label should be identical to the ones in their related cluster.
Now, we will discuss the concept of internal validation. The method comprises analyz-
ing the results of clustering based on the similarities between texts. A similarity measure
is constructed, and unlabelled texts are created. Without any further context, the assess-
ment measure may be established based on intra-cluster and inter-cluster similarity.
Figure 6.9 depicts the unprocessed texts and their corresponding representations, which
are used in the clustering assessment. This kind of assessment presupposes that no further
information is provided, such as target labels of texts. The technique encodes the original
texts into numerical vectors. The first step is to determine the similarity measure that will
be used to compare unprocessed texts and their representations. How the similarity mea-
sure is defined may have a significant impact on the evaluation of the clustering results.
Different methods can be used to calculate the intra-cluster similarity. For example, for
the cluster at the top left of Fig. 6.10, the intra-cluster similarity can be calculated using
the simple Euclidean measure as follows:
The Euclidean distance between two data points A(x1, y1) and B(x2, y2) is calcu-
lated as:
Distance A,B sqrt x1 x 2 ^ 2 y1 y 2 ^ 2
Using the data points [4, 5], [5, 6], and [3, 4], which are in the same cluster, you can
calculate the pairwise Euclidean distances as follows:
Distance between [4, 5] and [5, 6]:
Distance sqrt 4 5 ^ 2 5 6 ^ 2 sqrt 1 1 sqrt 2 1.414
[3 4] [4 2]
C2
C1
[3 1]
[2 0]
[0 2]
C3
For cluster C2, the similarities within the cluster will be:
Distance between [1, 4] and [4, 1]:
Distance sqrt 1 4 ^ 2 4 1 ^ 2 sqrt 9 9 sqrt 18 4.24
For Cluster C3, the similarities within the cluster will be:
Distance between [3, 1] and [2, 0]:
Distance sqrt 3 2 ^ 2 1 0 ^ 2 sqrt 1 1 sqrt 2 1.41
These distances represent the similarities (or dissimilarities) between the data points in
the same cluster, with smaller values indicating greater similarity.
Now, let’s measure the inter-cluster similarity between C1, C2, and C3 using the same
Euclidean measure.
Calculate the centroids for each cluster:
Centroid of C1: [(4+5+3)/3, (5+6+4)/3] = [4, 5]
Centroid of C2: [(1+4+4)/3, (4+1+2)/3] = [3, 2.33] (approximately)
Centroid of C3: [(3+2+0)/3, (1+0+2)/3] = [1.67, 1] (approximately)
Calculate the pairwise Euclidean distances between cluster centroids:
Distance between C1 and C2:
Distance sqrt 4 3 ^ 2 5 2.33 ^ 2 2.33
Now, let us discuss the relative validation, another technique for assessing clustering
findings. We choose the best clustering outcomes in advance. The effectiveness of various
clustering methods is measured by comparing their results against the most required one.
Clustering methods are rated not on how precisely they cluster data items but on how
closely their findings approximate the required.
Figure 6.11 shows the preparation of the clustering results for comparison purposes.
Data objects are clustered using various techniques, and the resulting information is col-
lected. The effectiveness of the clustering algorithms is determined by comparing the
expected and actual outcomes. There is no such thing as perfectly good clustering find-
ings, and there is always a great deal of subjective bias involved in creating such results.
Let’s talk about how relative validation may be used as a framework for judging clustering
outcomes rather than a hard and fast method.
The evaluation method considers that the number of results in the generated results and
the desired results is the same. We create a mapping table where each column corresponds
to a cluster in the desired results and each row corresponds to a cluster in the generated
ones. We sum up all the values and the values in the diagonal. The ratio of the sum of the
diagonal to sum of all the values becomes our metric. In Fig. 6.12, the three clusters on the
left represent the actual outcomes of grouping data items, whereas the three clusters on the
206 6 Text Clustering
[4 5] [1 4] [1 5] [4 1]
[5 6] [4 1] [5 6] [4 3]
[3 4] [4 2] [3 4] [4 4]
[3 1] [7 3]
[2 0] [5 0]
[0 2] [0 2]
right represent the intended outcomes that have been predefined. For example, one item in
each cluster on the left side of the figure is common with the respective clusters on the
right side.
Internal and external validity-based assessment measures are utilized more commonly
than these.
Now, we will focus on the third paradigm for assessing quality, known as external vali-
dation. It’s a method for analyzing the outcomes of clustering by adding the data from
6.6 Clustering Evaluation 207
other sources. This method comprises preparing the labelled instances as the test collec-
tion, hiding the labels during clustering, and evaluating similarity based on label consis-
tency. Figure 6.13 shows the annotated texts to be used in an external evaluation of the
clustering findings. As it can be seen in Fig. 6.13, the set of labelled texts is almost like
what is provided for relative validation. The labels are supplied clearly in the external vali-
dation method, but in the relative validation method, they are not given at the beginning of
the text collection. In this assessment paradigm, the outcomes of the clustering are mea-
sured against labelled instances. Data objects are grouped without revealing their labels,
and clusters are produced automatically without regard to the labels. We calculate both the
similarity between clusters and the similarities within clusters using the labels.
Similarity between data items is reported as a binary number, with zero indicating no
similarity and one indicating perfect identity. The final evaluation score is calculated by
adding the two scores together. The introduction of goal labels or other external informa-
tion about data items makes external validation distinct from internal validation. To illus-
trate the external validation, a simple example is shown in Fig. 6.14. The clustering of
texts is done separately. Labels from the clustering results are used in place of cosines to
calculate the similarity between clusters and within clusters. In this kind of assessment, the
text labels themselves serve as the first piece of external information. This was just one
method; some other measures include Hubert’s correlation, Rand statistics, and the Jaccard
coefficient.
In this section, we will discuss the method for calculating the clustering index that was
first suggested in 2007. The method uses the label text and belongs to external evaluation
method. Target labels of data items are used to calculate both intra-cluster and inter-cluster
similarity. The clustering index combines the two metrics into a single statistic, much as
F1 measure does. Here, we explain how to get the clustering index from the clustering
data. It is expected that labelled texts are employed in accordance with external validity,
and similarity between clusters is calculated using the results of clustering. As shown in
the equation below, the similarity between two texts may be calculated by comparing their
respective target labels.
Instead of focusing on the intra-cluster similarity, we may also look at the inter-cluster
similarity. Intra-cluster similarity must be maximized, while inter-cluster similarity must
be minimized, for successful data clustering. The intra-cluster similarity is calculated for
the whole set of clustering outputs. The two metrics, intra-cluster similarity and inter-
cluster similarity, calculated in the preceding step are combined to form the final metric.
2·intra _ cluster similarity·1.0 inter _ cluster similarity
CI
intra _ cluster similarity 1.0 inter _ cluster similarity
The clustering index is inversely proportional to the similarity between clusters but
proportionate to the similarity inside a cluster. This inverse of inter-cluster similarity is
called discrimination among clusters. The clustering index was established as the bench-
mark against which the efficacy of clustering might be measured. To evaluate a text clus-
tering method, labelled samples must be prepared. The reason behind this is that this
method used external information for evaluation as mentioned earlier. Discriminality takes
the role of similarity between clusters.
2·intra _ cluster similarity·discriminability
CI
intra _ cluster similarity discriminability
6.6 Clustering Evaluation 209
Similarity inside a cluster and discrimination across clusters are analogous to recall and
accuracy in the F1 metric. We may utilize the clustering index to fine-tune the clustering
results by calculating the two metrics based on the cosine similarity instead of the tar-
get labels.
Now, we will discuss how to use the clustering index to rank the quality of the crisp
clustering output. The clustering index was used for the basic clustering evaluation. Here,
we show how to calculate the clustering index and use it to analyze the outcomes of crisp
clustering.
A simplified example of crisp clustering results is shown in Fig. 6.15. The test collec-
tion begins with the goal labels A, B, and C. Figure shows the text belonging to the three
clusters C1, C2, and C3.
Let’s now discuss binary clustering. In binary clustering, data points are separated into
two groups. We calculate the intra-cluster similarity for each cluster and then the inter-
cluster similarity by averaging the two intra-cluster similarities. Figure 6.16 shows the
example of binary clustering.
Let’s discuss about the outcomes of multiple clustering when more than two groups are
identified. When using binary clustering, the similarity between clusters is calculated by
averaging the similarities between each pair of clusters, while intra-cluster similarity is
calculated in the same way for each cluster. The maximum number of pairings that may be
made from the m clusters is m(m-1)/2, and for each pair, the inter-cluster similarity is cal-
culated. Clustering metrics, such as the clustering index, are calculated by averaging the
Txt-01-A Txt-05-A
Txt-02-A Txt-06-A
Txt-03-B Txt-07-C
Txt-04-B
210 6 Text Clustering
inter-cluster similarities of all feasible pairings of clusters. Except for the need to generate
all possible pairings of clusters for calculating the inter-cluster similarities, the procedure
of assessing findings from the multiple clustering results is identical to the case of binary
clustering. Let’s talk about the labelled text sets that are used to test out text clustering
tools. The clustering index discussed here may be used to evaluate text clustering algo-
rithms on the sets of tagged texts.
The quality of clustering solutions may be measured quantitatively using clustering
indices. They are useful for evaluating the clusters’ cohesiveness, spatial separation, and
overall efficiency. Clustering indices are becoming popular and include:
• Dunn index: The Dunn index provides a numerical representation of the relationship
between the average cluster diameter and the average cluster distance. Larger numbers
represent more effectively clustered data.
• Calinski-Harabasz index: Dispersion outside of clusters is measured in relation to dis-
persion within of clusters using the Calinski-Harabasz index. It attempts to increase the
distance between clusters while decreasing the distance inside each cluster.
• Rand index: Similarity between two datasets, such as the clustering solution and the
ground-truth labels, may be evaluated using the Rand index. It calculates the percent-
age of sample pairings that share a cluster assignment in both the solution and the
ground truth.
In machine learning and text clustering, one of the most important steps is parameter tun-
ing, often called hyperparameter optimization. Finding the appropriate settings for a mod-
el’s or algorithm’s parameters to maximize output and generalization is the goal of this
process. Here we will provide a comprehensive overview of parameter tuning.
The success of a model or algorithm is overly sensitive to the settings of its hyperpa-
rameters, making parameter adjustment an absolute need. Better accuracy, quicker conver-
gence, less time spent training, and more reliable clustering outcomes are all possible with
fine-tuned parameters. In this way, the algorithm may be tuned to optimally capture the
underlying patterns in the text data.
Different hyperparameters influence the behavior and efficiency of text clustering
algorithms. The following are examples of frequent hyperparameters in text clustering
algorithms:
6.6 Clustering Evaluation 211
• Grid search: By defining a range of values for each parameter, as in a grid search, the
parameter space may be thoroughly explored. The model’s performance under each
permutation is measured, and the optimal parameters are chosen accordingly.
• Random search: To evaluate the efficacy of a model, random search chooses parameter
values at random from a set of bounds. When the search space is vast, it provides a
faster alternative to grid search.
• Bayesian optimization: In Bayesian optimization, the goal function is modelled proba-
bilistically, and then promising parameter combinations are selected repeatedly. For
faster and more accurate optimum value discovery, it zeroes in on promising portions
of the parameter space.
• Genetic algorithms: To develop a population of possible parameter configurations,
genetic algorithms use natural selection and genetic processes (such as mutation and
crossover). To find optimal parameter values, it repeatedly develops and optimizes the
population.
• Evaluation metrics for parameter tuning: During the tuning process, it is necessary to
compare various settings for various parameters. Metrics often used for text clustering
assessment include:
Text clustering methods function at their best when their parameters are finely-tuned.
Optimizing the clustering process, improving cluster quality, and boosting the overall per-
formance of text clustering models may be achieved by careful selection and fine-tuning
of the hyperparameters by researchers and practitioners.
6.7 Summary
Clustering textual data into meaningful groups or categories based on common features is
the goal of text clustering, which is an active area of research under machine learning and
natural language processing (NLP). The objective is to automatically recognize patterns,
topics, or subjects in massive amounts of unstructured text data to meticulously organize
and retrieve that information. Examination of clustering is essential for judging the accu-
racy, precision, and applicability of outcomes. Methods including internal and external
assessments, cluster validation strategies, and clustering indices are all part of this process.
6.8 Exercises
Q1: Consider the following table. It shows the text in the first column. Map each text to
the appropriate cluster, and mention the cluster name in the second column.
• Sprots
• Politics
• Country
• Movie
1. K-Means
2. AHC algorithm
3. Decision trees
4. DBSCAN
5. Navie base
Q4: Consider the K-means clustering algorithm. If the initial value of K is set to “2”, how
many clusters will be formed? Also, provide reasons for your answer.
Q5: Explain the difference between the crisp clustering and the fuzzy clustering.
Q6: Consider the following table, and identify which item belongs to which cluster.
Consider that “Cluster 1”, “Cluster 2”, “Cluster 3”, “Cluster 4”, and “Cluster 5” are
the names of the clusters.
Text Cluster
Germany is one of the most beautiful countries in the world
Football is one of my favorite games
National Congress party is expected to win the current elections
Terminator 2 was one of my favorite movies
Cricket world cup venue has not been decided yet
I cannot ensure who will win the next national elections
Q7: What is the difference between the single-view and multiple-view clustering?
x y
P1 1 3
P2 4 7
P3 2 7
P4 2 4
P5 6 8
P6 1 6
P7 8 6
P8 3 9
P9 2 9
P10 1 7
214 6 Text Clustering
If P5 is considered to be the central point, then calculate the similarity of each point
with P5 using any similarity measure.
Q9: What is the difference between the inter-cluster and intra-cluster similarity?
Q10: Consider the following code. Execute it, and check the output.
# Example usage
texts = ["This is an example text about sports.",
"I love reading books and literature.",
"The stock market is experiencing a downturn.",
"The latest fashion trends for summer.",
"The new smartphone model has been released."]
classifier = YourTrainedTextClassifierModel()
result = check_classes(texts, classifier)
print("Classes found in the text categorization system:")
for class_name in result:
print(class_name)
Q11: Differentiate between the hierarchical and flat clustering. Which one is more effec-
tive in terms of applications?
Q12: Write down a scenario where the clustering can be used as a method to assign text
labels. After this, use this labelled text as training data, and assign the labels to some
test data. Note that you can take the data from any source.
Recommended Reading
• Intelligent Text Categorization and Clustering by Nadia Nedjah, Luiza Macedo
Mourelle, Janusz Kacprzyk, Felipe M. G. França, Alberto Ferreira De Souza
Publisher: Springer Berlin, Heidelberg
Publication Year: 2008
This book features a rich tapestry of research projects that have been carefully chosen
for intelligent text clustering and categorization. As we begin this investigation, a sneak
peek at the chapters that lie ahead reveals a varied panorama of cutting-edge approaches
and insights, each of which contributes to the development of this dynamic area. We
will explore into these novel methods in the chapters that follow, illuminating the future
of intelligent text processing as we go.
6.8 Exercises 215
• Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering by
Laith Mohammad Qasim Abualigah
Publisher: Springer Cham
Publication Year: 2019
The text document (TD) clustering difficulty is addressed in this book using a novel
strategy that is elaborately built in two stages: (i) In the first step, a novel feature selec-
tion technique is introduced. It makes use of a particle swarm optimization algorithm
along with a cutting-edge weighting methodology. This strategy produces a refined
subset of highly informative features in a low-dimensional space when used in con-
junction with a thorough dimension reduction technique. The performance of the text
clustering (TC) algorithm is then improved by using this optimized subset, which also
speeds up processing. The K-means clustering algorithm is used to judge the efficacy of
these subsets. (ii) The creation of four distinct Krill Herd algorithms (KHAs), includ-
ing the basic KHA, modified KHA, hybrid KHA, and multi-objective hybrid KHA, is
shown in the second stage. These algorithms show incremental improvements over ear-
lier iterations. Seven benchmark text datasets with various characterizations and com-
plexities are used in the evaluation procedure to verify the efficacy of these methods.
Text summarization and topic modeling are other tasks that have become important, espe-
cially in the era of social media. Business organizations and analytical firms use tasks like
text summarization and topic modeling to get the documents’ themes. In this chapter, we
will explain both topics in detail. Each and every step will be explained with the help of
examples. Implementation in Python and complete details of the source code will also be
part of the details.
Text summarization is the process of extracting the summary of the text automatically
using an algorithmic approach. The summary is normally a shorter text, but it contains the
core of the text. The summarization can be performed both manually and automatically. In
the manual approach, a single person summarizes the text. In the automated approach, we
use algorithms to get the summary. In both cases, we pick the important portion of the text
instead of the entire text. For example, consider the following paragraph:
Text mining is an important field for extracting the information from the text. Using text
mining, we get the text and perform some preprocessing tasks e.g. POS tagging, stop word
removal, Named Entity Recognition, etc. After this we perform different operations, for
example, converting the text to vectors in order to feed it as input to any machine learning
algorithm. Once the text is processed, we can get the information from the text e.g. the
information about the subject of the text, or any event mentioned in the text.
Now, after summarizing the above paragraph, the summary text may be as follows:
In text mining, we preprocess the text and then process it to get the information present
in the text.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 217
U. Qamar, M. S. Raza, Applied Text Mining,
https://doi.org/10.1007/978-3-031-51917-8_7
218 7 Text Summarization and Topic Modeling
As you can see, we have skipped the irrelevant details from the text such as the steps of
preprocessing and the type of information that can be extracted. This is what text summa-
rization is all about. It helps us extract the important information from the text and skip the
irrelevant details.
Both the manual and the automated approaches have their own advantages and disad-
vantages. For example, the manual approach has more accuracy as compared to the auto-
mated approach; however, the manual approach is more tedious and difficult as it is beyond
the human capacity to summarize a text of large size. The automated approaches on the
other hand are less accurate but can be applied to large datasets. The accuracy of the auto-
mated approaches depends on the training data. The accuracy of the automated approaches
is increasing day by day with the availability of more data for training.
Implementation of automated approaches, however, is not an easy task. The majority of
algorithms read the text and break it into sub-portions. Then each portion is classified as a
summary or non-summary. So we can say that text summarization is based on text classi-
fication. The classification performed in this process is binary in nature where there are
two classes—“Summary” and “Non-summary”. The portion of the text that is identified as
non-summary is the one that contains irrelevant details or details that are not important.
On the other hand, the portion that is identified as the summary contains the text that is
important and should be included in the summary of the text, so all such portions are com-
bined and output as the summary of the given text. So we can say that the performance of
the summarization algorithm depends upon its classification accuracy. Several advanced-
level tasks can also be performed on the basis of text summarization, for example, multiple
text summarization in which we are given multiple texts and the output of the process is a
single summary text. Similarly, text summarization can also be performed on the basis of
some query where the output summary text is biased toward the query; this means that the
output text is based on the contents of the query. The produced summary can then be used
to extract the relevant information from the text. This helps in improving the performance
of information retrieval tasks. So we can say that a text summarization process performs
the following steps:
Manual text summarization is the process of writing the summary of the given text manu-
ally by a human being. This is equal to the rewriting of the text in its short form. On the
other hand, the automatic process of writing the summary of the text is performed by using
some algorithm. In both cases, the text produced as the summary is the shorter version of
the original text. Although the goal of both approaches is the same, both of these approaches
work in different ways. This is because a computer cannot perform the summarization
process in the same way as is performed by a human being. When text summarization is
performed by a human being, he/she reads the text, understands it, and regenerates the text
in its shorter form. The regenerated text may not contain the sentences present in the origi-
nal text; however, the regenerated text provides a brief description of the provided text.
In automatic text summarization, the input text is given to an algorithm. The algorithm
then divides the text into sub-portions, for example, paragraphs and sentences, and classi-
fies each portion as summary or non-summary. The non-summary portions are neglected,
whereas the text portions that are classified as summary are combined to output the final
summary of the given text.
In the manual approach, the generated summary text is the short version of the original
text and may not contain the sentences from the original text. The automated approach, on
the other hand, selects the sentences or paragraphs as a summary. The manual approach
involves understanding the text and its context, whereas in the automated approach, the
text is partitioned in order to produce a summary of the text. The manual approach is a
difficult and tedious job, whereas the automated approach is easy and is performed by the
computer. The accuracy of the manual approach is higher than that of the automated
approach. Also, the quality of the generated summary is higher in the case of the manual
approach as compared to the automatic one.
We can also use the hybrid approach by combining the manual and the automated
approach. In the hybrid approach, we can use the automated approach to generate the sum-
mary, and then the summary can be revised by the human being. Alternatively, we can use
different algorithms to generate different summaries, which can then be ranked by human
beings to select the best one.
Figure 7.1 shows the pictorial representation of both approaches.
In this section, we will discuss text summarization from a different perspective. The text
that needs to be summarized may comprise a single document or multiple documents.
Summarization of the text comprising a single document is called single-text summariza-
tion, whereas in the scenario where the input text comprises multiple documents, the sum-
marization process is called multiple-text summarization. In both cases, the output text is
220 7 Text Summarization and Topic Modeling
Paragraph-1
Paragraph-2
Input Output
Text Summary
Paragraph-n
rainfall, and frequent monsoons during certain months. Central Asia tends to have arid
and desert-like conditions with extreme temperature fluctuations between day and night.
East Asia, including countries like China, Japan, and Korea, experiences distinct seasons
with hot and humid summers, cold winters, and mild springs and autumns. South Asia,
which includes countries like India and Pakistan, experiences a combination of tropical
climates, desert climates, and mountainous regions with varying temperatures and pre-
cipitation levels.
Cricket is ideally played in weather conditions that are conducive to a smooth and
competitive match. The preferred weather for cricket includes mild temperatures with little
to no rainfall. In temperate regions, such as England, Australia, and South Africa, the
summer months often provide the most suitable weather for cricket, with warm days and
clear skies. In countries with a tropical climate, like India, Sri Lanka, or the West Indies,
cricket is mostly played during the dry season, as heavy monsoon rains can disrupt
matches during the wet season. Additionally, extreme heat can also pose challenges to
players' endurance and safety. Overall, a balance of warm but not scorching temperatures,
minimal rain, and clear skies make for the best conditions to enjoy and play the sport of
cricket.
Now, the summary of the first paragraph may be as follows:
Asia, a vast and diverse continent, boasts a wide array of climates and weather pat-
terns. Southern regions like Southeast Asia have tropical climates with high temperatures,
heavy rainfall, and monsoons. Central Asia is arid and desert-like with extreme tempera-
ture shifts. East Asia experiences distinct seasons with hot summers, cold winters, and
mild springs and autumns. South Asia combines tropical, desert, and mountainous cli-
mates, leading to diverse temperature and precipitation levels. Overall, Asia’s weather
showcases a rich tapestry of environmental conditions across its different parts.
Similarly, the summary of the second paragraph may be as follows:
Cricket is best played in weather conditions that promote a smooth and competitive
game. Ideal weather includes mild temperatures with minimal rainfall. In temperate
regions like England, Australia, and South Africa, summer offers the most suitable condi-
tions with warm days and clear skies. In tropical countries like India, Sri Lanka, and the
West Indies, cricket is mostly played during the dry season to avoid disruptions from heavy
monsoon rains. Extreme heat can also be challenging for players. Overall, a balance of
warm temperatures, limited rain, and clear skies create the perfect setting for enjoying
and playing cricket.
Now, we can merge both summaries as follows:
Asia, a vast and diverse continent, showcases a wide array of climates and weather
patterns. Southern regions like Southeast Asia experience tropical climates with high tem-
peratures, heavy rainfall, and monsoons. Central Asia is arid and desert-like with extreme
temperature shifts, while East Asia enjoys distinct seasons with hot summers, cold winters,
and mild springs and autumns. South Asia combines tropical, desert, and mountainous
climates, leading to diverse temperature and precipitation levels. Cricket, a sport best
played in conducive weather, thrives in ideal conditions with mild temperatures and
222 7 Text Summarization and Topic Modeling
minimal rainfall. In temperate regions like England, Australia, and South Africa, the sum-
mer months provide suitable weather with warm days and clear skies. Conversely, cricket
is played during the dry season in tropical countries like India, Sri Lanka, and the West
Indies to avoid disruptions from monsoon rains. Extreme heat can also pose challenges to
players. Overall, a balance of warm temperatures, limited rain, and clear skies create the
perfect setting for enjoying and playing cricket across Asia's rich tapestry of environmen-
tal conditions.
As discussed earlier, heterogeneous multiple-text summarization is a difficult task, so
in most cases, such summarization is performed manually by human beings.
We can perform the text summarization at different levels of abstraction, which means that
the summary text may contain the amount of information as per our requirement. In flat
text summarization type, intermediate text summaries are not produced, whereas in hier-
archical text summarization, we get intermediate summaries as well. For example, con-
sider an article that contains different headings. Now, it is possible for us to get a single
summary or intermediate summaries on the basis of each heading. In flat text summariza-
tion, we consider the entire article as one document and produce a summary. In hierarchi-
cal summarization, we consider each heading as a separate document, produce a summary
of each document, and finally merge each summary into the final version. These interme-
diate summaries are not allowed in flat text summarization type. So, in hierarchical text
summarization, we may have the original text, intermediate summaries, and the final sum-
mary. On the basis of producing the intermediate summaries, hierarchical text summariza-
tion can be performed in two ways. Firstly, we produce the intermediate summaries on the
basis of the different text headings and merge these summaries into the final summary.
Secondly, instead of producing the intermediate summaries of each heading, we produce
the text clusters on the basis of text similarity in the entire document, and then a single
summary is produced for each cluster. Finally, the summary of each cluster is merged into
the final summary text. This is shown in Fig. 7.2. The cluster-based approach seems to be
more natural and produces more accurate summaries. The reason behind this is that a
document may contain similar text that may be spread among the different portions of the
document instead of a single heading. However, this approach may be computationally not
efficient as the final summary may depend upon the accuracy of the clustering algo-
rithm used.
With respect to the level of abstraction, we can say that there are only two levels of
abstraction in flat text summarization, i.e., the original and the summary text. On the other
hand, in hierarchical text summarization, we may have different levels of abstraction
depending on the intermediate summaries.
If we compare both approaches, flat text summarization is the easy approach as com-
pared to the hierarchical one. Flat text summarization is performed on test documents with
a small size, whereas hierarchical text summarization can be performed on text with large
7.2 Summarization Types 223
Text- Text-
1 2
Summary
Text- Text-
3 4
Flat Text Summarization
Sub-Summary-1
Text Document
Heading-1 Final Summary
Heading-2 Sub-Summary-2
size and containing contents from multiple perspectives. Both approaches can also be
automated.
In this section, we will discuss two other criteria to summarize the text. One is the
abstraction-based summarization, and the other is the query-based text summarization. In
the abstraction-based approach, we get the general summary on the basis of all the con-
tents present in the text, whereas in the query-based approach, the summary is obtained on
the basis of a provided query. This means that we give a query, and the information is
obtained according to the topic mentioned in the query. For example, if a text discusses the
politics of different countries, and we ask for a summary of the politics in Germany, the
summary that will be returned will contain the text related to politics in Germany only, and
all the other text will be skipped.
In an abstraction-based approach, we get a general view of the contents. This can also
be considered equal to the single-view summarization type that can either be performed
manually or automatically. For example, consider the following text:
Football and cricket are two popular sports. Each has its own unique features that
attract fans throughout the world. Football is also known as soccer in a few parts of the
world. In this game, the teams compete with each other to score goals. This is done by
moving the ball to the opposite team's net. This can be done by using any part of the body
except the hands and arms. It involves teamwork and strategic moves which make it so
exciting for the fans. Cricket, on the other hand, is a bat-and-ball sport. It can be said as
slower than football. The time duration is also large compared to the football match. In
cricket, two teams take turns to ball and bat. The ultimate goal is to score more runs as
compared to the opposite team while defending the wickets. The game requires individual
224 7 Text Summarization and Topic Modeling
efforts along with teamwork. It has various formats, such as Test matches, One-Day
Internationals (ODIs), and Twenty20 (T20). Both sports foster a strong sense of camara-
derie and instill valuable life skills, making them beloved pastimes globally.
Now, if we summarize it from the point of view of cricket (i.e., using the query that
requires the summary with respect to cricket), the following may be the output:
Cricket is a bat-and-ball sport that is slower in pace and longer in duration compared
to football. Two teams take turns to bat and ball, alternatively aiming to score more runs
than the opposite team. The game requires both individual effort and teamwork having
various formats, including Test matches, ODIs, and T20 matches.
So far, we have seen different types of text summarization. Here is the Python code that
summarizes the text automatically:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
# Sample paragraph
sample_paragraph = "Text summarization is the process of
extracting the summary of the text automatically using some
algorithmic approach. The summary is normally a shorter text
but it contains the core of the text. The summarization can
be performed both manually and automatically. In the manual
approach, a single person summarizes the text. In the
automated approach, we use some algorithm to get the summary.
In both cases, we pick the important portion of the text
instead of the entire text."
Note that this is the basic approach. More advanced approaches can be used as well.
Furthermore, having a large number of corpses can also increase the accuracy.
The output of the code will be as follows:
Text summarization is the process of extracting the summary of the text automatically
using some algorithmic approach. The summary is normally a shorter text but it contains
the core of the text.
These are the simplest approaches that follow some heuristics to summarize the text.
As discussed earlier, text summarization can also be done through classification. The fol-
lowing is the pseudo code of the algorithm that performs summarization as a classifica-
tion task:
Paragraph-1
Paragraph-2
Text Document Classifier
Paragraph-3
… Summary
Paragraph-n
It can be seen that the process uses classification as an intermediate step to perform the
summarization process. To mark each paragraph as summary or non-summary, we can use
different approaches, e.g., keywords, training datasets, predefined phrases, etc.
Figure 7.3 shows the process.
To use the classification as a summarization process, we need the training dataset com-
prising the paragraphs and the labels. Once these are available, we can train the machine
learning algorithm. For this purpose, the text is converted into vectors using any natural
language processing (NLP) approach and fed to the model. Once the model is trained, we
can use it to classify the unknown text. It should be noted that the process may provide
different results as the output depends on the training data and the classification algorithm
used for classification purposes.
It should be noted that text summarization is different from topic modeling. Apparently,
both seem to be the same; however, there are certain differences. In classification-based
summarization, each individual paragraph is labelled as “Summary” or “Non-Summary”;
however, in topic modeling, the entire text is given a topic as a label. Furthermore, in text
summarization, the classification task performed is an instance of binary classification,
whereas in the case of topic modeling, it may be the case of multi-class classification
where a single text can be assigned with more than one label.
Text classification performed in text summarization is a flat classification type, whereas
in the case of topic modeling, it may be the case of hierarchical text classification.
We can also use regression for text summarization. In this case, each paragraph is
assigned a relevance score that shows its subjectivity or the level of abstraction of the
paragraph. The paragraphs with a certain level of score can be selected as summary. The
scheme is more flexible as compared to the one that uses classification. We can select the
summary at different levels of abstraction depending on the relevance score. So, we can
obtain a brief summary as well as one containing some more level of details.
7.3 Approaches to Text Summarization 227
As discussed earlier, we need training data for performing classification in order to sum-
marize the text. This training data comprises sample paragraphs and their labels. This
means that initially, we need to have these paragraphs along with their labels so that we
could apply the machine learning algorithm. Here in this section, we will discuss how we
can collect sample paragraphs. It should be noted that the process of collecting these
sample paragraphs depends upon the domain, so a paragraph that is labelled as “Summary”
may not be labelled the same in the case of other domains.
Most of the time, this process is performed manually, i.e., we read the paragraphs one
by one and label them each. As it can be seen, performing this process manually is a
tedious job, so we can automate it by using text categorization at the paragraph level. We
cluster the text on the basis of the similarity, and then in each cluster, the paragraphs are
labelled with a topic name. This gives us information about the domain of the paragraph.
Once all the paragraphs are labelled, we can then mark each paragraph as a “Summary” or
“Non-summary”. Once the paragraphs are marked as “Summary” or “Non-summary”, we
can use them as the training data to train any machine learning classification algorithm.
Now, let us discuss some of the tasks that can be based on text summarization. The first
one is the summary-based classification, i.e., we perform classification on the basis of the
summary of the text. In this process, first, we extract the clusters from each document.
Now, the text in each cluster is partitioned into paragraphs, and then each paragraph is
labelled with a predefined category or class. Once the paragraphs are labelled, each para-
graph is marked as “Summary” or “Non-summary”. The summary paragraphs are then
taken and used as the training dataset for classification. The process seems to be complex;
however, it enhances the accuracy of the classification as the irrelevant text is removed.
The overall process involves the following steps:
• Data collection: gather a dataset of text documents along with their corresponding sum-
maries and predefined categories or labels.
• Preprocessing: we clean the text and preprocess the text in order to remove the noise,
punctuation, stop words, and perform tokenization.
• Feature extraction: We convert the text into numeric values that can be given as input to
the classification model. Different methods can be used, e.g., TF-IDF, word embed-
ding, one hot encoding, etc.
• Model training: We use different machine learning or deep learning algorithms to train
a classifier on the summary data. Common algorithms that can be used are support vec-
tor machines (SVM), random forest, deep learning models like recurrent neural net-
works (RNNs), or transformer-based models like BERT.
• Evaluation: Once the classification is performed, we assess the performance of the clas-
sifier using evaluation metrics. Different evaluation metrics can be used such as accu-
racy, precision, recall, and F1 Score.
228 7 Text Summarization and Topic Modeling
Summary-based classification can be used for large text and can reduce the perfor-
mance overhead caused by computations. However, the loss of context information may
sometimes lead to misclassification. So, we can say that the quality of the intermediate
summaries plays a crucial role in the case of summary-based text classification. Similarly,
we can perform summary-based text clustering as well. In this process, we take the origi-
nal text and calculate the summaries of the paragraphs. Once the summaries are extracted,
we perform clustering on the summaries instead of the original text. This saves a lot of
computational power as the summaries do not contain irrelevant text.
Following are some of the benefits of the summary-based clustering approach:
Overall, text clustering using summaries provide a more effective approach for organiz-
ing and navigating large corpora, leading to improved speed, quality, and user experience.
Figure 7.4 shows the clustering process based on summarization.
7.4 Important Concepts 229
… …
Similar to text clustering, we can perform text expansion as well. In this process, we
take the short text that may be the summary and expand it with the relevant text. This pro-
cess is the opposite of the summarization. The process works as follows:
For this purpose, we need to have the relevant text corpus that contains the relevant text
and the associated words/phrases that explain the contents of the relevant text. Now, the
relevant words taken from the original short text can be used as the query to search for the
relevant text from the corpus.
In our implementations, we will find several mathematical and ML-based concepts useful.
Although you may already be familiar with some of these ideas. We will briefly revisit
them to ensure you have all the information. Additionally, we will explore concepts from
natural language processing within this section.
7.4.1 Documents
A document typically represents a piece of text, with headers and additional information.
On the other hand, a corpus consists of a collection of these documents. These documents
can vary from sentences to complete paragraphs containing information. When we men-
tion a tokenized corpus, we mean a compilation of documents where each document has
been divided into tokens, usually representing words.
230 7 Text Summarization and Topic Modeling
Text normalization involves the process of organizing and standardizing data through
techniques such as removing symbols and characters, excluding unnecessary HTML tags,
eliminating common words, fixing spelling errors, word stemming, and lemmatization.
The process of extracting valuable properties or features from the original textual material
so they can be used by a statistical or machine learning method is referred to as feature
extraction. Since the initial text pieces are typically converted into numerical vectors, this
technique is also known as vectorization. The justification for this conversion is that while
traditional algorithms work well with numerical vectors, they struggle to handle raw text.
There are many methods for feature extraction, such as binary features based on the bag-
of-words approach, which determines whether a document contains a certain word or
phrase. Frequency-based bag of words, which convey the frequency of word occurrences
inside a document, is another way. The use of term frequency-inverse document frequency
(TF-IDF)-weighted features, which consider both a phrase’s frequency and rarity across
documents and hence affect its relevance, is an innovative method.
A feature matrix is a common tool for converting a collection of documents into features.
Each row represents a document, and each column represents a particular feature, typi-
cally a word or group of words. Once these traits have been retrieved, we will utilize these
matrices to represent collections of documents or phrases.
A concept from linear algebra known as singular value decomposition (SVD) has applica-
tions outside of the world of computers, particularly in the area of information summariza-
tion. This approach comprises dissecting a real or complicated matrix into its basic
elements. Let’s use a matrix M with m rows and n columns to express it more explicitly.
SVD depicts M as a composition in the language of mathematics, where:
This knowledge demonstrates the orthogonality of U and V. The singular values found
in S are very important in summarization algorithms. We will use singular value
7.4 Important Concepts 231
import numpy as np
from scipy.linalg import svd
The function truncated_svd, which accepts a matrix and a value k as input, is defined in
this code. It uses the Scipy svd function to do the singular value decomposition and then
keeps the top k singular values to create the truncated matrices U, S, and Vt (Fig. 7.5).
The preservation feature is clearly demonstrated by the low-rank matrix approxima-
tion’s retention of k singular values. This clearly demonstrates how the SVD technique is
M = U S Vt
m x n matrix m x r matrix r x r matrix r x n matrix
used to decompose the initial matrix M into its components U, S, and V. In our calcula-
tions, the rows of matrix M represent terms, while the columns represent documents.
Usually appearing after feature extraction, this particular matrix is also known as the term-
document matrix. Before using the SVD operation, this procedure often entails converting
a document-term matrix into its transpose.
The primary stages carried out during text normalization encompass the following actions:
• Extraction of sentences
• Conversion of HTML escape sequences to their original forms
• Elaboration of contractions
• Lemmatization of text
• Elimination of special characters
• Exclusion of stop words
In step 1, we take a text file, remove the newlines, parse the content, convert it to ASCII
format, and then separate the text into individual sentences. The function is presented in
the following Python code:
def extract_sentences_from_document(document):
"""
Extracts sentences from a given document.
Args:
document (str): Input document containing multiple
sentences.
Returns:
list: A list of extracted sentences.
"""
sentences = re.split(r'[.!?]', document)
sentences = [s.strip() for s in sentences if s.strip()]
return sentences
# Example usage
document = "This is a sample document. It contains multiple
sentences. Each sentence is separated by
punctuation marks like periods, exclamation
marks, and question marks. "
7.4 Important Concepts 233
sentences = extract_sentences_from_document(document)
for i, sentence in enumerate(sentences):
print(f"Sentence {i + 1}: {sentence}")
The second step deals with escaping or decoding specific HTML characters. The fol-
lowing function will be used:
# Example usage
html_text = "Visit <a href = "
https://example.com">this link</a>.”
plain_text = unescape_html(html_text)
print(plain_text)
if lemmatize:
text = lemmatize_text(text)
else:
text = text.lower()
text = remove_special_characters(text)
text = remove_stopwords(text)
if tokenize:
text = tokenize_text(text)
return text
# Example usage
corpus = ["Some <b>HTML</b> text with contractions isn't
it?"]
normalized_corpus = normalize_corpus(corpus)
print(normalized_corpus)
Keyphrase extraction is a simple yet very efficient technique for extracting key concepts
from unstructured text. The process of extracting important terms or phrases from a body
of unstructured text is known as keyphrase extraction or terminology extraction. The out-
come is that these key phrases include the main ideas or themes of the text document(s).
This technique, which falls under the broad category of information retrieval and
7.5 Keyphrase Extraction 235
extraction, has a variety of uses and finds use in a range of fields. This process frequently
serves as the initial stage to carry out complex tasks in natural language processing (NLP)
and text analytics. And the outcomes extracted from this process serve as the attributes for
the complex systems.
7.5.1 Collocations
A collocation is a series of words or a set of words that frequently occur together, more
frequently than would be expected by chance or randomness. The grammatical functions
of the terms that make up these collocations, such as nouns, verbs, and others, define their
varied forms. There are many ways to extract collocations, but one of the best is to use a
strategy based on n-gram grouping or segmentation. With this method, we create n-grams
from a corpus, add up their frequencies, and then arrange them according to frequency to
find the n-gram collocations that occur the most frequently. The core idea starts with a
group of documents, which could be sentences or paragraphs. These papers are segmented
into sentences, which are subsequently compressed into one long sentence or string. Based
on the selected n-gram range, a sliding window of size “n” is applied across this string.
N-grams are produced across the string as a result of this operation. Each n-gram’s fre-
quency of occurrence is calculated, totaled, and then ordered according to that frequency.
This method reveals the collocations that are most frequently used based on frequency.
Here is the code:
import nltk
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('stopwords')
nltk.download('wordnet')
# Load the Alice in Wonderland text from Gutenberg corpus
alice_sentences = sent_tokenize(gutenberg.raw('carroll-
alice.txt'))
# Combine the sentences into a single string
alice_text = ' '.join(alice_sentences)
# Tokenize the text
tokens = word_tokenize(alice_text)
# Remove punctuation and convert to lowercase
translator = str.maketrans('', '', string.punctuation)
clean_tokens = [token.translate(translator).lower() for token
236 7 Text Summarization and Topic Modeling
in tokens]
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in clean_tokens if token
not in stop_words]
# Lemmatize the tokens
lemmatizer = WordNetLemmatizer()
normalized_tokens = [lemmatizer.lemmatize(token) for token in
filtered_tokens]
# Print the first 10 normalized tokens as an example
print(normalized_tokens[:10])
def flatten_corpus(corpus):
"""
Flattens a list of sentences into a single string.
Args:
corpus (list): List of sentences.
Returns:
str: Flattened string containing all the sentences.
"""
return ' '.join(corpus)
# Example usage
alice_flattened = flatten_corpus(alice_sentences)
# Print the first 200 characters as an example
print(alice_flattened[:200])
Args:
tokens (list): List of tokens.
n (int): Degree of the n-gram (e.g., 1 for unigram, 2
for bigram, etc.).
Returns:
list: List of n-grams.
"""
ngrams = [tuple(tokens[i:i+n]) for i in range( len(
tokens)-n+1)]
return ngrams
# Example usage
example_tokens = ["this", "is", "a", "sample", "sentence"]
unigrams = compute_ngrams(example_tokens, 1)
bigrams = compute_ngrams(example_tokens, 2)
print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
The compute_ngrams function in this code accepts an input list of tokens along with the
value of n. To produce n-grams of the desired degree, it then loops through the collection
of tokens. The use case example shows how to compute unigrams and bigrams from a
token example list. Here is the output of this function:
Unigrams: [('this',), ('is',), ('a',), ('sample',), ('sentence',)]
Bigrams: [('this', 'is'), ('is', 'a'), ('a', 'sample'), ('sample', 'sentence')]
When given a list of tokens (example_tokens), the function compute_ngrams creates
both unigrams and bigrams based on the input. The output shows the bigrams and uni-
grams that were produced using the sample token list.
If we want to get the top n-grams, we will write the function as:
Args:
tokens (list): List of tokens.
n (int): Degree of the n-gram (e.g., 1 for unigram, 2
for bigram, etc.).
top (int): Number of top n-grams to retrieve.
238 7 Text Summarization and Topic Modeling
Returns:
list: List of top n-grams.
"""
ngrams = compute_ngrams(tokens, n)
ngram_freq = Counter(ngrams)
top_ngrams = ngram_freq.most_common(top)
return top_ngrams
# Example usage
example_tokens = ["this", "is", "a", "sample", "sentence",
"this", "is", "another", "sentence"]
top_bigrams = get_top_ngrams(example_tokens, 2, top=2)
print("Top Bigrams:", top_bigrams)
The get_top_ngrams function in this code requires a list of tokens, the degree of the
n-gram, the value of n, and the desired number of top n-grams. It determines the frequency
of each n-gram using the Counter class from the collections module before retrieving the
top n-grams. The usage example shows how to extract the top bigrams from a token exam-
ple list. Similarly, we can compute the top trigrams and bigrams using our corpus.
Let us look at NLTK’s collocation finders now. These interesting tools enable us to
identify those word friends who frequently interact with one another. We also have choices
like raw frequencies and pointwise mutual information in our toolbox, so we’re not just
counting. The issue with pointwise mutual information is as follows. We calculate the log
of the likelihood that these two events occur together in comparison to the likelihood that
they would occur separately. It can be represented mathematically as:
p x,y
pmi x,y log
p x p y
import nltk
from nltk.corpus import gutenberg
from nltk.collocations import BigramCollocationFinder,
BigramAssocMeasures
Collocations in the words from “Alice in Wonderland” are found using NLTK’s
BigramCollocationFinder in this code. The BigramAssocMeasures class offers a variety
of metrics, including raw frequency, pointwise mutual information, etc., to evaluate the
importance of collocations. Raw frequency is applied in this illustration. The top 10 col-
locations by raw frequency are determined using the nbest approach and printed.
import nltk
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize
from nltk.chunk import RegexpParser
nltk.download('punkt')
nltk.download('gutenberg')
# Load text from the Gutenberg corpus
emma_words = gutenberg.words('austen-emma.txt')
emma_text = ' '.join(emma_words[:1000]) # Taking a small
portion for demonstration
# Tokenize the text
tokens = word_tokenize(emma_text)
# Perform part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)
240 7 Text Summarization and Topic Modeling
In this demonstration, we import some text from Jane Austen’s Emma from the
Gutenberg corpus, tokenize it, and tag it with part-of-speech information. Then, using the
RegexpParser, we build a straightforward grammar for extracting noun phrases (NP). We
take out noun phrases and give each one an equal weight. The weighted noun phrases that
are produced are printed.
To find the basic patterns in a collection of documents or texts, topic modeling is an effec-
tive technique in natural language processing and machine learning. Even when the sub-
jects are not explicitly labelled, it enables us to automatically identify the major topics or
themes that are present in a corpus. Finding a group of topics that best captures the content
of a given text corpus is the main objective of topic modeling. Every topic in the corpus is
represented by a combination of words, and each document in the corpus is a representa-
tion of these themes. Techniques for topic modeling offer a way to comprehend how
themes are distributed throughout publications and how words are distributed within
each topic.
The latent Dirichlet allocation, or LDA algorithm, is one of the most widely used meth-
ods for topic modeling. According to LDA, each document is made up of a variety of
subjects and words that are mixed to form each topic. To identify the underlying subjects
and their word distributions, it attempts to reverse-engineer this process.
Here is a brief explanation of how topic modeling operates:
• Tokenization and preprocessing: The text corpus is tokenized into words, and then low-
ercasing, deleting stop words, stemming/lemmatization, and other typical preprocess-
ing techniques are used. A document-word matrix is created, in which each row
7.6 Topic Modeling and Its Objectives 241
represents a document, each column represents a distinct word, and the cells hold the
frequency of each word in the corresponding document.
• Applying the topic modeling algorithm: Topics and associated word distributions that
best describe the data are determined by applying algorithms like LDA to the document-
word matrix.
• Interpreting topics: After getting the topics, you can interpret each topic by looking at
the terms that are most frequently used to describe it. This enables you to give the topics
labels that people can understand. After learning the themes, documents are represented
as a mash-up of the many topics. This reveals the primary ideas of each paper.
• Applications: Topic modeling is useful for a wide range of tasks, including information
retrieval, sentiment analysis, content recommendation, document categorization, and
spotting patterns in massive text collections.
LDA and other topic modeling algorithms are available as simple-to-use implementa-
tions in Python libraries like Gensim and Scikit-Learn. An example of applying Gensim’s
LDA implementation is given below:
import genism
from gensim import corpora
# Create a dictionary from the corpus
dictionary = corpora.Dictionary(tokenized_documents)
# Create a document-term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in
tokenized_documents]
# Apply LDA
lda_model = gensim.models.LdaModel(
corpus=doc_term_matrix, id2word=dictionary,
num_topics=5, passes=15,random_state=42)
# Print topics
for idx, topic in lda_model.print_topics(-1):
print(f"Topic {idx}: {topic}")
To create topic models, a variety of frameworks and algorithms are available. We’ll
discuss the following three techniques:
The first two techniques are old and popular. The third method, non-negative matrix
factorization, is a relatively new method that is quite successful and produces outstanding
results. In our practical applications, we’ll use Scikit-Learn and Gensim. The following
top corpus will be used to test the topic models:
242 7 Text Summarization and Topic Modeling
The corpus above contains eight texts in all; the first four focus on various animals,
while the last four are about programming languages. This amplifies the fact that there are
two distinct themes present in the corpus. We understood that by human intuition, the sec-
tions that follow will seek to derive the same knowledge using computer methods. We’ll
apply the frameworks we build for topic modeling to actual Amazon product reviews to
develop themes there as we go.
Latent semantic indexing (LSI), an established technique that dates back to the 1970s, was
initially created as a statistical technique to identify the underlying relationships between
terms in a corpus. LSI has established itself in the fields of search and information retrieval
in addition to text summarization. With the help of the well-known singular value decom-
position (SVD) method, which we previously discussed in the “Important Concepts” sec-
tion, LSI works on the tenet that closely related terms frequently occur together in the
same context.
Here is the code:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('stopwords')
# Toy corpus
toy_corpus = ["Lions are majestic animals found in the
wild.", "Python is a popular programming
7.6 Topic Modeling and Its Objectives 243
This code loads the required NLTK resources, lowercases, and removes punctuation,
stops words, and stems each word in the toy corpus before tokenizing and normalizing it.
For review, the normalized corpus is printed. As demonstrated in the code sample, ensure
that you have the NLTK library installed (pip install nltk) and that you have used the nltk.
download() function to obtain the necessary resources. We can also build a dictionary/
vocabulary from the corpus. The following code will be used to build a dictionary that
genism utilizes for the mapping of distinct terms into the specific numeric value.
# Toy corpus
toy_corpus = ["Lions are majestic animals found in the
wild.", "Python is a popular programming
language for data science.", "Elephants are
known for their intelligence and large
size.", "Java is widely used for building
robust applications.", "Tigers are powerful
predators with distinctive stripes.", "C++ is
often used in game development for its
performance.", "Dolphins are highly
intelligent marine mammals.", "Python and
Ruby are scripting languages used for web
development."]
# Initialize NLTK's Porter Stemmer and stopwords
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
# Perform text normalization
normalized_corpus = []
for document in toy_corpus:
# Tokenize words
words = word_tokenize(document.lower())
# Remove punctuation and stopwords, and perform
stemming
normalized_words = [stemmer.stem(word) for word in
words if word.isalnum() and word
not in stop_words]
# Join words back to form normalized document
normalized_document = ' '.join(normalized_words)
normalized_corpus.append(normalized_document)
# Build a dictionary from the normalized corpus
dictionary = corpora.Dictionary([doc.split() for doc in
normalized_corpus])
# Display the dictionary
print(dictionary)
The created dictionary, which converts each distinct term in the normalized corpus to a
numerical value, will be output by this code. Make sure that in addition to the NLTK
library and the provided resources, the Gensim library is installed.
Here the output of the above code looks like this:
Dictionary(32 unique tokens: ['anim', 'found', 'lion', 'majest', 'wild']…)
7.6 Topic Modeling and Its Objectives 245
The output shows a sample of the dictionary’s terms and notes that it has 32 distinct
tokens (words). Each word has its own distinct ID number.
A statistical approach called latent Dirichlet allocation (LDA) is used to identify the
underlying subjects in a corpus of documents. According to LDA, each document consists
of a variety of themes, and each topic is composed of a variety of word counts. To put it
another way, LDA represents a topic as a probability distribution over words and a docu-
ment as a probability distribution over subjects. For LDA to work, each document must
first have a distribution over themes. The topics in the corpus are represented by a vector
of probabilities that make up the topic distribution. The topic distribution’s probability
must add up to 1. The distribution of words for each topic is then produced by LDA. The
word distribution refers to this distribution. A word in the vocabulary corresponds to each
likelihood in the word distribution, which is a vector of probabilities. In the word distribu-
tion, the probabilities must add up to 1. LDA can be used to produce a document once the
topic and word distributions have been established. LDA first selects a topic at random
from the document’s topic distribution to create the document. LDA then selects a word at
random from the topic’s word distribution. Up till the document is finished, this procedure
is repeated.
LDA can also be used to infer a document’s themes. LDA initially determines the likeli-
hood of each subject given the document to infer the topics. The topic for the paper that
has the highest likelihood is also the most likely topic. Each document in the corpus goes
through this procedure once more. For identifying the underlying themes in a corpus of
documents, LDA is a potent tool. It has been applied to many different things, such as:
• LDA can be used for text mining to extract a corpus of documents’ subjects. This data
can be used to cluster the documents into collections of related documents or to enhance
the corpus’ searchability.
• LDA can be used in recommender systems to make texts based on user interests.
Finding documents that are comparable to those the user has already read achieves this.
• Machine translation: LDA can be used to boost the precision of these programs. This is
accomplished by determining the subjects of the source and target papers and then
using this knowledge to translate the texts more precisely.
N M
Implementation
The following steps will be taken to implement LDA:
1. Data loading
2. Cleansing of data
3. Investigative analysis
4. Data preparation for LDA analysis
5. LDA model instruction
6. Results of the LDA model analysis
1. Data Loading
The following code will be used to load data for implementation using NIPS confer-
ence papers:
import nltk
from nltk.corpus import nps_chat
The nltk.corpus.nps_chat module in this code houses datasets associated with the
NeurIPS conference. This code can be modified to suit your requirements, and the real
dataset you want to use for LDA analysis can be substituted for it. As demonstrated in the
code sample, ensure that you have the NLTK library installed (pip install nltk) and that you
have used the nltk.download() function to obtain the necessary resources.
7.6 Topic Modeling and Its Objectives 247
2. Cleaning of Data
The following code snippet will be used to clean the data accordingly:
import nltk
from nltk.corpus import nps_chat
import pandas as pd
# Download NLTK resources (if not already downloaded)
nltk.download('nps_chat')
# Load NeurIPS papers dataset
neurips_papers = nps_chat.fileids()
# Create a DataFrame to store paper text
papers_df = pd.DataFrame(columns=['paper_id', 'text'])
# Extract text data and populate DataFrame
for paper_id in neurips_papers:
paper_text = ' '.join(nps_chat.words(paper_id))
papers_df = papers_df.append({'paper_id': paper_id,
'text': paper_text}, ignore_index=True)
# Display the first few rows of the DataFrame
print(papers_df.head())
The paper ID and text for each NeurIPS paper are stored in a DataFrame constructed in
this code with the name papers_df. For a given paper ID, the nps_chat.words(paper_id)
function returns a list of words, which are subsequently combined to create the paper’s
content. “paper_id” and “text” are the two columns in the generated DataFrame. If your
dataset format differs from the example provided, you can modify this code to suit it.
The following code will be used to remove punctuation and lowercase:
import nltk
from nltk.corpus import nps_chat
import pandas as pd
import string
# Download NLTK resources (if not already downloaded)
nltk.download('nps_chat')
# Load NeurIPS papers dataset
neurips_papers = nps_chat.fileids()
The translate() function is used in this modified code to eliminate any punctuation from
the text. The text is subsequently changed to lowercase using the lower() method. The
remaining lines of code continue to create the papers_df DataFrame and add the cleaned
text data to it. The cleaned text data will be displayed in the first few rows of the DataFrame
in the output.
import nltk
from nltk.corpus import nps_chat
import pandas as pd
import string
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Download NLTK resources (if not already downloaded)
nltk.download('nps_chat')
# Load NeurIPS papers dataset
neurips_papers = nps_chat.fileids()
# Create a DataFrame to store paper text
papers_df = pd.DataFrame(columns=['paper_id', 'text'])
# Extract text data and populate DataFrame
for paper_id in neurips_papers:
paper_text = ' '.join(nps_chat.words(paper_id))
# Remove punctuation and convert to lowercase
paper_text = paper_text.translate(str.maketrans('', '',
string.punctuation))
paper_text = paper_text.lower()
papers_df = papers_df.append({'paper_id': paper_id,
'text': paper_text}, ignore_index=True)
# Combine all text data into a single string
all_text = ' '.join(papers_df['text'])
# Generate WordCloud
wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(all_text)
# Display the WordCloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
7.6 Topic Modeling and Its Objectives 249
The most frequent words in the combined text data from the NeurIPS papers are visual-
ized in this code using the WordCloud library. The WordCloud object is created using the
WordCloud() function, and the WordCloud visualization is shown using the imshow()
function from the matplotlib library. Please be aware that to successfully run this code, you
must have the WordCloud and matplotlib libraries installed (pip install wordcloud mat-
plotlib). The result will be a WordCloud graphic showing the most frequently occurring
words in the text data from the NeurIPS papers.
3. Data Preparation for LDA Analysis
In this step, create a textual data transformation that will be used as an input for the
LDA model training process. To begin with, stop words are eliminated, and the text is
tokenized. The tokenized object is then transformed into a corpus and dictionary.
Here is the code:
import nltk
from nltk.corpus import nps_chat
import pandas as pd
import string
from gensim import corpora
from nltk.corpus import stopwords
# Download NLTK resources and stopwords (if not already
downloaded)
nltk.download('nps_chat')
nltk.download('stopwords')
# Load NeurIPS papers dataset
neurips_papers = nps_chat.fileids()
# Create a DataFrame to store paper text
papers_df = pd.DataFrame(columns=['paper_id', 'text'])
# Extract text data and populate DataFrame
for paper_id in neurips_papers:
paper_text = ' '.join(nps_chat.words(paper_id))
# Remove punctuation and convert to lowercase
paper_text = paper_text.translate(str.maketrans('', '',
string.punctuation))
paper_text = paper_text.lower()
papers_df = papers_df.append({'paper_id': paper_id,
'text': paper_text}, ignore_index=True)
# Tokenize the text and remove stopwords
stop_words = set(stopwords.words('english'))
tokenized_texts = []
for text in papers_df['text']:
tokens = [word for word in text.split() if word not in
stop_words]
tokenized_texts.append(tokens)
# Create a dictionary from tokenized texts
250 7 Text Summarization and Topic Modeling
dictionary = corpora.Dictionary(tokenized_texts)
# Create a corpus using the dictionary
corpus = [dictionary.doc2bow(tokens) for tokens in
tokenized_texts]
# Display the first few entries in the dictionary and the
corpus
print("Dictionary entries:", list(dictionary.items())[:5])
print("\nCorpus (First document):", corpus[0])
In this code, stop words are eliminated from each document, and the textual data is
tokenized. The tokenized data is utilized to build a dictionary and corpus using the Gensim
library. The corpus is a list of bags of words representing each document, while the dic-
tionary contains a mapping of words to distinct IDs. Please be aware that to correctly run
this code, you must have the Gensim library installed (pip install gensim). The output will
show the first few dictionary items as well as the illustration of the very first corpus
document.
4. Model Training
The following code will be used to train the model:
dictionary = corpora.Dictionary(tokenized_texts)
# Create a corpus using the dictionary
corpus = [dictionary.doc2bow(tokens) for tokens in
tokenized_texts]
# Train the LDA model
num_topics = 3 # Number of topics to identify
lda_model = models.LdaModel(corpus, num_topics=num_topics,
id2word=dictionary, passes=15)
# Print topics and their top words
for topic_id in range(num_topics):
print(f"Topic {topic_id + 1}:
{lda_model.show_topic(topic_id)}")
The number k indicates how many components or latent features we have decided to
extract. In order to adhere to the fundamental idea of NMF, the matrices W and H are made
up of non-negative values.
In mathematics, the Frobenius norm (squared difference) between the original matrix X
and its approximation WH can be used to construct the NMF optimization problem:
minimize X − WH 2
To reduce the approximation error, the NMF algorithm iteratively updates the values of
W and H. These updates are produced using methods like gradient descent or multiplica-
tive updates. The advantage of NMF is its capacity to unearth hidden patterns, compo-
nents, or features in the data. For instance, in the context of image processing, NMF can
252 7 Text Summarization and Topic Modeling
break an image down into non-negative components that reflect different visual features.
Similar to this, NMF can uncover latent topics in a document-term matrix while text min-
ing. Generally speaking, non-negative matrix factorization is a potent technique that goes
beyond simple numerical operations and provides perceptions into the structure of data
through its built-in non-negative constraints and factorization mechanism.
Here is an example of code that uses the NMF class from the Sklearn library to create
an NMF-based topic model on a toy corpus. For each subject, the code will produce fea-
ture names and their accompanying weights, much like LDA does:
The tokenized words are transformed into a TF-IDF matrix in this code using the
TF-IDF vectorizer. Then, this matrix is used to train the NMF model to identify subjects.
The feature_names variable keeps track of the feature names that the TF-IDF vectorizer
assigned to the features that link words to their appropriate topical weights.
Please make sure you have the relevant NLTK resources and stop words downloaded,
as well as the sklearn and nltk libraries installed.
Product reviews are an invaluable source of knowledge in the world of consumer-driven
markets because they capture the attitudes, viewpoints, and preferences of consumers.
Finding relevant subjects from these reviews has the potential to reveal important factors
that affect consumers’ purchasing decisions, thereby helping businesses improve their
products and customer experiences.
Imagine yourself as a data enthusiast who is eager to explore a wealth of product
reviews. Using Python and the hidden potential of natural language processing (NLP),
let’s set out on a quest to extract topics from these reviews.
connections between words and reviews and displaying cohesive clusters of phrases
that define the essence of features, these techniques unearth implicit topics.
Step 5: Interpretation
You can identify the latent themes hidden in the reviews by looking at the top words in
each topic. For instance, “product quality”, “customer service”, and other related topics
might be discussed. Let’s say you have a set of product reviews and you’ve used the
code provided before to vectorize the text using TF-IDF and apply latent Dirichlet allo-
cation (LDA) to extract the topics from the text.
The display_topics function would display the top words for each topic determined by
the LDA model once the code had run. Here is an illustration of how you could interpret
the result:
Output
Topic 1:
product, quality, good, excellent, value, highly, recommend, money, worth, purchase
Topic 2:
service, customer, great, experience, excellent, friendly, staff, helpful, professional,
satisfied
Topic 3:
delivery, fast, arrived, time, quick, packaging, on-time, condition, timely, arrived
…
7.7 Modeling Case Study 255
Each topic includes a list of the most important terms, and these words reveal the fun-
damental ideas that underlie the reviews. Topic 1 might reflect conversations on the value
and quality of the products, Topic 2 might be about customer service encounters, and
Topic 3 might be about delivery and shipping times. To make sense of the identified
themes, keep in mind that topic interpretation is partly subjective and requires a human
touch. The quality and interpretability of the topics collected can also be affected by
changing the number of topics (num_topics) and improving preprocessing procedures.
The output will show you the precise top terms for each topic depending on the reviews
you’ve provided if you run the code on your actual dataset.
In this section, we will perform text modeling using Gensim and Scikit-Learn.
We will discuss a simple example of topic modeling using the Python library known as
Gensim. We will use the collection of news articles related to technology and will perform
topic modeling on these articles. We will perform the following tasks:
• Data preprocessing
• Building of the LDA model
• Evaluation of the model
• Interpretation of the results
• Text cleaning: We will remove all the unwanted characters, symbols, and special
characters.
• Tokenization: We will split the sentences in the text into individual words called tokens.
256 7 Text Summarization and Topic Modeling
• Lowercasing: For consistency, we will convert all words to lowercase. This is necessary
because without this step, two same words, e.g., “Sample” and “sample”, will be con-
sidered as two different words, which is semantically incorrect.
• Stop word removal: We will remove common words (e.g., “the”, “is”, “and”) that do
not give much meaning.
• Lemmatization: We will reduce the words to their base or root form.
import pandas as pd
import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import nltk
nltk.download('wordnet')
nltk.download('punkt')
def preprocess(text):
result = []
for token in simple_preprocess(text):
if token not in STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result
# Apply preprocessing to the text data
processed_docs = data['text'].map(preprocess)
lda_model = LdaModel(corpus=doc_term_matrix,
id2word=dictionary, num_topics=num_topics,
random_state=42, passes=10, alpha='auto',
per_word_topics=True)
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score:.4f}")
The coherence score indicates how well the topics are separated and interpretable.
Higher coherence scores are preferred.
The output will show the top words for each topic along with their probabilities.
Analyzing these words can help us understand the main themes discovered in the technol-
ogy news articles.
That’s it! This case study demonstrates how to perform topic modeling using Gensim
on a collection of technology news articles. By following these steps, you can adapt the
process to perform topic modeling on various other text datasets and gain valuable insights
from them.
Assuming we have executed the entire code from the data preprocessing step to the
topic modeling and evaluation steps, here’s the expected output:
Sample Output:
Topics and Their Word Distributions:
Topic 0: 0.074*"topic" + 0.036*"model" + 0.030*"document" + 0.027*"use" + 0.026*"lda"
+ 0.022*"word" + 0.020*"techniqu" + 0.019*"text" + 0.019*"python" + 0.018*"corpu"
Topic 1: 0.041*"use" + 0.034*"model" + 0.031*"topic" + 0.031*"data" + 0.024*"python"
+ 0.023*"gener" + 0.023*"librari" + 0.021*"vector" + 0.021*"space" + 0.018*"power"
Topic 2: 0.057*"model" + 0.051*"topic" + 0.027*"algorithm" + 0.027*"document" +
0.024*"lda" + 0.024*"use" + 0.023*"probabilist" + 0.022*"dirichlet" + 0.022*"latent"
+ 0.019*"distribut"
The output shows the top words for each of the three discovered topics along with their
probabilities. For example, Topic 0 contains words like “topic”, “model”, “document”,
“use”, etc., which represent the main theme of the topic.
Coherence Score: 0.4523
The coherence score represents the quality of the discovered topics. Higher coherence
scores indicate better separation and interpretability of topics.
Please note that the actual probabilities and words in the output may vary depending on
the input data and the random initialization of the LDA model. The coherence score will
also differ based on the dataset and the number of topics chosen.
The output allows us to interpret the discovered topics and understand the main themes
present in the technology news articles. You can analyze the top words for each topic and
the coherence score to assess the quality of the topic modeling results.
import string
def preprocess(text):
text = text.translate(str.maketrans(‘’, ‘’,
string.punctuation))
text = text.lower()
return text.split()
preprocessed_corpus = [preprocess(document) for document in
toy_corpus]
Output:
Topic 1: Python, programming, language, data, science
Topic 2: animals, wild, found, majestic, lions
The LDA model found two subjects in this output: one on programming and data sci-
ence and the other about lions and wild animals. Each topic’s top words offer clues about
the underlying topics.
This methodically organized technique illustrates how to use Scikit-Learn’s LDA on a
practice corpus to do topic modeling.
• Abstraction-based: These are complex methods that employ NLG approaches in which
the computer generates content on its own using knowledge bases and writes summa-
ries that resemble those written by humans.
Text wrangling, also known as normalization, is the process of organizing, cleaning, and
standardizing textual input into a format that can be used by other NLP and intelligent
systems driven by machine learning and deep learning. Preprocessing methods that are
frequently used are text cleaning, tokenization, removal of special characters, case conver-
sion, spelling correction, elimination of stop words and other superfluous phrases, and
stemming. In essence, text wrangling is the preprocessing effort carried out to make
unprocessed text data ready for training. To put it simply, it involves cleaning your data so
that your program can read it.
This process involves shaping raw text into a form that algorithms can digest. Various
strategies are used in this procedure, ranging from the traditional bag-of-words method of
counting word occurrences to the complex dance of word embeddings that encode seman-
tic associations. By skillfully converting unstructured textual data into organized numeri-
cal features, text representation with feature engineering enables algorithms to understand
and analyze language effectively.
The main idea of the text is captured through feature engineering techniques, which
reduce its complexity while keeping its relevance, context, and meaning. Several tech-
niques, including bag of words, TF-IDF, n-grams, and word embeddings, are used in this
procedure. Bag of words counts words without taking into account their order, whereas
TF-IDF assigns weights based on the frequency and rarity of each word. Word embed-
dings translate words to dense vectors while maintaining semantic links, while N-grams
record word sequences and add context. Text representation with feature engineering
bridges the gap between the unstructured nature of the text and the structured requirements
of algorithms. By transforming words into numerical features, it empowers machines to
recognize patterns, identify sentiments, extract topics, and perform a plethora of natural
language processing tasks. Ultimately, it’s about enabling a meaningful dialogue between
humans and machines, where words become the currency through which insights and
knowledge are exchanged.
262 7 Text Summarization and Topic Modeling
In the field of natural language processing, latent semantic analysis (LSA) functions as a
kind of master linguist detective. The distributional hypothesis, a key premise of LSA,
holds those words with comparable meanings cluster together in texts with a similar style.
Measures of document familiarity include cosine similarity. A score close to 0 suggests
that two documents are completely unrelated, while a value close to 1 indicates that two
documents are essentially identical. With its mathematical prowess, LSA strips apart lay-
ers of words to expose the hidden connections that thread throughout our textual universe.
Text summarization presents a number of complex problems that researchers and practi-
tioners must solve including:
1. Content selection: One of the most difficult parts of writing a summary is deciding
which sentences or phrases to include. The summarization system must recognize the
most crucial and pertinent information while eliminating redundant or unimportant
data. What information is deemed to be “important” can vary depending on the
situation.
2. Effective content compression: Because summaries are often significantly shorter than
the original text, the summarizing system must efficiently compress the content.
Compression must be carefully balanced with the preservation of the text’s nuances
and primary themes. If handled improperly, information loss or misinterpretation
may result.
3. Upholding coherence and cohesion: A strong summary should be coherent and cohe-
sive to ensure that the material flows naturally. It might be difficult to maintain the
text’s logical flow and relationships between sentences, particularly when several tex-
tual components need to be condensed.
4. Handling ambiguity and polysemy: Words and phrases frequently have numerous
meanings depending on the situation, making language inherently ambiguous. To guar-
antee that the summary effectively conveys the intended meaning, a summarizing sys-
tem must appropriately disambiguate words and sentences.
5. Abstractive versus extractive summarization: Abstractive and extractive summariza-
tion are the two major methods. While extractive summarizing chooses and rearranges
sentences from the original text, abstractive summarization creates new sentences that
may not be present in the original text. Both strategies have specific difficulties, such
as producing coherent sentences in extractive summarizing or maintaining grammati-
cal accuracy in abstractive summarization.
7.11 Exercises 263
7.10 Summary
Text summarization and topic modeling are two core tasks in the domain of text mining.
These tasks are used to get the documents’ themes. In this chapter, we have explained both
of these concepts in detail. Different types and techniques of text summarization have been
discussed along with the details of how each technique works. Efforts have been made in
this chapter to provide the working code in order to show the actual logic and processing
behind these techniques. For this purpose, the case studies have been presented. This will
help users and especially system developers to implement these tasks while developing
text mining systems. Overall, efforts have been made to provide the reader with sufficient
knowledge to develop and use the mentioned concept at any level.
7.11 Exercises
Q1: Consider the following two documents. Write a summary of these documents by
using the “Flat Text Summarization” process.
Document-1: Manual text summarization is the process of considering the long text
and converting it into a smaller version that explains the basic theme of the text. The
process is performed manually by human beings. In this process, a human reads the
text, comprehends, and then rewrites it in concise words. The process is very subjec-
tive and depends upon the knowledge and experience of the summarizer. So, it
means that for the same text, two different people may write different summaries.
Document-2: Automatic text summarization is a natural language processing tech-
nique. In this process, a computer algorithm is involved instead of a human being.
The algorithm generates a concise and coherent summary of the longer text. The
overall intention is to capture the important information and the key ideas present in
the text. Automatic text summarization can be easily applied to the long texts.
Q2: Consider the documents given in question no. 1. Write their summaries using the
“Hierarchical Text Summarization” process.
264 7 Text Summarization and Topic Modeling
Q3: Consider the following document. In the context of query-based summarization, sum-
marize this document from the “Sports” point of view and “Weather” point of view.
People in Europe like sports. A lot of games are played there. From football to tennis,
each game has a lot of fan base. Different leagues and championship matches are
arranged every year. Competitive leagues like the English Premier League and Spain’s
La Liga are very famous. Similarly, people anxiously wait for the Wimbledon tennis
championships, Tour de France cycling race, and the prestigious Formula One Grand
Race. This continent is a hub for both traditional sports and athletic excellence.
As far as weather in Europe is concerned, it varies from country to country and region
to region. In northern areas of Europe, the weather is cold, whereas the southern parts
enjoy warmer climates. Western Europe has moderate weather with rain throughout the
year. Central Europe can have all four seasons distinctly, with cold winters and hot
summers. This means that you can find everything from snowy landscapes to sunny
beaches in Europe.
Q4: Write down at least one- to two-sentence summary of different paragraphs (you can
take some random text) using a heuristics-based approach. You can use the following
phrases for extraction of the summary:
• In essence…
• Finally, we can say that….
• Overall…
• In conclusion…
• So, the result will be…
Q7: How does latent Dirichlet allocation (LDA) work in topic modeling?
Q8: Suppose you are given a dataset containing a collection of news articles from various
sources. Your task is to perform topic modeling on this dataset using latent Dirichlet
allocation (LDA). Follow these steps:
1. Preprocess the text data by removing punctuation, converting to lowercase, and
removing stop words.
2. Tokenize the preprocessed text.
3. Using the tokenized text, create a document-term matrix.
4. Apply LDA with a chosen number of topics.
5. Interpret the results: identify the main topics generated by the model, and provide a
brief description for each topic.
Q9: If a topic model identifies words like “government”, “economy”, “budget”, and
“trade” as important in a topic, what might be the general theme of that topic?
7.11 Exercises 265
Recommended Reading
• Advances in Automatic Text Summarization
Authors: Mani I., Maybury M.T.
Publisher: MIT
Publication Year: 1999
This book presents the key developments in the field of automatic text summarization.
It presents a coherent framework, along with highlighting the future directions. It con-
sists of six sections: classical approaches, text-based group approaches, exploitation of
discourse structure, knowledge-rich approaches, assessment methods, and emerging
challenges in synthesis. Overall the book provides a good source for automatic text
summarization.
• Automatic Summarization
Author: Inderjeet Mani
Publisher: John Benjamins Publishing Company
Publication Year: 2001
This book provides a systematic introduction to the domain, explaining key terminolo-
gies, methodologies, and automated techniques that use semantic and statistical knowl-
edge to generate extracts and reviews. It provides maximum details of research covering
artificial intelligence, natural language processing, and information retrieval. It also
includes comprehensive details of assessment approaches, along with emerging topics
such as multimedia and multi-document summarization.
Taxonomy generation refers to the automatic category predefinition of the text. It is the
process of generating topics or concepts and their relations from a given corpus. Dynamic
document organization, on the other hand, is the automated process of managing the text
on the basis of its contents. Both processes are important in the context of text mining.
This chapter will explain each process’s details, examples, and complete description of the
accompanying Python source code.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 267
U. Qamar, M. S. Raza, Applied Text Mining,
https://doi.org/10.1007/978-3-031-51917-8_8
268 8 Taxonomy Generation and Dynamic Document Organization
to the development of taxonomies. The choice of the name of the subgroup is relevant to
the words it contains. A text is viewed as a word set that is used for the extraction of asso-
ciation rules and the construction of taxonomies. Then constructed taxonomies are orga-
nized in the form of a graph tree where a node of the tree is used to indicate a word and the
edge between two nodes shows the relationship between those nodes or words. We can
show the taxonomies in the form of different graphical formations ranging from a simple
list to complex graphs. The simplest form of taxonomy organization is a list of categories
and concepts. It is the initial task for text categorization that is defined automatically for
categories. For predefining the categories using hierarchical text categorization, the hier-
archical structure of concepts and categories from abstract to specific levels is extremely
useful. The network of categories and the relations of the concepts is another type of tax-
onomy organization. The network of taxonomies above can be expanded with information
regarding the techniques, characteristics, and relationships of each concept.
The automatic definition of the classification frame is the primary objective of taxon-
omy generation. It is not possible to automatically categorize text using the list of name-
less clusters that were retrieved from the clustered text. As nameless clusters cannot be
used for automatic text categorization, we need to perform this task manually. To perform
the task manually, prior domain knowledge is required. The classification frame, which is
defined with the use of text categorization, is a collection of significant concepts derived
from the corpus through taxonomy generation. Important concepts and relations between
them are generated in the form of output as taxonomy generation.
• Word categorization
• Word clustering
• Topic routing
Keyword extraction is the process of taking key phrases out of the complete text that has
been provided. The task of keyword extraction within taxonomy generation involves iden-
tifying and extracting the most representative and informative keywords from a given text
or a set of documents. These keywords play a vital role in categorizing, organizing, and
labelling the content into a structured hierarchy or taxonomy. Think of them as the sign-
posts that guide you through the landscape of information. The objective is to pinpoint
words or phrases that encapsulate the core themes, concepts, and subjects covered within
the text. These keywords essentially act as the foundational building blocks for construct-
ing a taxonomy—a structured framework that classifies and organizes information into
8.2 Taxonomy Generation Tasks 269
given in Fig. 8.2. Full text is given as input first, and then all words are extracted in the
form of an indexed list. Then all indexed words are passed through the binary classifica-
tion algorithm. The binary classification separates the words into keywords and not key-
words. These two categories created as output are called keyword extractions.
Let’s now use the generation of taxonomies to demonstrate the relevance of keyword
extraction. In the first step, individual text documents are converted to a list of words.
Every document has its own indexed list of words. Then keyword extraction procedure is
followed, and a category keyword is formed. For every word list, a list or set of keywords
is generated. In the last step, all keyword categories are combined into a big list of key-
words, which is called a taxonomy. This is the complete procedure of text to taxonomy
generation. This is shown in Fig. 8.3. If required, we can use further filter operations to
select appropriate keywords from the big list of keywords.
The following are some advantages and disadvantages of keyword extraction.
• Advantages:
• Disadvantages:
–– Some words in the given text can have multiple meanings, and the keyword extrac-
tion procedure may not distinguish the difference in these meanings and lead to
ambiguous interpretation of extracted keywords.
–– Keyword extraction techniques focus on the frequency of words; the word with high
frequency is considered the most important word, but usually, more repeated words
are less important, for example, stop words.
–– Efficient extraction of keywords depends on preprocessing of text like punctuation,
stop words removal, stemming, and lemmatization. If preprocessing is not efficient,
it will affect the quality of extracted keywords.
–– Many keyword extraction algorithms are unsupervised, so they may not capture a
specific concept that is related to a particular domain. In this scenario, we explicitly
need to provide guidance to the algorithm.
Now, we will see the implementation of keyword extraction using a real-time example
with the help of Python and RAK library.
Output:
Review 1 Keywords: ['camera quality', 'battery life', 'excellent']
Review 2 Keywords: ['top notch', 'performance', 'screen', 'amazing', 'love', 'phone']
Review 3 Keywords: ['disappointing', 'sound quality']
Review 4 Keywords: ['user interface', 'modern', 'intuitive', 'design', 'sleek', 'gadget']
Review 5 Keywords: ['charging time', 'great device', 'slow', 'bit']
In the code snippet given above, we used some reviews of a product and used RAK
Python library to extract the important keywords from each review. This library ranks the
phrases on the basis of frequency and importance in the review comments. If we analyze
the keywords extracted from the first review, we find that the phrases “camera quality”,
“battery life”, and “excellent” indicate that the customer is praising the camera quality and
battery life.
Word categorization is the process of categorizing each word into one or more specified
categories. While classifying words is known as word categorization, classifying a text
(such as an article) is known as text categorization. Since a word is categorized into one of
the categories of grammar, the classification of parts of speech is a prime example of word
categorization. Words can be divided into verbs, nouns, adjectives, and other categories.
8.2 Taxonomy Generation Tasks 273
Word categorization is the task of taxonomy generation that is different from part-of-
speech classification because a part of speech is categorized on the basis of its meaning.
The word categorization task in taxonomy generation is similar to sorting a library of
words into distinct bins, where each bin represents a specific category or theme. It’s about
parsing the textual landscape to identify words that naturally lead toward certain subject
areas. These words act like signposts that point toward the essence of different topics. The
application of the word categorization task in taxonomy generation is like building a road-
map for navigating the vast landscape of information. It serves as the foundation for creat-
ing organized and structured content frameworks that enhance understanding and
accessibility.
Some of its applications are as follows:
In the example given above, labelled words are converted into numerical vectors. The
reason for converting the text into vectors is that machine learning algorithms take input
in the form of numeric values. These algorithms cannot work with text strings.
In order to create numerical vectors, the words in the corpus are divided into one or
more predetermined categories. So, text categorization and word categorization are the
same, but words are in the form of classified targets. Table 8.3 shows the differences
between topic-based word categorization and keyword extraction. Even though both tasks
fall under the category of classification, there are differences between them.
In word categorization, the categories list, tree, and topic are predefined, while in key-
word extraction, two categories called word or not-word are predefined. In most cases,
keyword extraction belongs to binary classification, whereas word categorization belongs
to multi-classification. In both word categorization and keyword extraction, words are
used as entities.
In word categorization, semantics are used for the classification criteria, while in key-
word extraction, importance of words is used for classification criteria. Now, we will dis-
cuss the relationship of word classification with taxonomy generation. Allocation of
sample words and predefined categories are important tasks of word categorization.
Figure 8.3 shows the relationship between word classification and taxonomy generation.
In this figure, words are classified into one or more predefined sets or categories of words.
Words that are classified into one or more categories are gathered in the form of topics.
Then, we applied filtering to select the important words. The important words are used for
the extraction of text. Between the first and second layers, we can add one or more masks
for automatic preliminary tasks instead of performing them manually.
• Advantages:
• Disadvantages:
–– As many words can have more than one meaning, determining the correct meaning
depends on the context, which makes it difficult because meaning can be rare.
–– Some words need to have extensive context because available context may not be
sufficient to efficiently categorize the word.
–– If provided training data is limited, then this may lead to inaccurate categorization
of data.
–– If a word is categorized incorrectly, then this inaccuracy can propagate to subtasks
and affect the accuracy of application.
276 8 Taxonomy Generation and Dynamic Document Organization
Now, we will see the implementation of word categorization using a real-time example
with the help of Python.
Output:
Predicted Category: Refund Requests
8.2 Taxonomy Generation Tasks 277
In the code snippet given above, we used some sample queries and categories. The
model is trained with the help of different library functions. Later we checked categoriza-
tion with the help of a test query. On the basis of words used in the query, the query is
categorized correctly.
Word clustering is the method of dividing a set of words into smaller groups. It is illus-
trated in Table 8.4. Normally, we consider the lexical cluster where spelling and grammar
play a very important role in the decision of clusters, but in this section, we will discuss
semantic clustering where the words are clustered on the basis of their meaning. We
already discussed word categorization and its relevance with taxonomy generation. Now,
let’s compare it with clustering. For the categorization of words, we must have some
labelled clusters. Word clustering, which divides sets of words into subsets of semantically
related words, is a key technique used in many NLP applications, including information
retrieval and filtering as well as word sense or structural classification.
The two basic types of similarity that have been utilized in the literature can be sum-
marized as follows:
Words
…
Words
Words
…
Words
Words
…
Words
Figure 8.4 shows the interaction of word and text clustering with each other. It can be
seen that we can perform text clustering on the basis of word clustering, so, first, we need
to perform word clustering because text clustering is derived from word clustering.
We will now talk about how word clustering is used to generate taxonomies. The words
are extracted from the corpus and converted to numerical vectors and then clustered on the
basis of their semantic meaning. Using medoid systems, the representative words are
taken out of each cluster. Representative words extracted from a cluster are presented in
the form of a list. All the extracted representative words are then combined to get the
taxonomy.
The following is the implementation of word clustering using Python. Different vehi-
cles and fruits are used as input text, and on the basis of the semantic meaning of each
word, they are clustered into two different clusters. The provided text is processed through
K-means, an unsupervised machine learning algorithm that clusters the text correctly.
8.2 Taxonomy Generation Tasks 279
Output:
Cluster 1:
- apple orange banana
- apple fruit
- orange fruit
- banana fruit
Cluster 2:
- car bike
- car vehicle
- bike vehicle
280 8 Taxonomy Generation and Dynamic Document Organization
Topic routing is the method of extracting the relevant information from the topic at hand,
as shown in Fig. 8.7. The topic is input, and a list of the extracted text is generated; this is
the reverse process of text categorization. This task is performed to extract a special type
of information by considering the relation between a text and a topic. Imagine construct-
ing a large library with books thoughtfully arranged on shelves, with each shelf signifying
a certain theme or topic. Consider the readers who come to this library to find books that
interest them. A skilled librarian who knows just which shelf to guide guests to, guarantee-
ing they find the books they’re looking for without any misunderstanding, is like having
topic routing in taxonomy creation. Topic routing plays the part of a human guide while
building taxonomies, those complex webs of categorized data. It’s crucial to ensure that all
types of content, including articles, research papers, and other written works, are assigned
to the appropriate taxonomy node. In the same way that our librarian directs guests to the
appropriate bookshelves, topic routing directs content to the appropriate category.
This procedure depends on how well a taxonomy is organized. A logical roadmap for
the content to follow is created by each branch and subbranch, each of which symbolizes
a different topic. Topic routing makes sure that you are shown content and subtopics that
are connected to the topic you are researching within a taxonomy, improving your research
experience. It’s like having a friend who is familiar with the library’s layout and can guide
you to where you need to go, saving you time and energy. The art of creating taxonomies
relies primarily on topic routing, in essence. It makes sure that the enormous amount of
material is organized and arranged so that everyone may navigate it with efficiency and
ease, from inexperienced explorers to seasoned scholars. Topic routing improves the
usability and utility of taxonomies, just as a librarian’s astute direction improves the library
experience.
In topic spotting, a full text is provided in the form of input, and relevant topics are
assigned using a fuzzy classification mechanism. Topic routing can be achieved by revers-
ing the process of topic spotting. In topic spotting, the texts are given as input, and the
topic is output, while in topic routing, topic is input, and some texts are output. Topics are
chosen as a list or tree throughout the routing process, and each topic is given as input. The
values of topic and text matching are calculated and graded. A collection of text that is
relevant to a topic is selected as output during the matching process. The classifier and
training samples are assigned to the appropriate subjects according to established topics.
Each topic is taught using training samples that have been classified as either positive or
negative. Topics are provided as input to the classifier, along with texts from the corpus
that belong to the positive class. The classifier generates the output in the form of text that
is relevant or not as shown in Fig. 8.5.
8.3 Taxonomy Generation Schemes 281
Topic 1 Topic 2
Selection and
Topic Classifier
Relevant Irrelevant
In this section, we will discuss a few schemes that can be used for the taxonomy genera-
tion from a given corpus. A corpus of data can be organized and categorized in a variety of
ways using taxonomy generation techniques, which can be tailored to different purposes
and situations. These schemes are index-based, clustering-based, association-based, and
linked analysis-based schemes.
Figure 8.9 shows the complete procedure of using topic routing for taxonomy
generation.
We can see the process of taxonomy generation through topic routing with the help of a
diagram as shown in Fig. 8.10. From a corpus, a list of words is indexed, which is used for
taxonomy generation. The selected words become topics that are provided to the topic
routing process. The topics that are associated with the selected word list are marked. The
criteria to select relevant words for a topic are very important for the implementation of
taxonomy generation. In index-based taxonomy generation schemes, the corpus is pro-
vided as input. Using an indexing system, the complete corpus is indexed into a list of
words as the first stage to generate a taxonomy. The text available in the corpus is com-
bined through concatenation and tokenized into a list of tokens. Then stemming is applied
to get the root form of each token. Then stop words are removed from the stemmed tokens.
In the next step, some tokens or words are selected as taxonomies on the basis of fre-
quency. For efficient selection purposes, grammatical information and TF-IDF are used to
282 8 Taxonomy Generation and Dynamic Document Organization
decide the frequency of tokens. Machine learning classification techniques are used to
select words as taxonomies on the basis of semantic relations. Words that are semantically
connected are considered for taxonomies, and connections between words are considered
on the basis of collocation. The consistency of words within a text and the total number of
times both terms are concatenated are taken into account when determining if two words
are related.
Calculating the overall word frequencies within the corpus is a basis for word selection,
with the assumption that all stop words have been removed. Additionally, the mean TF-IDF
weight can be used to improve the word selection parameter. Other elements, such as
material posted and grammatical clues, may also be taken into consideration as potential
selection criteria. It can be compared to a classification task in choosing the terms that
serve as taxonomies.
Instead of existing separately, taxonomies are thought of as things that preserve a
semantic relationship throughout their networks. They are linked together based on their
collocations. One can determine the relationship between two terms by dividing the total
number of texts in which both words appear by the total number of texts in which either
word appears. The collocation rate is the name given to this ratio. We can also evaluate
additional elements, such as the closeness of the terms inside a text and the frequency of
their co-occurrences in the given text, in order to identify these relationships.
Considering that we have access to the relationships between words in terms of their
meanings, let’s get things started. The operations that deal with word meanings have been
described by the authors, and their underlying mathematical principles have even been
developed. It’s important to keep in mind that there may not always be a match between
similarity in meaning and similarity in terms of their literal definitions. We rely on colloca-
tion, or the way words frequently appear together in texts, to determine how similar two
meanings are. To manage the various intricacies of semantic links between words, more
complex mathematical procedures still need to be developed and defined.
Now, we will discuss the process of taxonomy generation using a cluster of texts. The
corpus is provided as input, and the corpus’ text is grouped into smaller text groups known
as text clusters. Each cluster is given a name, and each named cluster becomes the taxon-
omy. Now, the whole process of corpus to taxonomy generation is discussed in detail. The
corpus that is the collection of texts is used as input. The outcome of taxonomy generation
is the named clusters that are produced by text clustering and cluster naming.
In this part, we will discuss the taxonomy generation process in detail by using text
clustering. The corpus’ texts are divided up into smaller groupings of texts. Then, we
encode the text into numerical vectors. For the clustering of similar text, K-means AHC
algorithms can be used. Figure 8.6 shows the scheme.
8.3 Taxonomy Generation Schemes 283
Text Group
Clustering
Text Clusters
Cluster Naming
Cluster naming is the next step of text clustering in the process of taxonomy generation
using a clustering scheme. For every cluster, TF-IDFF of words is calculated on the basis
of frequency. Words with higher frequency are used for the name of the cluster.
The clusters we create using this process are not always independent. Some clusters
may be completely independent of others, and some may have semantic relations. After
this analysis of clusters, a graph is built where cluster names are vertices and their seman-
tic relations are edges. So, in the clustering-based scheme, the taxonomies are created
using three steps, cluster generation, cluster naming, and analyzing the clusters. Even
though we expect this strategy to provide positive results, it does so at a high computa-
tional cost. In proportion to the number of data items, the clustering process is quadrati-
cally difficult, and the examination of relationships between clusters is also labor-intensive.
However, a single-pass approach can be used to cluster data items more effectively, which
can reduce some of these complications.
284 8 Taxonomy Generation and Dynamic Document Organization
In this scheme, taxonomy is generated on the basis of the relationship of words. The com-
plete process of this scheme is described with the help of Fig. 8.12. The corpus is provided
to the system. The texts within the corpus are indexed into a set or list of terms, much like in
the other techniques discussed above. The relationship between terms is then derived from
each set of words in the following phase. Individual texts from the corpus are indexed into a
list of terms. Then, using filtering based on TF-IDF, a subset is extracted from the word lists.
Then a subset is extracted from the lists of words using filtering on the basis of TF-IDF. In
the last step, the text retrieval process is followed on the basis of association rules. Filtered
results are considered generated taxonomies. Condition words that are used for the relation-
ship of text are extracted and filtered as the topics of the corpus. The next step is to associate
each association rule with a piece of text. The text is then transformed into a numerical vec-
tor together with its key component. The similarity between the casual part and generated
text is calculated on the basis of common links. The text is connected to the casual portion of
the association rule if the similarity exceeds the predetermined threshold.
We add the idea of association rule filtering to our current procedure to improve tax-
onomy generation. Associative rules are ones that have causal components that are subsets
of the largest possible causal portion that may be found in another rule. When one associa-
tion rule’s causal component overlaps with the causal component of another rule, redun-
dancy results. In order to strengthen the causal components’ support in each rule, we also
engage in word trimming, choosing which words to leave out. The results of this associa-
tion rule filtering are directly applied to the task of creating a taxonomy.
The topics of the corpus are the conditional terms of association rules that have been
reduced and filtered. Texts must be connected to each association rule. The cosine similar-
ity is used to determine how similar a word list in the causal component is to a text. The
text and the causal component are both represented as numerical vectors. The text is con-
nected to the causal portion of the connection rule if the similarity exceeds the threshold.
The end outcome of employing this technique to generate taxonomies from a corpus is the
automatic completion of the initial tasks for text categorization. Association rules can be
grouped together into smaller groups of related ones. Then, we define the similarity metric
as the intersection of the causal elements of the association rules. The association rules are
grouped according to the similarity measure using the AHC algorithm. As an alternative
approach, we can encode the association rules in the form of numerical vectors and select
representatives with the help of the K-means algorithm or the k-medoid algorithm. The
clustered association rules are used to create the hierarchical structure of the taxonomies.
and the corpus as a network. In this network, text plays the role of the hub. Creating links
between texts is the starting point of this scheme. Afterward, linked texts are transformed
into numerical vectors. The texts’ similarity is then determined. When two texts are identi-
cal enough, a link between them is generated; nevertheless, dense links between texts are
created when the threshold is close to zero. In dense networks, nearly every text in the
corpus is chosen as a hub, whereas in sparse networks, only a small number of texts are
chosen. The links created between texts or hubs are bidirectional. The association of two
texts is calculated on the basis of the cosine similarity of both texts. If both are con-
nected to the same text, then there is an association relation between them. If any text is
not linked to any other text, then we will remove that text, and it will not be part of the
network.
Taxonomies will be generated from the texts that will be part of the network. As the
connections will be weighted, the degree of each node is to be counted as the number of
connections. If a node has a high degree, it means this text is linked to many other texts.
Texts which have a degree equal to or greater than the decided threshold value are selected
as hubs or main texts. The texts that are used as hubs or hubs themselves are indexed into
a list of words. The terms in the list with the highest weights are chosen to form the gener-
ated taxonomies.
Consider that we are exploring the universe of textual networks and concentrating on
the relationships that emerge between texts. These connections, which are frequently por-
trayed as undirected links, highlight how comparable two texts are when compared on the
basis of the information they contain. Examining directed links between texts, where one
text serves as the origin and the other as the destination, offers another viewpoint. We refer
to words that appear frequently in texts as shared words. In this case, the text that has the
shared terms in a smaller percentage is designated as the source text. Additionally, in the
world of directed text networks, texts with a greater number of outbound linkages come to
the fore and take on the function of hub texts.
The governance of taxonomy includes four operations, i.e., carrying out our maintenance
of existing taxonomy, growth of taxonomy, integration of taxonomies, and ontology.
Governance is not just about classification. It’s about building bridges between categories,
fostering connections, and recognizing overlaps. It’s about curating the ecosystem to allow
different fields to collaborate and cross-pollinate, ensuring that the diversity of thought
enriches the whole. The function of the gatekeeper in taxonomy governance also involves
making decisions about what to include and what to leave out. It involves using judgment
to keep the taxonomy focused and current while also accommodating a wider range of
viewpoints.
Taxonomy governance creates a sense of order, a light that illuminates the route through
the deep jungle of data, in a world where information overload can cause confusion. It
286 8 Taxonomy Generation and Dynamic Document Organization
represents a dedication to maintaining the health of the garden of knowledge and a symbol
of the organized wisdom that has stood the test of time regardless of eras and advance-
ments in technology. We will explain four operations in detail.
Taxonomy maintenance keeps the complicated web of knowledge orderly and alive. It is a
commitment to continuously maintain a taxonomy, which is essentially a systematic
framework for organizing knowledge. Taxonomy maintenance entails regular check-ups,
modifications, and updates to ensure that the taxonomy continues to appropriately reflect
the always-evolving information world, just like a good craftsman diligently maintains
their tools. Imagine it as a fine-tuning process where old branches are pruned, new ones
are nourished, and the entire taxonomy is groomed to improve user navigation.
The operation of taxonomy governance focuses on the maintenance of any existing
taxonomy by adding a few texts or removing a few old texts and updating any text. A tax-
onomy is maintained by adding new text in the corpus or updating any text in the corpus
or deleting texts from the corpus. There are various operations that may be done with a
taxonomy generation, such as dividing a topic into multiple topics, combining multiple
topics into one, or adding any new topics. Figure 8.7 shows the division of a taxonomy into
multiple taxonomies. In the figure, Taxonomy i is grown bigger, so we need to split it into
a few smaller ones. In the diagram, we spit it into Taxonomy i and Taxonomy i+1.
The merging to taxonomy is another operation of taxonomy maintenance. If we have
two taxonomies similar to each other, it is better to merge those taxonomies into one, and
it will reduce the number of taxonomies. If we want to merge two taxonomies, then we
need to make a selection on the basis of similarities and the number of texts on both tax-
onomies as shown in Fig. 8.8. In order to add new topic(s), we must first construct a new
taxonomy. If any taxonomy has some texts and due to less content it is isolated from other
taxonomies, then we should delete these taxonomies.
We shall now talk about taxonomy growth, which is the progressive growth of taxonomies.
In this functionality, the texts are added and removed continuously from the corpus, which
is also shown in Fig. 8.9.
We will now talk about the procedure for adding new taxonomies to unfamiliar texts.
There are chances that we have text for new topics not for existing topics. So, to manage
it, the whole group of text is divided into familiar and unfamiliar groups. The unknown
group’s text is indexed into a list of words, and a new taxonomy is built using this list, as
shown in Fig. 8.10.
Think of a bookcase that hasn’t had any new titles added to it in years. The collection
loses its attractiveness as the books get stale over time. Similarly, a taxonomy that isn’t
updated to reflect new ideas, innovations, and trends runs the risk of becoming outdated.
This collection is given life by taxonomic growth, ensuring that it is a priceless resource
for both the present and the future.
As shown in Fig. 8.10, taxonomy 1 is large in size, so we are splitting this big taxonomy
into two smaller taxonomies. We will use a divisive hierarchical clustering algorithm to
divide the big cluster into two smaller clusters as shown in Fig. 8.17.
288 8 Taxonomy Generation and Dynamic Document Organization
Now, we will see the procedure of downsizing the taxonomy using the deletion function
continuously. In taxonomy downsizing, the text from big taxonomies is continuously
deleted. It is the reverse process of taxonomy expansion described above. As shown in
Fig. 8.11, Taxonomy 1-1 and Taxonomy 1-2 are combined into a single, larger taxonomy.
Normally more text is added to the corpus than deleted, so taxonomy growth is more fre-
quent than taxonomy downsizing.
The process of combining various taxonomies into a unified taxonomy is known as tax-
onomy integration. To enable users to easily navigate and make sense of a massive sea of
information, the process involves merging many taxonomies, which are frequently shaped
by various perspectives or settings. Taxonomy integration serves as a central coordinator
in the information-rich digital age, ensuring that data from different departments, data-
bases, and sources interact naturally. Similar to putting together a jigsaw puzzle, each
8.4 Taxonomy Governance 289
piece contains its own distinct information, and integration is the magic that connects them
to show the entire picture. Imagine using a digital library where each area has a distinct
layout for you to browse. There is complete confusion as science books mix with cooking
books and history books with English literature. As the curator, taxonomy integration
organizes and labels the sections to make it simple for you to locate what you’re looking for.
As we see above, there are different schemes to construct a taxonomy. So, there is a
need to merge all taxonomies constructed through different schemes into one organization.
As we need to integrate the taxonomies, there should be similiarities between taxonomies
for the organization of all into one. Figure 8.12 shows the procedure of taxonomy merging
through corpus integration. All corpus documents are merged into one big corpus; then,
any taxonomy generation scheme is used to generate a taxonomy from the big corpus.
Before integrating taxonomies, we must take into account how similar they are. We choose
two taxonomies from various organizations and combine them into one during the integra-
tion phase. If the picked taxonomies are from the same organization, then the merging of
such taxonomies is called inter-taxonomy merging. Figure 8.13 shows the process of tax-
onomy merging. In the diagram, corpus A is made up of taxonomies for business, society,
and IoT, whereas corpus B is made up of taxonomies for sports, the Internet, and com-
merce. In the given corpus, there are two similar taxonomies called IoT and the Internet.
290 8 Taxonomy Generation and Dynamic Document Organization
Similarly, two other taxonomies are called company and business. During the merging
process, the company and business are merged into business. Similarly, IoT and the
Internet are merged into the Internet. The remaining four other taxonomies are treated as
independent taxonomies.
Data becomes an inconsistent cacophony without taxonomy integration. It becomes
integrated into a musical symphony where one component complements the others, creat-
ing a symmetrical, significant, and effective framework that benefits users, companies,
researchers, and the entire global community.
8.4 Taxonomy Governance 291
8.4.4 Ontology
The science of “what is”, “of the kinds”, and “structures of objects” in the form of a tree
or graph is called ontology. Each node in the ontology graph represents a concept, and the
relationship between two concepts is represented by the edge between nodes. Because it
must be created semi-automatically or manually, ontology organization is more compli-
cated than taxonomy generation. Let’s consider the ontology of a computer department as
shown in Fig. 8.14. The computer department is the root node, and it has three concepts,
graduate courses, undergraduate courses, and people. The two graduate courses include
topics like neural networks and machine learning. Java and data structures are two exam-
ples of undergraduate ideas. The notion splits the population into the three subgroups of
staff, faculty, and students. The faculty is split into three further subgroups.
On the World Wide Web, knowledge about a particular area is represented and described
using a powerful language of ontology, Web Ontology Language or OWL. OWL is suited
for building rich and thorough ontologies that are simple to understand by both humans
and machines since it is designed to allow for the encoding of complicated relationships
and concepts. It is a crucial component of the semantic Web framework, allowing for the
sharing and fusion of data across many platforms and applications.
To define classes, attributes, people, and connections within a domain, OWL offers a
structured method. It makes use of formal syntax and semantics that facilitate automated
reasoning and inference, allowing logical deduction from the data encoded in the ontology.
The following is an example to illustrate the use of Web Ontology Language (OWL).
Let’s consider a simple ontology for representing information about animals. We’ll
define a few classes, properties, and relationships using OWL.
Classes:
Animal
Mammal
Bird
Carnivore
Properties:
hasHabitat
hasDiet
Individuals:
Using OWL, we can define the classes, properties, and relationships as follows:
# Define classes
Class: Animal
Class: Mammal
Class: Bird
Class: Carnivore
# Define properties
ObjectProperty: hasHabitat
ObjectProperty: hasDiet
Individual: Lion
Types: Carnivore, Mammal
Facts: hasHabitat Savanna, hasDiet Meat
Individual: Eagle
Types: Bird
Facts: hasHabitat Sky, hasDiet Prey
For the purposes of this example, we have defined the classes “Animal”, “Mammal”,
“Bird”, and “Carnivore”. Additionally, we have defined terms like “hasHabitat” and “has-
Diet”. Then, we made characters like “Lion” and “Eagle” and put them in the proper
8.4 Taxonomy Governance 293
classes. Additionally, we have outlined the connections between people and their attri-
butes, such as each animal’s habitat and nutrition.
OWL enables complicated reasoning and inference by allowing us to specify relation-
ships, restrictions, and axioms. To build structured and meaningful representations of
information on the Web, it is utilized in a variety of fields, such as knowledge representa-
tion, semantic search, data integration, and others.
Imagine a world in which information isn’t just a collection of disconnected data points
but rather a network of connected knowledge. This is structured knowledge sharing. A sys-
tematic foundation for knowledge exchange among various applications, databases, and
domains is created by OWL, which enables us to establish relationships among entities,
concepts, and properties.
Advantages of OWL are as follows:
• Rich semantic Web: OWL serves as the foundation for the Semantic Web, an idealized
version of the Internet in which computers can comprehend and interpret data much
like people do. It enables smarter searches, suggestions, and personalized experiences
for users, empowering Web content to be more meaningful.
• Precision in modeling: With OWL, categorization is not your only option. Complex
relationships such as hierarchy, equivalence, part-whole, and more can be defined. This
accuracy helps produce realistic models that accurately reflect the complexity of real-
world systems, such as biological systems and corporate processes.
• Cross-domain interoperability: OWL makes it possible for many communities to create
shared vocabularies and ontologies, which makes it easier to communicate between
various domains. For instance, a seamless connection between a medical ontology and
a healthcare information system can guarantee reliable data interchange.
• Data integration: Businesses work with data from many sources and formats in today’s
data-rich environment. OWL provides a common vocabulary and structure for compre-
hending the semantics, which helps in the integration and reconciliation of various
datasets.
• Decision support: OWL’s capacity to record complex connections and rules might be
useful to businesses and researchers. Building intelligent systems for better insights
and suggestions, OWL contributes by modeling decision-making procedures and
domain-specific rules.
OWL essentially changes the Web and digital environments into more significant, con-
nected, and intelligent environments. By bridging the gap between human cognition and
machine processing, it enables us to engage with a deeper understanding of the environ-
ment and go beyond the surface-level data items that are presented to us.
The composite task of text categorization and clustering is called dynamic document
organization. The function of dynamic document organization is to manage the text
automatically on the basis of contents. Dynamic document organization software (DDO)
organizes your papers with skill and adaptability like a digital master. Imagine having a
smart assistant who can organize your digital assets according to your changing prefer-
ences and needs.
Its fundamental functionality relies on sophisticated algorithms and metadata to com-
prehend the essence of your documents. It’s about capturing the complexities of your work
and hobbies, not about strict files and fixed categories. Consider yourself a researcher who
explores a variety of subjects. The documents associated with your ongoing research
become the center of attention while using dynamic document organization. The focus
effectively shifts to other sets of papers as your interests change, resulting in a fluid and
natural experience. We can run the dynamic document organization through two operation
modes called a creation mode and a maintenance mode. Now, we will discuss both of these
modes. In creation mode, the DDO systems collect the documents and perform initial
categorization and structuring. The system collects information about the documents, e.g.,
metadata such as tags, keywords, creation dates, etc. Documents are then grouped into
categories on the basis of their attributes.
Once the documents are initially organized, the maintenance mode begins. This mode
ensures the system’s continued accuracy, relevance, and efficiency. In this stage, the sys-
tem tracks the updates and revisions in order to maintain integrity. Both of these modes are
important parts of dynamic document organization systems. So, we have to maintain a
balance between organizing new content and ensuring the ongoing accuracy of existing
content for a successful document management strategy.
Now, we will discuss online clustering, which is an excellent way of managing the text
automatically. Online clustering is a special type of clustering. Here, items are provided in
the form of a stream, and then clustering continues infinitely. Offline clustering is effec-
tive when whole data is available in advance, but in the real world, whole data is not
always available. The items we want to cluster may be received in the form of a continu-
ous stream. To develop the online clustering system, we need to modify the existing offline
8.7 Online Clustering Algorithms 295
clustering algorithm and create a new one. For this, data will arrive in the form of a con-
tinuous stream instead of the whole data available at hand. Now, we will discuss both
offline and online clustering algorithms and will compare them with respect to their func-
tionality. In the offline version, all the data that need to be clustered is already available;
however, in the case of online execution, the data is not available and is received continu-
ally with the passage of time. Traditional clustering algorithms can be used for perform-
ing offline clustering. So, for online clusters, those algorithms need to be modified.
Figure 8.15 is an illustration of online clustering where the data items are provided in a
continuous manner. In an online cluster, we need to wait for all data items to be clustered.
We need to decide the number of clusters in advance before data starts receiving, and it is
also mandatory to categorize all existing items. The main difference between online clus-
tering and offline clustering is the clusters are updated step by step and interactively.
In case the data continually arrives in the form of a stream, it may become impossible
to cater to all of the data. Furthermore, clustering results may also depend on the order in
which the data arrives. In online clustering, we need to adjust results more frequently
because data is received continuously. This very frequent update puts a heavy load on the
system and results in poor-quality clustering. We need to manage the online streams, for
example, by storing the data in the queues and providing it to the algorithm later on.
Now, we will discuss the online clustering algorithm. We will discuss the theory and work
of the online clustering algorithms. We will provide details of how the offline K-means
clustering algorithm can be converted to an online clustering algorithm. We will also dis-
cuss K-NN and how it can be transformed into an online clustering algorithm. We will also
discuss fuzzy clustering algorithms.
When items are provided as a stream during the clustering process, it is referred to as
“online clustering,” and the process may continue indefinitely. In reality, the objects we
aim to cluster are never presented one at a time; instead, the data pieces come in almost
continuously. For clustering data items that are provided as a continuous stream, we must
build online clustering strategies by altering current clustering algorithms and construct-
ing new ones. Here, we contrast offline versus online clustering and discuss their short-
comings (Fig. 8.15).
When all of the data items are shown at once, the clustering is said to be offline. The
essential premise of offline clustering is that after clustering current data items, no new
data item will be available. Traditional clustering techniques must be converted into their
296 8 Taxonomy Generation and Dynamic Document Organization
Cluster Cluster
3 4
Clustering
online clustering equivalents in order to process large amounts of data because they are
used for offline clustering.
Let’s now explore the complexities of bringing offline clustering into the online sphere.
Investigating the complete data collection is simply impracticable in light of the constant
data arrival, particularly in the context of big data. To keep up with this always-changing
stream, data clustering needs to advance as an art. The results of the online clustering are
inextricably linked to the order in which the data items make their grand appearance,
which is the twist. Fine-tuning becomes the name of the game; we must achieve a precari-
ous balance. If you tune the results too frequently, the system will get overworked; if you
tune the results infrequently, the quality of your clustering findings may suffer. How then
do we manage this delicate balance? Imagine us keeping a careful eye on the data stream,
like watchful guardians, waiting for the right opportunity to intervene and make the
changes that will improve the outcomes.
Given that the foundations of present clustering algorithms are offline techniques, we
are at a point where adaptation is required. The current attempt requires converting these
algorithms into their online equivalents. Imagine that clusters are methodically updated
with each new arrival rather than receiving a major redesign all at once when all data ele-
ments have been seen. The fundamental method switches from batch to interactive mode.
However, there’s still more. We can provide “virtual examples,” which are simply made-
up data items based on our in-depth knowledge of the application area, to establish the
scene and initialize these clusters with their unique properties. Consider a situation where
data needs to be taken out of the mix. Do not be alarmed; decremental clustering can be
incorporated to handle these situations with elegance. Let’s now step it up a level. The
online clustering we’re developing takes on the difficulty of dynamically variable data
values rather than only dealing with constant data points.
8.7 Online Clustering Algorithms 297
In the offline version of the K-means clustering algorithm, all the data is available at hand,
and we initialize the cluster mean vectors. The algorithm arranges the data on the basis of
similarity to the cluster’s mean vector. In the case of the online K-means clustering algo-
rithm, the mean vectors are updated whenever new data items arrive. The process contin-
ues until the convergence of the mean vectors.
The online K-means clustering algorithm can be performed in two ways. Firstly, the
mean vectors are updated each time a new data item arrives, whereas in other cases, sev-
eral data items are arranged in a block, and the mean vectors are updated in a batch. The
major issue of the online K-means clustering algorithm is its performance. The perfor-
mance depends upon the arrival of data and its order.
As far as the fuzzy K-means is concerned, the mean values can be updated whenever
the membership value of the newly arrived data item is computed. The fuzzy K-means
may also suffer from the same problems as discussed above.
The following Python code shows implementation of online clustering with the help of
a modified K-means algorithm:
import numpy as np
import matplotlib.pyplot as plt
class OnlineKMeans:
def __init__(self, num_clusters, dimensions):
self.num_clusters = num_clusters
self.centroids = np.random.rand(num_clusters,
dimensions)
self.cluster_counts = np.zeros(num_clusters)
def get_clusters(self):
return self.centroids
num_clusters = 3
dimensions = 2
num_updates = 100
data_stream = generate_data_stream(num_updates, dimensions)
online_kmeans = OnlineKMeans(num_clusters, dimensions)
online_kmeans.fit(data_stream, num_updates)
clusters = online_kmeans.get_clusters()
print("Final Clusters:")
print(clusters)
Output:
Final Clusters:
[[0.68229178 0.78273444]
[0.21131366 0.45612506]
[0.70857747 0.22993655]]
The K-nearest neighbors (KNN) algorithm, which falls under the category of supervised
learning, is revamped into an online clustering algorithm, showcasing its versatility in dif-
ferent contexts. It’s important to observe that the boundary between supervised and unsu-
pervised learning algorithms is not rigid, as these algorithms can be reshaped and
repurposed to suit various scenarios and tasks. It’s essential to notice the absence of
labelled training examples in the early phases of converting KNN to its online counterpart.
The initial step is to determine the number of clusters and then insert virtual training
examples into each cluster. These virtual examples are either created arbitrarily or pur-
posefully using knowledge of the particular application domain. The cluster to which new
training examples belong is decided by the collective vote of the clusters to which the
selected virtual examples belong as fresh training examples are introduced into the sys-
tem. The original version of the KNN variations as well as their unsupervised versions can
be altered. By creating virtual examples in each cluster at random or based on prior knowl-
edge of the chosen application area, it is possible to change the radius nearest neighbors.
Each item may be grouped into more than one cluster when using fuzzy clustering. The
membership values of each data point in the clusters are computed rather than the order of
the data points. When an item comes for fuzzy clustering, its membership values are com-
puted, and it is then divided into many clusters depending on those membership values.
This approach departs from the conventional single cluster assignment by allowing each
item to belong to numerous clusters. Although “overlapping clustering” and “fuzzy clus-
tering” are sometimes used synonymously, they differ in terms of their fundamental ideas.
The idea that things can belong to more than two clusters at once, resulting in an intriguing
300 8 Taxonomy Generation and Dynamic Document Organization
overlap between groups, is at the heart of overlapping clustering. Fuzzy clustering, on the
other hand, has a slightly different viewpoint and places more emphasis on computing
membership values across clusters as continuous values. The regions of overlap are fre-
quently shown utilizing blended colors when it comes to visually expressing these
diverse groups.
In dynamic organization, we will discuss different modes of the system including the ini-
tial maintenance mode and creation mode, along with the other tasks performed during the
process.
Here, we will discuss the execution of a simple dynamic document organizing system.
When the system starts execution, it does not have any text. We can say that at this stage,
the system is in initial maintenance mode. We provide the system with the initial text,
which is then converted into a single cluster. After providing more text, it starts clustering
this text. Now, the system moves to the creation mode (Fig. 8.17).
The level of organization may deteriorate when the maintenance mode is progressed
through. A pivot is required at this point in order to move from maintenance mode to cre-
ative mode. This change is prompted directly by the flood of later texts. A lopsided cluster
distribution could start things off. Another factor that could cause the change is the distri-
bution across text clusters. Furthermore, this transition might be defined by a clustering
index based on both intra- and inter-cluster similarities.
The choice of whether to make the change a subtle push or a significant shift emerges
once the decision to transition has been taken. The decision between smooth and hard
transitions is relevant here. The former entails adjustments such as division and merging.
The maintenance mode acts as an important concept in the world of the dynamic docu-
ment organization (DDO) system. Consider this stage as the conductor of a symphony,
putting the freshly supplied texts together in a musical composition. The maintenance
mode enters as the texts stack up like fascinating jigsaw pieces and begin to perform their
functionality using the technique of text clustering. In this stage, the DDO system func-
tions as a nimble librarian, grouping texts into collections with related themes. The system
appears to be building a highly structured library, with each cluster standing in for a dif-
ferent shelf of information. However, this process is dynamic, and the system is constantly
learning to better interpret the words.
The technology smoothly incorporates new texts into old clusters, much like a good
curator does when curating an art show. In a manner similar to a curator updating an exhi-
bition with fresh artwork, this guarantees that the arrangement stays consistent and cur-
rent. In essence, the DDO system’s maintenance mode fosters a region of permanent
harmony in the middle of textual disorder. It’s a mode that captures the system’s capacity
to develop, learn, and build a knowledge base that is always expanding. Texts that are
added later are organized into their own clusters based on how far apart or similar they are
to one another. More texts are anticipated to enter the system in a stream. They are com-
pared to and separated from the prototype vectors of clusters. They are grouped together
into clusters with the greatest similarity and the smallest distance. The clusters that are
created in the creation mode are thought to be hyperspheres.
Table 8.5 compares non-decomposition and decomposition approaches specifically for
dynamic document organization (DDO).
This mode activates when we have a large number of texts ready to join the group but the
existing clusters aren’t quite right. It’s similar to moving furniture around in a room to
make a place for a new piece; we’re changing the arrangement to make room for new
viewpoints. In order to provide those arriving texts with a comfortable place to live, we’ll
be creating brand-new clusters when the creation mode activates. Imagine creating new
neighborhoods in a developing city; each cluster becomes a thriving community of texts
that are related. We can avoid the mess of trying to fit the new texts into clusters where they
don’t quite fit by doing it this way. Instead, we’re creating a space that reflects their par-
ticular themes and energies.
Of course, this method involves more than merely throwing texts together at random.
We’re using factors like the number of fresh texts poised for publication and their disper-
sion across several clusters. It’s similar to planning a party in that we want a well-balanced
assortment of people from various backgrounds to keep the gathering exciting and ener-
getic. Therefore, in the creation mode, we’re creating clusters that each tell unique stories
while ensuring that everyone has a voice. Consider the creation mode as a makeover for
your library of books. Now is the time to get your hands dirty, organize those volumes, and
give your library a brand-new appearance. Like when you first stack all of your books
together, your system is initially in maintenance mode. It is comparable to shifting from
pile-up mode to full-on organizing mode. When you have a sufficient number of books
stacked up, this switch is pressed, signaling that it is time to start creating.
From the initial maintenance phase to the dynamic creation mode, we’re changing
gears. Consider it as a transition from the preliminary collecting stage to a more thorough
8.8 Dynamic Organization 303
arranging stage. This is where the actual activity starts, whether you’re beginning from
scratch or simply adding more information. Let’s now delve into the specifics of the cre-
ation mode. You combine those sorted texts into a single large group. From there, they fit
into subgroups with comparable content like jigsaw pieces. Imagine for a moment that you
were organizing your library into mystery, romance, or sci-fi sections. The creative mode
can be implemented in a variety of ways. It’s comparable to selecting various cake-baking
recipes. You will obtain an organization that has a little different flavor depending on the
clustering algorithm you use.
Consider the time you used to label your orderly shelves. That’s sort of what’s going on
here. To help you find what you’re looking for fast, name your clusters as you would tag
your sections. It’s interesting to note that this naming aspect hasn’t been adequately
assessed in the world of literature. There’s room for more if you’re feeling ambitious.
Consider creating many short summaries for each cluster, acting as mini-trailers for your
content. Why stop there, then? With virtual text generation, you might further spruce up
your collection by adding a spice of creativity. The creation mode offers a lot, from chang-
ing modes to sculpting content. As a result of the creation mode, named clusters of equiva-
lent works are formed. By creating categories in advance, a list of cluster names results.
Clustered texts serve as model texts for learning classifiers that are operating in mainte-
nance mode. As a result of the creation mode, the text categorization’s early chores are
automated. The quality of samples created using the creation mode is lower than the qual-
ity of samples created manually.
Two types of text organization exist: first, the “hard organization” option, which is
similar to completing a problem all at once, and the second method, “soft organization”,
which is more like adding or removing jigsaw pieces as you go.
This section focuses on additional work required to enhance the DDO system. The authors’
implementation of the DDO system includes text categorization, text clustering, and clus-
ter naming. We require the system to perform text segmentation, text summarization, and
text taxonomy generation that were discussed in earlier chapters. Text summarization is
required for the system to process texts more effectively, and taxonomy production is
required to switch from the maintenance mode to the creation mode. In this part, we pro-
vide a brief description of the extra jobs and explain how the DDO system will use them.
Let’s look at the creation of taxonomies, which was discussed earlier, as extra work for
putting the DDO system into place. The process of creating topics and the connections
between them and texts from a corpus is referred to as taxonomy generation. The four
methods were discussed for creating the taxonomies; in the current edition, we used a
clustering-based technique. The current version may utilize a different generating strategy.
Consider text summary as an additional responsibility for putting the DDO system into
304 8 Taxonomy Generation and Dynamic Document Organization
practice. Encoding summaries into numerical vectors requires substantially less effort than
encoding whole texts. The multiple-text summarization produces cluster scripts, which are
cluster summaries. Summaries of well-organized materials should be displayed to users
instead of entire texts as their previews. Therefore, adding the text summarization module
is anticipated to enhance this system.
Let’s think about integrating the text segmentation discussed before into the DDO sys-
tem. The length of the texts that make up the system is extremely diverse; it is possible that
the system has a very long text that covers several different topics. When a lengthy text is
imported into the system, text segmentation is used to divide it into topic-based subtexts.
The technology creates subtexts from the lengthy text and treats them as distinct entities.
To do this, we broaden the heuristic approaches to advanced ways. The virtual texts, which
are different from the real ones, refer to texts that have been intentionally created by con-
catenating subtexts from various texts. We divide the text into paragraphs when the text
segments are added to the system. A comprehensive text can be created by putting together
subtexts or paragraphs that are pertinent to the subject. Pronouns can be modified in indi-
vidual sentences by substituting their matching nouns.
This section addresses the problems of encoding the text into numerical vectors. Let’s
think about the procedure for choosing one of the feature candidates to be the feature. We
need to perform indexing on the corpus in order to extract a large number of features from
it. Most effectively, only a few hundred features are chosen from among them. Text repre-
sentations in numerical vectors often have a dimension of three hundred. The efficiency is
improved if we convert text into smaller string vectors as compared to the large number
vectors. The sparse number vectors also create the problem. It has been observed that in
sparse vectors, zero (0) values dominate the other values by almost 90%. Two sparse vec-
tors typically have no similarity to each another. This indicates that the sparse distribution
is to blame for the poor discrimination between numerical vectors. The authors found a
solution by encoding texts into tables rather than using numerical vectors.
Poor transparency is another problem with numerical vectors used to represent texts.
Text contents cannot be inferred from representations or numerical vectors in any way.
Numerical vectors that simply contain numerical values lose symbolic qualities that
8.9 Challenges of Dynamic Document Organization 305
directly reflect the content. It is presented as an ordered list of numerical values without
any explanation of what those values mean.
The move from the creation mode to the maintenance mode does not give the user the
option to divide decomposition into as many binary classes as clusters. The advantage is
more dependability in keeping the text organized, but the cost is high overhead. We go
over the topic of decomposition in the transition in this section.
Texts can be divided into positive (+) or negative (-) categories. Each cluster receives a
classifier, and texts associated with that cluster are given a positive class label. Only a
small percentage of texts from the negative class, which are those that do not belong to the
cluster, are chosen as frequently as texts from the positive class. There are numerous sub-
sets of texts classified as belonging to the negative class, whereas all texts with the positive
class are fixed. The effectiveness of classifying new texts depends on the already provided
trained data.
If the binary classification is decided, we need to provide as many clusters as are avail-
able. In the abovementioned method, we prepare the text by labelling it with the positive
and negative classes. They are named symbolically as categories. The classifiers that
match the clusters are trained using their own example texts before waiting for new texts
to be added. The initial text is classified into positive or negative classes. Each class cor-
responds to a cluster. If we decide to execute the classifier, then we need the additional
process for selecting among the already classified ones.
Operating the maintenance mode of the DDO system presents the dilemma of choosing
between the crisp classification and the fuzzy classification. We don’t need to perform the
task of binary classification if the crisp categorization is established as the system’s operat-
ing policy. If fuzzy categorization is chosen, decomposition is mandatory. While the DDO
system is running, there can be a switch between the crisp and fuzzy organization.
Operating the system involves choosing one of the two categories or both of them.
This section focuses on variations that come from the DDO system. It was created with the
intention of continuously managing text organization. From the set of tasks performed in
the process of dynamic document organization, we can derive other tasks, e.g., taxonomy
generation, pattern detection, virtual text generation, etc. The tasks that result from this are
used to improve the system’s performance and features. These tasks that are derived from
the process of DDO are called DDO system variants.
306 8 Taxonomy Generation and Dynamic Document Organization
8.10 Summary
In this chapter, we have discussed two topics, i.e., taxonomy generation and dynamic
document organization. Taxonomy generation refers to the process of generating hierar-
chical structures from the contents of the documents. Different taxonomy generation
schemes have been discussed including index-based, clustering-based, association-based,
and link-analysis-based schemes. These concepts have been discussed with the help of
examples. After this, a detailed discussion is provided on the dynamic document organiza-
tion. Different tasks associated with dynamic document organization have been discussed
in detail. Finally, different challenges of dynamic document organization are mentioned.
8.11 Exercises
Q1: Consider the following text and the training data. Using this data, try to identify the
keywords from the given text.
Taxonomy 1
Hierarchical-based 1
Clustering 0
Classification 1
Attributes 1
Link 1
Here the value “1” marks as “Important”, and “0” marks as “Un-Important”.
Q2: What is the difference between text classification and taxonomy generation?
Q3: Consider the following text corpus and the documents: Classification, is, the, pro-
cess, of, assigning, class, labels, to, textual, documents, on, basis, training, data.
Q4: Consider the following classes, properties, and individuals. Define them using the
Web Ontology Language (OWL).
Q5: Suppose you are working on developing a dynamic document organization for orga-
nizing news articles. The intention is to develop a system that automatically orga-
nizes news articles on the basis of their contents. Mention in detail which tasks you
will perform in the following steps:
• Data collection
• Preprocessing
• Topic modeling
• Dynamic updating
• Evaluation
Q6: What is the difference between online clustering and the normal clustering process?
Q7: Consider the following documents, and apply K-NN clustering to cluster similar
documents. You can use the value of K=2.
D0: Fruits are delicious foods that come in many flavors, colors, and shapes. They
contain vitamins, minerals, and fiber that are important for maintaining health.
From the sweetness to the tropical delight of bananas and pineapples, fruits
provide a refreshing and satisfying taste. Fruits provide the nutrients that are
essential for the body.
D1: Vegetables are plants that exist in nature. We eat vegetables to remain healthy
and strong. Vegetables come in different shapes and colors. They have different
types. Some common examples include green spinach, orange carrots, and red
tomatoes. Vegetables provide us with vitamins that help our bodies grow and
work properly. Eating vegetables can make us feel energetic and happy.
308 8 Taxonomy Generation and Dynamic Document Organization
D2: Sports are exciting and active games that are played by people for fun, exercise,
and competition. Different types of sports are played in different regions of the
world, e.g., football, cricket, hockey, tennis, etc. All of these games have a large
fan base. Different tournaments of these games are arranged throughout the
world every year. Hundreds of thousands of people enjoy these games.
D3: Football is one of the most common sports that is played throughout the world.
It is played on a big grassy field. The two teams try to compete with each other.
The team that scores the maximum number of goals wins. It is a popular team
sport enjoyed by people around the world. It’s played on a big grassy field, and
the goal is to score points by getting a ball into the opposing team’s net. Players
use their feet to kick the ball, and teamwork is crucial to pass, defend, and score
goals. Football matches can be intense and exciting, with fans cheering for their
favorite teams. Whether it’s a friendly game in the park or a professional match
in a stadium, football brings people together to celebrate skill, strategy, and the
joy of playing as a team.
Eatables→Vegetables
Eatables→Fruits
Q10: Write down the pseudocode of the online K-means clustering algorithm.
Recommended Reading
• Blueprints for Text Analytics Using Python: Machine Learning-Based Solutions for
Common Real World (NLP) Applications by Jens Albrecht, Sidharth Ramachandran,
and Christian Winkler
Publisher: O’Reilly Media
Publication Year: 2021
This practical guide offers data scientists and developers a set of proven methods for
effectively dealing with typical challenges of text analysis and natural language pro-
cessing. Written by Jens Albrecht, Sidharth Ramachandran, and Christian Winkler, the
book features case studies and detailed examples of Python code to facilitate a quick
and smooth entry into the topic.
8.11 Exercise 309
In the context of human-centric text mining, the interaction of the user with text mining
systems has critical importance. This consequently emphasizes the need to improve data
visualization techniques. Various techniques have been developed so far to visualize the
data. In this chapter, we will discuss various text visualization techniques in detail, along
with sample data and how these techniques can be used to represent it.
• User interactivity: Users are actively involved in information extraction, analysis, and
interpretation. This interaction allows the users to provide specific domain knowledge
to refine their queries and adjust parameters in order to extract the relevant and most
accurate information.
• Visualization: Human-centric text mining often includes visualizations in order to bet-
ter present the results and provide information in a more accurate and understandable
format. Visualizations can help users identify the patterns, trends, and relationships that
might be difficult to understand in the case of the raw text.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 311
U. Qamar, M. S. Raza, Applied Text Mining,
https://doi.org/10.1007/978-3-031-51917-8_9
312 9 Visualization Approaches
• Iterative process: Human-centric text mining is an iterative process. This means that
users provide their queries, get the results, refine the queries, and get more accurate
results.
• Domain expertise: The user’s domain expertise and knowledge are important in getting
accurate information in human-centric text mining systems. So, domain knowledge and
human expertise are important in providing the system with the knowledge about the
information that needs to be displayed to the user.
• Customization: Human-centric text mining recognizes that different people have differ-
ent information needs. Therefore, visualization tools and methods are often customiz-
able in order to meet different user needs.
• Feedback loop: The continuous feedback provided by the user can help text mining
systems improve the accuracy and the analysis process. User feedback can also be used
to adapt and refine the text mining process.
• Contextual information: Textual data often contains noise and ambiguity that may
require human judgment to interpret the results accurately.
Overall, we can say that the accuracy of the results not only depends on the information
processing of the text mining system but also on how humans interact with the systems.
Furthermore, it also requires the domain knowledge and expertise of humans to accurately
interpret and analyze the information. So, it is important to provide users with a variety of
tools to interact with data in text mining systems. These tools help users to effectively
explore, analyze, and interpret the textual data, which ultimately enhances the overall
knowledge discovery process.
The following are some key types of tools that a text mining system should offer to
improve user interaction:
• Search and query tools: Text mining systems should include different search tools that
allow users to interact with the system by providing keywords, phrases, or different
textual queries to retrieve relevant information from the system. So, advanced search
operators and filtering options should be provided in order to refine search results.
• Visualization tools: These tools are used to transform textual data into graphical repre-
sentations. This makes the patterns and other extracted information more understand-
able. Examples of these tools are word clouds, bar charts, line graphs, and network
diagrams that can help the users to enhance the understandability of the extracted
information.
• Clustering and categorization tools: These tools are used to group similar documents or
textual data together on the basis of common features, contents, or characteristics.
These tools can help users understand the data more effectively.
• Sentiment analysis tools: These tools help understand the textual data that contains
opinions, emotions, or sentiments. These tools can assess the overall emotional context
of a document. These tools help understand public sentiment about a certain topic.
9.1 Introduction and Importance of Text Visualization 313
• Named entity recognition (NER): These tools can be used to identify and classify the
different entities, e.g., names of people, places, organizations, dates, etc. present within
the text. This can help users in identifying key actors of a system.
So, in order to handle the large textual data, simple graphical tools, e.g., pick lists, drop-
down boxes, radio buttons, etc. are commonly used in almost all types of software applica-
tions and have become inadequate. These tools cannot properly represent the patterns and
the information present in the textual data. So, designers of text mining systems emphasize
the use of more robust tools that appropriately present the information in a more under-
standable way. Even in a normal-sized dataset, the use of these tools is insufficient to
properly present the extracted information.
In order to address this issue, efforts have been made to develop more advanced visual-
ization tools. The tools have been developed on the basis of different types of information
and the ways the information can be conveniently presented to the user. Also, the behav-
iors of the user and the information needs of different domains, e.g., computer science,
civil engineering, behavioral science, etc., are kept in mind while developing these tools.
Such tools are developed by keeping the following goals in mind:
• Focused browsing: These tools focus on the domain-specific data, which helps users
identify the relevant information more conveniently.
• In-depth exploration: The depth and complexity of these tools help users interact with
the system more appropriately and explore complex patterns and information.
• Iterative exploration: These tools help users refine their queries at runtime in order to
get more precise and refined information. These tools provide a dynamic interaction
where the next query is provided on the basis of the previous information retrieved.
These tools are normally based on the information needs of different domains including
data visualization, information retrieval, machine learning, and cognitive psychology.
Tasks, e.g., cluster analysis, topic modeling, network analysis, interactive dashboards, and
natural language processing, are normally performed by using these tools. These tasks
ultimately help users get in-depth insight from the data and perform decision-making on
the basis of the information retrieved from the textual systems.
A simple text browsing interface can be seen in Fig. 9.1. There are different issues with
such conventional and simple interfaces:
These types of interfaces normally have a limited query-building function and also
limit the capacity of the information visualization even for smaller-sized documents.
Furthermore, the character-oriented nature of the interface reduces its capacity to provide
visual aids to render the information in a more affective way.
The visualization approaches in text mining systems on the other hand normally focus
more on the graphical display of the textual information in the form of the diagrams rather
than the list boxes and the drop-down boxes. The simple conventional visualization
approaches try their best to provide the affective rendering of the information; however,
advanced visualization tools provide more sophisticated ways to present complex infor-
mation and trends in textual data.
The interaction power of the conventional interfaces is also weak with respect to the
runtime interaction to refine the queries. The text visualization tools on the other hand can
provide the filters to view the textual data at any level of abstraction by providing some
threshold value. A simple example may be a circular control that you can expand and con-
tract in order to cover the area from where you want to display a list of cancer patients. An
accompanying bar chart can be updated on the basis of the area covered by the circle.
Such types of tools enable users to affectively interact with the system and focus on the
specific point of interest. We can also add refinement controls with these tools. Such con-
trols let the user highlight the information from different perspectives; for example, by
using the filter, the user can search only the cancer patients under the age of 18 or only
female cancer patients that are unmarried.
The ability to apply the filters is especially useful in case of large data where it is
impossible to render the information affectively within the limited space of the output
device. The ability to apply the filters enables us to focus only on the specific information
and skip the rest.
9.1 Introduction and Importance of Text Visualization 315
The following are some of the advantages of the advanced-level text visualization
approaches (visualization tools) over the conventional basic-level tools (character-based
information browsing controls):
• Visualization tools have the capability to show a large volume of information as com-
pared to the conventional character-based controls.
• Visualization tools can show the relative grouping of the clusters, their similarity and
dissimilarity, the distance between them, or other information related to the grouping of
the items.
• Visualization tools provide the capacity to interact with the highlighted features within
the contextual information in which the feature is relevant.
• Visualization tools provide the ability to view the information from different levels of
abstraction. For example, in the case of geographical applications, you can view the
information from the macro level to the micro level.
• Visualization tools can be used to conveniently search for the right information from
large data volumes.
It should be noted that adding more complex features sometimes makes interaction
complex or inconvenient because of their unfriendly interaction with the user (i.e., users
may need to provide more information in order to interact with these features). This may
ultimately affect the analysis process because one can easily be confused over the exact
information needed to provide in order to get the output from the system.
This also emphasizes the selection of appropriate visualization tools for the display of
the information. It may depend upon the required information along with the appropriate
way to represent that information. For example, the comparison can be visualized more
appropriately by using a bar chart as compared to the circle control shown in Fig. 9.2. Text
visualization tools have evolved from simple character-based tools to advanced visualiza-
tion tools that support dynamic interaction.
London
As can be seen, a text mining system comprises four components, also called layers.
The last layer is the “Presentation” layer, also called the visualization layer. This is the
layer of the text mining system that the user interacts with. This layer can be used to per-
form two tasks. The first task is the input of the data. All the user input is provided through
this interface. This is the interface where the user provides the required queries to retrieve
the relevant information. The system takes this information and then processes the pro-
vided query to retrieve the relevant information from the corpus.
The second task is to present the output to the user. All the output information is dis-
played on this layer. This is the layer where visualization tools are present to render infor-
mation and present the results. Once the output is provided, the user refines the query on
this basis to further update the results according to the requirements.
The presentation layer provides the user with the capacity from simple browsing to
advanced visualization aids. It can be seen that the visualization layer comes after the
“Core Mining Operations” module. In earlier systems, the visualization layer was tightly
coupled to the mining operations layer, which made it very difficult to update the text min-
ing systems and introduce new analysis techniques. However, in modern text mining sys-
tems, this is not the case. Now, the visualization layer is loosely coupled with the core
mining operations layer.
This is also important from the visualization layer’s point of view itself. The reason
behind this is that text mining tools are improving day by day and they need to improve
their visualization capabilities in order to display the advanced-level analysis results. So,
as discussed earlier, visualization tools kept on improving with the passage of time, and
the process continues. We need to update the visualization layer with more advanced tools
for the enhanced needs of information rendering. This is only possible if the visualization
layer is decoupled from the lower layers so that we may plug and play the advanced visu-
alization tools whenever available and needed.
This decoupling between the visualization layer and the lower-level layers is possible
through various standard information interchange protocols and formats. For example,
XML is one of the common information interchange formats that can be used to send and
receive information from the lower layers. As the format remains the same, we can easily
replace one control with the other without making many changes in the system. So, we can
say that visualization tools have become an important part of the text mining systems that
play a critical role in the display of information according to the needs of the user.
9.2 Visualization Layer in the Text Mining System 317
Just like text mining systems, visualization tools have also evolved with the passage of
time and the complexity of the information needs of businesses and organizations. Before
going into the depth of what visualization techniques are available, first let us explore
more about the browsing at the presentation layer. Browsing is a broad term that specifies
different tasks that can be performed at the presentation layer. These tasks may include the
end user search through text, queries, query refinement, and text display after interaction
with the middle layer (also called the logic layer). Along with the abovementioned tasks,
the presentation layers also provide some functionality to directly interact with the few
portions of the text data. We can also use this layer to fine-tune the data processing algo-
rithms, e.g., by providing the parameters.
Normally, these browsing tools interact with the logic layer using some specialized
query language that can be used to direct different tasks at the middle layer, which ulti-
mately affects the results that are retrieved from the document corpus. So, we can say that
different text processing operations are started by the presentation layer. Once the opera-
tions are completed, the output data is displayed in different formats, for example, tables,
lists, trees, etc. The tasks performed for extraction and processing of data are guided and
constrained by the query provided by the user at the browsing interface. A simple example
can be to search all the tasks performed by a certain user (subject).
Such types of basic-level controls allow simple browsing and not at the detailed level.
For example, you can provide a query to display the documents that contain data about the
USA or the UK. You may get all such documents as output, and they may be sorted; how-
ever, you cannot browse them content-wise. For example, a document may contain the
data about the USA or the UK but in what context? This facility may not be present in
simple browsing interfaces. With advanced visualization tools, you can perform interac-
tive query execution and see in depth which aspects of the USA or the UK are contained
in the retrieved document.
Although efforts are made to enhance the capacity of the browsing tools at the presenta-
tion layer in order to customize them for the specific needs of the user, their capacity is still
limited to meet the visual needs. One of the features of these browsing tools is that differ-
ent tools can be used for the same purpose and the best can be selected. For example, the
pie chart and the bar chart can be used to represent the same data; however, the one that
may be used will depend on the user’s selection.
Here are some simple browsing controls that were conventionally used in simple brows-
ing interfaces:
• Labels: Simple textual control to display some static text, e.g., a guidance message
about what type of data to enter in a field.
• Text Box: An input box that can be used to take a single line input, e.g., a threshold
value after which the genetic algorithm should stop its execution or the number of clus-
ters that the K-means algorithm should generate after its execution on a certain textual
document.
318 9 Visualization Approaches
• DropBox: A simple drop-down control that contains a list of different values from
which the user can select one. For example, the list of countries from which the user can
select his own or a list of months from which the user can select his month of birth.
• CheckBox: The check box control can be used to select different options at the same
time, for example, the concepts that the retrieved documents should contain. Similarly,
the check boxes can be used to select the cities where a certain product should be
available.
• Radio Buttons: The radio buttons can be used to select one option from a group of
options, for example, the student grade from the groups “A”, “B”, “C”, and “D”.
• Buttons: The buttons are clickable controls that can be used to submit a certain query.
The click of a button shows that a certain input has been completed by the user and that
the system should perform a task as a result. For example, we may submit personal data
and then click the submit button to process it further. Similarly, a button can be used to
start processing once the data related to a certain query has been completely entered.
• ListBox: Just like the drop-down box, the list box can be used to display a list of items.
The difference is that the drop-down box displays the complete list only after clicking
it, whereas a list box displays a certain number of items all at once.
• GridBox: The grid box can display the data in the form of a table, for example, the
features of all the documents that contain a certain concept.
As can be seen, the controls discussed above provide basic functionality and a limited
interaction to refine the queries. Visualization tools have become more important for text
mining systems.
Figure 9.3 shows a simple interface containing some conventional browsing controls.
These were some details about the browsing functionality of the presentation layer.
Now, we will discuss visualization tools in detail.
Already there is a lot of literature available about the use of generic and domain-specific
visualization tools. Also, there are several visual techniques available for the visualization
of unstructured data like text. So, it is impossible to select the best technique to visualize
the data. However, based on the requirements, we can select the one that best shows the
analysis results. As far as the textual data is concerned, different visualization techniques
are available, e.g., concept graphs, histograms, circle graphs, self-organizing maps, etc. In
this section, we will discuss some common visualization tools that can be used in text min-
ing systems to display the analysis results.
“Concept” refers to the subject of a particular query, e.g., “USA” or “UK.” The term
“Graph” refers to a pictorial display of the contents. So, concept graphs are the visual
techniques that can be used to display the analysis results. Various types of concept graphs
are available for use. In this book, we will discuss the following techniques:
As the name implies, the simple concept graph is the simplest visualization technique that
can be used to show unstructured data like the text in documents. The simplest concept
graphs have two main benefits. Firstly, they provide the facility to organize the textual
data, and secondly, they provide the interactivity of the user with the text. So, the user can
refine the queries and regenerate the results. All this can be done by just clicking the
required node in the concept graph, and the node will be expanded. The underlying details
will be displayed. Both of these features provide easy user interaction along with effective
data analysis.
It should be noted that to increase the level of abstraction, we can link different graphs
with each other to provide more in-depth analytical details. The high-level graphs can
320 9 Visualization Approaches
provide the context information, whereas the low-level graphs can provide the specific
information within that context. Simple set concept graphs have been successfully used in
many text mining systems.
Now, we will discuss the structural details of the simple concept set graph. It comprises
nodes and leaves. It is just like a tree structure that shows the contents in a hierarchy. The
root nodes represent the high-level concepts, whereas the leaf nodes represent the sublevel
concepts. The sublevel concepts can be considered as subsets of the higher-level concepts.
The nodes are called the vertices, and the relation between the nodes is represented by an
edge. The graph is traversed from top to bottom, i.e., from root node to leaf node. A par-
ticular path within the graph shows the subdivision of the concept into subconcepts, i.e.,
we move from more generic concepts to a specific concept. Figure 9.4 shows a sample
concept set graph.
The root node represents the concept “USA”, and the three intermediate nodes repre-
sent different aspects of this concept, i.e., “Agriculture”, “Metals”, and “Science”. The
metal concept can be explored further by clicking on this node. It can be seen from Fig. 9.4
that it comprises further subconcepts, i.e., “Lead”, “Zinc”, “Gold”, and “Silver”. Here, the
concept “Silver” shows that the document corpus contains the documents that discuss
“Silver” in their contents. The name given to each node represents the concept. Here, a
single, two, or more word combination can be used, but it should be kept in mind that the
concepts should be more meaningful. The traversal can be in both directions, i.e., upward
or downward.
The interesting aspect of the concept graph is that not only can it be used to traverse the
relevant concepts but irrelevant concepts as well. For example, if we are interested in seeing
the products with the maximum sale and a certain product appears with a minimum sale per-
centage, the product may be irrelevant to us. While the irrelevant concepts may be skipped,
sometimes these concepts may lead to further analysis. For example, by seeing a product with
minimum sales, we may perform further analysis to check the reasons behind the sale.
9.3 Concept Graphs 321
Modern visual techniques enable us to expand and contract a certain node. By expand-
ing a node, the sublevel concept graph will be displayed, and by contracting the node, the
graph will become invisible. This feature lets users focus on certain aspects of the concept.
Another form of the concept set graph is the “Treeview” control that can be seen in the
navigation panel of Microsoft Windows Explorer. The “Treeview” control displays the
root node as the root drive, the intermediate nodes display the directories, and the leaf
nodes display the files.
Now, mapping it to a text mining system, the nodes may represent the concepts along
with the necessary information. For example, the root node may represent the concept
“USA” along with the number of documents retrieved from the corpus. The sub-nodes
may represent the subconcepts along with the number of documents containing each con-
cept. We can also modify the controls to enhance the visual aid; for example, the aspects
that contain the minimum number of documents may be highlighted in red.
The visual tree control provides the intermediate display showing both the context
information and the specific concept. For example, we may see the number of documents
containing the “Silver” concept in the context of the “Metal” concept. Again the sub-nodes
here represent the subset of the higher-level concepts. We can link each node with some
other visual control that can provide some more information other than that shown in the
concept set graph. For example, by clicking the sub-node “Silver”, another control (e.g., a
pictorial display of geographical locations having “Silver” resources) may be highlighted.
This linking of the different visual display techniques helps the user perform a more accu-
rate and affective analysis of the data. For example, we may provide a different measure to
select the subconcepts by refining the previous query.
Efforts are always made to provide the most important information along with the name
of the concept node.
For example, we can provide the “Concept Support” value along with the concept
name. This helps the user to get the support of the given concept within the corpus. We can
also provide the same information with sub-nodes and leaf nodes. Alternatively, we can
use a different technique to show the hierarchical relation. The technique comprises the
directed graph where the quality measures are shown on the edges of the graph. Such a
graph is called a direct acyclic graph (DAG). Now, we will discuss this graph in detail.
DAG is another visualization technique to show the concept sets in a simple and easy
way. The graph comprises the nodes called “Vertices” and the arrows called “Edges”.
Starting from the initial node to the final, a path represents a hierarchy of the concepts and
the subconcepts. DAG can be used to represent more complex relationships that are nor-
mally difficult to model in the case of the visualization technique discussed above. For
example, it is difficult to visually show a single concept that is a subconcept of more than
one concept. However, in real life, we can face such types of scenarios. In real life, a child
concept can have more than one parent concept, which means that a concept can be a sub-
set of more than one high-level concept. For example, the concept of “Amphibious vehi-
cle” can be a subconcept of both “Vehicle” and “Boat”.
So, the DAG tool can be used to show the complex associations of real-life scenarios.
322 9 Visualization Approaches
B F
E
A H
C G
However, it should be noted that DAG graphs can become complex with the increase in
the associations as it can be difficult to comprehend the associations when there are mul-
tiple paths from and to a single node.
According to Fig. 9.5:
The Concepts “B” and “C” are subconcepts of the concept “A”.
Concepts “D” and “E” are subconcepts of concept “B”.
Concept “F” is the subconcept of both concepts “D” and “E”.
Concept “G” is a subconcept of concept “C”.
“Concept “H” is a subconcept of the concepts “F” and “G”.
The fact that the concept “F” is a subconcept of “D” and “E”, which ultimately are
subconcepts of “B”, makes it difficult to comprehend the concept “F”. However, the selec-
tion of the concept “F” in any analysis may depend upon the value of the selection measure
used to collect the documents or concepts. So, for example, for a certain measure, we can
consider “F” as a subconcept of “D”, and for another measure, we can consider it as a
subconcept of “E”.
The following is the Python code for generating the DAG graph shown in Fig. 9.5:
import networkx as nx
import matplotlib.pyplot as plt
textual_data = {
"A": ["B", "C"],
"B": ["D", "E"],
"C": ["G"],
"D": ["F"],
"E": ["F"],
"F": ["H"],
"G": ["H"],
"H": []
}
9.3 Concept Graphs 323
G = nx.DiGraph()
The output of the code will be the same as shown in Fig. 9.5.
In the above code, first, we have imported the important Python libraries. Then the data
member “textual_data” is used to represent the concepts that will be visualized. After this,
the “Digraph” function is used to create the graph. Nodes are then added to the graph, and
finally by using the “Draw” function, we have drawn the graph.
It should be noted that apart from the concepts and the resulting documents, DAG can
also be used to show the activity network. This can be helpful in analyzing different activi-
ties, for example, the critical paths, the shortest paths, etc. In this case, a path in the DAG
can represent an execution flow. This is important because we can direct the system to
change the execution paths or flows in case there is some issue in a certain flow. For
example, in the case of network traffic, we can direct the traffic to a different path in case
there is some issue (e.g., cable cut) at a certain location.
The concept association graphs are another visualization tool that can be used to represent
the association between the concepts. You can select the category, and it will show the
associations between the two concepts. A simple concept association graph will have two
concepts at each end of the vertex. Normally, the concepts are related to a single category,
but we can model the multiple category concepts within a single concept graph. First, we
will explain the single-category concept association graphs, and after that, we will discuss
the multiple-category association graphs. As mentioned earlier, the single-category con-
cept association graphs link the concepts from the same category.
For example, consider the following textual data that shows the association between
different countries and the percentage of the documents that discuss these countries
together:
324 9 Visualization Approaches
USA, UK → 65%
Germany, France → 20%
UK, France → 10%
France, Japan → 10%
Now, the concept association graph for this data can be shown as given in Fig. 9.6.
In this figure, the vertices show the countries, and the edges show the percentage of the
documents that discuss those countries together in a certain document corpus. The graph
provides an interactive way to perform analysis-related tasks. For example, if we specify
the threshold value as 20%, only the following two associations will appear:
USA, UK → 65%
Germany, France → 20%
The association graph shown in Fig. 9.6 is an undirected graph, i.e., an association
between two countries can be evaluated from either side. This does not show the hierarchi-
cal nature of the concepts; however, the graph can be directed as well. In the case of a
directed graph, there will be a directed edge (an arrow). The concept at the tail of the arrow
represents the parent concept, whereas the concept on the side of the arrowhead represents
the subconcepts (this should not be confused with the generalization/specialization rela-
tionship in software engineering). Again, a parent concept can have more than one subcon-
cept. If a parent concept has the subset of two child concepts, then there will be two arrows
from the parent concept. Each arrow will be captioned with the association support value.
We can make this graph interactive as well by again specifying the threshold value in
which case the associations having association support under the threshold value will be
disappeared. Just like the single-category association graphs, there can be multiple-
category association graphs as well. In multiple-category association graphs, the vertex
can contain two different types of concepts. Such association graphs provide more infor-
mation as compared to the single-category association graphs; however, by increasing the
categories, the graph may become more complex.
10
20 Japan
France 10
Germany
9.3 Concept Graphs 325
There are several measures that can be used to associate the concepts, e.g., document
similarity using the cosine, Euclidean distance, Manhattan distance, and arithmetic means.
The common measure is support confidence. The measure should be selected such that it
should remain significant if the association is read from both sides.
As far as the threshold value is concerned, selecting the minimum value returns a
greater number of associations, which makes the graph more complex, whereas selecting
a high threshold value may result in a sparse graph having minimum associations. Although
the selection of the threshold value depends upon a particular requirement of the analysis,
the balanced value may provide a good mix of contextual information as well as specific
information about the concept.
Now, if we talk about the operations that can be performed on the concept graphs,
mainly there can be four types of operations:
• Browsing operations
• Searching operations
• Linking operations
• Presentation operations
The browsing operations are related to the selection of the documents from the corpus
on the basis of the query. All the documents that fulfill the criteria specified in the query
are returned as a result. For example, we can specify a set of concepts here, and the docu-
ments that contain those concepts are returned. Another example may be to return all the
documents in which the term frequency of a certain term is greater than a specified thresh-
old value. Once the documents are retrieved as a result of a query and the graph is formed
on the visual interface, we may need to search different concepts in order to refine the
query, for example, the associations related to a certain category where the sub-association
set has cardinality greater than two.
Linking operations on the other hand link more than one graph. Once the graphs are
linked with other graphs, we select a certain concept in the first graph, and the corre-
sponding concepts in the second graph are also highlighted. For example, by selecting the
concept “USA, UK”, all the agricultural concepts that are common in both countries are
highlighted.
Presentation operations are related to the rendering of the graphs on a visual interface.
As the name shows, these operations are related to the structure and the display of the
graph. For example, these functions can help highlight the concepts or associations that
exceed a certain maximum limit. Similarly, these operations can help sort the concepts,
zoom in or out, or filter certain concepts from the given concept graph.
Although the graphs present a convenient way to represent the textual data, they have
their drawbacks. The following are some of the limitations of the concept graphs:
326 9 Visualization Approaches
• Although graphs are one of the simplest ways to represent unstructured data, with a
large number of dimensions, the graphs can easily become too complex to comprehend
and get the insight. This is also true in the case of a large number of concepts and a large
document corpus.
• The graphs are less efficient when analyzing complex relations, for example, when the
concepts are related to each other with respect to a number of associations in different
contexts.
• From an execution point of view, the graphs are difficult to handle as it requires a lot of
memory storage to manage a large number of nodes and vertices.
• Updating the graphs (e.g., adding or removing the vertices and nodes) is difficult as it
requires careful handling of the data.
• It may be difficult to represent and analyze certain types of data in a graph.
Overall, we can say that although the graphs are a convenient way to represent the
associations and the relationships among the data, the selection should depend upon the
specific analysis requirements.
9.4 Histograms
A histogram is just like a bar chart that is used to display the frequency distribution. The
vertical axis represents the count, i.e., frequency, and the horizontal axis represents the
data ranges or distributions. Each bar line represents the frequency of a certain
distribution.
A histogram has the following parts:
• Title: The title provides the details about the contents of the histogram; however, some-
times, when the diagram is embedded in text and details are already provided, the title
can be skipped as well.
• Horizontal axis: The horizontal axis shows the ranges or the values for which the fre-
quency needs to be plotted. In the case of text mining, these can be concepts.
• Vertical axis: The vertical axis shows the frequency.
• Bars: Bars represent the frequency value for each concept or range.
• Legend: The legend provides additional information about how the data was collected
and what scales are used.
The following example explains how we can draw the histogram for checking the fre-
quency of different distributions. Consider the data values shown in Table 9.1 that are
taken from the output of some process.
As mentioned above, the histograms represent the distributions and their frequencies.
We need to find the distributions and the frequency of each distribution in the data. If we
consider the ranges from “0 to 9”, “10 to 19”, and so on, the frequency of each distribution
will be as shown in Table 9.2.
9.4 Histograms 327
Although most of the time histograms are used for statistical or mathematical compari-
son, they have their use in text mining systems as well. For example, we can use a histo-
gram to compare the frequency of different concepts in a document corpus as shown in
Table 9.3.
Now, after this, we will draw the histogram as shown in Fig. 9.7.
328 9 Visualization Approaches
Now, in the context of text mining, Fig. 9.8 shows a sample histogram. The histogram
shows the concepts and the frequency of each concept in a corpus.
As discussed earlier, histograms are somewhat similar to bar charts. An important ques-
tion here is when we should use a histogram. Histograms can be used to represent the data
where there is a large set of measures in a table. Also, histograms are an appropriate tool
to use when you want to see where the majority of the values fall on a certain measure. The
following are a few scenarios where it is more convenient to use the histograms.
Histograms are more appropriate to use when there are large sets of data and you want
to summarize it. The example shown in Fig. 9.8 is a simple example where only seven
concepts are shown. When there is a large number of concepts, histograms are the best
way to visually comprehend the distribution of each concept relative to other concepts.
Histograms can also be used to specify the limits, i.e., the lower limits and the upper
limits as shown in Fig. 9.9. In the context of text mining, we can use these limitations to
check which concepts fall within certain limits. The concepts below the lower limits and
above the higher limits may be irrelevant. For example, in the case of the concepts list
given in Fig. 9.8, we can specify that the lower limit as 30 and the higher limit as 80. Now,
9.4 Histograms 329
by seeing the histogram, we can check that the concepts “Department” and “Money” are
out of the limit and, so, will be irrelevant to us.
Histograms are the best tools to visualize the summaries of concepts in a convenient
way. As an example, it can be seen without reading the entire list that the concept
“Department” has the minimum frequency and the concept “State” has the highest fre-
quency in the corpus. So, we can say that the histograms are the best tools when we want
to show visual comparisons. An important point here is that in a histogram, there are dis-
tributions and frequencies. It does not show the progress over time.
Histograms are the best tools to visualize the summaries of concepts in a convenient
way. As an example, it can be seen without reading the entire list that the concept
“Department” has the minimum frequency and the concept “State” has the highest fre-
quency in the corpus. So, we can say that the histograms are best tools when we want to
show visual compassions. An important point here is that in a histogram, there are distribu-
tions and frequencies. It does not show the progress over time.
The ability to specify the limits makes histograms a handy tool for decision-making.
For example, we can skip certain concepts that fall below a specific limit.
The following is the Python code to show the frequency of each word in a given
paragraph.
import re
def plot_word_histogram(text):
# Preprocess the text: remove punctuation and convert to
lowercase
cleaned_text = re.sub(r'[^\w\s]', '', text.lower())
# Create a histogram
plt.figure(figsize=(10, 6))
plt.bar(words, frequencies)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Word Frequency Histogram')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
# Example text
text = "A histogram is just like a bar chart that is used to
display the frequency distribution. The vertical axis
represents the count i.e. frequency and the horizontal
axis represents the data ranges or distributions. Each
bar line represents the frequency of a certain
distribution. Although most of the time the histograms
are used for statistical or mathematical comparison
yes they have their use in text mining systems as
well."
plot_word_histogram(text)
The particular type of histogram to use depends on the nature of the data and the visu-
alization needs of the system.
Line graphs are another visualization tool that can be used to display the results of the
analysis. Apparently, they seem to have limited capability; however, based on the results
of the analysis, we can display results that are easy to comprehend. A majority of text min-
ing systems still use such visualization tools. They can display the things that can also be
displayed through the histogram; however, here, we have the advantage that we can visual-
ize the results that are updated with the passage of time. One of the main benefits of these
tools is that they are easily available in the form of different libraries that are free to use.
Furthermore, such tools are computationally less expensive and can be easily managed by
a system. The available tools provide libraries in the form of APIs that can be used to inte-
grate these tools into our system.
All the above-discussed benefits make line graphs an easy tool to use in simple text
mining systems or especially when such systems are at the early stage. One of the major
uses of line graphs is that they are used to compare the results of queries. The horizontal
axis contains the concepts that need to be compared, and the vertical axis shows the metric
that is used to compare the concepts.
Figure 9.10 shows the sample line graph drawn for the data given in Table 9.4.
The line graph of the above data is shown in Fig. 9.11.
Figure 9.11 shows the line graph of the students’ data given in Table 9.4.
It should be noted that Fig. 9.11 shows the data only in one dimension, i.e., “Marks”.
The difference between the line graph and the histogram can be realized here as well. The
histogram can model the data only in one dimension, whereas the line graph can be used
to model multiple dimensions at the same time. For example, consider the scenario where
we need to show the frequency of different words in three different documents as shown
in Table 9.5.
We have shown a similar example earlier where such frequency of each word was
shown in the overall corpus. Here we can show the frequencies against each document.
This means that the line graphs can provide more detail as compared to the histograms.
Figure 9.12 shows this data in the form of a line graph.
Marks
100
90
80
70
60
50
Marks
40
30
20
10
0
John Smith Eliza Marn Maria Robert
100
90
80
70
60
Document 1
50
Document 2
40
Document 3
30
20
10
0
The North State Military Money Season
Each line represents a dimension. From Fig. 9.12, it can be seen that Document 3 con-
tains the maximum frequency in the case of each word. The closer the lines are at the
certain word (or concept), the more similar are the results of the concept after analysis. It
should be noted that the line graphs can be used to show the progress of different activities
with the passage of time; e.g., the horizontal axis can represent the progress over time.
However, in the case of text mining, the horizontal axis comprises the concepts for which
we need to model different measures.
The line graphs are easy to comprehend; however, it can be seen that with an increase
in dimensions, things get messed up, and it is difficult to visualize the frequency of each
word in a certain document.
Here, we will discuss some common APIs from the Matplotlib Python library that can
be used to draw line graphs for different purposes.
Matplotlib is a common Python library that can be used to visualize the data in data
mining systems. The library provides a number of functions for drawing different types of
graphs to enhance the visualization of the system. Here are some of these functions:
• plot(x, y, label): This function can be used to draw a basic line plot. “x” and “y” here
represent the data points that will be drawn, whereas the “label” shows the label of the
diagram.
• xlabel(text): It specifies the label of the x-axis coordinate. The parameter “text” speci-
fies the label of the graph.
• ylabel(text): It specifies the label of the y-axis coordinate.
• title(text): This function is used to set the title of the graph.
• legend(): This function specifies the legend of the graph. The legend helps identify the
different lines in cases where there are multiple dimensions.
• xticks() and plt.yticks(): These functions are used to customize the tick positions and
the x- and y-axis labels.
• xlim() and plt.ylim(): These functions are used to set the range of values on the x and y
axes. These can be helpful for focusing on specific areas of the graph. This function can
also be helpful for making the graph interactive.
• grid(): This function can be used to add the grid to the plot so that the results can be
easily read and interpreted.
• tight_layout(): This function can be used to adjust the space between subplots and vari-
ous plot elements so that the plot looks neat.
• show(): Finally, this function displays the plot after setting the above parameters.
def get_word_frequencies(text):
# Tokenize and convert to lowercase
words = re.findall(r'\w+', text.lower())
return Counter(words)
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top 10 Word Frequencies in Documents')
plt.legend()
plt.xticks(rotation=45)
plt.grid(True) # Add a grid to the plot
# Adjust spacing between plot elements
plt.tight_layout()
plt.show()
plot_word_frequencies(documents, labels)
Circle graphs are another type of visualization tool that can be used to display the data.
These graphs are used to display the data in two-dimensional space. The graph is used
where there are large data, and there are relations between the data items, for example, the
concepts and the related subconcepts. Such types of graphs are mostly used to display the
patterns; however, we can also use them to display the categories.
336 9 Visualization Approaches
The most common type of circle graph is the pie chart. It is a circle graph that repre-
sents the data in the form of slices of a circle. Normally, it is used to show the summaries.
Each slice or segment represents the portion of the corresponding value in the entire
value set.
Consider the concepts given in Table 9.3. The corresponding circle graph is shown in
Fig. 9.13.
The graph in Fig. 9.13 shows the distribution of the concepts. It can be seen from the
graph that the concept “State” has the maximum proportion in the entire concept space.
The legend at the bottom gives us guidance about each proportion, but the actual value is
obtained from the graph. The slice that covers the maximum area of the graph has the
maximum proportion.
Now, we will discuss the construction of the graph by considering some sample con-
cepts. The following are the steps to construct a graph chart for comparison purposes:
• Step-1: Categorize your data (find all the concepts and their count).
• Step-2: Count the total of all the concepts.
• Step-3: Divide the count of each concept by the total.
• Step-4: Convert to percentage.
• Step-5: Calculate the degree out of 360.
As an example, we will take five concepts in order to keep things simple. The following
is a list of the concepts and their count in a certain document:
The → 10
File → 5
Department→ 5
State→ 10
System→ 10
9.6 Circle Graphs 337
The total of all the counts is 40. So, now we will divide the count of each individual
concept by the total.
The: 10/40
File: 5/40
Department: 5/40
State: 10/40
System: 10/40
The: (10/40)*360 = 90
File: (5/40)*360 = 45
Department: (5/40)*360 = 45
State: (10/40)*360 = 90
System: (10/40)*360 = 90
Now, you can start drawing the graph chart. For this purpose, draw a circle of appropri-
ate size, and then draw a radius at the point from where you want to draw slices. As dis-
cussed earlier, you can start from the top. So, draw a top radius, and measure the first
angle, i.e., 90 degrees. Now, draw a radius at this angle. Then considering it as a base, draw
the angle at 45 degrees. Continue the process until all the slices are drawn. Figure 9.14
shows the diagram.
In order to properly use the pie chart, you must have some whole value that is divided
into sub-portions. Normally, the intention of the circle graph is to compare the proportion
of each value (concept) as compared to the whole value instead of comparing the concepts
with each other. If this is not the case, we should avoid the use of a circle graph. The word
“whole” in the above description represents two values: first, when the whole represents
the total count, for example, the number of admissions divided by the gender type; and
second, when the “whole” represents the total admissions divided by the age bracket or
city in order to decide where to launch the marketing campaign.
The data for the chart can be arranged as shown in Table 9.1. The first column represents
the concept or the variable, and the second column shows the frequency, proportion, or
amount of the concept. Normally, we do not need to show the total in the graph except when
338 9 Visualization Approaches
its need is especially mentioned in the requirements. The tools and the available controls can
count the figures by themselves at the time of the creation of the chart. As can be seen from
Fig. 9.13, we have included the annotations in the figure. This is a good practice as the circle
charts are drawn over a small area, so other than the explicitly comprehendible proportions,
e.g., “25%”, “50%”, “75%”, etc., it is difficult to comprehend the proportion of each concept
or category from the graph. Furthermore, it is recommended to use the proportions for draw-
ing the circle graph instead of the actual values as this may not provide much information.
So, on the basis of the above factors, it is always good practice to annotate the graph with the
appropriate values for easy comprehension and extraction of the knowledge.
It is also better to show the slices in order of their proportion. The order can be from
ascending to descending or vice versa. In the resulting graph, the slices will then be from
largest to smallest or from smallest to largest. However, if there is an order in the concepts
or categories, then it is better to maintain that order. An important point here is to select
where to start the drawing of the slices. Although we can start from anywhere, normally,
the tools either start from the top or the right. One of the drawbacks of the circle graphs is
that the visualization power is compromised in case there are a greater number of values
alternatively resulting in a graph having a greater number of slices. In this case, it can be
difficult to view the small slices and their proportions. So, we may have less distinctive
colors to separate the slices from each other.
Although there is no proper recommendation, if there are more than five concepts or
categories, it is recommended to use some other visualization tool. As an alternative, it is
suggested to combine the categories with small values in the “Other” or “Miscellaneous”
category. However, this may not happen without the loss of some information. Correctly
reading the circle graphs requires the correct representation of the slices. Adding extra
effects, for example, 3D effects, can easily distort the user’s attention and cannot help
much in comparison of the slices. So it is recommended to avoid any type of unnecessary
depth. There is another type of distortion that is commonly ignored in circle graphs.
Normally, one of the slices is pulled out to add the fancy effects. This may, on one hand,
emphasize the largest slice; on the other hand, the added extra space may affect the
9.6 Circle Graphs 339
part-whole comparison. One of the common mistakes in using the circle chart to show the
results of the part-whole analysis is to use data that does not represent the part-whole
comparison. Another common misuse is to use the circle graphs in cases when the values
are just the summary of the actual values, for example, the average transaction amount for
different types of transactions. Since some relevant information is missing, for example,
how many times each transaction has occurred, the values may not represent the whole
sum. So, it is better to use a bar chart in all these scenarios.
Circle graphs are also not recommended in the case where you want to compare the
concepts with each other instead of finding the proportion of each concept with respect to
the whole. In these cases, the circle graphs may mislead and may not be appropriate. The
reason behind this is that the sizes of the slices may appear to be equal, especially for the
values that are close to each other or the slices that appear at the end. In all such cases, it
is important to use some other chart type.
As we saw in the case of the histogram, it can be used to compare the ranges only in
one dimension, i.e., their frequencies. The same is the case with the circle graphs. When
there is more than one dimension, it is not recommended to use circle graphs. As an alter-
nate, you can use two or more graphs; however, this may reduce the visualization capabil-
ity and may affect the analysis process as you may need to view two or more graphs at the
same time.
In the context of visualizing text data, a special type of circle graph can be used to relate
the concepts apart from their comparison. These circle graphs got popularity after their use
in NetMap (a popular data mining visualization tool). In these graphs, categories or con-
cepts are mapped on the circumference of the circle. Two different concepts can be mapped
to each other by connecting them through a line within the circle. Figure 9.15 shows the
circle graph.
North Korea
Korea USA
South
Korea
As you can see, different concepts are connected through a line that passes within the
circle. The thickness of the line shows the scale of the relation. Furthermore, we can use
colors to display the nature of the relationship. Normally, the circle graphs are used to
model the association rules that appear in the answer sets of the queries.
While modeling the concepts, the concepts appear at the circumference of the circle,
whereas the associated concept appears at the other point on the circumference. Several
different types of circle graphs are available to show the different nature of the association.
For example, the gradient colors (from yellow to red) may be used to show the direction
of the color. Similarly, a single-color line can be used to show the bidirectional associa-
tion. As mentioned earlier, the thickness of the connecting line can also be used to show
some score values of the association. Similarly, the font size and the color values can be
used to show a certain type of association.
The circle graphs can also be made interactive by linking them with other visual tools
or highlighting the extra information on mouse click or move event. A simple example
may be to highlight all the associated concepts, once the mouse is clicked on a certain
concept or moved over it. Similarly, we can show the information on mouse click events.
It should be noted that the circle graph has a major drawback that it can get confusing or
messed up once the number of concepts increases. Having too many concepts mapped to
the circumference of the circle requires too large circle size, as placing such a tool on a
small area cannot be user-friendly. Now, we will close our discussion on circle graphs by
providing the Python code of how to draw a circle graph:
text = "Now is the time for all good men to come to the aid
of the nation"
words = text.lower().split()
word_counter = Counter(words)
most_common_words = word_counter.most_common(5)
labels, frequencies = zip(*most_common_words)
plt.figure(figsize=(3, 3))
plt.pie(frequencies, labels=labels, autopct='%1.1f%%',
startangle=140)
plt.axis('equal')
plt.title('Most Common Words and Their Frequencies')
plt.show()
In the above code, first we have imported some common libraries. After this, the text is
defined out of which we will plot the graph chart for different concepts. Note that we have
taken the five most common words to draw; here, you can provide as many as needed.
After this, figure size is defined, and finally, the figure is drawn using the “show” function.
Figure 9.16 shows the output.
9.7 Category Connecting Maps 341
Circle graphs serve as the foundation for category-connecting graphs, which are another
useful visualization tool used in text mining systems. The simple concept connecting
graph shown in Fig. 9.15 only shows the concepts and their associations. However,
category-connecting maps move the visualization capability a step ahead and model the
third dimension, which is the category of the concepts. The category connecting map
shown in Fig. 9.15 just shows the associations within one context, i.e., “Korea”. However,
in real-world scenarios, we may need to associate the concepts from different categories.
For example, we may show the association between the different countries and the vaccine
type in the context of COVID-19.
Normally, a concept connecting map comprises four components. The first and main
component is the “Concept” or “Category” that we need to model. Each concept or cate-
gory is viewed as a node, and it can range from a simple label to a more detailed represen-
tation. The second component is called the “Connection”, which shows the relationship or
association between the category or concept. These connections can be directed or undi-
rected depending on the requirements. The third component is the “Circular Layout”. A
circular layout is a circular arrangement to position the nodes and connections. The circu-
lar layout allows the nodes to be properly arranged and positioned. Sometimes, a proper
subcircle is shown as a concept, and the majority of the time, a simple concept name is
written at a specific position. Finally, the fourth component is called visual encoding.
These encodings are the visual heuristics that are used to make the information explicit so
that visual comprehension can be enhanced. These types of encodings include line thick-
ness, colors, and arrowheads to represent the shapes.
Note that if the category connecting maps are generated from pre-processed data, it
becomes easy for the text mining systems to render the display; however, the associations
can also be generated at runtime, which may require some sort of processing alternatively
resulting in the low performance.
342 9 Visualization Approaches
Furthermore, another issue with the category connecting maps is that they can easily
become complex to visualize. To facilitate the user for information visualizations, all the
concepts within the same category are rendered using the same text color and font type
that are totally different from the concepts in other categories. The high-level categories
are mentioned outside the circle and with some special formatting, for example, by
underlining.
It is important to note that text mining applications may have support to show multiple
circle graphs at the same time. This may increase the capacity of the text mining system to
compare the results of different queries from different perspectives. For example, we may
get the result of the same query using different parameter values at the same time and can
get the information at different levels of abstraction. Another example may be to get the
associations of the same categories within different contexts.
We can also split the same circle graph within different sub-graphs in order to increase
the visualization power. As mentioned earlier, with a greater number of concepts and cat-
egories, the graphs become complex and difficult to comprehend. So we can split the
graphs into different subgraphs to enhance the visual capacity. For example, by clicking
on a certain concept, all the associations of this concept with all other concepts can be
shown in a different subgraph to separately see and comprehend this concept.
Self-organizing maps (SOMs), also known as Kohonen maps or Kohonen networks, use
the power of artificial intelligence and especially neural networks to identify patterns and
find the relationships in the data. This tool was developed by Teuvo Kohonen in the 1980s.
We can say that SOMs are a subset of artificial neural networks that use unsupervised
machine learning and data for visualization tasks. SOMs are used in a large number of
applications, e.g., biology, finance, text mining, etc. One of the important features of the
SOMs that distinguish them from the other visualization tools that we have discussed so
far is that they can convert high-dimensional data to low-dimensional space before visual-
izing the results. This is just like any dimensionality reduction algorithm.
As far as the structure of SOMs is concerned, a SOM comprises a grid of nodes. These
nodes are connected with each other. Each node is associated with a weight vector that has
the same dimensionality as that of the input data. The grid of the node works as the canvas,
and all the data is presented on this canvas. As discussed earlier, SOM uses unsupervised
learning, so we do not need to provide the labelled data. SOM captures all such informa-
tion from the data itself. This means that SOM can be successfully applied in exploratory
data analysis. This feature makes SOM especially important for exploratory data analysis
and discovering hidden patterns.
The overall structure of SOM is defined by a two-dimensional space called a grid. The
grid is arranged in rows and columns. Each node is associated with a weight vector that
initially represents a point in the input space. With the passage of time, these weights are
9.8 Self-Organizing Maps (SOMs) 343
updated to reflect the distributions of the input data. The grid structure represents the
underlying associations in the data. This means that the grids that are close to each other
represent the input data that are similar to each other.
The SOM is an ideal visualization tool for presenting complex data distributions. As
mentioned earlier, the dimensional space is reduced, still preserving the relations. The
architecture’s topological preservation property makes SOMs particularly useful for visu-
alizing complex data distributions. High-dimensional data can be challenging to interpret
directly, but by mapping it onto a lower-dimensional grid, SOMs provide a more intuitive
representation. Similar data points are positioned close to each other on the grid, which
aids in identifying clusters, trends, and outliers within the data. As far as training of the
SOM is concerned, it is one of the important processes before using SOMs. In the training
process, we dynamically adjust the weights with respect to the input data. We present the
input data, and the weights are adjusted in order to get the winning weights. For all the
nodes, the SOM calculates the Euclidean distance between the input vector and the weight
vectors of all nodes on the grid. The intention is to get the node with the closest weight
vector, also called the best-matching unit (BMU).
Just like other machine learning algorithms, the training process in the case of SOMs is
performed in iterations. These iterations are called epochs. In each iteration, we update the
BMU and the weights of its neighboring nodes. The learning rate that controls the magni-
tude of weight adjustments decreases with the passage of time, which ensures conver-
gence. We also use a neighborhood function that defines the extent to which the neighboring
nodes are affected by the update in BMU. The value of the neighborhood is initially high
and decreases with the passage of time to ensure the map’s refinement. The learning rate
and neighborhood function are important parts of the SOM training process. The learning
rate specifies the size of weight adjustments during the training process, which enables the
changes in data distributions. With the passage of time, the learning rate decreases, allow-
ing the SOM to more accurately converge toward the underlying data structure. These
changes ensure that the SOM identifies both the global and the local patterns existing
within the data.
The neighborhood function is used to specify how the updates occur from the BMU to
its neighbors. Initially, the large value of the neighborhood is used to allow widespread
exploration of the input data. With the passage of time, the neighborhood value decreases.
This leads to a more fine-grained refinement of the SOM’s representation. Both the learn-
ing rate and neighborhood functions ensure that the map transforms from an exploration
phase to a convergence phase, where the final representation becomes more stable. One of
the important properties of the SOM is called topology preservation. This property ensures
that the relationships that exist between data points in the input space are also present in
the SOM’s map space (grid). This means that the points that are close to each other in the
original data (or have some relationship with each other) are also presented by the nodes
that are close to each other on the SOM grid. This ability of SOM to preserve the underly-
ing relationships in data is important for understanding the underlying structure of the data
and is also a reason for the success of SOMs as a visualization tool.
344 9 Visualization Approaches
Since SOM reduces the large dimensions to small dimension space, for example, in two
or three dimensions, the analysis becomes an easy task. It is always difficult to manage a
large amount of dimensional space. In fact, the problem leads to a phenomenon called the
curse of dimensionality. All the visualization tools discussed so far have this problem, i.e.,
with an increase in data size, the presentation becomes so complex and messy that the
visualization capability of these tools is seriously affected. Consider a category connecting
map, for example, with a large number of association lines between the concepts. It is
almost impossible to identify which line connects which concepts. This is not the problem
with SOM. With reduced data space (still preserving the original relationships), SOM
helps us explore complex datasets in a simple and convenient way. Patterns, trends, and
clusters that are difficult to identify and analyze in the original data space can easily be
identified on the SOM, providing valuable insights that can be used in further analysis. So,
we can get a quicker understanding of the complex data, which alternatively helps in
decision-making.
During the dimensionality reduction with SOM, we train the SOM with the training
data, and once the SOM is trained, we can use it as a transformation function by giving the
unknown data. The SOM then maps the unknown data onto its map space. The resulting
coordinates on the maps represent the original data with the help of a smaller number of
dimensions. This not only helps in visualization but also supports other tasks such as clas-
sification, regression, etc. As far as the applications of SOM are concerned, clustering is
an important application of SOM. In the context of SOMs, clusters comprise the nodes
that have similar weight vectors. With the passage of time, as the SOM learns from the
training data, it organizes the nodes into clusters that actually reflect underlying patterns
in the data. This means that you can use the SOM to explore the internal data patterns
without having any prior knowledge of the categories or classes in the data.
As far as the applications of SOM are concerned, it has a wide range of applications.
For example, in biology, SOMs can be used to analyze gene expression patterns that help
in identifying the functional relationships between the genes. In finance, SOMs can be
used to perform market analysis by uncovering hidden patterns in financial time series
data, which can be used to make investment decisions. SOMs can also be used in image
processing where these can be used to perform image compression and feature extraction.
Similarly, SOMs can be used in speech recognition to recognize spoken words.
The following is the Python implementation of SOM:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
data = np.random.rand(100, 2)
input_dim = data.shape[1]
# Grid size of the SOM
output_dim = (10, 10)
9.8 Self-Organizing Maps (SOMs) 345
learning_rate = 0.1
epochs = 50
plt.imshow(weights.mean(axis=2), cmap='viridis',
interpolation='none')
plt.title("Trained SOM")
plt.colorbar()
plt.show()
The code visualizes the trained SOM in two-dimensional space. Here, each cell in the
grid represents a neuron weight vector. The color intensity indicates the average weight
value according to the dimensions.
Figure 9.17 shows the output of the code.
346 9 Visualization Approaches
Now, we will discuss one of the important applications of SOM, i.e., Web self-
organizing maps (WebSOM). WebSOM is a type of SOM that is used to analyze and
visualize large datasets on the Web. It is a hybrid approach that combines Web technolo-
gies with the SOM. It lets the user interact with the SOM using a Web interface. WebSOMs
are used to deal with complex and high-dimensional data in order to perform efficient data
exploration and understanding.
WebSOMs provide an interactive Web-based interface. We can use this interface to
interact with the algorithm and get the dynamic analysis results. Various browsing func-
tions such as information zooming, panning, labeling, etc. are also available to facilitate
visualization and analysis. Although traditional SOM faces scalability issues when dealing
with large datasets, WebSOM can address this challenge by using advanced Web technolo-
gies and optimizations. Users can interact with the WebSOM in real time by providing and
adjusting the parameters. Users can select a subset of data and view the results of their
queries immediately. So, we can perform a quick analysis and get immediate insights.
WebSOMs are designed to provide effective visualizations to explore the complex patterns
from the data. Similar to traditional SOM, the structure layout of WebSOM represents the
patterns and trends that exist in the original data. Just like SOM, WebSOM also helps users
perform tasks related to further analysis like clustering, outlier detection, and dimension-
ality reduction through the Web interface.
WebSOMs can be integrated with various data sources. This allows users to load the
data from databases, APIs, or local files directly into the interface. So, the user only needs
to provide the data without requiring users to preprocess or format the data themselves.
Now, we will discuss some of the applications of WebSOM. WebSOMs can be used to
perform exploratory data analysis. These are especially valuable for exploring complex
datasets with high dimensionality. WebSOMs can help reveal hidden patterns and struc-
tures in the data that might not be apparent through traditional analysis methods. WebSOMs
9.8 Self-Organizing Maps (SOMs) 347
can also be used in the domain of bioinformatics. In genomics and proteomics research,
we can use WebSOMs to identify the gene clusters or proteins having similar expression
profiles. With the help of WebSOMs, we can perform the analysis of the market data. This
allows the organizations and analysis to identify the current trends and relations in the
business data.
Companies can use WebSOMs to analyze customer behavior. By analyzing customer
data, business organizations can find meaningful groups for targeted marketing efforts.
WebSOMs can also be used in image processing. We can perform image clustering and
categorization, where visual similarities between images can be shown by the map layout.
One of the important features of WebSOMs is that they can support collaborative data
exploration. This means that multiple users can access the same map at the same time and
can perform the analysis in collaboration.
As far as the implementation of WebSOMs is concerned, it involves combining the
principles of SOMs with Web development technologies such as HTML, CSS, and
JavaScript, and frameworks like D3.js or WebGL for visualization (front-end). The back
end can use server-side technologies to handle data processing and interactions with
databases.
Now, we will discuss some advantages and disadvantages of WebSOM. The following
are some of the advantages of WebSOM:
Apart from the advantages discussed above, there are many challenges associated with
WebSOM as well.
The following are some of the disadvantages of WebSOM:
348 9 Visualization Approaches
So, we can say that using WebSOM has its own advantages and disadvantages. The
decision to use it must be based on the specific needs of the text mining system.
There are many other types of SOM. Table 9.6 shows the name and a brief description
of each type.
Hyperbolic trees are one of the most common visualization techniques developed at the
Xerox Palo Alto Research Center (PARC). It is a significant advancement in the field of
data visualization. The basic intention to develop this technique was to effectively present
the complex hierarchies and their navigation while avoiding the drawbacks of the tradi-
tional tree layouts. The traditional approaches become ineffective and difficult to navigate
when the number of nodes increases. For this purpose, the technique uses the properties of
hyperbolic geometry, which provides a space-efficient, interactive, and visually appealing
way of visualizing hierarchical relationships present in the data.
Hyperbolic geometry is a non-Euclidean geometry. This is different from the properties
of flat (Euclidean) space. In hyperbolic space, the distance between the two points expo-
nentially increases if we move away from the central point. This property allows the
hyperbolic trees to allocate more space to distant nodes, which ultimately enhances the
visual capability of the tree.
Unlike traditional tree layouts, hyperbolic trees use the visual space more effectively.
The nodes that are close to the center take more space, and the nodes that are away from
the center take less space. This enhances the visibility of the nodes.
As far as the navigation of the hyperbolic trees is concerned, the trees allow the users
to zoom in or out to see the information at different levels of abstraction. The navigation
makes it easy for the users to explore different levels of the hierarchy. Another important
feature of the hyperbolic trees is that they provide a balance between the “focus” and the
350 9 Visualization Approaches
“context”. Users can focus on a specific node while the parent and the child nodes also
remain visible. This allows users to view the specific information within the context.
Furthermore, the circular layout maximizes the use of space and provides an aesthetically
pleasing arrangement. Figure 9.18 shows the circular layout of the hyperbolic trees.
Just like other visualization tools, we can add visual effects to the hyperbolic trees to
enhance the visualization capacity. For example, we can change the color of the nodes at
a certain level of the hierarchy. Many applications can also add 3D visual effects to enhance
the visualization capability further.
Now, we will discuss some advantages and disadvantages of the hyperbolic trees. The
following are some of the advantages of hyperbolic trees:
• Hyperbolic trees make efficient use of the space. As we move outward, the nodes
become smaller, which allows more nodes to be accommodated without sacrificing
readability.
• Users can easily visualize the hierarchies by moving toward and away from the center
of the circle. This navigation makes analysis easy.
• As the hyperbolic trees reduce the size of the nodes that are at a distance, this reduces
the overlaps and maintains visual clarity. This ultimately enhances the visualization of
the hierarchical structure as compared to the conventional tree structure layouts.
• Hyperbolic trees allow the user to view specific information and context information at
the same time. This allows the user to see the information of interest within the avail-
able context, which ultimately facilitates information extraction.
• Since hyperbolic trees make use of hyperbolic geometry, they can conveniently display
large and complex data hierarchies.
• Interacting with hyperbolic trees may require the user to have a prior knowledge of
navigation mechanics in order to effectively explore the hierarchy.
9.11 Exercises 351
9.10 Summary
The chapter starts by discussing the importance of text visualization in the context of text
mining systems. It highlights the importance of visualizing textual information to extract
meaningful insights, patterns, and relationships. After this, different types of advanced
visualization tools and techniques are discussed. Starting with concept graphs, the details
are provided about how the concept graphs are drawn and how they work. Three important
types of concept graphs are also discussed. Then, histograms are discussed in detail, espe-
cially from the point of view of their use in text mining. Line graphs and circle graphs are
discussed next. Finally, details about the category connecting maps and self-organizing
maps are provided. Efforts are made to explain the concepts in detail and in simple words
for easy understanding. Examples are also provided. We have also provided the different
Python codes for these tools so that the user can understand the logic of creating these
tools within their text mining systems.
9.11 Exercises
Q1. List at least five conventional data visualization tools and the type of data each tool
may represent.
Q2: List at least four functions that can be performed on concept graphs.
Q3: Consider the following data, and draw the histogram. Also, mention which concepts
will be displayed if minimum and maximum limits are set to 40 and 80, respectively.
Q4: Consider the following data, and draw the line graph to show the score of each con-
cept in each document.
Concept Frequency
Text 76
Visualization 98
Histogram 50
Applied 69
System 74
Display 36
352 9 Visualization Approaches
Concept Frequency
Character 20
Q5: Consider the following associations, and draw the category connecting map. Note
that the concepts “Text” and “Mining” have strong associations, and this should be
reflected in the category connecting map.
• {UK, USA}
• {Text, Mining}
• {Visual, Display}
• {North Korea, South Korea}
• {Analysis, Extraction}
Q6: Consider the following hierarchies, and draw the hyperbolic tree:
• Asia
–– China
–– India
–– Pakistan
• Africa
–– Algeria
–– Libya
• North America
• South America
• Antarctica
• Europe
–– France
–– Germany
–– UK
9.11 Exercises 353
Q7: Write down the five steps to construct a circle graph (same as shown in Fig. 9.13).
Explain each step by considering the following data:
UK → 15
USA → 15
Germany→ 18
France→ 12
China→ 10
Note that “UK → 15” shows that the frequency of the concept “UK” is 15.
32
USA UK India
10
19
15 Japan
France 30
Germany
Which associations will be displayed when the threshold is set to 20% at minimum and
30% at maximum?
Q9: Why are self-organizing maps (SOMs) difficult to implement as compared to other
visualization tools?
Q10: In the context of the clarity of visualization, how do hyperbolic trees manage to effi-
ciently use the visual space as compared to other hierarchical structures?
Recommended Reading
• Text Mining with R: A Tidy Approach by Julia Silge and David Robinson
Publisher: O'Reilly Media
Publication Year: 2017
The book’s approach demonstrates how treating text as data frames allows you to
manipulate, condense, and present text attributes. In addition, the integration of natural
language processing (NLP) into efficient workflows has been demonstrated. Providing
sample code for practice and data research helps extract tangible insights from sources
such as literature, news, and social media.
354 9 Visualization Approaches
Deep learning has obtained great importance for processing textual data, especially in text
clustering and classification. In this chapter, we will explain how deep learning can be
used in the context of text mining with examples and a complete description of the accom-
panying Python source code.
In the last few years, deep learning has gained a lot of importance in almost all domains of
life. It is a subfield of machine learning, where the learning happens to be in the form of
successive layers. In this section, we will first discuss deep learning and some of its related
concepts. After this, we will discuss how we can apply deep learning to text mining.
A typical deep learning model comprises an artificial neural network comprising n
number of layers. The term “deep” in deep learning represents the number of layers the
input goes through in order to train the model. However, this does not represent any type
of added intelligence. But we can say that the number of layers can help the model achieve
accuracy. A typical deep learning model has three layers, i.e., the input layer, hidden lay-
ers, and output layer. The input is provided at the input layer, after which it goes through
the hidden layer. The output is produced at the output layer. Modern deep learning algo-
rithms involve tens and even hundreds of layers. The weights are adjusted to reduce the
errors as much as possible. The adjusting of the models can be called learning of the
model. The adjusting of the weights happens automatically through different mechanisms.
One such mechanism is called backpropagation. In backpropagation, we change the
weights in the backward direction and then calculate the difference between the expected
output and the actual output. The process continues until the change does not add to a
further decrease in the error.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 357
U. Qamar, M. S. Raza, Applied Text Mining,
https://doi.org/10.1007/978-3-031-51917-8_10
358 10 Text Mining Through Deep Learning
In terms of the structure, only two layers are exposed to the external world, i.e., the
input layer and the output layer. The hidden layers form the internal part of the deep learn-
ing model. These layers comprise artificial neurons. The final layer, called the output layer,
shows the output. In the case of classification, the output may be the identification of an
object either as a dog or a cat. In the context of text mining, the output may either be
“Spam” or “Not spam” in case of email spam detection.
The sequence of steps that are followed in case of a deep learning model during the
training are as follows:
1. Inputs (X1, X2, …, Xn): As the name implies, these are the values that are given to a
neuron. The values may be from the user data or from the previous layer. Each input is
assigned a weight that shows the importance of the input. The greater the weight, the
higher the importance.
2. Weights (W1, W2, …, Wn): Each input is multiplied by a weight. During the training
process, the weights keep on changing. This is called training of the network. The
weights are adjusted to enhance the performance on a specific task. Weights determine
how much influence each input has on the neuron’s output.
3. Summation Function (Σ): Each input is multiplied by its corresponding weight, and
then all the results are summed up. This is called weighted sum, and its equation is
given below:
weighted _ sum X 1 W 1 X 2 W 2 Xn Wn
4. Activation Function (f): The weighted sum is then passed through an activation func-
tion. The activation function determines whether the neuron will fire or not. Different
activation functions can be used. The most common activation functions include the
Sigmoid, Hyperbolic Tangent (Tanh), Rectified Linear Unit (ReLU), and others. Here
is the definition of some of the activation functions:
(a) Sigmoid: The output of the sigmoid function lies between 0 and 1. Normally, this
function is used for binary classification. If the value of the weighted sum is from
0.0 to 0.49, the neuron is not fired, and for a value above this, the neuron is fired.
(b) Tanh: Produces an output between −1 and 1. This ultimately enhances the
symmetry.
(c) ReLU: If the input is positive 1 is produced as output; otherwise the output is zero.
It’s commonly used because of its simplicity and effectiveness.
5. Output (Y): The output of the activation function becomes the output of the artifi-
cial neuron:
Y f weighted _ sum
So, we can say that an artificial neuron in an artificial neural network takes inputs,
applies weights to these inputs, computes a weighted sum, passes this sum through an
activation function, and produces an output. These neurons are organized into layers.
These layers are called the neural network.
It should be noted that there is no hard and fast rule for the exact number of layers as
well as for the number of neurons in a specific layer. We can say that selecting the number
of neurons in a layer is one of the common problems of a deep neural network. There are
different heuristics that can be followed. One of the heuristics may be to select the neurons
on the basis of the patterns that you want to identify. The higher the number of neurons,
the more patterns you can select. However, there may be variations in the selection of the
patterns even for the same data.
Another method may be to select the number of neurons on the basis of the decision
classes. In this method, first, we draw the decision boundaries between the decision classes
on the basis of the data. We then represent the boundaries in the form of lines as shown in
Fig. 10.2.
360 10 Text Mining Through Deep Learning
Decision
Boundary
Lines on
Decision
Boundary
Fig. 10.2 Decision boundary and the lines drawn on this boundary
The number of neurons in the first hidden layer is equal to the number of separate lines
drawn on the boundary. In the case of the diagram shown in Fig. 10.1, there will be four
neurons in the first hidden layer. As there should be only one output, we will have to merge
the outputs of these neurons. Here, the decision depends on the designer. For example, we
can have a hidden layer with two neurons. The first neuron will merge the output of the
first two neurons, and the second neuron will merge the output of the remaining two neu-
rons. There may be many other heuristics as well.
However, adding more neurons has its own issues. It increases the complexity of the
network. Adding more neurons means a lot of processing, which requires a large number
of resources for the implementation of the deep learning models. Also, adding more neu-
rons introduces problems like overfitting, a scenario where a model performs well for the
training data but may not be accurate for the test data. To overcome the issue, the selection
of the number of neurons must be done carefully, so that the model can produce efficient
and effective results as per the requirements. As discussed above, a deep learning model
comprises an artificial neural network. We will discuss some of its components.
As far as the applications of deep learning are concerned in text mining, deep learning
has been widely used in text mining to perform different tasks. Various natural language
processing tasks can be performed using deep learning algorithms. Here, we will discuss
some applications of deep learning models in text mining:
• Sentiment analysis: Deep learning models can be used to perform sentiment analysis.
In this regard, the recurrent neural networks (RNNs) and transformers have been widely
used. These models can be used to determine the sentiment or emotional tone of a text,
such as whether a review is positive, negative, or neutral.
• Named entity recognition (NER): NER is used to identify and classify the entities, e.g.,
names of people, organizations, locations, etc., in text. Different deep learning models,
e.g., long short-term memory with conditional random fields (LSTM-CRF) or
transformer-based models, can be effectively used in this regard.
10.2 Deep Learning Models for Processing Text 361
In this section, we will discuss different models that are used for text processing. The pro-
cessing may include classification, translation, summarization, etc. We will discuss convo-
lutional neural network and recurrent neural networks. We will utilize recurrent neural
networks in long-term memory. We will also discuss transformers.
The feed-forward neural network comprises multiple layers in which the information
flows only in the forward direction. There is no back loop in the network. The network
comprises of multiple layers. The information enters the system from the input layer, trav-
els through the hidden layers, and exits from the output layer. There is no feedback from
the output layer to the input layer.
The feed-forward neural network uses the following function to approximate the
information:
y f x,
Here “x” shows the input, and “y” shows the category the input is assigned to. Once the
output is achieved, the new input is taken. This is the basic working of the feed-forward
neural network. Figure 10.3 shows a simple feed-forward neural network.
362 10 Text Mining Through Deep Learning
Now, let’s discuss the training of the feed-forward neural network. We have already
discussed the input, output, and the hidden layers. Let’s discuss how the weights are
adjusted during the training. The process is called backpropagation. We have already
mentioned its steps in Sect. 10.1. Here we will discuss both forward propagation and
backward propagation in detail. However, before discussing these concepts, let’s discuss
the loss function and the chain rule. Both of these are used in training of the neural
network.
In a feed-forward neural network, commonly known as a multi-layer perceptron (MLP),
functions play a critical role in processing information as it travels through the network.
These functions are applied at each layer of the network and define how the input data is
turned into an output prediction. Here are the major functions used in a feed-forward neu-
ral network:
• Input layer: The input layer simply feeds the input data through without any changes.
It acts as the initial point of input for the data.
• Hidden layers: The hidden layers, which can be one or more layers sandwiched
between the input and output layers, apply an activation function to the weighted sum
of the inputs from the preceding layer. The common activation functions used in hidden
layers include:
–– Sigmoid: The sigmoid function (also known as the logistic function) transfers the
input to a range between 0 and 1. It is widely used in earlier neural network designs.
–– Hyperbolic Tangent (tanh): The tanh function transfers the input to a range
between −1 and 1, which can help offset some of the concerns associated with the
sigmoid function’s vanishing gradients.
10.2 Deep Learning Models for Processing Text 363
–– Rectified Linear Unit (ReLU): The ReLU function is the most often applied acti-
vation function in current neural networks. It outputs the input if it’s affirmative and
0 otherwise. This function helps alleviate the vanishing gradient problem and speeds
up training.
–– Leaky ReLU: Leaky ReLU is a version of the ReLU function that enables a modest,
nonzero gradient for negative inputs, addressing the “dying ReLU” problem.
–– Exponential Linear Unit (ELU): ELU is another form of the ReLU function that
has been created to reduce the vanishing gradient problem even more.
–– Parametric ReLU (PReLU): PReLU is an extension of Leaky ReLU where the
slope of the negative portion is learnt during training.
• Output layer: The output layer often utilizes an activation function that is particular to
the issue being solved. Common activation functions for the output layer include:
–– Linear: Used for regression situations, when the network returns a continuous result
–– Sigmoid: Used for binary classification tasks, where the network produces a prob-
ability value between 0 and 1
–– Softmax: Used for multi-class classification issues, where the network produces
probabilities for each class, and the class with the highest probability is chosen as
the prediction
Each layer in the feed-forward neural network applies these functions to its inputs and
transfers the transformed results to the next layer. The weights and biases in the network
are learned using a process called backpropagation and gradient descent, allowing the
network to adapt and generate accurate predictions based on the training data. In essence,
functions in a feed-forward neural network are responsible for adding non-linearity into
the model, enabling it to learn complicated correlations within the input, making them an
essential part of deep learning models.
Convolutional neural networks (CNNs) are also called convents. CNNs are also made of
neurons having weighted inputs. However, in the case of neural networks, the input com-
prises of images or image segments. The network learns in the same way as in the case of
conventional neural networks.
In the case of convolutional neural networks, “convolution” refers to a mathematical
operation that can be used to perform processing on the input data and extract features
from it. Convolution is a fundamental building block in CNN to learn and extract features
from the data. CNNs are a type of deep learning architecture primarily used for tasks
related to computer vision, such as image recognition and object detection.
364 10 Text Mining Through Deep Learning
CNNs have a smaller number of neurons in their layers. In contrast to regular neural
networks, the layers of CNNs have neurons arranged in a few dimensions: channels, width,
height, and number of filters in the simplest 2D case. A convolution neural network com-
prises the convolution layer, pooling layer, and fully connected layers. These layers are
used for feature extraction, dimensionality reduction, and classification. This forms a com-
plete convolution. In a convolutional neural network (CNN), the pivotal component is the
convolutional layer. Unlike earlier layers we’ve studied, where each neuron was linked to
every pixel in the input picture, the neurons in the first convolutional layer are only con-
nected to certain pixels inside their receptive fields (as shown in Fig. 10.4). Similarly, in
the second convolutional layer, each neuron is only linked to neurons positioned inside a
tiny rectangular area of the first layer.
This common architecture allows the network to focus on simple, fundamental proper-
ties in the initial hidden layer, which are subsequently integrated into larger more compli-
cated features in succeeding hidden layers. This hierarchical structure matches the way
real-world pictures are formed, which is one of the reasons CNNs excel in image identifi-
cation tasks.
10.2 Deep Learning Models for Processing Text 365
Convolutional
Layer 1
Input Layer
fh=3
fw=3
Zero
Padding
To be more exact, take a neuron positioned at row “i” and column “j” in a particular
layer. It’s coupled to the outputs of neurons in the preceding layer placed within rows “i”
to “i + fh – 1” and columns “j” to “j + fw – 1”, where “fh” and “fw” denote the height and
width of the receptive field (as shown in Fig. 10.5). To guarantee that a layer retains the
same height and breadth as the preceding layer, it’s typical to add zeros around the input,
as indicated in the diagram. This approach is referred to as zero padding.
Furthermore, there’s a possibility to connect an additional input layer to a smaller layer
by spacing out the receptive fields, as shown in Fig. 10.6. This strategy considerably
decreases the computational complexity of the model. The shift from one receptive area to
the next is dubbed the “stride”. In the figure, a 5 × 7 input layer (with zero padding) is
connected to a 3 × 4 layer, utilizing 3 × 3 receptive fields and a stride of 2 (notice that the
stride can vary in various directions). A neuron placed at row “i” and column “j” in the
higher layer is linked to the outputs of neurons in the preceding layer lying within rows “i
366 10 Text Mining Through Deep Learning
• Text classification: CNNs can be helpful for text classification, which includes assign-
ing categories or labels to text data. These categories can be numerous, ranging from
sentiment analysis (determining if a text displays positive or negative attitude) to topic
classification (classifying news items into distinct subjects).
• Sentiment analysis: When it comes to sentiment analysis, CNNs may be applied to
assess the emotions represented in text. By utilizing varying filter sizes, CNNs may
discover distinct patterns or combinations of words that signal positive or negative
sentiment, aiding in automated sentiment categorization.
• Document categorization: For tasks such as classifying whole documents into prede-
termined groups or subjects, CNNs may be utilized. They may filter through the mate-
rial to detect significant keywords or sequences of words, aiding in document
classification.
• Named entity recognition (NER): In NER tasks, CNNs may be used to recognize
named entities inside text, such as names of persons, organizations, or locations. They
function by recognizing specific patterns or structures that typically signal the existence
of such entities.
• Text generation: While recurrent neural networks (RNNs) and transformer models are
generally linked with text generation, CNNs can nonetheless play a role. They may be
integrated into the text generation process to aid in finding and creating relevant
material.
• Text summarization: CNNs can aid in summarizing long text documents by detecting
and extracting important lines or phrases. This makes the process of writing succinct
summaries more efficient and precise.
However, it’s vital to understand that CNNs aren’t a one-size-fits-all answer for any
NLP activity. For jobs involving sophisticated language structures, substantial relation-
ships between words, or in-depth grasp of semantics and syntax, alternative designs like
RNNs and transformer models tend to be more successful. Additionally, picking the cor-
rect word embeddings or contextual embeddings (such as Word2Vec, GloVe, or embed-
dings from transformer models) plays a crucial part in constructing efficient text mining
models. The choice of architecture and embeddings should coincide with the unique
requirements and peculiarities of the text mining task at hand.
368 10 Text Mining Through Deep Learning
An MLP consists of one input layer, one or more levels of TLUs, termed hidden layers,
and one final layer of TLUs called the output layer. The layers near to the input layer are
typically termed the lower layers, while the ones adjacent to the outputs are usually called
the higher levels. Every layer, except the output layer, includes a bias neuron and is fully
linked to the following layer as shown in Fig. 10.7.
When you’ve got one of those artificial neural networks (ANNs) that’s filled with lots
of these hidden layers, they call it a “deep neural network.” It’s like the heavyweight
champ of neural networks. There’s a whole discipline called deep learning, which is all
about researching these DNNs and other models that perform some real deep thinking,
crunching a ton of computations. For a long period of time, researchers were impeded
by a curious challenge: teaching those multi-layer perceptrons (MLPs) the ropes of
learning, and to be honest, they weren’t making much progress. It seemed like attempt-
ing to teach an old dog new tricks. But then, in 1986, three authors—David Rumelhart,
Geoffrey Hinton, and Ronald Williams—put a game-changing piece of work on the
table. It was like a eureka moment in science. They introduced what’s now known as the
backpropagation training algorithm, and it’s a heavyweight champ that’s still shaking
the ring today.
In simple terms, this backpropagation stuff is like the old gradient descent approach,
but with a modern twist. Instead of painstakingly crunching the data to find out how to
adjust every single aspect, it’s got this smart trick under its sleeve. In just two rounds of
action across the network (one forward, one backward), our algorithm magically com-
putes how to fine-tune each link weight and bias term to make those mistakes vanish into
∫ ∫ ∫ Output
∑ ∑ ∑ layer
∫ ∫ ∫ ∫ Hidden
1 ∑ ∑ ∑ ∑ layer
Input
1
layer
x1 X2
thin air. It’s like having a smart mentor for your neural network, instructing it exactly on
where to improve. Once it’s armed with these gradients (they’re like golden hints on how
to navigate), it takes a confident step in the proper direction, following the script of gradi-
ent descent. Repeat this dance until the network strikes the bullseye and discovers the
solution it’s been searching for. It’s kind of like a chef adding a pinch of salt at a time,
tasting, and modifying until the meal is perfectly right.
Let’s take a deeper look at this algorithm.
It takes it slow and steady, handling just a piece of data at a time, like a batch of 32
samples. It doesn’t stop there; it goes through the complete training set numerous times,
and each time it does, we term it an “epoch”. It’s kind of like completing laps on a track;
you keep going around until you become better. Each of these mini-batches gets sent on a
journey. It starts at the network’s front entrance, the input layer, which passes it on to the
first concealed layer. Now, here’s where the actual action happens. The algorithm crunches
the numbers, determining what all the neurons in this layer think about the mini-batch (and
it does this for every instance in the mini-batch). It doesn’t stop there; it passes these
results along to the next layer, which makes its own computations and gives it off to the
next, and so on. It’s like a relay race, where everyone passes the baton until it reaches the
last runner, which is the output layer. We call this entire trek the “forward pass”. It’s like
forecasting the future, but we keep track of all the stages along the way, just in case.
Next up, it’s judgment time. The algorithm takes a long, hard look at what the network
spits forth. It’s time to figure out precisely how incorrect it was, using a loss function that
compares what the network stated with what it should’ve said. This lets us measure how
far off the target we are. Here’s when the magic happens: the algorithm starts determining
who’s to blame. It’s like being a detective in a difficult case, and it works out just how
much each connection contributed to the mess-up. This phase involves a mathematical
technique called the chain rule, which is kind of like the secret sauce of calculus. It’s rapid
and accurate, like using a laser to solve a puzzle. Now, it’s time to trace back the stages.
The algorithm looks at how much of this mess comes from each link in the layer below.
It’s like tracing back your steps in a maze until you locate the entrance again. This compo-
nent likewise utilizes the chain rule but in reverse, working its way backward until it
reaches the starting point, the input layer. We call this the “backward pass”, and it’s why
the whole algorithm got its name.
Finally, the algorithm rolls up its sleeves and starts to work. It executes a gradient
descent step to change all the connection weights in the network. It utilizes the error gra-
dients it just computed as its guidance, like a chef adding exactly the right amount of spice
to a meal to make it perfect. This algorithm is a big deal, so let’s break it down one more
time: for each training example, it starts by making a guess (the forward pass), and then it
looks back through each layer to see who’s to blame for any mistakes (the reverse pass);
finally, it tweaks the connections to fix those mistakes. It’s like a dance where every step
matters, and in the end, we get a better-tuned neural network.
370 10 Text Mining Through Deep Learning
Softmax
Output
∑ ∑ ∑ layer
Hidden layer
1 ∑ ∑ ∑ ∑ (e.g., Relu)
Input
1
layer
x1 X2
on the probability of being urgent. Each output neuron has its own shot at guessing. You
may wind up with nonurgent ham, urgent ham, nonurgent spam, or even (unlikely) urgent
spam—but that’s probably an algorithm malfunction!
Now, what if you’re dealing with a more complicated scenario? Say you’re categorizing
items into one of several categories (e.g., sorting photographs of digits into numbers 0
through 9). Well, you’ll need one output neuron for each potential class, and here’s where
the softmax activation function comes into play as shown in Fig. 10.8. It guarantees that
all the estimated probabilities are neatly restricted between 0 and 1, and they all add up to
1, which is critical when classes don’t overlap. This is what we term multi-class categori-
zation, when each instance belongs to just one class out of several options. So, there you
have it—MLPs are like the Swiss Army knives of the machine learning world, able to
tackle all sorts of categorization issues, from basic “yes or no” judgments to juggling
many labels and categories. It’s like having a versatile tool in your toolbox for each catego-
rization assignment that comes your way.
Now, when it comes to the hard lifting that neural networks need, Keras doesn’t accom-
plish it alone. It relies on something called a compute back end, which is like the engine
under the hood of your automobile. As of today, you’ve got three good options to select
from: TensorFlow, Microsoft Cognitive Toolkit (CNTK), and Theano. Just to keep things
very clear, we’ll name this original Keras implementation “multibackend Keras” because
it plays nice with different back ends.
Let’s shift our attention from the conventional feedforward neural networks we’ve been
investigating so far. Imagine a different form of neural network, one that’s a little like a
feed-forward network but with a twist—it contains connections that loop back on them-
selves. Picture the simplest form of this, which is a recurrent neural network (RNN), as if
it’s a neuron that accepts inputs, creates an output, and then loops that output back to itself.
It’s like a feedback loop in a conversation.
Now, think about it in terms of time. At each time step, this recurrent neuron gets inputs
(let’s call them x(t)), and it also gets its own output from the previous time step, which
we’ll name y(t – 1). When you’re just starting off, at the very first time step, there’s no
prior output, so we normally set it to zero. If you want to envision it, you can imagine this
small network spread out along a timeline, with the same neuron replicated at each time
step, kind of like flipping through a flipbook. The fancy phrase for it is “unrolling the net-
work through time.” It’s like witnessing a movie frame by frame, except in this instance,
it’s a neural network developing step by step as shown in Fig. 10.9.
∑ ∑ ∑ ∑
portion of a neural network that hangs onto information over time is termed a “memory
cell”, or just “cell” for short. Now, here’s the thing: a single recurrent neuron or even a
whole layer of them are like the rookie version of a cell. They can retain just brief patterns,
generally around ten steps’ worth, although this might vary depending on what they’re
learning.
But wait, it gets more fascinating! Later on, we’ll go into fancier cell types that can
recall longer patterns, probably about 10 times longer, again depending on the objective.
These cells are like the seasoned pros of the memory field.
So, how does it work? Well, a cell’s state at a specific time step, let’s call it h(t) (the “h”
is merely for “hidden”), is basically a mix of certain inputs at that point and its own state
from the previous time step. It’s like mixing up components in a recipe: h(t) = f(h(t – 1),
x(t)). The output at the time step, y(t), is also created from the prior state and what’s hap-
pening right now. In basic cells, the output is just the state itself, but in the more compli-
cated ones, things may become a bit trickier. It’s like a recipe where the product could be
different from what you put in.
Ignored
Y(0) Y(1) Y(2) Y(3) Y(4) Y(0) Y(1) Y(2) Y(3)
Encoder Decoder
Fig. 10.10 Seq-to-seq (top left), seq-to-vector (top right), vector-to-seq (bottom left), and encoder-
decoder (bottom right) networks
gradients we computed during BPTT. It’s worth highlighting that these gradients run
across all the outputs that matter for the cost function, not just the final one. For example,
in the figure, the cost function cares about the last three outputs of the network—Y(2), Y(3),
and Y(4)—and therefore, the gradients trip across these three outputs, ensuring they all
contribute to the learning process.
However, it’s vital to observe that the gradients don’t flow via Y(0) and Y(1). Additionally,
because we employ the same parameters W and b at each time step, the backpropagation
mechanism automatically performs the proper task by summing across all time steps.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN,
Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import
pad_sequences
import numpy as np
# Sample text data and labels
texts = ["I love this movie!", "This is a terrible film.",
"Awesome acting!", "I couldn't stand it.",]
# 1 for positive sentiment, 0 for negative sentiment
labels = [1, 0, 1, 0]
# Tokenize the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
# Pad sequences to have the same length
max_sequence_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences,
maxlen=max_sequence_length)
# Create a simple RNN model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1,
output_dim=32, input_length=max_sequence_length))
# Simple RNN layer
model.add(SimpleRNN(32))
# Output layer for binary classification
model.add(Dense(1, activation='sigmoid'))
# Compile the model
376 10 Text Mining Through Deep Learning
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
# Convert labels to numpy array
labels = np.array(labels)
# Train the model
model.fit(sequences, labels, epochs=10, batch_size=2)
# Evaluate the model
loss, accuracy = model.evaluate(sequences, labels)
print(f"Loss: {loss}, Accuracy: {accuracy}")
After running the code, you should see training progress displayed in the console.
Here’s what you might see as output during the training process:
Epoch 1/10
2/2 [=========================] - 1 s 3 ms/step - loss: 0.6883 - accuracy: 0.5000
Epoch 2/10
2/2 [=========================] - 0 s 3 ms/step - loss: 0.6162 - accuracy: 0.7500
…
Epoch 10/10
2/2 [=========================] - 0 s 3 ms/step - loss: 0.3505 - accuracy: 1.0000
This output displays the progress of training across ten epochs. Each epoch corresponds
to one trip through the training data. The model’s loss (binary cross-entropy) and accuracy
on the training data are presented for each epoch. In this simplistic example, the model
should increase its accuracy as it continues to train.
After training, you’ll see the final loss and accuracy numbers printed, like this:
These statistics represent the model’s performance on the training data once training is
complete. The loss should decrease, and the accuracy should approach 1.0 (100%) if the
model is learning well on this small sample. In reality, you would normally evaluate the
model on a separate test dataset to check its generalization ability.
While the use of short-term memory in recurrent neural networks (RNNs) could look as if
it can answer practically any deep learning difficulty, it’s crucial to note that RNNs are not
without their limits.
10.2 Deep Learning Models for Processing Text 377
The difficulty with RNNs derives from their primary characteristic: the repeating pro-
cessing of the same input over time. When similar information travels through the same set
of cells several times, it can progressively lose its impact and finally fade away, especially
if the cell weights are set to modest numbers. This process is known as the “vanishing
gradient problem”, where error-correction signals, vital for learning, decline as they pass
through the layers of a neural network.
Due to the vanishing gradient problem, you have challenges when attempting to stack
too many layers of RNNs or update them properly. This constraint imposes constraints on
the depth and complexity of RNN-based models in deep learning tasks. Recurrent neural
networks (RNNs) face obstacles that are noticeably more complicated. In the context of
backpropagation, gradients act as corrections that regulate the incorrect adjustments per-
formed by networks during prediction. The layers before the final prediction are respon-
sible for propagating these gradients back to the input layers, which eventually enable
correct weight updates. However, when a layer receives a minute gradient update, its
learning process might essentially halt.
In the case of RNNs, a particularly confusing situation develops when internally back-
propagated signals tend to fade away after multiple recursive steps. As a result, RNNs are
more proficient at updating and learning from current sequences, whereas they tend to
forget previous signals. This constraint renders RNNs rather myopic, reducing their effi-
cacy in jobs that necessitate a more extended memory span.
It’s vital to remember that in an RNN layer, backpropagation occurs both outward,
impacting adjacent layers, and inward, altering the memory within each RNN cell.
Regrettably, regardless of the initial intensity of the signal, it tends to decrease and finally
fades with time. This basic short memory and the vanishing gradient problem provide
major hurdles for RNNs when it comes to learning larger sequences. Applications such as
picture captioning or machine translation rely on a thorough recollection of all compo-
nents of a sequence. Consequently, many practical applications demand other methodolo-
gies, and classic RNNs have been substituted by more sophisticated recurrent cells.
Long short-term memory networks (LSTM) are arranged around what we call “gates”.
These gates are essentially internal mechanisms within the LSTM cell that perform math-
ematical operations such as summation, multiplication, and activation functions to pre-
cisely control the flow of information. This control over information flow allows a gate to
make judgments about what to keep, what to highlight, and what to discard, all dependent
on the input it gets from a sequence, whether for short-term or long-term memory. This
sophisticated flow control device bears parallels to the way an electric circuit functions.
The picture presents a visual illustration of the internal structure of an LSTM, shedding
light on how these gates are positioned and operate inside the network.
Understanding the numerous components and functions of LSTM networks could
appear difficult at first, but breaking it down into the following sequence of phases can
clarify the process:
378 10 Text Mining Through Deep Learning
1. To begin, the short-term memory, which either carries over from a prior state or begins
with random values, interacts with the freshly entered segment of the sequence. This
interaction leads to the production of an initial derivation.
2. The short-term memory, which now comprises a blend of the previous signal and the
freshly entered signal, attempts to reach the long-term memory. However, before doing
so, it must go via the forget gate, which plays a vital role in selecting what information
to delete. This requires a technical bifurcation where the signal is duplicated.
3. The forget gate functions by making decisions about which short-term information
should be preserved and which should be omitted. It does this by adopting a sigmoid
activation function, which efficiently filters out signals considered non-essential while
increasing those judged critical for preservation.
4. The information that successfully overcomes the forget gate travels to the long-term
memory channel, taking along the data from prior states.
5. The values stored in the long-term memory merge with the output from the forget gate
by multiplication.
6. For the component of short-term memory that didn’t pass through the forget gate, it
takes a duplicate path. One component proceeds to the output gate, while the other
encounters the input gate.
7. At the input gate, the short-term memory data undergoes independent modifications
including a sigmoid function and a tanh function. The outcomes of these two modifica-
tions are then multiplied together and added to the long-term memory. The influence on
long-term memory depends on the sigmoid function, which decides whether the signal
is worth remembering or should be ignored.
8. Following the addition with the outputs from the input gate, the long-term memory
stays intact. Composed of chosen inputs from the short-term memory, the long-term
memory stores information for prolonged durations within the sequence and is immune
to transitory gaps.
9. The long-term memory acts as a direct source of knowledge for the subsequent state.
Additionally, it is transferred to the output gate, where it converges with the short-term
memory. This last gate normalizes the data from the long-term memory using the tanh
activation and filters the short-term memory using the sigmoid function. These findings
are multiplied together before being passed to the next stage in the sequence.
In the world of LSTMs, a mix of sigmoid and tanh activations plays a vital role inside
their gates. The essential notion to bear in mind here is how these two functions operate.
Firstly, the tanh function carries out normalization of its input, restricting it inside the
range of −1 to 1. This implies it maintains the input values within a reasonable and con-
trollable range.
On the other hand, the sigmoid function serves a different purpose. It operates to com-
press the input values, effectively decreasing them to lie inside the 0 to 1 range.
Consequently, it has the potential to “switch off” weaker signals by moving them closer to
zero, thereby eliminating their effects.
10.2 Deep Learning Models for Processing Text 379
In simpler terms, the sigmoid activation function serves a dual purpose—it aids in
remembering by amplifying the signal when it’s important and assists in forgetting by
dampening or suppressing the signal when it’s not as essential or beneficial. This duality
is what makes the sigmoid function a critical component in the operation of LSTMs.
1-
x x
∫ ∫ ∫
Update gate
∫ ∫ x +
working memory gets refreshed by an update gate that takes into consideration the current
information presented to the network. The new information is then merged with the old
working memory through a gate known as a reset gate. This reset gate chooses the working
memory information to effectively keep a recollection of the data to be passed on to the
next phase in the sequence.
Unlike LSTMs, GRUs employ a reset gate to reject irrelevant information and an update
gate to maintain valuable signals. GRUs offer a cohesive memory system without separa-
tion between long- and short-term memory components.
You have the freedom to effortlessly integrate both gated recurrent units (GRUs) and
LSTM layers into your neural networks without requiring major code adjustments. To
include these layers, just import them using “keras.layers.GRU” or select for “keras.lay-
ers.CuDNNGRU” if you want the GPU-accelerated version, which utilizes the NVIDIA
CuDNN library.
When working with GRUs, your interaction with them closely mimics that of LSTM
layers. You may define the number of GRUs necessary in a layer by supplying the “units”
parameter. This smooth interchangeability between LSTM and GRU layers offers several
benefits and considerations:
• Mitigating the disappearing gradient: GRUs handle signals much like LSTMs, which
makes them useful in resolving the disappearing gradient problem, a hurdle that can
hamper the training of deep networks.
• Unified memory approach: In contrast to LSTMs, GRUs do not discriminate between
long-term and short-term memory. They rely on a single working memory known as a
cell state, which undergoes recurrent processing within the GRU cell.
• Simplicity in architecture: GRUs tend to be less complicated than LSTMs, which might
be helpful in terms of model simplicity and decreased processing needs.
• Memory span: However, owing to their simpler structure, GRUs may have issues in
keeping information over lengthy durations, making LSTMs more ideal for jobs involv-
ing longer sequences.
• Efficient training: GRUs often train quicker than LSTMs, largely because they incorpo-
rate fewer parameters that need change.
• Handling limited data: When dealing with cases where training data is minimal, GRUs
generally outperform LSTMs. Their diminished capacity to over-remember informa-
tion makes them less prone to overfitting the facts they encounter.
In the end, your decision between GRUs and LSTMs should correspond to the unique
needs of your task, the length of the sequences you’re dealing with, the available training
data, and the trade-offs you are ready to make in terms of model complexity and memory
capacity. Each architecture boasts its own distinct strengths and shortcomings, and your
pick should be driven by the precise objectives of your machine learning or deep learning
project.
10.2 Deep Learning Models for Processing Text 381
organizations, locations, dates, and more. LSTM’s aptitude in recognizing patterns and
contextual signals boosts the accuracy of NER systems.
• Machine translation: LSTM serves as a crucial component in machine translation
models like sequence-to-sequence (seq2seq) models. These algorithms accurately
translate text from one language to another by comprehending subtle links between
words in various languages.
• Text generating: LSTM’s recurrent design makes it appropriate for text-generating
jobs. By training on extensive text corpora, LSTM-based models can generate coherent
and contextually relevant text, finding application in chatbots, content creation, and
creative writing.
• Text summarizing: LSTM has proved its worth in text summarizing, where it autono-
mously pulls key information from large papers, creating short summaries. This is nota-
bly effective in news aggregation, research paper abstracts, and content curation.
• Question answering: LSTM-powered question-answering systems can interpret and
create replies to natural language inquiries based on a given corpus of text. This tech-
nology is employed in chatbots, virtual assistants, and customer care.
import numpy as np
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
# Load the IMDb dataset (you can adjust the number of words)
max_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(
num_words=max_words)
# Preprocess the data
10.2 Deep Learning Models for Processing Text 383
In this code, we import the IMDb movie reviews dataset using Keras and pre-process
it, including padding the sequences to a given length. We develop a basic LSTM-based
model with an embedding layer, an LSTM layer, and a dense layer for binary classifica-
tion. The model is compiled with binary cross-entropy loss and the Adam optimizer. We
train the model on the training data.
Finally, we analyze the model on the test data and output the test loss and accuracy.
Please note that this is a simple example, and you may change it based on your individual
text mining goal and dataset. You may need to fine-tune hyperparameters, pre-process
your text input differently, and explore more complicated architectures for more challeng-
ing tasks.
10.2.6 Transformers
Transformers, in the context of technology and language processing, are neither shape-
shifting robots nor characters from a sci-fi movie. Instead, they are a pioneering class of
deep learning models that have encouraged a revolution in the way humans comprehend
and interact with language. Imagine a transformer as a highly talented and flexible lin-
guist, capable of interpreting language in a way that mirrors human cognition. These
384 10 Text Mining Through Deep Learning
models have the unique capacity to evaluate text by evaluating the links between words,
the intricacies of context, and the relevance of each word within a sequence. At its founda-
tion, transformers utilize a unique technique known as “self-attention”. Think of this pro-
cess as a sophisticated attention system that allows the model to focus on specific bits of
text, giving varying levels of relevance to distinct words based on the surrounding context.
It’s like having an experienced reader who can pay greater attention to crucial elements
while analyzing a piece of text.
The uses of transformers in text mining are as diverse as the sea of words they analyze.
They’ve made their way into sentiment analysis, where comprehending not only the words
but the context in which they’re used is essential for correct sentiment identification. In
text classification, transformers excel in grasping the complicated links between docu-
ments and categories, making them a go-to choice for jobs like news categorization and
spam detection.
What makes transformers even more interesting is their capacity to handle extended
sequences of text. Unlike their predecessors, which battled with sustaining context over
lengthy stretches, transformers thrive in this aspect. They can evaluate lengthy articles,
research papers, and even entire novels while preserving a strong comprehension of the
narrative’s flow. Yet, it’s not all plain sailing in the context of transformers. With tremen-
dous power comes great responsibility, and these models are no exception. They’re known
for their thirst for computing resources; training huge transformers may be resource-
intensive and pricey. The huge model sizes might also cause issues in deployment and
real-time applications.
Self-Attention
Feed Feed Feed
Forward Forward Forward
• Encoder: The function of the encoder is to traverse the input time steps systematically
and turn the entire sequence into a constant-length vector known as a context vector.
• Decoder: Conversely, the decoder’s role requires moving through the output time steps
while extracting information from the context vector.
1. The initial phase involves transmitting the word embeddings of the input sequence to
the initial encoder.
2. Subsequently, these embeddings undergo change and are sent to the succeeding encoder
in the sequence.
3. The resultant output from the ultimate encoder within the encoder stack is distributed
to all the decoders within the decoder stack.
• Self-attention: Imagine reading a sentence, and at each word, you pay different
amounts of attention to the other words in the phrase. You could devote more attention
to the words immediately connected to the one you’re reading and less to those further
away. This selective concentration is comparable to how self-attention works in
transformers.
1. For each word in a sequence, the self-attention mechanism computes a “score” for
every other word in the same sequence.
2. These scores determine the amount of significance or relevance of each word to the
word being processed.
3. The scores are then used to weigh the values of all the words, and the weighted values
are merged to form a new representation for the word under examination.
This procedure happens for every word in the sequence, allowing the model to capture
dependencies and context efficiently. Self-attention enables the model to grasp not just the
words themselves but also how they relate to each other within the sequence, making it a
useful tool for tasks like language processing and translation.
numerous “sets of eyes” or “heads” that can focus on different portions of the data
concurrently.
1. The input sequence is turned into several representations by applying distinct sets of
learnable parameters (these are the “heads”).
2. Each head then executes self-attention separately on these representations, yielding
distinct sets of weighted values.
3. These sets of weighted values are then concatenated or mixed in a way that conveys
different sorts of relationships and context within the sequence.
The idea behind multi-head attention is that it allows the model to pay attention to dif-
ferent sections of the input sequence with distinct “perspectives”. Each head can focus on
various features, such as syntax, semantics, or word connections, and then the model can
learn to blend these multiple views to produce more educated predictions or
representations.
–– Pre-training: BERT begins on a journey across a large sea of text data, where it
learns to forecast missing words in sentences by examining the surrounding words
from both sides. This pre-training approach equips BERT with an amazing depth of
linguistic knowledge. It grasps subtleties, context, and the links between words in a
way that prior models could only dream of.
–– Fine-tuning: Following pre-training, BERT undertakes fine-tuning for specific lin-
guistic tasks. Whether it’s categorizing text emotion, answering queries, or any other
language-related difficulty, BERT adjusts its considerable linguistic skills to thrive
in these jobs. It’s as though BERT is not simply skilled in language but can also
modify its talents to face diverse linguistic obstacles.
But what makes BERT truly exceptional is its adaptability. It’s not restricted to a par-
ticular activity; instead, it’s a multi-tool for language interpretation. In sentiment analysis,
388 10 Text Mining Through Deep Learning
BERT doesn’t only recognize good or negative sentiments; it comprehends the subtle
emotions and nuances that lead to such sentiments.
The BERT model performed pre-training on a huge text dataset, integrating informa-
tion from Wikipedia (equal to a stunning 2500 million words) and the Book Corpus (com-
prising 800 million words). This model is then customizable by fine-tuning with question
and response datasets. Figure 10.14 shows the pictorial representation of the BERT model.
In the field of language models, we meet a major hurdle—the lack of training data. The
discipline of natural language processing (NLP) encompasses different tasks, each sup-
ported by task-specific datasets, yet these datasets generally consist of simply a few thou-
sand or a few hundred thousand human-labelled training instances.
To overcome this difficulty, BERT adopts a novel technique. It trains a language model
using a big unlabelled text corpus, integrating unsupervised or semi-supervised learning.
Additionally, it allows for fine-tuning this large model for specific NLP tasks, effectively
utilizing the huge reservoir of information it has gained through supervised learning. What
makes BERT different is its capacity to process text bidirectionally, something not previ-
ously feasible by language models. Unlike its predecessors, which could only read text
sequentially, either from left to right or right to left, BERT reads in both ways simultane-
ously. This extraordinary feature, made possible by the emergence of transformers, is fit-
tingly referred to as bidirectionality.
BERT comes in two variants, each featuring a pre-trained model that’s been tuned on a
huge dataset. BERT’s path included building upon various existing NLP algorithms and
designs, including semi-supervised training, OpenAI Transformers, ELMo Embeddings,
ULMFit, and the formidable transformer architecture.
At its heart, BERT primarily consists of an Encoder stack within the transformer archi-
tecture. Here’s a breakdown of the two variants:
These two versions of BERT offer a variety of features to cater to diverse NLP jobs and
applications.
The model’s input is provided in a comprehensible manner, both as an individual sen-
tence and as a pair of phrases (like a query and its answer), all inside a single sequence of
tokens. BERT makes use of WordPiece embeddings from a vocabulary of 30,000 tokens.
Figure 10.15 shows BERT with Input and Output representation.
In this configuration, the very first token in each input sequence is constantly labelled
as [CLS], which serves as the classification token. The resultant hidden state correspond-
ing to this token serves a significant function as the overall representation of the entire
sequence, notably for classification tasks.
When sentence pairs are merged into a continuous sequence, they are differentiated in
two separate ways. Firstly, a [SEP] token is employed as a separator. Secondly, a learned
embedding is applied to each token, indicating whether it corresponds to sentence A or
sentence B.
“I LIKE STRAWBERRIES”
Token
30,000 Embeddings
768
768
1. Attention mechanism: At the core of transformers lies the attention mechanism, spe-
cifically self-attention. This approach allows the model to focus on different regions of
the input sequence, independent of its length. Unlike earlier models that struggled with
lengthier sequences owing to computational restrictions, transformers can effectively
weigh the relevance of words and their relationships throughout vast material.
2. Parallel processing: Transformers make use of parallel processing. Traditional
approaches handled text sequentially, which was a serious bottleneck when dealing
with long texts. In contrast, transformers may evaluate many sections of a sequence
concurrently, considerably speeding up the calculation and making them more efficient
for lengthy inputs.
3. Subword tokenization: To handle extended sequences, transformers apply subword
tokenization. Instead of considering each word as a single token, words are broken into
smaller subword units. This not only decreases the vocabulary size but also helps the
model to absorb lengthier texts more effectively. It guarantees that unusual or lengthy
words don’t impair performance.
4. Chunking: In reality, really lengthy sequences might still pose issues for transformers.
In such instances, the input can be separated into smaller pieces or segments, each
handled individually. The model then mixes the representations of these segments to
make sense of the overall sequence. This chunking strategy effectively expands the
model’s capacity to handle incredibly lengthy texts.
5. Hierarchical transformers: Another way to handle large sequences employs hierar-
chical transformers. Instead of processing text as one continuous sequence, it’s divided
down into hierarchies. The model initially examines chunks of text and then aggregates
their representations to comprehend the complete material. This hierarchical structure
is particularly effective for activities like document categorization.
6. Memory enhancement: Some types of transformers feature memory enhancement
technologies. These processes allow the model to retain and retrieve information from
prior sections of the sequence, eliminating the need to re-analyze the full text for each
query or prediction. This simulates humanlike memory and assists in processing long
discussions or papers more effectively.
1. Search engines: When you input a query into a search engine, it’s like asking a question.
Transformers assist search engines to grasp the context of your question, making the
results more relevant. They may also examine the content of Web sites to identify their
quality and relevancy, guaranteeing that you obtain the best answers to your inquiries.
2. Chatbots and virtual assistants: Have you ever interacted with a virtual assistant or a
chatbot? Transformers play a significant part here, making these exchanges feel more
like talks with humans. They interpret your questions, deliver appropriate replies, and
even sense your emotions through the text, giving a more customized experience.
10.3 Deep Learning in Sentiment Analysis 391
Deep learning provides a varied array of architectures to model sentiment analysis prob-
lems and has overtaken other machine learning methods as the foremost strategy for com-
pleting sentiment analysis tasks. Recent breakthroughs in deep learning architectures
show a move away from recurrent and convolutional neural networks and the rising accep-
tance of transformer language models. Utilizing pre-trained transformer language models
to transmit information to downstream tasks has been a milestone in NLP.
In an age dominated by digital communication, understanding the sentiment behind
text data has become important. Sentiment analysis, often known as opinion mining, is the
act of understanding emotions, views, and attitudes indicated in textual information. It
gives companies and individuals important insights into public attitude, be it toward prod-
ucts, services, or societal concerns. Traditional techniques for sentiment analysis generally
struggled to capture the complexities and nuances of human discourse. This task opened
the groundwork for deep learning to revolutionize sentiment analysis.
Deep learning, a subset of machine learning, has emerged as the driving force behind
the growth of sentiment analysis. Unlike rule-based or shallow machine learning
approaches, deep learning models may independently learn sophisticated patterns and rep-
resentations from enormous quantities of text input. This part presents the notion of deep
learning and its significant importance in the context of sentiment analysis. At its core,
deep learning mimics the workings of the human brain by mimicking artificial neural net-
works. These networks consist of layers of interconnected nodes, each layer analyzing and
modifying input to find hidden patterns.
Neural network topologies serve as the cornerstone of deep learning in sentiment anal-
ysis. They have the unique capacity to capture complex correlations among data, making
them well suited for the complexities of sentiment analysis.
392 10 Text Mining Through Deep Learning
x1 h1
X h2
S1
hw, b(x)
X h3
Layer
+ +
Layer Layer
We can think of deep learning as the supercharged sibling of artificial neural networks,
taking learning challenges to a whole new level. It’s like giving neural networks a dose
of superpowers, allowing them to stretch their learning muscles in ways that were ear-
lier deemed viable just with a handful of layers and a sprinkling of data. Inspired by the
complex web of connections in the human brain, neural networks are developed. These
networks are an assembly of many information-processing units, each masquerading as
a neuron, building layers that communicate smoothly. Just like a synced orchestra, they
work together harmoniously.
These networks may learn to tackle various tasks, from categorization to more sophis-
ticated feats, by playing with the connections between these artificial neurons. It’s compa-
rable to witnessing a machine duplicate the learning dance of a real brain, and the results
may be nothing short of astonishing. Figure 10.16 shows a simple feed-forward neural
network.
Many deep learning models in the field of natural language processing (NLP) require word
embedding results as input features. Word embedding is a technique used for language
modeling and feature learning. It essentially transforms words from a vocabulary into vec-
tors of continuous real numbers. For example, it converts a word like “hat” into a numeri-
cal vector like (…, 0.15, …, 0.23, …, 0.41, …).
This technique typically involves a mathematical transformation from a high-
dimensional sparse vector space (such as a one-hot encoding vector space, where each
10.3 Deep Learning in Sentiment Analysis 393
Researchers have mostly looked into sentiment analysis at three distinct degrees of detail:
document level, sentence level, and aspect level.
Document-level sentiment classification includes categorizing an opinionated docu-
ment, such as product review, as reflecting an overall positive or negative attitude. It
accepts the entire document as the fundamental unit of information and assumes that the
document provides ideas about a single thing, such as a certain phone.
Sentence-level sentiment classification, on the other hand, classifies individual sen-
tences within a document. However, it’s vital to recognize that not every line is inherently
opinionated. Traditionally, academics frequently begin by identifying whether a statement
is opinionated or not, a procedure known as subjectivity categorization. Subsequently,
the detected opinionated statements are classed as reflecting favorable or negative views.
Alternatively, sentence-level sentiment classification can alternatively be framed as a
three-class classification problem, wherein sentences are classified as neutral, positive, or
negative.
In contrast to document-level and sentence-level sentiment analysis, aspect-level sen-
timent analysis, or aspect-based sentiment analysis, gives a more detailed perspective.
Its purpose is to extract and summarize people’s opinions stated regarding certain enti-
ties and their associated attributes or qualities. These features are typically referred to as
394 10 Text Mining Through Deep Learning
There are various neural network architectures customized for sentiment analysis tasks.
Imagine RNNs as literary detectives, analyzing the sequential relationships inside text.
These networks can absorb not just individual words but also the overall context, making
them effective in sentiment classification for extended paragraphs or documents. In the
field of sentiment analysis, CNNs also fulfil the function of pattern recognition specialists.
They scan text for significant aspects, much like an artist methodically brings out the com-
plexity in a picture. These features are then utilized for sentiment classification.
Transformers are the newest stars in the NLP universe. Picture them as verbal prodigies
capable of recognizing context, subtlety, and tone. They’ve revolutionized sentiment anal-
ysis by excelling in capturing nuanced emotional responses. The introduction of transfer
learning and pre-trained language models has caused a paradigm change in sentiment
analysis. Transfer learning enables models to apply information obtained from one area to
another. In sentiment analysis, this means that models may harness insights from huge
language datasets to increase their accuracy in interpreting emotions. Pre-trained language
models like BERT and GPT-3 are analogous to professors who’ve thoroughly studied
language. These models are equipped with a profound knowledge of language subtleties,
allowing them to succeed in sentiment analysis tasks without requiring considerable
training.
10.4 ChatGPT
models, founded on the transformative architecture built by Google, serve as the bedrock
for ChatGPT. However, the magic occurs when ChatGPT undergoes painstaking fine-
tuning for conversational purposes. This refining process employs a combination of super-
vised and reinforcement learning approaches to guarantee that ChatGPT excels in its
ability to engage in meaningful discussions.
By the start of 2023, ChatGPT had attained unparalleled levels of growth, swiftly
becoming one of the fastest-growing consumer software apps in history. It had accumu-
lated an incredible user base reaching 100 million people. However, among the joy and
success, questions began to appear. Some commentators worry about ChatGPT’s ability to
eclipse human intellect and its propensity to encourage plagiarism or promote disinforma-
tion. These concerns, while true, underline the enormous influence that ChatGPT has had
on the technical environment and the essential need for ethical considerations as AI con-
tinues to grow.
It is an exceptional phenomenon, signifying a dramatic shift in the continual march of
technological innovation. It stands as a significant member of the large language model
(LLM) family, lying under the umbrella of generative AI within the area of artificial intel-
ligence (AI). This distinction is notable since it holds the power to produce fresh material,
as opposed to just analyzing pre-existing data.
What’s particularly exciting is that ChatGPT is built for engagement by anybody, utiliz-
ing basic, ordinary language. The consequence is a discussion that flows in a wonderfully
humanlike manner. Within the pages of this chapter, you’ll acquire insight into where and
how to use ChatGPT, along with persuasive reasons why it’s worth your attention.
The GPT framework: ChatGPT is a notable member of the generative pre-trained trans-
former (GPT) family, a lineage of AI models recognized for their proficiency in natural
language processing (NLP). Understanding the GPT architecture is key to appreciating
ChatGPT’s potential.
• Pre-training and fine-tuning: At the core of ChatGPT’s growth lies the dynamic
combo of pre-training and fine-tuning. These two steps of model training provide
ChatGPT with the language knowledge and conversational talents that make it stand out.
• Key components: These components include tokenization, attention techniques, and
context management, all working in harmony to support smooth dialogues.
• Contextual understanding: Central to ChatGPT’s effectiveness is its ability to pre-
serve context during a conversation.
• Multi-turn conversations: ChatGPT isn’t confined to solitary answers. It thrives in
multi-turn dialogues, as context from earlier exchanges feeds its answers, providing a
cohesive and interesting discourse.
396 10 Text Mining Through Deep Learning
At the basis of ChatGPT’s strength is its language modeling skill. Imagine it as a lin-
guist, well-versed in the subtleties of human language. When you submit a message to
ChatGPT, it doesn’t just analyze the words; it deciphers the underlying patterns, syntax,
and semantics, much like a professional musician reads sheet music.
• Tokenization: Tokens, the building pieces of language, play a key function. ChatGPT
divides your phrases into these tokens, analogous to dissecting a jigsaw puzzle into
separate parts. These tokens are then processed, allowing ChatGPT to comprehend and
reply intelligently.
• Attention mechanism: To keep the discussion moving, ChatGPT utilizes a neat con-
cept known as the attention mechanism. It’s like a conductor arranging a symphony,
ensuring that each phrase receives the correct concentration and attention. This
approach enables ChatGPT to retain context and coherence throughout the discourse.
• Contextual understanding: What sets ChatGPT apart is its ability for contextual com-
prehension. It doesn’t just react based on a single sentence; it examines the entire dia-
logue. It’s analogous to an attentive buddy who recalls what you said previously,
ensuring that each response connects effortlessly with what’s been said.
10.5 Summary
The chapter provides an in-depth exploration of the role of deep learning in text mining
and text-related applications. It begins by highlighting the significant impact of deep
learning techniques on the field of text mining, emphasizing their transformative effects on
text data processing.
The chapter then delves into various deep learning models tailored for text processing,
including feed-forward neural networks, convolutional neural networks (CNNs), multi-
layer perceptrons (MLPs), recurrent neural networks (RNNs), long short-term memory
networks (LSTMs), and transformers. Each model is discussed in terms of its architecture
and application in text-related tasks, showcasing the versatility and adaptability of deep
learning approaches to handle different aspects of text data.
A specific focus is placed on deep learning’s role in sentiment analysis, where the abil-
ity to discern sentiment from text is crucial. The chapter demonstrates how deep learning
techniques have significantly improved sentiment analysis by providing more accurate and
nuanced results.
Furthermore, the chapter introduces ChatGPT, a conversational AI model, as a practical
application of deep learning in text-based interactions. ChatGPT’s use of deep learning
underscores its effectiveness in natural language understanding and generation, making it
relevant to chatbot development and humanlike text-based conversations.
In summary, this chapter serves as a comprehensive guide to understanding the impact
and versatility of deep learning in text mining and related fields. It covers a wide range of
deep learning models, their applications, and their implications for sentiment analysis and
conversational AI, highlighting the transformative power of these techniques in handling
and extracting valuable insights from textual data.
10.6 Exercises
Q1: Provide a few heuristics about how we can select the number of layers in a neural
network along with the number of neurons in each layer.
Q2: Explain how the softmax activation function works. Provide its comparison with the
Sigmoid function.
Q3: Explain how the input of a simple multi-layer neural network is different from the
recurrent neural network.
Q4: Suppose you want to process the sequential text; why would the RNN or the CNN
be a preferred choice over conventional neural networks?
398 10 Text Mining Through Deep Learning
Q5: Which activation function does the following figure represent? Discuss the advan-
tages and disadvantages of this function as well.
1.0
1
f(z) =
-z
1+e
0.5
f(z)
0.0
-8 -6 -4 -2 0 2 4 6 8
Q6: Provide the Python code to implement the simple recurrent neural network.
Q8: What are the major features of transformer models over the other seq2seq models.
Q9: Provide the stepwise details of how the LSTM works once a sequence is provided
as input.
Recommended Reading
• Machine Learning for Algorithmic Trading: Predictive Models to Extract Signals from
Market and Alternative Data for Systematic Trading Strategies with Python by
Stefan Jansen
Publisher: Packt Publishing
Publication Year: 2020
This edition shows how to work with market data, fundamental and alternative data,
such as tick data, daily and minute bars, SEC filings, earnings call transcripts, financial
news, or satellite images to create trading signals. It demonstrates how to model finan-
cial characteristics or alpha factors that allow a machine learning model to predict
returns from the USA and global stock price data and ETFs. It also shows how to evalu-
ate the signal content of new features using Alphalens and SHAP values and includes a
new appendix containing over a hundred alpha factor examples.
• Web Data Mining with Python: Discover and Extract Information from the Web Using
Python by Dr. Ranjana Rajnish and Dr. Meenakshi Srivastava
Publisher: BPB Publications
Publication Year: 2023
10.6 Exercises 399
This book begins by covering the basic concepts of Web mining and its classification.
It then explores the basics of Web data mining, its uses, and components, followed by
topics such as legal aspects related to data mining, data extraction and pre-processing,
dynamic Web site scraping, and CAPTCHA testing. It also introduces you to the con-
cept of opinion mining and Web architecture mining. In addition, it covers Web graph
mining, Web information mining, Web and hyperlink searching, Hyperlink-Triggered
Topic Searching, and partitioning algorithms used in Web mining. Finally, the book
will teach you different mining techniques to discover interesting usage patterns from
Web data.
• Mastering Text Mining with R: Extract and Recognize Your Text Data by Ashish Kumar
and Avinash Paul
Publisher: Packt Publishing
Publication Year: 2016
Text mining (or text data mining or text analysis) is the process of extracting useful,
high-quality information from text by modeling patterns and trends. R provides an
extensive text mining ecosystem through its numerous frameworks and packages.
• Mastering Social Media Mining with R by Sharan Kumar Ravindran and Vikram Garg
Publisher: Packt Publishing
Publication Year: 2016
As the number of users on the Internet increases, the content created has increased
dramatically, creating a need for information about the untapped gold mine that is
social media data. For computational statistics, R stands out from other languages by
providing readily available data extraction and transformation packages, making it
easier to perform your ETL tasks. In addition, its data visualization packages help users
better understand underlying data distributions, while its suite of “standard” statistical
packages simplifies data analysis.
Lexical Analysis and Parsing Using Deep
Learning 11
Lexical analysis and parsing tasks capture the inherent properties of words and their rela-
tionships at a lower level. These tasks typically involve basic techniques, for example,
word segmentation, part-of-speech tagging, and parsing. One common feature of these
tasks is that their outputs are structured in nature. To address structured prediction tasks
like these, two main categories of methods are commonly used: graph-based methods and
transition-based methods. Graph-based methods directly differentiate output structures on
the basis of their inherent characteristics. Transition-based methods, on the other hand,
transform the process of constructing outputs into sequences of state transition actions, so
they can distinguish between different sequences of these actions. Neural network models
have been proved to be successful in handling structured prediction problems within both
the graph-based and transition-based frameworks. In the context of this chapter, we will
provide a comprehensive review of the application of deep learning techniques in lexical
analysis and parsing. In addition, we will also compare these modern neural network
approaches and traditional statistical methods for a better understanding of their strengths
and limitations.
Natural word characteristics include various features, for example, syntactic word catego-
ries, often referred to as part of speech (POS), morphological attributes, and more. The
process of gathering this information is known as lexical analysis.
In languages like Chinese, Japanese, and Korean, where word boundaries are not sepa-
rated by whitespace, the task of lexical analysis becomes more complex. This is especially
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 401
U. Qamar, M. S. Raza, Applied Text Mining,
https://doi.org/10.1007/978-3-031-51917-8_11
402 11 Lexical Analysis and Parsing Using Deep Learning
difficult in case of word segmentation, which divides a sequence of characters into indi-
vidual words.
Even in English, where whitespace typically serves as a reliable indicator of word
boundaries, there are instances where it doesn’t provide an exact solution. For example,
certain scenarios may require the treatment of combinations like “New York” as a single
entity. This specific challenge falls within the domain of named entity recognition (NER).
Alternatively, in English, punctuation marks commonly appear adjacent to words.
Therefore, a critical challenge of lexical analysis is to determine whether we should
include punctuation marks as an integral part of a word or to treat them separately.
So, we can say that the scope of lexical analysis includes a large number of tasks, from
understanding the inherent attributes of individual words to addressing segmentation com-
plexities in languages like Chinese, Japanese, and Korean, as well as managing named
entity recognition and punctuation considerations in languages such as English.
In the context of languages like English, tokenization is often considered as a matter of
convention rather than a complex research challenge. Once we have obtained certain prop-
erties of words, our interest naturally shifts toward exploring the relationships that exist
among them. Parsing normally includes tasks like identification and labelling of words (or
sequences of words) that are connected to each other either compositionally or recursively.
In the context of parsing, two widely utilized methods are:
Both of these methods are used to identify the complex relationships between the
words. Note that all of the abovementioned language processing tasks can be classified as
structured prediction problems, a term that is used within the domain of supervised
machine learning.
In conventional approaches, these tasks involve a large number of manually created
features derived by the humans themselves. These features are then provided as input to a
linear classifier model to predict a score for each class, with the results combined while
adhering to specific structured constraints. With the commencement of deep learning
approaches, a significant shift has occurred. Now, using the paradigms like end-to-end
learning, we can eliminate the need for costly feature engineering. Deep learning models
can be used to uncover the implicit features that may be challenging for humans to design
manually.
In the context of natural language processing, deep learning has become a core by sub-
stantially enhancing the performance and efficiency of these language understanding
tasks. Nevertheless, due to the common issue of ambiguity in natural languages, these
tasks have become very challenging. In fact, some ambiguities may even escape the notice
of human observers, which clearly shows how complicated is the task of natural language
processing. Before proceeding to the details, let’s first discuss the example of lexical anal-
ysis in compiler design. Fig. 11.1 shows the input and output of a lexical analyzer in the
context of compiler.
11.1 Introduction to Lexical Analysis and Parsing Using Deep Learning 403
Read
Characters Token
Lexical Syntax
Input
Analyzer Analyzer
Push Ask for
Back Token
extra
Characters
The initial phase in the compilation process is lexical analysis, which takes modified
source code generated by language preprocessors, often structured as sentences. During
this phase, the lexical analyzer dissects these syntactical structures into a sequence of
tokens while simultaneously removing any unnecessary whitespace or comments from the
source code. If the lexical analyzer encounters an invalid token, it produces an error. This
component closely interacts with the syntax analyzer, as it reads character streams from
the source code, verifies the legality of tokens, and provides the required data to the syntax
analyzer upon request.
Now, let us discuss about lexemes. Lexemes consist of a series of alphanumeric char-
acters within a token. To be recognized as a valid token, lexemes must adhere to predefined
rules established by grammar rules, often described by patterns. These patterns, in turn,
are defined using regular expressions.
In programming languages, tokens consist of a variety of elements, including key-
words, constants, identifiers, strings, numbers, operators, and punctuation symbols.
For instance, in the C programming language, consider the line where a variable is
declared.
int x = 25;
int (keyword), x (identifier), = (operator), 25 (constant) and ; (symbol).
Now, let us explain some of the terms that are used in language theory:
• Alphabets: Any limited set of symbols, such as {0,1}, can be considered a set of binary
characters. Likewise, a set comprising {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} represents
hexadecimal characters. Furthermore, a set encompassing {a-z, A-Z} corresponds to
the characters used in the English language.
• Strings: A string is defined as a finite sequence of alphabetic characters. The length of
a string is determined by the total count of its alphabetic characters. For instance, in the
string “TextualData”, the length is 11, denoted as |TextualData| = 11. When a string
contains no alphabetic characters, signifying a length of zero, it is referred to as an
empty string, symbolized by the Greek letter ε (epsilon).
• Language: A language is defined as a finite collection of strings derived from a finite set
of alphabetic characters. Computer languages are regarded as finite sets on which
404 11 Lexical Analysis and Parsing Using Deep Learning
Regular expressions are a powerful tool for expressing finite languages by defining pat-
terns of finite sequences of symbols. These patterns adhere to a grammar known as regular
grammar, and the languages defined by such grammars are referred to as regular languages.
Regular expressions play a crucial role in specifying patterns, with each pattern corre-
sponding to a set of strings. Consequently, regular expressions effectively label sets of
strings, making them an important tool for describing programming language tokens.
Regular languages, which are defined by regular expressions, are not only easily com-
prehensible, but we can implement them easily as well. Furthermore, there exist algebraic
laws governing regular expressions that enable manipulation of these expressions into
various other equivalent forms.
Following are some of the operations that can be performed in a language:
Now, let us discuss about finite automata. A finite automaton, often referred to as a
finite state machine, operates by taking a sequence of symbols as input and transitions
between different states based on those symbols. It serves as a recognizer for regular
expressions. When a regular expression string is provided as input to a finite automaton,
the automaton sets its state for each literal within the input. If the input string is success-
fully processed, and the automaton reaches a designated final state, it is considered
accepted. In other words, the input string is recognized as a valid token of the language
under consideration.
The mathematical model of finite automata consists of various components as dis-
cussed below:
• States (Q): Each finite automaton consists of a finite set of states. These states are used
to represent different conditions or configurations of a machine.
• Alphabet (Σ): In finite automaton, the alphabet consists of a finite set of symbols. The
input strings are formed using these symbols. We can call these symbols as the building
blocks of a language.
11.1 Introduction to Lexical Analysis and Parsing Using Deep Learning 405
• Transition Function (δ): The transition function is used to specify how the automaton
changes its states after reading the symbols from the input. The current state and the
input symbols are both used for transition to a new state.
• Start State (q0): Each automaton begins processing from the initial state. This state is
called the start state.
• Accept States (F): Also called the final states, these are states where the strings are
considered to be accepted. There may be a single or more than one final state. Once the
automaton reaches any of these states after taking inputs, the string is considered to be
accepted, otherwise rejected.
• Input preprocessing: In this phase, we prepare the input text for performing lexical
analysis. Different tasks are performed at this stage including elimination of comments,
whitespace, and extraneous characters.
• Tokenization: In this step, we break the input text into a sequence of tokens. This is
done by matching the characters in the input text against predefined patterns or regular
expressions that define various token types.
• Token classification: In this stage, the lexer identifies each token and categorizes it on
the basis of its type. For example, in a programming language, tokens like keywords,
identifiers, operators, and punctuation symbols are classified as distinct token types.
• Token validation: During this phase, the lexer checks the validity of each token. This
verification is done according to the rules of the programming language. For example,
it may check if the variable name is a valid identifier or if an operator follows the cor-
rect syntax.
• Output generation: This is the last stage. In this stage, the lexer produces the output of
the lexical analysis process. Normally, the output is in the form of a token list. This list
of tokens can then be forwarded to the next stages of compilation or interpretation for
further processing.
Now, we will provide generic details of the lexical analysis and parsing tasks.
This is the first task of lexical analysis. We have already discussed about word segmenta-
tion. Here, let’s discuss about the common challenges of this task:
406 11 Lexical Analysis and Parsing Using Deep Learning
• Ambiguity: Words in many languages can have multiple meanings depending on their
context. Accurately determining the boundaries between words can be challenging
when dealing with homographs or homophones.
• Lack of spaces: In languages like Chinese and Japanese, there are no spaces between
words in the written text. Identifying word boundaries solely based on characters is
difficult.
• Compound words: Some languages, such as German, create compound words by com-
bining multiple smaller words. Determining where one compound word ends and
another begins can be challenging.
• Agglutinative languages: Languages like Turkish and Korean are agglutinative, where
affixes are added to words to convey meaning. This can result in long, complex strings
of characters that need to be segmented correctly.
• Abbreviations and acronyms: Recognizing and segmenting abbreviations, acronyms, or
initialisms can be tricky, as they may appear as single words or be broken down into
their constituent parts.
• Code-switching: In multilingual environments, people often switch between languages
within a single sentence or text. Segmenting words accurately becomes challenging in
such situations.
• Named entities: Identifying and correctly segmenting named entities (e.g., names of
people, places, organizations) is crucial for many applications but can be challenging
due to their unique structures.
• Noisy text: Text from sources like social media or optical character recognition (OCR)
may contain errors, misspellings, or non-standard language usage, making segmenta-
tion more challenging.
Normally, the phrase structures follow alignments in a certain way, and this alignment or
order is derived by context-free grammars (CFGs). In such a derivation, any phrase con-
taining more than a single word is composed of a series of non-overlapping “child” phrases
or individual words. These children are arranged in a manner that collectively make the
output of the parent phrase.
We can use a dependency tree for such alignments. A dependency tree is one of the
structures that are commonly used in NLP. A dependency parse is a directed tree with
words serving as vertices. The edges, also referred to as arcs, represent syntactic relation-
ships between pairs of words and can contain the labels that represent the specific type of
relation. An important feature is that we can provide a word, which functions as the root
of the tree. Each word, other than the root, maintains a single incoming edge originating
from its syntactic head.
For instance, Fig. 11.2 shows both the constituent and dependency trees for the sen-
tence “Economic news had little effect on financial markets.” We can classify the
11.1 Introduction to Lexical Analysis and Parsing Using Deep Learning 407
dependency parsing into two types: projective parsing, where there are no instances of arcs
crossing each other in the trees, and non-projective parsing, characterized by the presence
of such crossing arcs. Projective structures are mainly used in English and Chinese.
One significant advantage of using dependency structures over more detailed constitu-
ent structures lies in their enhanced clarity. Consider Fig. 11.2 as an example; within the
constituent structure, discerning that “the news” serves as the subject of “had” can be chal-
lenging, whereas the dependency structure clearly highlights this relationship between the
two words. Additionally, dependency structures prove to be more accessible for annotators
who possess strong domain knowledge but lack an in-depth understanding of linguistics.
Syntactic parsing provides valuable structural insights that find practical applications in
various domains.
Let us discuss about the dependency parse tree in more detail along with its components.
408 11 Lexical Analysis and Parsing Using Deep Learning
A dependency parse tree, often referred to as a dependency tree, is a data structure used
in natural language processing (NLP) and computational linguistics. It represents the
grammatical structure and relationships between words in a sentence or a text.
It is a linguistic structure that graphically illustrates the syntactic relationships between
words in a sentence. In this tree, each word is represented as a node, and the relationship
between words is represented by directed edges called arcs. These arcs connect the nodes.
The arcs represent the grammatical dependencies between the words where one word typi-
cally acts as the head.
The following are the key elements of a dependency tree:
• Nodes (vertices): Each word in the sentence represents a node in the tree.
• Edges (arcs): The directed edges from one node to another are used to represent the
relationships. The edges are labelled as well. These labels can show the type of depen-
dency, e.g., subject, object, modifier, etc.
• Root node: Each dependency tree has a single node called the root node. This node does
not have any incoming arcs. It represents the main governing element of the sentence.
Dependency parse trees play an important role in various NLP tasks including syntactic
analysis, semantic role labelling, and information extraction. The reason behind this is that
dependency trees provide a structured representation of the grammatical and semantic
relationships between words in a text.
Structured prediction includes three common tasks, i.e., sequence segmentation, sequence
labelling, and parsing.
In sequence segmentation, we divide a continuous sequence of text or speech into
meaningful and discrete units, e.g., words, phrases, sentences, or any other linguistically
relevant segments, depending on the specific task and language.
Sequence segmentation is one of the fundamental preprocessing tasks of NLP. It helps
machines understand and analyze text or speech by breaking it down into small and man-
ageable components. Following are a few common types of sequence segmentation in NLP:
The choice of segmentation type and method depends on the specific NLP task and the
characteristics of the language being processed. Accurate sequence segmentation is criti-
cal for enabling subsequent NLP tasks to operate effectively and extract meaningful infor-
mation from text or speech data. The following code shows the word segmentation task:
import nltk
nltk.download('punkt')
text = "This is a sample sentence. It contains multiple words
and punctuation marks."
words = nltk.word_tokenize(text)
sentences = nltk.sent_tokenize(text)
print("Words:", words)
print("Sentences:", sentences)
In this code:
We import the nltk library, which provides various NLP tools and resources. After this,
we download the “punkt” dataset, which contains pre-trained models and data for tokeni-
zation. Note that you only need to download this data once. Then, we have defined a
sample text. We then used nltk.word_tokenize() to tokenize the text into words. After this,
we used nltk.sent_tokenize() to tokenize the text into sentences. Finally, we print the
tokenized words and sentences. The output will be:
Words: [‘This’, ‘is’, ‘a’, ‘sample’, ‘sentence’, ‘.’, ‘It’, ‘contains’, ‘multiple’, ‘words’,
‘and’, ‘punctuation’, ‘marks’, ‘.’]
Sentences: [‘This is a sample sentence.’, ‘It contains multiple words and punctuation
marks.’]
The words are extracted and separated into individual tokens. The sentences are seg-
mented into separate sentences based on periods (full stops) and other sentence-ending
punctuation mark.
The next task is sequence labelling. Sequence labelling, often referred to as tagging,
involves the task of assigning an appropriate label or tag to each element within an input
410 11 Lexical Analysis and Parsing Using Deep Learning
sequence. To put it formally, if we have an input sequence represented as x = x1,…, xn, the
corresponding output tag sequence would be y = y1,…, yn, where each input xi is associated
with a single output tag yi.
One of the most classic and well-known examples of sequence labelling problems is
part-of-speech (POS) tagging. In this case, xi represents a word within a sentence, and yi
represents its respective POS tag.
In addition to POS tagging, numerous natural language processing (NLP) tasks can be
considered as sequence labelling problems. One such task is named entity recognition,
which involves identifying and categorizing named entities in text into predefined catego-
ries like individuals’ names, geographical locations, and organizations. In this scenario,
the input remains a sentence, and the output comprises the sentence with tags denoting
entity type. Typically, three entity types are considered: PER (persons), LOC (locations),
and ORG (organizations). For a given input sentence, the goal is to mark the boundaries of
these entities appropriately.
Consider the following sentence:
John works in NASA. He does exercise daily in stadium.
The following code identifies the two named entities from the given sentence:
import spacy
nlp = spacy.load("en_core_web_sm")
sentence = "John works in NASA. He does exercise daily in
stadium."
doc = nlp(sentence)
for ent in doc.ents:
print(f"Entity: {ent.text}, Label: {ent.label_}")
In the context of machine learning algorithms, after completing the mapping process on
our training examples, we can proceed to train a tagging model using these instances.
When the new sentence is presented, the model allows us to make predictions for a
sequence of tags. Subsequently, it becomes a straightforward task to identify the entities
within the tagged sequence.
The challenges associated with sequence segmentation problems can be reduced by
considering them as sequence tagging problems through the careful design of appropriate
tag sets. To show this, consider the Chinese word segmentation, where each character in a
sentence can be annotated with either a “B” tag (indicating the beginning of a word) or an
“I” tag (signifying the character is inside a word).
11.1 Introduction to Lexical Analysis and Parsing Using Deep Learning 411
The primary objective behind this transformation from a sequence segmentation prob-
lem to a sequence labelling problem is to simplify both modeling and decoding processes,
making the latter considerably more manageable.
As far as the task of parsing is concerned, parsing is the term used to describe a wide
range of algorithms that convert sentences into syntactic structures.
There are two prevalent representations for syntactic parsing: phrase-structure parsing
(also known as constituency parsing) and dependency parsing.
Constituency parsing typically relies on a grammar to generate syntactic structures. In
essence, a grammar comprises a collection of rules, each corresponding to a possible deri-
vation step under specific conditions. Context-free grammars (CFGs) are commonly
employed in constituency parsing. The parsing process involves selecting the highest-
scoring derivation based on the grammar.
Presently, two dominant algorithms for dependency parsing are graph-based and
transition-based methods. Graph-based dependency parsing can be defined as the task of
identifying the maximum spanning tree (MST) within a directed graph composed of ver-
tices (words) and edges (dependency arcs linking two words). Conversely, a transition-
based dependency parsing algorithm can be formalized as a transition system that includes
a set of states and a set of transition actions. This transition system commences in an initial
state and iteratively progresses through transitions until it reaches a terminal state.
A common challenge shared by both graph-based and transition-based dependency
parsing is the calculation of scores assigned to dependency arcs or transition actions.
Conventional lexical analyzer techniques are normally based on regular expressions and
finite automata. These techniques have been widely used in programming language com-
pilers and text-processing tasks for many years. Following are some of the advantages of
these techniques:
• Optimization: The NFAs and DFAs allow us to optimize them by using various tech-
niques, for example, by reducing the number of states or by reducing the number of
transitions. These optimizations can significantly improve the performance of lexical
analyzers.
• Well-established: Conventional lexical analysis techniques have been used in compilers
and text processing tools for decades. They are well understood and extensively docu-
mented and have a proven track record of reliability and efficiency.
• Parallelism: Lexical analyzers based on DFAs and NFAs can be parallelized, which is
advantageous for exploiting multi-core processors and accelerating tokenization in
modern computing environments.
Conventional lexical analyzer techniques have many advantages as well. The following
are some of the disadvantages of these techniques:
While these disadvantages exist, conventional lexical analyzer techniques remain valu-
able in many applications where speed, simplicity, and predictability are essential.
However, for more complex languages or tasks that require semantic understanding and
context-aware analysis, advanced techniques like deep learning-based lexical analyzers
may be preferred.
11.2 Conventional Lexical Analysis Case Study 413
Let us develop a case study for creating a lexical analyzer using the conventional tech-
nique. Developing a lexical analyzer, also known as a lexer or scanner, is an essential
component of a compiler or interpreter. In this case study, we will develop a lexical ana-
lyzer for a simple programming language. We’ll walk through the process step by step.
Step 1: Define the Language
Define the grammar and syntax rules of the simple programming language. For this
case study, we will stick with the following components:
• Keywords: (if|else|while|int|return)
• Identifiers: [a-zA-Z_][a-zA-Z0-9_]*
• Integer Literals: [0–9]+
• Operators: [+\-*/=<>!]+
• Delimiters: [;(){}]
import re
Step 3: Tokenization
Now, let’s write a Python script to tokenize the input code. We will use the re module
to match regular expressions and extract tokens. The following is a Python code for
this step:
414 11 Lexical Analysis and Parsing Using Deep Learning
def tokenize(input_code):
tokens = []
while input_code:
matched = False
for pattern, token_type in patterns:
match = re.match(pattern, input_code)
if match:
matched = True
value = match.group(0)
tokens.append((token_type, value))
input_code = input_code[len(value):].strip()
break
if not matched:
raise Exception(f"Lexer Error: Unrecognized
character: {input_code[0]}")
break
return tokens
Step 4: Testing
Test your lexer with various input programs to ensure it correctly identifies and
tokenizes the code. Also, handle error cases gracefully and provide meaningful error
messages.
input_code = """
int main() {
if (x > 0) return 1;
else return 0;
}
"""
try:
tokens = tokenize(input_code)
for token in tokens:
print(token)
except Exception as e:
print(e)
Step 5: Integration
In this step, the lexer is integrated with the rest of your compiler or interpreter. The
lexer produces a stream of tokens. This steam is then used by the parser to build an abstract
syntax tree (AST).
Step 6: Handling Advanced Features
If the language requires advanced features like string literals, comments, or complex
syntax rules, you will have to extend the lexer. This can be done by adding new regular
expressions and token definitions.
11.3 Structured Prediction Methods 415
In this case study, we will build a lexer for a simple programming language. However,
for real-world languages, the implementation of the lexer is very complex, but the princi-
ples remain the same.
In this section, we will discuss two latest approaches used for structured prediction: graph-
based and transition-based methods. These methods can normally be used as the founda-
tion for many advanced deep-learning algorithms designed for structured prediction tasks.
The prediction we aim to derive from the classifier can be expressed as a conditional
probability. We can employ the naive Bayes algorithm to break down this probability:
P y P X
P y|X
PX
P xi |,y|,x1 xi 1 |, xi 1 xn P xi |y
We get:
P y i 1P xi |y
n
P y|X
P x1 xn
The logistic regression classifier is built upon the logistic function, as mentioned as follows:
1 ex
f x
1 e x 1 e x
In logistic regression, to get the knowledge of the decision boundary separating the two
classes, the classifier learns weights associated with each data point (referred to as Theta
values), expressed as follows:
1 n
P y|x exp 0, y i , y xi
Z x i 1
n
where Z x exp 0, y' 0, y' xi
'y i 1
yˆ = argmax y P ( y|x )
When using the logistic regression classifier, it becomes apparent that we are maximiz-
ing the conditional probability. By applying Bayes’ rule, we can derive the generative
classifier utilized in the naive Bayes classifier.
11.3 Structured Prediction Methods 417
yˆ = argmax y P ( y|x )
P x|y P y
argmax y
P x
argmax y P x|y P y
This outcome aligns with the generative classifier derived previously in the naive Bayes
algorithm.
Furthermore, it’s worth noting that P(x | y) * P(y) is equivalent to P(x, y), the joint dis-
tribution of x and y. This observation reinforces the earlier definition of generative classi-
fiers. By modeling the joint probability distribution between classes, the generative model
can generate input points X based on the label Y and the joint probability distribution.
Conversely, the discriminative model, by learning the conditional probability distribution,
has acquired the ability to identify the decision boundary that separates data points.
Consequently, when given an input point, it can employ the conditional probability distri-
bution to determine its class.
Now, let’s explore how these definitions relate to conditional random fields (CRFs).
CRFs belong to the discriminative model category, and their fundamental concept involves
applying logistic regression to sequential inputs. If you have familiarity with hidden
Markov models (HMMs), you’ll notice some similarities between CRFs and HMMs, par-
ticularly in their usage with sequential inputs. HMMs use a transition matrix and input
vectors to learn the emission matrix and are conceptually similar to naive Bayes. HMMs
are categorized as generative models.
Note that some sentences have been slightly modified for clarity and conciseness while
preserving the original meaning.
Now, let us start our discussion on conditional random fields. As we demonstrated in
the preceding section, we formulate the conditional distribution as follows:
yˆ = argmax y P ( y|x )
In the context of CRFs, our input data is sequential, requiring us to consider the preced-
ing context when predicting a data point. To encapsulate this behavior, we employ feature
functions with multiple input values, which include:
The collection of input vectors, denoted as X.
The role of the feature function is to convey a specific attribute or characteristic of the
sequence represented by the data point. For example, when employing CRFs for parts-of-
speech tagging, likewise, the feature function is defined as follows:
f(X, i, L{i − 1}, L{i}) = 1 when L{i − 1} represents a noun and L{i} represents a verb;
otherwise, it equals 0.
Similarly, f(X, i, L{i − 1}, L{i}) = 1 when L{i − 1} signifies a verb and L{i} signifies
an adverb; otherwise, it equals 0.
Each feature function relies on the labels of both the previous word and the current
word, producing either a 0 or a 1. To construct the conditional field, the subsequent step
involves assigning a set of weights (represented by lambda values) to each feature func-
tion, which the algorithm will then learn:
1 n
P y,X , exp i f i X ,i,yi 1 ,yi
Z x i 1 j
n
where Z x f X ,i, y j i
'
i 1 , y' i
y y i 1
' j
When applying maximum likelihood to the negative log function, we will seek the
argmin (as minimizing the negative yields the maximum likelihood estimate). To find this
minimum, we need to calculate the partial derivative with respect to lambda. This results in:
L X ,y, 1 m m
j
m k 1
F y k
, x k
p y|,x k |, Fj y,x k
k 1
n
where Fj y,x k f i X ,i,yi 1 ,yi
i 1
The partial derivatives are used in the gradient descent process. In the gradient descent
method, the values are gradually updated using small steps until they converge. The gradi-
ent descent update equation for CRF is as follows:
11.3 Structured Prediction Methods 419
m m
[Fj y k ,x k p y|,x k |, Fj y,x k
k 1 k 1
In summary, the utilization of conditional random fields involves several steps. First,
we define the required feature functions. Then, we initialize the weights, typically with
random values. Subsequently, we apply gradient descent iteratively until the parameter
values (in this case, lambda) reach convergence. It’s worth noting that CRFs share simi-
larities with logistic regression, as both rely on conditional probability distributions, but
CRFs expand upon the algorithm by incorporating feature functions tailored for sequen-
tial inputs.
From the preceding sections, it becomes evident how conditional random fields (CRFs)
differ from hidden Markov models (HMMs). Despite both being utilized for modeling
sequential data, these are two different algorithms.
Hidden Markov models adopt a generative approach and generate output by modeling
the joint probability distribution. Conversely, conditional random fields take a discrimina-
tive approach and model the conditional probability distribution. CRFs do not rely on the
independence assumption, which implies that labels are independent of each other, thereby
avoiding label bias.
One way to view this distinction is to consider hidden Markov models as a highly spe-
cific instance of conditional random fields, where constant transition probabilities are
employed. It’s important to note that HMMs are rooted in the principles of naive Bayes,
which can be derived from logistic regression, which in turn serves as the basis for CRFs.
In context of applications, conditional random fields (CRFs) find extensive use in natu-
ral language processing, offering numerous applications in this field. One application is
parts-of-speech tagging, where the parts of speech in a sentence are determined, a task
heavily reliant on the context of previous words. By employing feature functions that capi-
talize on this contextual information, CRFs can be used to learn how to discern the corre-
sponding parts of speech for each word in a sentence. Another similar application is named
entity recognition, involving the extraction of proper nouns from sentences. CRFs prove
valuable in predicting sequences in which multiple variables rely on one another. Additional
applications span diverse domains, including image-based parts recognition and gene
prediction.
The following code shows the application of CRF for POS tagging:
import nltk
import sklearn_crfsuite
from sklearn_crfsuite import metrics
from sklearn.model_selection import train_test_split
nltk.download('treebank')
from nltk.corpus import treebank
sentences = treebank.tagged_sents()
420 11 Lexical Analysis and Parsing Using Deep Learning
def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):
return [label for word, label in sent]
We import the necessary libraries and download a sample dataset from NLTK (Treebank
tagged sentences). We define a feature extraction function (word2features) and functions
to convert sentences into features and labels (sent2features and sent2labels). We prepare
the data, splitting it into training and testing sets. We create a CRF model using sklearn-
crfsuite and train it on the training data. We make predictions on the test data and evaluate
the model’s performance using classification metrics.
After discussing the CRFs, let’s discuss the graphics-based dependency parsing.
Let’s consider a directed graph consisting of vertices denoted as V and edges as E. The
score of an edge from vertex u to vertex v can be represented as s(u, v). A directed span-
ning tree is a subset of edges, denoted as E′, which is a subset of E. In this tree, each vertex,
except the root vertex (which has none), has precisely one incoming arc from E. Furthermore,
this tree does not contain any cycles. We denote the set of all possible directed spanning
trees for E as T(E). The total score of a spanning tree, represented as E′, is calculated as
the sum of the scores of edges present in E′.
The maximum spanning tree (MST) is defined as follows:
s u,v
max
E T E s u ,v E
The decoding problem for (unlabelled) dependency parsing can be simplified by map-
ping the problem to the maximum spanning tree problem. In this mapping, words within a
sentence are considered as vertices, and the dependency arcs represent the edges. Typically,
11.3 Structured Prediction Methods 421
vertex u is referred to as a head (or parent) and vertex v as a modifier (or child). Expanding
this approach to labelled dependency parsing is a natural extension. If there are multiple
edges connecting u to v, each associated with a specific label, the same algorithm remains
applicable.
Here, we introduce a fundamental graph-based approach known as the first-order
model. The first-order graph-based model operates on a strong independence assumption,
where the arcs in a tree are considered independent of each other. This means that the score
of an arc is not affected by the presence of other arcs. This approach is normally called the
arc-factorization method.
The main challenge here is to determine the score, denoted as s(u, v), for each potential
arc when an input sentence is provided. Traditionally, discriminative models have been
employed for this purpose, representing an arc using a feature vector generated by a fea-
ture function, denoted as f(u, v). Consequently, the score of the arc can be computed as the
dot product between a feature weight vector w and f, expressed as:
s u ,v w· f u ,v
• Surface form, lemma, part-of-speech (POS), and any shape, spelling, or morphological
features for each word
• The words involved, which encompass both the head and the modifier, as well as con-
text words on both sides of the head and modifier, along with words located between
the head and modifier
• Characteristics of the arc, including its length (the number of words between the head
and modifier), direction, and, if applicable, the syntactic relation type when parsing is
labelled
a8 S8
a3 S3
a9 a14
S9 S1
S1
a4 a10
a1 S4 S1
a15 S1
S0 S1
a11
a12 a16
S5 S1 S1
a2 a5
a13
a6 S1
S2 a17
S6
S1
a7 a18
S1
S7
(S), which may be infinite. This system includes a starting state (s0 ∈ S), a set of terminal
states (St ∈ S), and a collection of transition actions (T). The transition process begins at
the starting state (s0) and continues iteratively until a terminal state is reached.
For instance, Fig. 11.3 illustrates a straightforward finite state automata, where the
starting state is denoted as s0 and the terminal states encompass s6, s7, s8, s14, s15, s16,
s17, and s18. The primary objective of a transition-based structured prediction model is to
differentiate sequences of transition actions that lead to these terminal states. This distinc-
tion allows the model to assign higher scores to sequences that correspond to the correct
output state.
The arc-standard transition system is commonly employed for projective dependency
parsing. In this system, each state corresponds to:
In a greedy parser, the decision regarding what action to take in a state (s ∈ S) is deter-
mined by a classifier. Training this classifier involves considering gold-standard trees from
the training section of a treebank. This allows us to derive canonical gold-standard
sequences, often referred to as oracle sequences, consisting of transition state and
action pairs.
The term “feature engineering” is commonly used to describe the process of designing
features for various linguistic structured prediction tasks. It reflects the linguistic expertise
and domain knowledge required in this endeavor.
In the field of natural language processing (NLP), researchers often opt for a strategy
of including as many features as possible during the learning process. This approach
allows the parameter estimation method to determine which features contribute to the
model’s performance and which ones should be disregarded. This tendency stems from the
diverse and complex nature of linguistic phenomena and the continuous increase in com-
putational resources available to researchers. It is generally agreed upon that incorporating
more features is beneficial in NLP models, particularly in frameworks like log-linear mod-
els that can efficiently integrate a multitude of features.
To mitigate error propagation in greedy transition-based algorithms, beam search
decoding with global normalization is typically employed. Additionally, large margin
training with early updates is utilized for learning from inexact search results. These tech-
niques enhance the robustness and accuracy of structured prediction models in NLP tasks.
The transition-based framework, in addition to dependency parsing, finds application in
various structured prediction tasks within natural language processing (NLP). It involves
establishing a correspondence between structured outputs and sequences of state transi-
tions. Take sequence labelling, for example. The output is constructed by incrementally
assigning labels to each input element from left to right. In this context, a state is repre-
sented as a pair (σ, β), where σ represents a partially labelled sequence and β represents a
queue of unlabelled words. The initial state is ([], input), and terminal states are defined as
(output, []) . Each action advances the state by assigning a specific label to the front ele-
ment of β.
In the context of sequence segmentation, such as word segmentation, a transition sys-
tem operates by incrementally processing input characters from left to right. The system’s
state is represented as (σ, β), where σ denotes a partially segmented word sequence and β
constitutes a queue of incoming characters. When initially entering the system, the state is
characterized by an empty σ, while β contains the entire input sentence.
After reaching any terminal state, σ contains a fully segmented sequence, and β remains
empty. During each transition action, the system advances its current state by managing
the next incoming character. This action can involve either “separating” (sep) the character
to begin a new word or “appending” (app) it to the end of the last word within the partially
segmented sequence.
424 11 Lexical Analysis and Parsing Using Deep Learning
Deep learning was first applied to sequence labelling problems in 2008, marking one of
the earliest instances of deep learning’s successful use in addressing natural language
processing tasks.
This initial work not only involved embedding words into a d-dimensional vector but
also incorporated additional features. The process included feeding words and their cor-
responding features within a window into a multiple-layer perceptron (MLP) for tag
prediction.
The training criterion was based on word-level log-likelihood, treating each word in a
sentence as an independent entity. As mentioned earlier, there is often a correlation
between a word’s tag in a sentence and the tags of its neighboring words. In an updated
approach, tag transition scores were introduced into the sentence-level log-likelihood
model. In essence, this model resembles conditional random field (CRF) models, with the
distinction being the use of a nonlinear neural network instead of a linear model.
CRF models, due to their Markov assumption, can only leverage local features and
struggle to capture long-term dependencies between tags. This limitation is sometimes
critical in various natural language processing tasks. Recurrent neural networks (RNNs)
offer a solution by theoretically being able to model sequences of arbitrary length without
relying on the Markov assumption.
In more detail, RNNs are defined recursively, where a function takes a previous state
vector and an input vector, producing a new state vector. This conceptually makes RNNs
similar to deep feed-forward networks, albeit with shared parameters across different lay-
ers. However, RNNs face challenges such as the vanishing gradient problem. The gradient
exploding problem has a straightforward solution of clipping gradients when their norm
exceeds a specified threshold. The vanishing gradient problem is more complex, but gating
mechanisms like long short-term memory (LSTM) and gated recurrent unit (GRU) offer
effective solutions.
An extension of RNNs is bidirectional RNN (BiRNN), which accounts for the influ-
ence of both previous and succeeding words when predicting a tag in sequence labelling
tasks. It employs two RNNs—forward and backward RNNs—to represent the word
sequences before and after the current word. Deep RNNs, created by stacking RNN layers,
exhibit considerable power in various tasks, including semantic role labelling (SRL) with
a sequence labelling approach.
Although RNNs have been successful in sequence labelling problems, they don’t
explicitly model tag dependencies like CRFs. Therefore, a transition score matrix between
tags can be introduced to create a sentence-level log-likelihood model, often referred to as
11.4 Neural Graph-Based Methods 425
an RNN-CRF model. Various RNN types, such as LSTM, BiLSTM, GRU, BiGRU, can be
used in this context. Neural CRFs can also be extended to handle sequence segmentation
problems, such as neural semi-CRFs. These models employ segmental recurrent neural
networks (SRNNs) to represent segments and may include segment-level representations
using segment embedding to encode entire segments explicitly.
Let us discuss the use of the convolutional neural networks for dependency parsing. As
discussed earlier, convolutional neural networks (CNNs) are a class of deep learning mod-
els that have been primarily associated with image-processing tasks. However, they have
also found applications in various natural language processing (NLP) tasks, including
neural graph-based dependency parsing.
Here’s how CNNs can be used in neural graph-based dependency parsing:
• Word embeddings: Dependency parsing often starts with representing words in a sen-
tence as word embeddings (e.g., Word2Vec, GloVe). These embeddings capture seman-
tic and syntactic information about words. Each word in a sentence is associated with
an embedding vector.
• Convolutional layers: CNNs are used to capture local patterns and features in the input
data. In dependency parsing, CNNs can be applied to the word embeddings to extract
features related to word context.
• Convolutional filters: CNNs use convolutional filters (also known as kernels) to slide
over the input data (word embeddings). These filters learn to detect specific patterns or
features. In dependency parsing, these patterns might relate to word dependencies or
syntactic relationships.
• Feature maps: Convolutional operations result in feature maps. Each feature map rep-
resents the presence of a specific feature or pattern in the input data. Multiple filters can
generate multiple feature maps, each capturing different aspects of the input.
• Pooling: After applying convolutional filters, pooling layers (e.g., max-pooling or
average-pooling) can be used to reduce the dimensionality of the feature maps and
retain the most salient information. Pooling helps in capturing the most important fea-
tures while discarding less relevant information.
• Graph representation: The feature maps generated by the CNN can be considered as
feature vectors for individual words or tokens in the sentence. These feature vectors can
be used to build a graph representation of the sentence, where each word is a node and
the feature vectors are associated with these nodes.
• Dependency parsing: Once you have the graph representation of the sentence with fea-
ture vectors, you can apply a neural graph-based dependency parsing model. This
model can take advantage of the features learned by the CNN to make predictions about
syntactic dependencies between words in the sentence.
• Training: CNNs used in this context are typically pre-trained on large corpora and fine-
tuned on a specific dependency parsing task using annotated data. The parsing model is
trained to predict dependency relations between words.
The following are the benefits of using CNNs in neural graph-based dependency
parsing:
• Local context: CNNs can be effectively used to capture the local context and dependen-
cies between neighboring words. This is essential for understanding syntactic
relationships.
11.4 Neural Graph-Based Methods 427
• Semantic features: CNNs can be effectively used to capture semantic features. These
features can be used for dependency parsing, such as word proximity and similarity.
• Reduced dimensionality: Pooling layers can help reduce the dimensionality of feature
representations. This makes them computationally efficient and less prone to overfitting.
• End-to-end learning: By incorporating CNNs into the parsing pipeline, the model can
learn to extract relevant features directly from the input data. This reduces the need for
manual feature engineering.
Normally, CNNs are used as part of the large neural networks for dependency parsing.
The specifics can vary on the basis of the nature of the application. CNNs can be used in
combination with RNNs or transformers to capture both local and global context in a
sentence.
Developing a complete CNN-based application for graph-based dependency parsing in
Python is a very complex task. The reason is that it requires a deep understanding of both
neural network architectures and natural language processing. However, here, we have
provided a simple example using a Python library called PyTorch to show the basic struc-
ture of how a CNN can be integrated into a dependency parsing pipeline.
Please note that this is a simple example and doesn’t cover all aspects of a real-world
dependency parser but shows the concept of incorporating a CNN. In practice, dependency
parsers often use more complex neural network architectures and large datasets.
import torch
import torch.nn as nn
import torch.optim as optim
class CNNFeatureExtractor(nn.Module):
def __init__(self, input_dim, embedding_dim, num_filters,
filter_sizes):
super(CNNFeatureExtractor, self).__init__()
self.embedding = nn.Embedding(input_dim,
embedding_dim)
self.convs = nn.ModuleList([
nn.Conv2d(in_channels =1,
out_channels=num_filters,
kernel_size=(fs, embedding_dim))
for fs in filter_sizes])
pooled = [nn.functional.max_pool1d(conv,
conv.shape[2]).squeeze(2) for conv in
conved]
cat = torch.cat(pooled, dim=1)
return cat
class DependencyParser(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(DependencyParser, self).__init__()
self.fc = nn.Linear(input_dim, hidden_dim)
self.out = nn.Linear(hidden_dim, output_dim)
features = cnn_extractor(input_data)
predictions = parser(features)
print("Predictions:", predictions)
This example shows how to create a simple CNN-based feature extractor followed by a
basic feed-forward neural network for dependency parsing. In practice, you would replace
the sample input data and models with your dataset and more sophisticated neural network
architectures. Additionally, you would need labelled data and a training loop to train the
model effectively.
11.4 Neural Graph-Based Methods 429
Now, let us discuss the use of recurrent neural network (RNN) for their use in depen-
dency parsing. Recurrent neural networks (RNNs) are a class of neural network architec-
tures commonly used in natural language processing (NLP) tasks, including neural
graph-based dependency parsing. RNNs are particularly well suited for handling sequences
of data, making them a natural choice for parsing tasks where the order of words in a sen-
tence matters.
Here’s how RNNs can be used in neural graph-based dependency parsing:
The following are the benefits of using RNNs in neural graph-based dependency
parsing:
• Sequential dependencies: RNNs are very effective in capturing the sequential depen-
dencies between the words. These dependencies are crucial for understanding syntactic
relationships in a sentence.
• Variable-length input: RNNs can handle variable-length input sequences. This makes
them suitable for sentences of different lengths.
• Bidirectionality: BiRNNs can capture dependencies in both directions. This allows
them to model complex syntactic structures more effectively.
• Contextual information: RNNs maintain hidden states that encode contextual informa-
tion about words in a sentence. These hidden states can be used for making informed
parsing decisions.
• End-to-end learning: By incorporating RNNs into the parsing pipeline, the model can
learn to extract relevant features and dependencies directly from the raw input data.
This reduces the need for manual feature engineering.
While RNNs are very effective for capturing sequential dependencies, they have their
own limitations, such as difficulties in modeling long-range dependencies. As a result,
more recent approaches in neural dependency parsing often combine RNNs with other
architectures, such as convolutional neural networks (CNNs) or transformers, to capture
both local and global context effectively.
Following is a sample Python code to demonstrate the use of RNN for dependency
parsing:
import torch
import torch.nn as nn
import torch.optim as optim
class DependencyParser(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(DependencyParser, self).__init__()
self.fc = nn.Linear(input_dim, hidden_dim)
self.out = nn.Linear(hidden_dim, output_dim)
features = rnn_extractor(input_data)
predictions = parser(features)
print("Predictions:", predictions)
Dependency parsing yields syntactic trees as its output, a structural format akin to
sequences.
Graph-based dependency parsers assess components within dependency graphs, such
as labels and sibling labels. In contrast, transition-based dependency parsers employ shift-
reduce actions for incremental output construction. Seminal research employs statistical
models like SVM to make localized decisions during the parsing process, as seen in exam-
ples like MaltParser.
The primary objective of a greedy local parser is to determine the next parsing action
based on the current configuration. MaltParser functions by extracting features from the
top nodes of σ and the leading words in β. Features include attributes like form and POS
for s0, s1, q0, and q1, all utilized as binary discrete features. Additionally, the attributes of
dependents of s0, s1, and other σ nodes, such as their forms, POS, and dependency arc
labels, can be incorporated as extra features. These features are collected for a given parser
configuration and provided as input to an SVM classifier, which yields shift-reduce actions
from a set of valid actions. An alternative to MaltParser’s structure is illustrated in Fig. 11.4.
Similar to MaltParser, features are extracted from the top of σ and the front of β in a
parser configuration, serving as input for predicting the subsequent shift-reduce action.
In another model, instead of using discrete indicator features, embeddings represent
words, POS, and arc labels. In Fig. 11.4, a neural network with three layers is employed to
h = (Wx + b)3.
Output
Layer ……
Hidden
Layer ……
Embedding
Layer …… …… ……
forecast the next action based on input features. The input layer concatenates word, POS,
and arc label embeddings from the context. Subsequently, the hidden layer applies a linear
transformation followed by a cube activation function.
h Wx b .
3
The reason for opting for a cube function as the nonlinear activation function, as
opposed to the more conventional sigmoid and tanh functions, lies in its capability to
achieve arbitrary combinations of three elements in the input layer. Traditionally, in statis-
tical parsing models, these combinations were manually defined. Empirically, this
approach has demonstrated superior performance compared to alternative activation func-
tions. In the final step, the hidden layer serves as input to a standard softmax layer for
action selection.
Implementation of a complete greedy shift-reduce dependency parser is a complex job.
So, a simplified example is provided here.
import torch
import torch.nn as nn
import torch.optim as optim
class Parser(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(Parser, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, output_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(parser.parameters(), lr=learning_rate)
434 11 Lexical Analysis and Parsing Using Deep Learning
outputs = parser(inputs)
parsed_action = torch.argmax(test_output).item()
When presented with an input sentence, a greedy local sequence labeller operates incre-
mentally, assigning labels to individual input words by making local decisions. These
assignments are treated as classification tasks. Strictly speaking, this form of sequence
labeller can be categorized as either graph-based or transition-based. This is because each
label assignment serves to disambiguate either the ambiguities in the graph structure or
those in the transition actions.
In this context, we classify greedy local sequence labelling as transition-based, and this
classification is based on a specific reason. Graph-based sequence labelling models typi-
cally resolve entire sequences of labels as a single graph. This is achieved by making
Markov assumptions on output labels, enabling exact inference using the Viterbi algo-
rithm. Such constraints mean that features can only be extracted from local label sequences,
including second-order and third-order transmission features. In contrast, transition-based
sequence labelling models do not impose Markov properties on the outputs. As a result,
they often extract highly nonlocal features, leading to their reliance on greedy search or
beam search algorithms for inference.
All the examples provided below show the application of greedy algorithms, with some
of these cases incorporating complex, global features. A substantial amount of research
11.5 Neural Transition-Based Methods 435
S1 S2 Si-k Si Si+k Sn
Output ……
Layer
Hidden
Layer ……
… …… … … … … … … …
W1 W2 Wi-k Wi Wi+k Wn
Embedding
Layer
has been dedicated to the use of neural models for CCG supertagging, a task that presents
greater complexity compared to traditional POS tagging. CCG, which stands for combina-
tory categorial grammar, is a linguistic framework characterized by a relatively limited
number of lexicalized elements. It relies heavily on lexical categories, especially super-
tags, to convey a significant portion of its syntactic information in parsing.
In contrast to more basic syntactic labels like part-of-speech (POS), supertags contain
a lot of syntactic information and also represent predicate-argument structures. Supertags
present a lot of challenge as treebanks often contain over a thousand of such supertags.
Traditional statistical models for CCG supertagging utilize conditional random fields
(CRFs) and extract features within a context window for each label.
These models heavily depend on POS information, requiring POS tagging as a prelimi-
nary step before supertagging. This introduces the possibility of errors in POS tagging,
consequently affecting the quality of supertagging. Researchers have also explored the use
of simple neural models for CCG supertagging, as shown in Fig. 11.5. In this model, a
three-layer neural network is used to assign supertags to each word within an input sen-
tence. The initial layer functions as an embedding layer, where each word is mapped to its
corresponding embedding representation. Binary-valued discrete features, such as the
two-letter suffix of the word and a binary indicator of capitalization, are concatenated with
the embedding vector.
The second layer serves as a hidden layer for integrating the features. For a given word,
a context window comprising context words is used to extract features. Input embeddings
from each word in the context window are combined and fed into the hidden layer, which
employs a tanh activation function for nonlinear feature integration.
The top layer serves as a softmax classification layer, assigning probabilities to all
potential output labels. Surprisingly, this straightforward model has demonstrated impres-
sive performance. It has achieved higher parsing accuracy for both in-domain and
436 11 Lexical Analysis and Parsing Using Deep Learning
cross-domain data when compared to the CRF-based baseline tagger. As a greedy model,
it results in significantly faster processing times in contrast to a neural CRF alternative and
still maintaining comparable accuracy.
The success of this model results from the effectiveness of neural network models in
automatically extracting features, which eliminates the need for a preliminary POS tag-
ging step.
Furthermore, retraining word embeddings on extensive raw data helps address the issue
of feature sparsity present in baseline discrete models, ultimately leading to improved
cross-domain tagging performance.
Similarly, Figs. 11.6 and 11.7 show the recurrent neural network with independent
labels and with the chained labels.
The following code shows the implementation of greedy sequence labelling using the
POS tagging task. It is a simple task that does not demonstrate the complexities.
S1 S2 Si Sn
Output
Layer … …
… … … … …
H1 H2 Hi Hn
Hidden
… …
… … … … …
Layer
… …… … … … …
W1 W2 Wi Wn
Embedding
Layer
S1 S2 Si Sn
Output
Layer … …
… … … … …
H1 H2 Hi Hn
Hidden
… …
… … … … …
Layer
Embedding
Layer … …… … … … …
W1 W2 Wi Wn
sentence = "The quick brown fox jumps over the lazy dog"
predicted_tags = []
def greedy_sequence_labeling(words):
predicted_tags = []
for word in words:
if word.lower() in ["the", "a", "an"]:
predicted_tags.append("DT") # Determiner
elif word.endswith("ing"):
# Verb, gerund or present participle
predicted_tags.append("VBG")
elif word.isnumeric():
predicted_tags.append("CD") # Cardinal number
else:
predicted_tags.append("NN") # Noun (default)
return predicted_tags
words = sentence.split()
predicted_tags = greedy_sequence_labeling(words)
In this code, we start with a sample input sentence and a list of sample POS tags. We
define a greedy_sequence_labeling function, which takes a list of words as input and pre-
dicts the POS tags for each word. In this example, we use a simple rule-based approach for
demonstration purposes. In a real-world scenario, you would replace this logic with your
machine learning model or more sophisticated rules. We tokenize the input sentence into
individual words. We call the greedy_sequence_labeling function to predict the POS tags
for each word in the sentence. Finally, we print the predicted POS tags for each word in
the sentence. Remember that this is a simplified example, and real-world sequence label-
ing tasks often involve more complex models and feature engineering. You can replace the
greedy_sequence_labeling function with your own model for more accurate predictions
based on your specific task.
438 11 Lexical Analysis and Parsing Using Deep Learning
Greedy local neural models have demonstrated their superiority compared to their statisti-
cal counterparts by utilizing word embeddings to mitigate sparsity and employing deep
neural networks to acquire nonlocal features. These models make use of syntactic and
semantic information spanning the entire sentence for structured prediction, allowing
them to model nonlocal dependencies among labels. However, it’s important to note that
the training of such models is local, potentially leading to label bias, as the optimal
sequence of actions may not always involve locally relevant actions.
To address this, globally optimized models, which have been the prevailing approach in
statistical NLP, have also been applied to neural models. These models typically employ
beam search (as outlined in the pseudocode below), maintaining an agenda that keeps
track of the B highest-scoring action sequences at each step.
The formal algorithm for beam search in transition-based structured prediction is pre-
sented in the algorithm above. Initially, the agenda contains only the start state in the state
transition system.
At each step, all items in the agenda are expanded by applying all possible transition
actions, resulting in a set of new states. The B highest-scoring states from this set are
selected and serve as agenda items for the next step. This process continues until terminal
states are reached, and the highest-scoring state in the agenda is considered the output.
Unlike greedy local models, globally optimized models rank agenda items based on
their global scores, which encompass the total scores of all transition actions within the
sequence.
There are two primary training approaches for globally optimized models: one aims to
maximize the likelihood of gold-standard action sequences, while the other seeks to maxi-
mize the score margin between gold-standard and non-gold-standard action sequences.
Occasionally, other training objectives are employed.
11.5 Neural Transition-Based Methods 439
The primary aim of the large margin objective is to increase the score margin between
correct and incorrect output structures. This objective has been employed in discrete struc-
tured prediction techniques like the structured perceptron and MIRA. Ideally, this training
objective ensures that the correct structure is scored significantly higher than all the incor-
rect ones. However, in structured prediction tasks, the number of possible incorrect struc-
tures can grow exponentially, making the exact objective computationally infeasible in
many cases.
To address this issue, the perceptron algorithm approximates the large margin objec-
tive. It focuses on the most violated margin and has a theoretical guarantee of convergence
during training. Specifically, when given a positive example (the gold-standard structure)
and a negative example (the one with the highest violation), the perceptron algorithm
adjusts the model parameters by adding the feature vector of the positive example to the
model and subtracting the feature vector of the negative example from the model param-
eter vector. This process is repeated for all training examples, ultimately leading the model
to score gold-standard structures higher than incorrect ones.
The perceptron algorithm identifies a negative example for each gold-standard training
instance, one that incurs the largest violation of the ideal score margin. This typically
involves searching for the highest scored incorrect output or one that ranks the highest,
considering both its current model score and its deviation from the gold standard. In the
latter case, the structured dilation represents the cost of the incorrect output, where outputs
with structures similar to the gold standard have lower costs. This training objective, which
considers both model scores and costs, allows the model to distinguish not only between
gold-standard and incorrect structures but also among different incorrect structures based
on their similarity to the correct structure.
In the context of neural networks, this training objective translates to maximizing the
score difference between a given positive example and its corresponding negative exam-
ple. This is typically achieved by calculating the derivatives of the score difference with
respect to all model parameters and then updating the model parameters using gradient-
based methods like AdaGrad.
The maximum likelihood objectives for neural structured prediction draw inspiration
from log-linear models. Specifically, when we have the score of an output, denoted as
score(y), a log-linear model computes its probability as follows:
exp score y
p y .
yY
exp score y
440 11 Lexical Analysis and Parsing Using Deep Learning
In this context, Y represents the set of all possible outputs. When we are dealing with a
structured output like a sequence, this log-linear model effectively transforms into a con-
ditional random field (CRF) under specific conditions.
Another line of research explores a related objective by making assumptions about
structured score calculations in transition-based models as shown in Fig. 11.8. Here, the
score for a state sk is computed as follows:
k
score S k f Si 1 ,ai .
i 1
The definition of f and a are the same as in the previous section. Given this score calcu-
lation, the probability of the state sk is
exp score( S k )
p Sk .
S k S
exp score S k
Another training objective that has been experimented with is maximizing the F1
Score, which is used for transition-based CCG parsing. Beam search can be employed to
discover the state with the highest score, with each state’s score being determined using
the calculation method illustrated in Fig. 4.9. For a given state Sk, the score is computed as
follows:
k
score S k g Si 1 ,ai .
i 1
In this context, the function g represents a network model, and a represents a transition
action. The key distinction between the network function g and the network function in all
previously mentioned methods is that g employs a softmax layer to normalize the output
11.6 Deep Learning-Based Lexical Analysis Case Study 441
actions, whereas f does not use nonlinear activation functions to score different actions
given a state.
The training objective is depicted by the following formula:
ES k A F 1 S k p S F1 S .
Sk A
k k
In the formula, A represents the beam of candidate actions after parsing is completed,
and F1(sk) represents the F1 Score of the state sk as evaluated using standard metrics when
compared to the gold-standard structure.
p(sk) can also be calculated as follows:
exp score( S k )
p Sk
Sk A
exp score S k
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(source_code)
source_sequences = tokenizer.texts_to_sequences(source_code)
442 11 Lexical Analysis and Parsing Using Deep Learning
source_sequences = pad_sequences(source_sequences,
maxlen=max_sequence_length)
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
model = tf.keras.Sequential([Embedding(input_dim =
vocab_size, output_dim=embedding_dim, input_length =
max_sequence_length), LSTM(units=64, return_sequences
= True), Dense(units=num_classes,
activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
11.7 Advantages and Disadvantages of Deep Learning-Based Lexical Analysis… 443
model.fit(source_sequences_train, one_hot_labels_train,
validation_data=(source_sequences_val, one_hot_labels_val),
epochs=num_epochs, batch_size=batch_size)
Step 5: Evaluation
Test set evaluation: Evaluate the trained model on the test set to measure its perfor-
mance. Metrics like accuracy, precision, recall, and F1 Score can be useful.
Error analysis: Analyze the model’s predictions to understand its strengths and weak-
nesses. Identify common sources of errors, and refine the model accordingly.
Following is a sample Python code for this step:
test_loss, test_accuracy =
model.evaluate(source_sequences_test, one_hot_labels_test)
print(f'Test Loss: {test_loss}, Test Accuracy:
{test_accuracy}')
Step 6: Deployment
Incorporate into workflow: Integrate the lexer into your compiler or interpreter work-
flow, so it can preprocess the source code before further processing.
Model serialization: Serialize the trained model to a file format that can be loaded and
used by your application.
Following is the Python code for this step:
model.save('lexical_analyzer_model.h5')
Deep learning-based lexical analysis techniques offer several advantages over conven-
tional techniques:
• Flexibility and adaptability: Deep learning models can be adapted to a large number of
languages and input formats without much effort. These models can themselves learn
444 11 Lexical Analysis and Parsing Using Deep Learning
complex patterns and structures from data. This makes them suitable for a large number
of applications.
• Context awareness: Deep learning models can capture contextual information. This
allows them to handle context-sensitive tokenization and lexical analysis tasks more
effectively. They can recognize patterns that depend on surrounding tokens. This makes
them suitable for tasks like part-of-speech tagging or named entity recognition.
• Handling ambiguity: Deep learning models can effectively handle ambiguity in the
input. Instead of depending on fixed rules, they use probabilities for different interpre-
tations or tokenizations. This makes them valuable in languages with ambiguous
constructs.
• Semantic understanding: Deep learning-based lexers can provide information about the
meaning and structure of the input. They can perform semantic analysis tasks, such as
sentiment analysis, topic modeling, and parsing. Normally, these tasks are considered
to be challenging for conventional lexers.
• Reduced maintenance overhead: Deep learning models learn from data. This means
that they don’t need manual rule specifications. This eliminates the need for extensive
rule creation and updates.
• Adaptation to new data: Deep learning models can easily adapt to new data and lan-
guages. When provided with additional training data, they can improve their accuracy
and coverage. This makes them suitable for rapidly changing environments.
• Automatic feature extraction: Deep learning models automatically extract relevant fea-
tures from the input. This reduces the need for feature engineering. This makes them
suitable for tasks with a large number of features.
• Continuous learning: Deep learning models can be continuously updated and fine-
tuned as new data becomes available, ensuring that they remain accurate over time.
While deep learning-based lexical analysis techniques offer these advantages, these
approaches have their own challenges as discussed as follows:
• Data requirements: Deep learning models require large amounts of labelled training
data to perform well. Acquiring and annotating such data can be expensive and time-
consuming, especially for specialized domains or languages with limited resources.
• Computational resources: Training deep learning models demands significant com-
putational resources, including powerful GPUs or TPUs and large memory capac-
ity. This can be costly and may limit the accessibility of these techniques to smaller
organizations.
• Complexity: Deep learning models are complex, with many parameters to tune.
Configuring the architecture, hyperparameters, and training procedures can be chal-
lenging and often requires expertise.
11.9 Exercises 445
• Interpretability: Deep learning models are often considered “black boxes” because they
can be difficult to interpret why they make specific predictions. This lack of transpar-
ency can be problematic in applications where understanding the reasoning behind
decisions is crucial.
• Data imbalance: Imbalanced datasets can lead to biased models. Deep learning models
may struggle to learn from underrepresented classes, requiring additional techniques
like oversampling or cost-sensitive learning.
• Limited explainability: As mentioned earlier, the lack of explainability can be a signifi-
cant drawback, especially in applications where interpretability and accountability are
essential, such as legal or medical domains.
• Ethical considerations: Deep learning models can inadvertently learn biases present in
training data, leading to unfair or unethical predictions. Addressing bias and fairness
issues is an ongoing challenge.
11.8 Summary
Lexical analysis and parsing tasks aim to delve into the intrinsic characteristics of words
and their interconnectedness at a deeper level. These tasks typically involve foundational
techniques such as word segmentation, part-of-speech tagging, and parsing. One common
aspect of these tasks is that they produce structured outputs. To tackle structured predic-
tion tasks like these, two primary categories of methods are often employed: graph-based
methods and transition-based methods. Graph-based methods directly distinguish between
output structures based on their inherent features. Conversely, transition-based methods
transform the process of constructing outputs into sequences of state transition actions,
effectively distinguishing between various sequences of these actions. Neural network
models have demonstrated their effectiveness in addressing structured prediction prob-
lems within both the graph-based and transition-based paradigms. In this chapter, we have
presented an extensive overview of the utilization of deep learning techniques in the con-
text of lexical analysis and parsing. Furthermore, we have shown the comparisons between
these contemporary neural network approaches and traditional statistical methods to pro-
vide a more comprehensive understanding of their respective strengths and limitations.
11.9 Exercises
Q1: Discuss the role of the lexical analyzer in the context of compiler design.
446 11 Lexical Analysis and Parsing Using Deep Learning
import nltk
nltk.download('punkt')
text = "This is a sample sentence. It contains multiple words and
punctuation marks."
words = nltk.word_tokenize(text)
sentences = nltk.sent_tokenize(text)
print("Words:", words)
print("Sentences:", sentences)
Q7: Discuss the difference between neural graph-based methods and neural transition-
based methods.
Q7: Discuss some of the advantages of deep learning-based lexical analysis approaches
over the conventional methods.
Q9: Discuss graph-based structured prediction methods along with their function.
Q10: Discuss some of the challenges faced by deep learning-based lexical analysis
approaches.
Recommended Reading
• Lexical Analysis: Norms and Exploitations by Patrick Hanks
Publisher: The MIT Press
Publication Year: 2013
In Lexical Analysis, Patrick Hanks presents a large-scale empirical investigation of
word use and meaning in language. The book addresses the need for a corpus- and
lexicon-based theoretical approach that helps people understand how words combine
11.9 Exercises 447
Machine translation (MT) is a core concept in text mining that consists of computers trans-
lating human languages automatically. This chapter will introduce the concepts of MT
using deep learning models and techniques with examples and a complete description of
the accompanying Python source code.
Suppose that you are a translator and you are asked to translate from German to English.
You are given the term Sitzpinkler. Its literal translation is someone who urinates while
seated, yet its intended meaning conveys “wimp”. The suggestion here is that a man who
opts for a seated posture while urinating is considered less masculine. However, there’s
more complexity at play. This word gained prominence through a comedy show that intro-
duced several similar terms. For instance, Warmduscher refers to someone who prefers
warm showers, or Frauenversteher designates someone who comprehends women. Indeed,
a trend emerged, inventing such terms for humorous insults, albeit not of a truly harsh
nature. They are typically used in a playful, mocking manner.
These terms also reflect the prevailing cultural zeitgeist, particularly with evolving
expectations surrounding masculinity. Sitting down to urinate isn’t genuinely unmanly,
though it is a practice commonly associated with women, potentially causing a man who
wishes to embody a traditional “real” man to feel a loss of identity in doing so. As you can
discern, there are numerous layers at play here. So, what does a translator do in such a
scenario? Most likely, they would use “wimp” and proceed. This example underscores the
inherent challenge of translation. A word’s meaning in a language is intricately linked to
its historical usage within a particular culture. “Four score and seven years” isn’t merely
another way to express 87 years, and “I have a dream” conveys more than just outlining a
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 449
U. Qamar, M. S. Raza, Applied Text Mining,
https://doi.org/10.1007/978-3-031-51917-8_12
450 12 Machine Translation Using Deep Learning
future vision. Words convey not only explicit meanings but also a subtle subtext of impli-
cations that often lacks an equivalent in another language or culture.
There exist numerous methods to translate a sentence. The primary competing objec-
tives are adequacy and fluency. Adequacy entails preserving the original text’s meaning,
while fluency necessitates generating output that reads as smoothly as any well-crafted
text in the target language. Oftentimes, these two objectives are at odds with each other. A
strict adherence to the original meaning may result in awkward translations. Different
genres of text make differing trade-offs in this regard. In the context of literary translation,
emphasis is placed on style, ensuring that the text flows well, even if this entails altering
some of the meaning to maintain the overall essence of the text. Consider, for instance, the
translation of song lyrics, where the crucial aspect is that the translated song conveys the
same emotions and sounds right.
However, when translating documents like operational manuals or legal texts, concerns
about fluency take a back seat. It is perfectly acceptable to produce stilted and cumber-
some phrases if it is the only way to convey the same information accurately.
Take, for example, a phrase that might appear in a newspaper article: “the same popula-
tion as Nebraska.” Suppose you want to translate this into Chinese. Few individuals in
China would have any notion of Nebraska’s population, so you might opt to substitute
Nebraska with the name of a Chinese city or province that the reader is familiar with. This
aligns with the author’s intent—to provide a concrete example that resonates with
the reader.
A subtler example is a foreign phrase that directly translates to “the American newspa-
per the New York Times.” For most American readers, this would appear peculiar, as it is
widely known that the New York Times is an American newspaper. Therefore, the original
phrase likely did not intend to emphasize its American nature; it was merely offering clari-
fication to readers who might not be acquainted with the paper. Conversely, consider the
reverse scenario—a literal translation from German might be Der Spiegel reported, leav-
ing most American readers uncertain about the source’s credibility. In such cases, a profes-
sional translator might opt for “the popular German news weekly Der Spiegel reported.”
A fundamental goal of translation is to remain inconspicuous. At no point should a
reader think, “This is translated exceptionally well/poorly,” or, worse, “What did this say
in the original?” Readers should not detect any wrong signs of translation and should be
under the illusion that the text was originally composed in their own language.
Figure 12.1 shows a sample translation from English to French.
12.2 Ambiguity 451
12.2 Ambiguity
If we had to pick one word that captures the challenge of using computers for natural lan-
guage processing, it’s “ambiguity”. Natural language is naturally unclear in many ways,
from word meanings to sentence structures, making it hard to understand. People deal with
this ambiguity by looking at the bigger picture and using their knowledge, but they still
misunderstand each other sometimes. Sometimes, people intentionally use unclear lan-
guage to avoid committing to one meaning, and in those cases, translations need to keep
that ambiguity intact.
The common example of ambiguity is that certain words can exhibit markedly different
meanings. Let’s examine the following example sentences:
In the first sentence, “bark” refers to the sound a dog makes. In this context, “bark” is a
noun representing the vocalization of a dog. In the second sentence, “bark” is used as a
verb, and it means to remove the outer covering of a tree, typically the tough, protective
layer, using a tool like an axe or a knife. In this context, “bark” refers to a physical action
involving trees and is unrelated to the sound a dog makes.
452 12 Machine Translation Using Deep Learning
Another big problem in computer language understanding is that, sometimes, words don’t
combine together in a simple way to make sense. This means we can’t easily break down
the problem of translating language into smaller, separate parts.
A good example of this is when people use phrases that don’t mean what the words
would mean individually. Like when someone says, “It’s raining cats and dogs.” If you try
to translate this word by word into another language, it won’t make sense. For example, in
German, a good translation might be es regnet Bindfäden, which literally means “it rains
strings of yarn” (to show that it’s raining very hard). Sometimes, you might be able to
figure out where these unusual phrases come from or what they mean, but most of the time,
people just remember them as they are and don’t think too much about what each
word means.
Sometimes, words in a sentence can be confusing because they can have different mean-
ings. For example, think about “eating steak with ketchup” and “eating steak with a knife.”
In the first one, “with ketchup” tells us about the steak, and in the second one, it tells us
about how we eat. This can be confusing, but when we translate, it’s not always a big
problem because the new language might also have the same kind of tricky structure. So,
we don’t always have to figure it out.
Languages often have different ways of arranging sentences, and this can be important
when translating these sentences. One key difference between languages is how they show
the relationships between words. English mostly uses the order of words in a sentence, like
subject-verb-object. But in languages like German, they can change the subject or object
at the start of a sentence, thus changing the ending words by making it clear who is
doing what.
Consider the following short German sentence, with possible translations for each
word below it:
In this translation:
• The first word, das, could mean “that” or “the”, but since it isn’t followed by a noun,
the translation “that” is more likely.
• The third word, sie, could mean “she” or “they”.
12.2 Ambiguity 453
• The verb behaupten means “claim”, but it is also morphologically inflected for the
plural. The only possible plural subject in the sentence is sie, in the interpretation
of “they”.
As a result, the closest English translation would be “They claim that.” However, this
requires reordering from the original German word order (object-verb-subject) to the
English word order (subject-verb-object). Google Translate offers the translation “At least,
that’s what they say,” which maintains the emphasis on “that” by placing it early in the
English sentence. This is a common choice among human translators as well.
Translation can be quite challenging when meanings differ between languages or involve
connecting multiple distant concepts or words that are not explicitly mentioned. Let’s
consider the issue of pronominal anaphora, where pronouns refer to other things previ-
ously mentioned, which may not always be right before the pronoun. For example:
Here’s a simple case where “it” refers to “movie”. When translating this into languages
like German or French, we need to find a matching pronoun for “it”. However, these lan-
guages have gendered nouns, unlike English, where many things are neutral. Nouns in
these languages can be masculine, feminine, or neutral, often with no obvious reason (e.g.,
“moon” is male in German but female in French, “sun” is female in German but male in
French). In our example, “movie” translates to “Film” in German, which is masculine. So,
the pronoun “it” becomes the masculine “er”, not the feminine “sie” or the neutral “es”.
This translation process involves several layers of understanding: connecting the
English pronoun “it” to the English noun “movie”, choosing to translate “movie” as
“Film”, knowing that “Film” is masculine, and then using this information to translate “it”
as “er”. This means we need to track a lot of information, and it highlights the challenge
of resolving co-references (identifying which things in a text refer to the same thing).
Now, let’s consider a more complex example that involves co-reference resolution:
“Whenever I visit my uncle and his daughters, I can’t decide who is my favourite cousin.”
In this case, we need even more complex inference to figure out that the cousin is
female because she’s the daughter of the uncle. This requires not only resolving co-
references (connecting “cousin” and “daughters”) but also knowing about family relation-
ships and understanding the grammatical gender of German nouns.
Finally, let’s explore issues related to discourse relationships.
Consider the two examples:
In the first sentence, the word “since” means the same as “because”. It shows that one
thing happened as a result of another thing. In the second sentence, “since” means that
something happened after another thing, indicating a sequence of events. Understanding
how sentences are connected in a text, known as “discourse structure”, is a complex task
in language processing. It becomes especially tricky when words like “since” can have
different meanings based on the context. Detecting the correct meaning requires figuring
out how the sentences are related to each other, which can be challenging for language
processing systems.
Furthermore, connections between sentences in a text may not always be shown by
clear words like “since”, “but”, or “for example”. Instead, these connections can be shown
through how sentences are structured grammatically.
For example:
“That being said, I understand the point.”
In this example, the structure of the first part of the sentence suggests that something is
being conceded or admitted. When we convert sentences like these into different lan-
guages, we may have to include additional words or phrases to make this connection clear,
which can complicate the translation process.
Machine translation has become a domain of common use. Almost anyone with the ability
to read this book can construct a machine translation system that rivals the current state of
the art. Data resources are easily available, and evaluation standards created through
assessment methods are easily accessible. Moreover, newly developed techniques are
often accessible through open-source toolkits.
Most translated content, like books and publications, can’t be freely used due to copyright
restrictions. But there’s still a large number of parallel texts that anyone can access. Some
12.3 Practical Issues 455
international and government organizations share their content online, and that’s a help-
ful source.
One of the earliest collections used for machine translation is the Hansard corpus. It
contains records of Canada’s parliamentary meetings in both French and English. Similarly,
the European Union has published a lot of material in its 24 official languages. They’ve put
together a parallel corpus (Europarl1) using records from their parliaments. This corpus is
widely used to train machine translation systems. Because parliament discusses a wide
range of topics, the Europarl corpus is good for creating a news translation system, for
example.
The Web site OPUS2 collects parallel texts from different places, like open-source
software guides, government documents, and religious books. For example, the Bible is
available in many languages as a parallel text. But because it’s big and sometimes uses old
words, it might not be very useful for modern stuff. There’s a project called Paracrawl that
gets parallel texts from the Internet. But the data quality is different because they collect
lots of stuff without choosing carefully. Paracrawl gives a score to show how good each
pair of sentences is.
For the most widely spoken languages like French, Spanish, German, Russian, and
Chinese, there’s a lot of data to use. However, for most other languages, there isn’t much
data available. This becomes a big problem when we’re dealing with less common lan-
guages, known as “low-resource languages.” In these cases, the shortage of training data
is a significant hurdle. Even for many popular Asian languages, there’s a severe shortage
of parallel texts to use for translation.
In the context of natural language processing, machine translation stands out as a rela-
tively well-defined task. Unlike some other problems in the field, machine translation is
characterized by a friendly competitive spirit rather than ideological battles.
One key reason for this amicable environment is the requirement for concrete demon-
stration of translation quality. Mere claims of superior machine translation are insufficient;
participants must prove their system’s effectiveness by taking part in open shared evalua-
tion campaigns. Currently, two prominent annual campaigns are organized by academic
institutions.
The Workshop on Machine Translation (WMT) evaluation campaign is part of the
Conference for Machine Translation. It happens at the same time as one of the big meet-
ings of the Association for Computational Linguistics. It began as a competition for a few
languages using the Europarl corpus. But now, it includes many languages, even ones with
limited resources, like Russian and Chinese. Besides the main task of translating news for
WMT, they also have other tasks, like translating medical texts or closely related lan-
guages. They also look at how well the translations work using different measures.
456 12 Machine Translation Using Deep Learning
The IWSLT evaluation campaign is mainly about combining speech recognition and
machine translation. It includes tasks for translating spoken content, like TED talks, both
as transcriptions and as end-to-end speech translation systems. Additionally, the American
National Institute for Standards in Technology (NIST) organizes shared tasks, typically
linked to ongoing research programs funded by agencies like the Defense Advanced
Research Projects Agency (DARPA) or the Intelligence Advanced Research Projects
Activity (IARPA). NIST’s earlier shared tasks in Chinese and Arabic machine translation
had a significant impact, and recent efforts have shifted toward addressing low-resource
languages.
Lastly, there is an evaluation campaign organized by the Chinese Workshop on Machine
Translation, which focuses on Chinese and Japanese translation tasks. These shared evalu-
ation campaigns serve as critical platforms for researchers to assess and advance the state
of the art in machine translation, fostering collaboration and healthy competition in
the field.
Google Translate has played a pivotal role in bringing machine translation to a wider audi-
ence. It seamlessly provides translation services directly to users when and where they
need it. For instance, when you’re searching for information on the Internet and come
across a Web page in a foreign language, such as a Finnish page addressing a computer
issue or a French page explaining how to purchase Parisian metro tickets, you can simply
click a button (“translate this page”) to have its content rendered in English or any other
language you’re more comfortable with.
Google Translate essentially opens up the Web to users across all languages. Its value
is even more pronounced when translating English content into other languages, as English
remains a dominant language on the Internet, hosting a wealth of valuable information that
may not be available in other languages, especially in fields like advanced science.
This cross-lingual information access from the Internet shapes user expectations
regarding the technology. Users are aware that machine translation is responsible for the
translation process, and they understand that any mistranslations or language imperfec-
tions are a result of the technology’s limitations, not the fault of the content’s original
publisher. Additionally, machine translation for information access has been a driving
force behind research funding in the USA, with programs like DARPA LORELEI present-
ing challenges related to humanitarian crises in foreign countries where aid workers need
access to lifesaving information in unfamiliar languages.
12.4 Applications of Machine Translation 457
Numerous commercial use cases benefit from machine translation for information
access. Patent lawyers can track claims made in Chinese patents, news reporters can stay
informed about developments in foreign countries, and hedge fund managers can access
information in any language that impacts company profitability. Even lower-quality
machine translation can be valuable, as it provides a general understanding of a docu-
ment’s content, helping users decide its relevance. Only relevant documents need further
examination by language specialists.
However, there is a significant challenge associated with information access through
machine translation. If any part of the original document’s meaning is distorted during
translation, it falls upon the user to identify these issues. Detecting distortions may be pos-
sible through indicators like poor language quality and semantic implausibility.
Nonetheless, erroneous translations have the potential to mislead users, making misinfor-
mation a notable concern, particularly with neural machine translation systems that may
prioritize fluency over accuracy to the extent that the output bears no relation to the input.
As a result, the development of confidence scores that indicate the reliability of a transla-
tion becomes crucial, especially when important decisions are based solely on machine-
translated content.
The translation industry is vast, but machine translation has not reached a level of quality
where customers are willing to pay a significant amount for it. High-quality translation
typically necessitates professional translators who are native speakers of the target lan-
guage and ideally experts in the subject matter. The majority of the translation industry
comprises numerous language service providers that often outsource work to freelance
translators.
Machine translation isn’t as good as what professional human translators can do in
terms of quality.
However, it’s a helpful tool that translators can use to get more work done. Back in the
1990s, translators started changing how they work. They used something called “transla-
tion memory tools,” which are like searchable collections of past translations. When they
had a sentence to translate, these tools would look through their database of old transla-
tions to find a similar sentence and show how it was translated before. Professional transla-
tors who regularly work with the same client can translate repetitive stuff faster. This
includes things like annual reports, legal contracts, and product descriptions that have a lot
of repeated text.
Professional translators have been gradually using machine translation in their work,
but it’s not fully integrated yet. For some translation tasks, like marketing messages that
need cultural information or literary and poetic works, machine translation isn’t very help-
ful. However, for many other regular translation jobs, machine translation can be useful.
One way machines and humans collaborate in translation is by giving professional
458 12 Machine Translation Using Deep Learning
translators the output of a machine translation system and asking them to correct it. This
is called “postediting machine translation.”
People in the translation industry have had debates about using machine translation.
One concern is how the profits are shared. If translators have to work twice as fast with
machine translation but get paid less, it can cause problems, especially if they’ve had bad
experiences with poor machine translation. Making machine translation work well often
means adjusting it for a specific field or style, which can be tough if language service pro-
viders don’t have the right tools, data, knowledge, or computer power. Another issue is that
post-editing machine translation isn’t as fun as translating from scratch. Fixing a machine’s
mistakes, which can be repetitive, isn’t as creative as creating original text inspired by a
foreign document.
There’s also a worry that translators might have to produce translations really quickly
with less focus on polished language in the future. People are trying to make machine
translation smarter and more interactive. One way is “adaptive machine translation,”
where systems learn from translators as they work, using what they do to get better.
“Interactive machine translation” means the machine translation system suggests things to
the translator and changes those suggestions based on what the translator picks. It’s more
than just giving a single machine translation of a source sentence.
But making tools that work well for translators is tough. Machine translation can give
lots of extra information, like different translations, how confident it is, and keeping track
of terms. But giving translators too much info can be distracting. The ideal tool would
quickly give translators clear answers to their questions during translation, but it shouldn’t
get in the way.
Finding out what questions or problems a translator has isn’t easy. Some research uses
things like tracking key presses and eye movements to see what translators are doing. But
even with all that data, it’s still hard to tell what a translator is dealing with when they’re
just looking at the screen without doing anything obvious.
12.4.3 Communication
Another big way machine translation is used is for communication. This means helping
two people who speak different languages talk to each other. Making this happen smoothly
is tricky. Machine translation might need to work with other technologies like speech pro-
cessing to fit into a conversation naturally. For communication, machine translation has to
be super-fast. It might need to start translating before the speaker finishes a sentence so
there are no awkward pauses. One big project in this field is Microsoft’s work with Skype.
They’re trying to let people have conversations on Skype even if they speak different lan-
guages. For example, you might talk in English, and your friend speaks in Spanish, but the
conversation gets translated back and forth.
Computers are already processing the speech, which means they can do more things
with it.
12.4 Applications of Machine Translation 459
When we look closely at this issue, there are three main steps:
• First, turning the spoken words into written text (speech recognition)
• Second, translating that text (machine translation)
• Third, turning the translated text back into spoken words (speech synthesis)
Ideally, the speech synthesis part would also include things like the way the speaker
emphasizes words or their emotions. It might even sound like the original speaker. But in
many real situations, people skip the speech synthesis part. It’s often easier to read a
sometimes-not-perfect translation on a screen than to listen to it.
When people talk, they often use a smaller set of words compared to when they write.
But there’s a problem when it comes to translating spoken language. The kind of language
used in everyday talk is different from what’s often available for translation. People talk in
a more casual way, using things like “I” and “you” a lot, asking questions more often,
making mistakes, and using slang. In fact, if you saw an exact written record of your
everyday speech, you might be surprised by how ungrammatical and messy it looks.
It’s important to remember that communication doesn’t always mean speaking.
Machine translation is also used in chat forums where people type questions and answers.
These forums can be for fun or for customer support. But many of the same language
issues apply here. Chat text has its own quirks, like emoticons, slang abbreviations, and
lots of spelling mistakes.
When it comes to how good machine translation needs to be, it’s not as high for every-
day communication as it is for things that get published. If the machine makes mistakes,
the people talking will usually notice and try to clear things up. Sometimes, they might
even get upset. We really see the need for translation when we travel to a foreign country.
Think of a travel translator, which was made famous by The Hitchhiker’s Guide to the
Galaxy. In that story, there was a translator called a “Babel fish” that you put in your ear,
and it could translate spoken words for you.
Nowadays, travel translation tools are more commonly used. They’re often in the form
of a handheld device or a phone app. The technology they use is similar to what we talked
about for speech and chat applications earlier. If the device translates speech, it’s handy to
also show the original spoken words on the screen. This helps the speaker check if the
translation was correct. Because the technology isn’t perfect, and there can be noise and
limited computer power on the device (sometimes, it uses cloud computing, which can
cause delays), the most reliable travel translators work with text as their main focus.
Speech recognition is just an extra feature.
Another good thing a travel translator can do is image translation. Suppose you are at a
restaurant and you get a menu with words you can’t understand, maybe even written in
strange symbols. All you have to do is use the camera on your travel translator app, and it
translates the text in the picture into your language. The first versions of these phone apps
were simple. They just used a dictionary for translation, but they had fun extras, like mak-
ing the translation look like the original text’s font.
460 12 Machine Translation Using Deep Learning
In 2018, people started trying to translate spoken speeches or university lectures. But it
was tough because they had to deal with all the challenges of combining speech recogni-
tion and machine translation. They also needed better sound conditions and more formal
speaking. In these early attempts, they didn’t just give the machine translation the speech
transcript. They tried to make it work together more closely. For example, they gave lists
of possible translations or word patterns that show different ways of saying things. This
helped the machine translation understand the spoken words better. But not much progress
has come from this work. It turns out that just using the best version of the transcript is
often the simplest and works just as well.
When it comes to integrating different systems, there are some challenges. One of them
is dealing with punctuation in written text. You need to figure out where to put it or remove
it when using machine translation. In written text, numbers are often shown as digits like
“15”, but when you’re dealing with speech recognition, it gives you the actual spoken
words like “fifteen”.
Machine translation and natural language processing have reached a level of maturity that
enables their application in various practical contexts. These applications span a wide
spectrum, from well-established ones like text search, exemplified by Google, to emerging
fields such as personal assistants like Amazon’s Echo. Additionally, there are visionary
applications like customer support dialog systems, complex question answering, and per-
suasive argumentation, which represent the future of this technology.
One of the primary challenges in the field is translating these broad visions into practi-
cal applications that can be consistently performed, benchmarked, and tracked for prog-
ress. To gauge machine performance, it’s often necessary to compare it against human
capabilities. Machine translation, in particular, offers a relatively well-defined task with
measurable progress, despite occasional disagreements among professional translators
regarding specific sentence translations. In contrast, tasks like document summarization,
constructing coherent arguments, or open-ended conversations are less well defined and
present greater challenges.
Machine translation, however, frequently serves as a component of more extensive
natural language processing applications. For instance, in cross-lingual information
retrieval, the goal is to perform Web searches not only in English but also in other lan-
guages to find relevant content. Achieving this involves query translation, Web page trans-
lation, or both. Intelligence Advanced Research Projects Activity (IARPA) in the USA has
initiated a cross-lingual information retrieval project, which further complicates the task
due to limited data availability for languages like Swahili, Tagalog, and Somali.
Extending the complexity, cross-lingual information extraction requires not only find-
ing relevant information but also distilling core facts following a structured schema. For
example, a query might seek a list of recent mergers and acquisitions from a collection of
12.5 Machine Translation Approaches 461
multilingual news articles. In this case, the system is expected to not only return relevant
stories but also generate a formatted table with details like company names, event dates,
financial transactions, and more.
Each of these applications places specific demands on machine translation systems. In
query translation, where input sentences may consist of just a few words, contextual dis-
ambiguation is challenging. However, user search history may offer valuable context.
Additionally, applications may require high recall, such as retrieving all relevant docu-
ments. In such cases, the system’s preferred translation for a word in a foreign document
may not match the query term exactly, but an alternative translation into the query term
might be suitable. Confidence scores indicating translation reliability can play a crucial
role in such scenarios. The evolving landscape of machine translation continues to adapt
to meet the diverse needs of these practical applications.
There are several different approaches to machine translation, each with its own advan-
tages and disadvantages. Here are some of the main types of machine translation
approaches:
• RBMT relies on linguistic rules and grammatical structures to translate text from
one language to another.
• It involves creating a set of language-specific rules and a bilingual dictionary.
• This approach can produce high-quality translations for languages with well-
defined rules and limited vocabulary, but it struggles with languages that have com-
plex syntax and idiomatic expressions.
• SMT is based on statistical models that learn from large bilingual corpora.
• It uses algorithms like the IBM models or phrase-based models to determine the
likelihood of a translation given the source text.
• SMT can handle a wide range of languages and can produce reasonable transla-
tions but may struggle with idiomatic expressions and context.
• NMT is a deep learning-based approach that uses neural networks, typically recur-
rent neural networks (RNNs) or transformers.
462 12 Machine Translation Using Deep Learning
Deep learning plays a central and transformative role in the field of machine translation
(MT). It has revolutionized the way translations are generated and has significantly
improved translation quality. Here are the key roles that deep learning plays in machine
translation:
Neural machine translation (NMT): Deep learning, particularly neural networks, has
given rise to neural machine translation, which has become the dominant approach in
modern machine translation. NMT models have completely changed the way translations
are generated.
The core role of deep learning in NMT includes:
Deep learning for machine translation involves several key components that work together
to build effective neural machine translation (NMT) systems. Table 12.1 provides an over-
view of the key components in a deep learning-based NMT system.
464 12 Machine Translation Using Deep Learning
First, let us discuss the conventional machine translation models. Various approaches
exist in this context, e.g., statistical machine translation, rule-based machine translation,
phrase-based machine translation, etc.
A statistical machine translation (SMT) model operates by utilizing statistical models
and algorithms to automate the process of translating text from one language to another. It
begins with the collection of a substantial parallel corpus, which comprises text in the
source language aligned with corresponding translations in the target language. This paral-
lel corpus serves as the foundation for training the SMT model. After collecting the data,
preprocessing steps are applied to clean and tokenize the text, breaking sentences into
words or subword units for translation. Alignment models come into play to align the
source and target language sentences within the parallel corpus, learning the probabilities
of how words or phrases in one language correspond to those in the other.
466 12 Machine Translation Using Deep Learning
Following alignment, the model employs translation models, which estimate the likeli-
hood of translating specific source words or phrases into their corresponding target coun-
terparts, incorporating the alignment probabilities into their calculations. Language
models, which gauge the fluency and grammaticality of translations in the target language,
estimate the probabilities of word sequences. These models ensure that translations not
only capture meaning accurately but also sound natural. During the decoding phase, the
SMT model utilizes algorithms like beam search to choose the most probable translation
for each source word or phrase, taking into account the learned translation and lan-
guage models.
Throughout this process, a phrase table is maintained, containing translation probabili-
ties for different phrases or words in both languages, which aids in making translation
decisions during decoding. Parameter estimation is a crucial step where the SMT model
fine-tunes its translation and alignment models based on the training data, optimizing them
for accuracy and fluency. After decoding, post-processing steps may further enhance the
quality of translations, such as reordering words or correcting grammar.
To evaluate the quality of translations, various metrics like BLEU and METEOR are
used, comparing the generated translations against human references. While SMT models
have been an essential approach in machine translation, they have been largely superseded
by neural machine translation (NMT) models, which employ deep learning techniques to
capture context and produce higher-quality translations. Nonetheless, the principles of
statistical modeling have played a significant role in the development of NMT and other
machine translation methods.
The following is a Python code to demonstrate how the SMT model works:
import nltk
from nltk.translate import IBMModel1
from nltk.translate import AlignedSent
from nltk.corpus import comtrans
training_data = []
for pair in bitext:
source_sentence = pair[0]
target_sentence = pair[1]
12.7 Component-Wise Deep Learning for Machine Translation 467
training_data.append(AlignedSent(source_sentence,
target_sentence))
ibm1 = IBMModel1(training_data, 5)
alignment = ibm1.align(source_tokens,
source_sentence.split())
import nltk
import numpy as np
nltk.download('punkt')
phrase_pairs = set()
return list(phrase_pairs)
phrase_pairs = extract_phrases(english_sentences,
french_sentences)
In this code, first, we tokenize English and French sentences using NLTK’s word_
tokenize function.
We define a function (extract_phrases) to extract phrases from parallel sentences and
store them in a set. We extract phrases from the provided parallel corpus. We define a
simple translation model as a Python dictionary (translation_model) for demonstration
purposes.
We define a function (translate_sentence) to perform translation using the phrase table
(dictionary). Finally, we translate a sample English sentence using the translation model.
Now, let’s discuss the case study to show the neural machine translation. A complete
case study of neural machine translation (NMT) involves several steps, including data
preparation, model building, training, and evaluation. In this example, we’ll perform
English-to-French translation using the torchtext library in Python and a small parallel
corpus. Note that for a real-world application, you would need a more extensive dataset
and more complex models.
import torch
from torchtext.data import Field, BucketIterator,
TabularDataset
# Build vocabulary
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
470 12 Machine Translation Using Deep Learning
import torch.nn as nn
class Seq2seq(nn.Module):
def __init__(self, encoder, decoder, device):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.device = device
top1 = output.argmax(1)
12.7 Component-Wise Deep Learning for Machine Translation 471
return outputs
Step 3: Training
Now, we’ll train the NMT model using the prepared data and the defined model. The
Python code for this step is given below.
import random
import torch.optim as optim
print(f"Epoch: {epoch+1:02}")
print(f"\tTrain Loss: {train_loss:.3f} | Train PPL:
{math.exp(train_loss):7.3f}")
print(f"\t Val. Loss: {valid_loss:.3f} | Val. PPL:
{math.exp(valid_loss):7.3f}")
472 12 Machine Translation Using Deep Learning
Step 4: Evaluation
Finally, we evaluate the trained model on the test set as follows:
This is a simplified example of building and training an NMT model for English-to-
French translation. In practice, you would need to handle various aspects such as batching,
model checkpointing, and more extensive hyperparameter tuning. Additionally, you may
want to use pre-trained embeddings and more complex architectures for better translation
quality.
to handle word reordering more naturally and effectively. In NMT, the use of positional
embeddings also contributes to addressing reordering challenges. Positional embeddings
provide information about the position of words within the source and target sentences.
This additional information helps NMT models learn and generate correct word orderings,
especially in languages where word order plays a crucial role in conveying meaning.
While NMT models have shown promise in addressing reordering challenges, it’s
important to note that the specific approach to reordering may vary depending on the lan-
guage pair and the complexity of word order differences. Reordering remains a critical
aspect of machine translation, and ongoing research continues to explore improved meth-
ods for handling reordering in various translation scenarios. However, creating a complete
Python code example for a reordering model in machine translation is a complex task, as
it involves various components, including data preprocessing, alignment, and integration
with translation models. Below, we will provide a simplified Python code snippet that
demonstrates the concept of reordering in a basic manner using a phrase-based approach.
Please note that this example is highly simplified for illustration purposes and does not
cover the complexities of real-world machine translation systems.
import random
In this simplified example, we have a source sentence in English and its corresponding
target sentence in German. Both sentences are split into words or phrases. In real machine
translation, this would involve more sophisticated tokenization and pre-processing.
474 12 Machine Translation Using Deep Learning
We simulate the reordering process by randomly shuffling the target phrases. In prac-
tice, a reordering model would use alignment information, linguistic features, or neural
networks to make decisions about reordering.
Finally, we display the source sentence and the reordered target sentence.
Keep in mind that this code represents a basic illustration of reordering and does not
reflect the complexity of actual reordering models used in machine translation systems.
Real reordering models consider various factors to determine the optimal word or phrase
order in the target language sentence.
In the context of machine translation, a language model (LM) plays a critical role as a
fundamental component of the translation process. Essentially, a language model is a com-
putational model that learns the patterns, structures, and grammatical rules of a specific
language. Its primary function is to estimate the likelihood of word sequences or sentences
in that language. This probability estimation is essential for generating coherent and con-
textually accurate translations.
One of the key functions of a language model in machine translation is to provide prob-
ability estimates for sequences of words. This means that it assigns probabilities to differ-
ent word combinations and sentence structures in the source language. By doing so, the
language model helps the translation system identify the most probable translations,
ensuring that the generated target-language sentences are linguistically sound and contex-
tually appropriate.
Context is a crucial aspect of language comprehension and generation, and language
models are adept at considering the surrounding context when predicting the next word in
a sentence. They do not view words in isolation but take into account the words that pre-
cede them. This contextual awareness is particularly valuable in machine translation,
where it aids in producing translations that are not only accurate but also contextually
relevant. It ensures that the translation reflects the intended meaning in the source text.
Furthermore, language models significantly contribute to the fluency and naturalness of
translated text. They help ensure that the generated translations read smoothly and adhere
to the grammatical rules of the target language. This is vital for producing high-quality
translations that are not only accurate but also pleasant to read and comprehend for human
readers.
In addressing the challenge of linguistic ambiguity, language models prove invaluable.
Many words and phrases in natural language have multiple meanings or interpretations.
Language models consider the context in which a word or phrase appears, enabling them
to disambiguate and select the most appropriate translation. This capability is crucial for
producing accurate and contextually meaningful translations, especially when dealing
with polysemous words.
12.7 Component-Wise Deep Learning for Machine Translation 475
# Input text
input_text = "Once upon a time"
# Generate text
output = model.generate(input_ids, max_length=50,
num_return_sequences=1, no_repeat_ngram_size=2,
top_k=50)
In this code, we import the necessary modules from the transformers library, including
the GPT-2 model and tokenizer. We specify the model name, in this case, “gpt2”, but you
can choose a different variant depending on your requirements. We load the pre-trained
model and tokenizer using GPT2LMHeadModel.from_pretrained and GPT2Tokenizer.
from_pretrained. We define the input text you want to use for text generation. We tokenize
the input text using the tokenizer.
We generate text using the model’s generate method, specifying parameters like max_
length, num_return_sequences, no_repeat_ngram_size, and top_k to control the genera-
tion process. Finally, we decode and print the generated text.
476 12 Machine Translation Using Deep Learning
There is another term very common these days called large language models. Large
language models represent a transformative development in the field of natural language
processing (NLP). These models are designed to understand and generate humanlike text
by leveraging vast amounts of data and powerful neural network architectures. They are
characterized by their massive scale, often consisting of hundreds of millions to billions of
parameters, enabling them to capture complex linguistic patterns and semantic nuances in
text. Large language models have gained significant attention due to their ability to per-
form a wide range of NLP tasks, including language translation, text generation, question
answering, and more.
The underlying architecture of large language models is typically based on deep learn-
ing, particularly the transformer architecture. Transformers use self-attention mechanisms
that allow the model to weigh the importance of different words in a sentence, effectively
capturing long-range dependencies and contextual information. This architecture facili-
tates parallel processing, making it computationally efficient, and is a key factor in the
success of large language models.
Training large language models is a data-intensive process. They are pre-trained on
enormous corpora of text from the Internet, encompassing diverse sources, genres, and
languages. During pre-training, the model learns to predict the next word in a sentence,
effectively learning the statistical patterns and structures of language. This process requires
immense computational resources and takes advantage of distributed computing clusters
to handle the sheer volume of data and model parameters. Once pre-trained, these models
can be fine-tuned for specific NLP tasks, including machine translation. Fine-tuning
involves training the model on a smaller dataset that is relevant to the target task. For
machine translation, this dataset would consist of parallel sentences in multiple languages.
Fine-tuning allows the model to adapt its pre-trained knowledge to the intricacies of trans-
lation and generate high-quality translations.
Large language models excel in capturing context and semantics, which makes them
exceptionally useful in machine translation. When translating a sentence, the model con-
siders not only individual words but also the relationships between them and the overall
sentence structure. It uses this contextual information to generate coherent and contextu-
ally appropriate translations. Additionally, large language models often outperform tradi-
tional machine translation systems in low-resource language pairs, where training data is
limited.
Despite their impressive capabilities, large language models have raised concerns
related to computational resources, environmental impact, and ethical considerations.
Training and running such models demand substantial energy and hardware resources,
contributing to carbon emissions. Furthermore, biases present in the training data can lead
to biased or inappropriate outputs, highlighting the importance of ethical considerations in
deploying these models.
12.7 Component-Wise Deep Learning for Machine Translation 477
• Embedding layer: This layer converts words or tokens into high-dimensional vectors,
often referred to as word embeddings. These vectors capture the semantic meaning of
words and their relationships in the model’s training data.
• Transformer architecture: The transformer architecture is the backbone of most large
language models. It consists of multiple layers of attention mechanisms, feed-forward
neural networks, and layer normalization. Transformers enable the model to capture
long-range dependencies and contextual information efficiently.
• Attention mechanism: Attention mechanisms allow the model to weigh the importance
of different words in a sentence when making predictions. Self-attention mechanisms,
such as the one used in transformers, enable the model to consider contextual informa-
tion from both preceding and following words.
• Positional encoding: Since transformers do not have built-in positional information,
positional encoding is added to the input embeddings to provide information about the
position of words in a sequence. This helps the model understand the order of words.
• Multi-head attention: Multi-head attention mechanisms allow the model to focus on
different parts of the input sequence simultaneously. This enhances its ability to capture
different types of relationships and dependencies within the text.
• Feed-forward neural networks: Each transformer layer typically contains feed-forward
neural networks. These networks perform transformations on the input data to capture
complex patterns and relationships.
• Layer normalization: Layer normalization is applied after each sublayer within a trans-
former layer. It helps stabilize and speed up the training process by normalizing the
outputs of each sublayer.
• Pre-trained weights: Large language models are pre-trained on massive text corpora,
allowing them to learn the statistical properties of language. These pre-trained weights
are fine-tuned for specific tasks, such as text generation or classification.
• Vocabulary: Models have a fixed vocabulary size, which includes a set of words and
subword tokens. Tokenizers are used to segment text into these vocabulary elements.
• Output layer: Depending on the specific task, a large language model may have differ-
ent output layers. For text generation tasks, it typically includes a softmax layer that
generates probabilities over the vocabulary for the next word.
• Fine-tuning mechanism: Large language models can be fine-tuned on specific down-
stream tasks, such as translation, summarization, or question answering. Fine-tuning
adjusts the pre-trained weights to perform well on these tasks.
• Loss function: During training, a loss function is used to measure the model’s predic-
tion error compared to the ground truth. Common loss functions include cross-entropy
for classification tasks and mean squared error for regression tasks.
478 12 Machine Translation Using Deep Learning
Embedding
Layer
Transformer
Output Layer
Architecture
Aenon
Vocabulary
Mechanism
LLMs
Pretrained Posional
Weights Encoding
Feedforward
Mul-Head
Neural
Aenon
Networks
Let’s consider a scenario where we possess an input sequence labelled as X = [x1, x2,
x3, …] and, simultaneously, an output sequence labelled as Y = [y1, y2, y3, y4, …].
These inputs and outputs make use of sequences, each comprising distinct input sym-
bols and output symbols. It’s worth noting that these symbols may vary and often do.
12.8 End-to-End Deep Learning Models for Machine Translation 479
x1,.., xn y1,.., ym
Encoder Decoder
Zi yi-1
Here are some scenarios where “sequence-to-sequence” (s2s) models are applied:
• In language translation, the input sequence is in one language, and the output is in
another.
• For speech translation, the inputs consist of audio samples, and the outputs are textual
transcriptions.
• In the context of video description, the inputs comprise video frames, while the outputs
are captions.
In the context of sequence learning, the most prevalent neural network architecture is
the RNN, or recurrent neural network. Consequently, it’s quite common to see the use of
RNNs in the most typical sequence-to-sequence (s2s) neural networks. A standard s2s
RNN maintains the traditional encoder and decoder components, but in this context, both
the encoder and decoder are constructed using RNNs. Typically, these RNNs are built
upon architectures such as long short-term memory (LSTM) or gated recurrent unit
(GRU), and they often consist of multiple layers.
In the depicted figure, the RNN encoder comprises two layers. One of these layers is an
embedding layer, which serves the purpose of taking inputs and converting them into a
consistent code vector. This particular process is known as input embedding. Ultimately,
the encoder’s output is a sequence vector denoted as Se.
As for the decoder shown in the figure, it also consists of two layers. The lighter-green
layer represents the RNN, while the darker-green layer corresponds to the output network.
In the decoder’s operation, it takes the vector Se produced by the encoder and the previous
decoder output embedding as inputs, subsequently generating one output symbol at a time.
Beyond the basic components, several optimizations have given rise to additional vital
components within the Seq2seq model.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Input, Embedding, LSTM,
Dense
from tensorflow.keras.models import Model
482 12 Machine Translation Using Deep Learning
# Encoder
encoder_input = Input(shape=(None,))
encoder_embedding = Embedding(input_vocab_size,
embedding_dim)(encoder_input)
encoder_lstm = LSTM(hidden_units, return_state=True)
encoder_outputs, state_h, state_c =
encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]
# Decoder
decoder_input = Input(shape=(None,))
decoder_embedding = Embedding(output_vocab_size,
embedding_dim)(decoder_input)
decoder_lstm = LSTM(hidden_units, return_sequences=True,
return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding,
initial_state=encoder_states)
decoder_dense = Dense(output_vocab_size,
activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
stop_condition = False
output_seq = []
target_seq = np.array([[sampled_token_index]])
decoder_states = [h, c]
print(output_seq)
484 12 Machine Translation Using Deep Learning
In the encoder-decoder architecture, the decoder starts its task with a critical starting point
called the initial state or hidden state. This initial state serves as an intermediate
12.8 End-to-End Deep Learning Models for Machine Translation 485
Discarded Output
ht
LST LST LST
ct
M M M
representation of the input sequence’s context and is normally derived from the final
encoding generated by the encoder. It acts as a foundation for the decoder’s subsequent
generation of the output sequence.
The basic mode of operation for the decoder is autoregressive generation. This means
that the decoder generates the output sequence one element at a time, often following a
left-to-right direction. At each time step, the decoder produces a single output element.
During this process, it also considers the information encoded in the previously generated
elements. This autoregressive approach is important for capturing complex dependencies
and correlations that exist within the output sequence. This ensures that the generated
sequence is coherent and contextually accurate.
To facilitate the autoregressive generation process, the decoder maintains a set of hid-
den states. These hidden states maintain the internal representations of the decoder’s
knowledge, and they evolve as the decoder processes each input element. They encapsu-
late information about the context of the input sequence as well as the previously gener-
ated elements in the output sequence. These hidden states are critical for making decisions
regarding the generation of subsequent output elements. At each time step, the decoder
generates an output element, which is typically a symbol or word. To do this, it uses a
probability distribution over the entire vocabulary (or a set of possible output symbols).
This distribution is computed by using the current hidden state, and it assigns probabilities
to each symbol in the vocabulary. The decoder then takes samples from this distribution to
determine the next output element. This probabilistic approach ensures that the decoder’s
output is not deterministic, thus allowing for diversity in generated sequences.
During the training phase, the decoder is provided with the training data sequences as
a reference. It generates its own sequence and compares it to the reference sequence using
some loss function, for example, cross-entropy loss. The goal is to minimize this loss,
486 12 Machine Translation Using Deep Learning
Output
ht
ct LSTM LSTM LSTM
Decoder’s
discarded
internal
states
which essentially measures the dissimilarity between the generated and target sequences.
This optimization process is often performed through backpropagation. It helps the
decoder learn to generate accurate and contextually relevant output sequences. In many
advanced encoder-decoder models, an attention mechanism is used within the decoder.
This mechanism allows the decoder to focus on specific parts of the input sequence when
generating each output element. By considering the relevant input information, the decoder
can significantly improve its performance, especially when dealing with long input
sequences or complex translation tasks.
In summary, the decoder plays a critical role in the encoder-decoder model, as it is
responsible for producing accurate output sequences. Through autoregressive generation,
hidden states, and probability distributions, it navigates the sequence generation process
with the goal of minimizing loss during training. This ensures that the generated sequences
align with the desired output.
Figure 12.5 shows the structure of a decoder network.
Repetitive contents refer to the specific challenge of translating text that contains a sub-
stantial amount of repetitive or duplicated content. This challenge is encountered in vari-
ous domains and applications, and it poses both opportunities and difficulties for machine
translation systems.
Highly repetitive content consists of text passages, sentences, or phrases that are dupli-
cated numerous times within a document or across documents. This repetition may occur
due to various reasons, such as legal documents with standard clauses, technical manuals
12.9 Translation of Highly Repetitive Content 487
with recurring instructions, or financial reports with repetitive data entries. Translating
such content with machine translation can offer significant efficiency gains. Since the
same or similar content repeats, the system can generate translations once and then reuse
them throughout the document or across similar documents. This minimizes the need for
redundant translation efforts and speeds up the overall translation process. Traditional
machine translation systems, particularly rule-based or statistical ones, may struggle with
highly repetitive content. These systems typically generate translations independently for
each source sentence, which can lead to redundancy in the output. Additionally, they may
not identify repeated content effectively, resulting in multiple translations of the same or
similar phrases.
Neural machine translation (NMT) models have shown advantages in handling highly
repetitive content. NMT models, such as sequence-to-sequence models with attention
mechanisms, can capture context and dependencies within a document. When they
encounter repeated content, they tend to generate consistent translations due to their abil-
ity to consider the broader context.
NMT models can also leverage their memory of previously generated translations.
When they encounter a repeated phrase or sentence, they can recognize it as such and
reuse the translation they produced earlier. This feature significantly improves translation
quality and coherence in documents with repetitive elements. For specialized domains or
applications where repetitive content is common (e.g., legal, medical, or technical transla-
tions), NMT models can be customized and fine-tuned to better handle such content. This
involves training the model on domain-specific parallel data to improve its performance in
these scenarios. While NMT systems excel at handling repetitive content, human post-
editing may still be required, especially when precise and context-specific translations are
crucial. Human translators can review the output, ensure consistency, and make any neces-
sary adjustments.
In summary, the “Translation of highly repetitive content” is a pertinent challenge in
machine translation, and it showcases both the capabilities and advantages of neural
machine translation models. These models are well equipped to handle repetitive elements
efficiently while maintaining translation quality and consistency. However, domain-
specific customization and human oversight may still be valuable in certain contexts to
ensure the highest level of accuracy and coherence in translated content.
Following are some of the models used to handle repetitive contents:
Translating user-generated content is a complex task within the context of machine trans-
lation, primarily due to the varied and informal nature of such content. This category
encompasses a wide range of text found on social media platforms, product review sites,
comments sections, and online forums. One of the foremost challenges lies in the informal
language and slang that users employ in these contexts. Machine translation systems, typi-
cally trained on more formal and structured text, often struggle to accurately capture the
nuances of informal language, requiring a more context-aware approach to translation.
Abbreviations and acronyms are prevalent in user-generated content, assuming a level
of familiarity among users. For machine translation models to provide meaningful transla-
tions, they must possess the ability to decipher these shortened forms and offer translations
that align with the context in which they are used. Emojis and emoticons further compli-
cate the translation process, as these visual elements convey emotions, tone, and context
that may not have direct linguistic equivalents in other languages.
User-generated content often carries with it the nuances of user intent and sentiment.
Users frequently express a wide range of emotions, including positivity, negativity, humor,
sarcasm, or irony. Detecting and faithfully translating these emotions and nuances is a
critical aspect of providing contextually accurate translations.
The multilingual nature of user-generated content poses additional challenges. Online
platforms attract users from diverse linguistic backgrounds who interact in various lan-
guages. Machine translation systems must seamlessly handle translations between a mul-
titude of source and target languages, ensuring that the essence of the content is preserved.
Cultural references, jokes, and idiomatic expressions specific to particular communities or
cultures frequently appear in user-generated content. Translating these references accu-
rately demands a level of cultural awareness and contextual understanding that traditional
machine translation models may struggle to achieve.
Moreover, user-generated content can contain spelling mistakes, grammatical errors, or
unconventional sentence structures. Machine translation models need to be robust enough
to provide comprehensible translations while accommodating these errors and preserving
the original intent.
Privacy and data sensitivity concerns often arise when translating user-generated con-
tent. Some content may contain personal or sensitive information, necessitating the careful
handling of such data in compliance with privacy regulations. Addressing these challenges
requires ongoing research and development efforts to improve machine translation sys-
tems, especially in the context of user-generated content. This includes the creation of
models that can effectively handle informal language and accurately capture sentiment
and context. Also, the integration of sentiment analysis and natural language understand-
ing techniques can enhance the ability to grasp user intent and emotions, thus leading to
more context-aware translations. Customizing and optimizing the models for specific plat-
forms or user communities also plays a crucial role in achieving higher translation quality.
Thus, we can say that translation of user-generated content remains an evolving and
dynamic area of focus within machine translation.
12.11 Online Customer Service 489
Online customer service has become an important part of modern businesses that enables
them to connect with customers throughout the world. However, language barriers can
impede effective communication. In this case study, we will explore how a company can
use deep learning-based machine translation to enhance its online customer service and
improve customer experiences across language boundaries.
In this case study, the client is a global e-commerce platform with customers and sellers
from different linguistic backgrounds. They face the challenge of providing efficient cus-
tomer support to users who communicate in multiple languages, hindering timely resolu-
tion of issues and impacting customer satisfaction.
Following are some of the challenges of this case study:
• Multilingual support: The company needs to solve customer queries and resolve issues
in many languages, including English, Spanish, Mandarin, French, and more.
• Real-time communication: Many customer service interactions require real-time
responses, which makes manual translation impractical.
• Contextual understanding: Accurate translations that preserve the context and intent of
customer messages are essential to avoid misunderstandings.
• Scalability: The solution should be scalable to efficiently handle a high volume of cus-
tomer inquiries across various languages.
• Data privacy: Ensuring the privacy and security of customer data during the translation
process is a critical concern.
• Data collection: The company collects a large dataset of customer service interactions
in multiple languages. This dataset serves as the foundation to train the machine trans-
lation model.
• Model selection: The company adopts a state-of-the-art neural machine translation
(NMT) model. This model has demonstrated superior performance as compared to
other models in the same domain.
• Training and fine-tuning: The NMT model is trained using the collected dataset. Fine-
tuning processes include domain adaptation to address customer service queries and
responses.
• Integration with customer service platforms: The NMT model can seamlessly be inte-
grated into the company’s customer service platforms. This allows for real-time transla-
tion of customer inquiries and responses.
• Quality control: The system uses quality control measures to identify and fix translation
errors. Human agents monitor translations and provide feedback to continually improve
the model.
490 12 Machine Translation Using Deep Learning
• Multilingual support: Customer service agents can now efficiently handle inquiries in
multiple languages. This eliminates the language barriers and expands the company’s
global reach.
• Real-time communication: Real-time translation capabilities enable faster response
times that lead to improved customer satisfaction and issue resolution.
• Contextual understanding: The NMT model, with its ability to capture context, ensures
that translated responses accurately convey the intended meaning, which ultimately
reduces misunderstandings.
• Scalability: The solution proves highly scalable and accommodates a growing volume
of customer interactions across languages without compromising quality.
• Data privacy: The company maintains strict data privacy standards, ensuring that cus-
tomer information remains secure throughout the translation process.
We can conclude that by leveraging deep learning-based machine translation, the com-
pany can enhance its online customer service, reduce language barriers, and deliver excep-
tional support to users worldwide. The solution exemplifies how advanced NMT models
can drive improved customer experiences and foster global business expansion in an
increasingly interconnected world.
Implementing a complete deep learning-based machine translation system, as described
in the case study, would require a substantial amount of code and resources. However, we
have provided you with a simplified Python code snippet that demonstrates how to use the
popular transformers library by Hugging Face to perform machine translation using a pre-
trained model.
import torch
from transformers import MarianTokenizer, MarianMTModel
In this code, we install the transformers library if you haven’t already. We load a pre-
trained Marian NMT model for English-to-French translation. Define an input text that
you want to translate.
Tokenize the input text using the model’s tokenizer. We use the model to generate
translations.
Decode and print the translated text.
Deep learning-based machine translation has made significant progress, but it still faces
several challenges. One significant challenge is the resource-intensive nature of training
deep learning models. To tackle this, researchers and organizations can leverage cloud
computing services that provide access to powerful GPUs or TPUs, making it more cost-
effective and accessible. Transfer learning is another approach where pre-trained models,
such as those for language understanding, can be fine-tuned for specific translation tasks,
reducing the amount of training data required. The quality and quantity of training data
pose substantial challenges. Data augmentation techniques like back-translation, parallel
data synthesis, or data filtering can help augment training data, making it more diverse and
representative. Crowdsourcing translation efforts can also be valuable, especially for low-
resource language pairs, allowing the crowd to create translations or validate data quality.
Out-of-distribution data remains a concern, especially when translation models encoun-
ter content or domains not well represented in the training data. Domain adaptation,
achieved through fine-tuning models on domain-specific data, can mitigate this issue.
Additionally, generating artificial data that simulates out-of-distribution scenarios can
help models handle such situations more effectively. Handling rare words or out-of-
vocabulary terms is an ongoing challenge. Subword tokenization models like byte-pair
encoding (BPE) or SentencePiece can help in segmenting words into subword units, aid-
ing in the handling of rare or complex terms. Creating custom dictionaries for domain-
specific terminology can also improve translation quality.
Interpretable models are essential for gaining insights into model decisions, but deep
learning models often lack interpretability. Visualizing attention mechanisms within
models can shed light on why certain translations are generated. Another approach
involves combining automated translation with rule-based or human post-editing, ensur-
ing both quality and interpretability. Maintaining context and coherence in translations
is a complex challenge. Transformer-based models, which consider the entire context of
492 12 Machine Translation Using Deep Learning
12.12 Summary
The chapter begins by introducing deep learning techniques in the context of machine
translation (MT). Deep learning has revolutionized MT by enabling more effective and
context-aware translations. The chapter explores translation models, a fundamental com-
ponent of deep learning for MT. These models are essential for mapping source language
sentences to target language sentences effectively. Another critical component discussed is
reordering models. These models help address the challenges posed by differences in word
order and structure between languages, a common issue in MT. Another important topic
covered is the translation of user-generated content. This section explores how deep learn-
ing techniques can adapt to the informal and diverse language found in user-generated
content. The chapter concludes by presenting a case study related to online customer ser-
vice, showcasing practical applications of deep learning-based MT in real-world scenarios.
12.13 Exercise 493
12.13 Exercise
Q2: How does the ambiguity in one language affect the translation process in another
language?
Q3: Discuss different translation models that can be used to translate from one language
to another.
Q4: How does reordering models work? What are some major challenges faced by
these models?
Q5: Provide some of the challenges that a machine translation model may face during the
translation process.
Q7: Write down the stepwise explanation of the encoder decoder model for machine
translation.
Q8: Consider the following sentence written in French language. Apply some translator
to convert it into English:
• Salut.
• Comment vas-tu?
• Pourriez-vous s’il vous plaît m’aider à soulever ce poids?
• C’est trop lourd.
• Merci.
Q9: Consider the following paragraph and translate it in French language using some
translator:
The solar system is a dynamic and complex system, and it has been the subject of
extensive scientific study and exploration. Space missions, telescopes, and other
observational tools have provided valuable insights into the nature and history of
our solar system. It’s a fascinating area of study that continues to reveal new discov-
eries about the universe we inhabit.
Recommended Reading
• Learning Deep Learning: Theory and Practice of Neural Networks, Computer Vision,
Natural Language Processing, and Transformers Using TensorFlow by Magnus Ekman
Publisher: Addison-Wesley Professional
Publication Year: 2021
After introducing the building blocks of deep neural networks, such as artificial neu-
rons and fully connected, convolutional, and recurrent layers, Magnus Ekman shows
how to use them to build advanced architectures, including the transformer, and
describes how to use these concepts to build modern computer vision and natural lan-
guage processing (NLP) networks, including Mask R-CNN, GPT, and BERT. It explains
how a natural language translator works and the system that generates image descrip-
tions in natural language.
• Quick Start Guide to Large Language Models: Strategies and Best Practices for Using
ChatGPT and Other LLMs by Sinan Ozdemir
Publisher: Addison-Wesley Professional
Publication Year: 2023
The development of large language models (LLM) has revolutionized the field of natu-
ral language processing in recent years. Models like BERT, T5, and ChatGPT have
demonstrated unprecedented performance on a wide range of NLP tasks, from text
classification to machine translation. Despite their impressive performance, the use of
LLMs remains a challenge for many professionals. The large size of these models,
combined with a lack of understanding of their inner workings, has made it difficult for
professionals to use and optimize them effectively to meet their specific needs.