0% found this document useful (1 vote)
51 views41 pages

Intent Detection Report (1)

Uploaded by

Fardinul Hoque
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
51 views41 pages

Intent Detection Report (1)

Uploaded by

Fardinul Hoque
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

B.Sc.

Engineering Thesis Defense


Intent Detection of Users Query for AI Chatbots and Virtual Assistants

Rezaul Karim (ASH1801035M)


Fardinul Hoque (ASH1901006)

Under The Supervision of


Ratnadip Kuri
Assistant Professor, Dept. of CSTE, NSTU

Department of Computer Science and Telecommunication Engineering


Noakhali Science and Technology University
14 March, 2024
Intent Detection of Users Query for AI Chatbots and Vertual Assistants
Table of Contents
Acknowledgement................................................................................................................................5
Abstract.................................................................................................................................................6
Introduction..........................................................................................................................................7
1.1 Statement of the Problem...........................................................................................................8
1.2 Query Intent Detection...............................................................................................................8
1.3 Objectives..................................................................................................................................8
1.4 Multinomial Naive Bayes..........................................................................................................9
1.5 Linear Support Vector Machine (SVM) Classifier..................................................................10
1.6 Random Forest Classifier Algorithm:......................................................................................10
1.7 Gradient Boosting Algorithm:..................................................................................................11
1.8 Bi-LSTM (Bidirectional Long Short-Term Memory) Algorithm:...........................................12
Literature Review...............................................................................................................................13
Methodology.......................................................................................................................................15
3.1 Dataset Collection and Preparation..........................................................................................15
3.2 Justification for Dataset Collection Methodology...................................................................15
3.3 Data Preprocessing..................................................................................................................17
3.4 Labeling and Encoding............................................................................................................18
3.5 Word Frequency.......................................................................................................................19
3.6 Train-Test Split........................................................................................................................25
3.7 Text Vectorization - TF-IDF:...................................................................................................26
Result and Analysis............................................................................................................................27
4.1 Result Parameters....................................................................................................................27
4.2 Linear SVC..............................................................................................................................30
4.3 Multinomial Naive Bayes........................................................................................................33
4.4 Random Forest.........................................................................................................................39
4.5 Gradient Boosting....................................................................................................................42
4.6 Bi-LSTM..................................................................................................................................44
4.7 Analysis....................................................................................................................................45
Conclusion & Future work.................................................................................................................46
5.1 Future work..............................................................................................................................46
5.2 Conclusion...............................................................................................................................47
References:.........................................................................................................................................48
Acknowledgement
We begin by extending our heartfelt gratitude to the guiding force that shapes our journey, the
unwavering support of Allah. Our sincere appreciation is also directed towards our mentor,
Ratnadip Kuri, Assistant Professor, Department of Computer Science and Telecommunication
Engineering (CSTE), Noakhali Science and Technology University. His continuous guidance and
wisdom have been instrumental in shaping the trajectory of our research endeavors. We express our
deepest thanks to Ratnadip Kuri for his invaluable insights, meticulous feedback, and patient
encouragement throughout the course of this study. His dedication to fostering an environment of
intellectual curiosity has been a beacon that illuminated our path. We are especially grateful for his
constructive critique, which significantly enhanced the quality of our work. Our academic journey
has been enriched by the camaraderie of our peers, and we extend our thanks to them for their
collaborative spirit and shared experiences. Special appreciation goes to our friends who stood by
us with words of encouragement, constructive criticism, and shared moments of joy and challenge.
Their love and belief in our potential have been the driving force behind our success.
Rezaul Karim (ASH1801035M)
Fardinul Hoque( ASH1901006)
Abstract
Intent detection, also known as query intent detection or intent classification, is a natural language
processing (NLP) task aimed at determining the underlying intention or purpose behind a given text
input, usually in the form of a query or sentence. The goal is to categorize the input into predefined
classes or categories that represent different intents or actions. Our approach overcomes challenges
in traditional datasets, introducing WANLI, a collaborative effort with GPT-3, showcasing superior
performance, and generalization on diverse queries, thus redefining the landscape of query intent
detection.Unlike previous large-scale crowd-sourced datasets, our approach goes beyond recurrent
patterns. Algorithms (LinearSVC, Multinomial Naive Bayes, Random Forest, Gradient Boosting, Bi-
LSTM) play distinct roles, exhibiting commendable accuracy, robustness, diverse strengths, with Bi-
LSTM emerging as a standout performer.Not only are the results significant, but our comprehensive
approach to dataset generation is as well. LinearSVC and Multinomial Naive Bayes demonstrate
commendable accuracy; Random Forest exhibits exceptional robustness; Gradient Boosting
showcases diverse strengths; Bi-LSTM emerges as a standout performer, underscoring the depth and
variety of our algorithmic exploration.

Keywords: User Intent Detection, Chatbots, Virtual Assistants, Natural Language Processing
(NLP),Machine Learning, Tokenization, Bi-directional LSTM, .
Chapter 1

Introduction

It is impossible to overestimate the crucial role query intent detection plays in improving user
interactions with conversational agents, chatbots, and virtual assistants in the constantly changing
field of natural language processing (NLP). In order to uncover new dimensions and improve upon
preexisting paradigms, this thesis sets out on an ambitious and inventive journey into the depths of
query intent detection. The pursuit for accuracy and efficacy in interpreting user intent becomes the
lodestar directing our intellectual journey as we face the complexities of human communication.
The foundation of this innovative study is the development of a dataset that not only satisfies but
also surpasses the particular requirements of query intent identification. In the lack of a common
dataset, our method is a tribute to creativity since it makes use of cutting edge prompt engineering
strategies developed in partnership with ChatGPT. The outcome is an extensive and varied set of
more than 500,000 carefully classified inquiries that show a rainbow of user intent across ten
different categories. These sorts include transactional, troubleshooting and support, appointment
and reservation, educational, entertainment, health and wellness, personal, and product or service
inquiries. They also include informational and navigational searches. This dataset's depth and
diversity, carefully selected from 935 industries, provide a comprehensive picture of the many
difficulties that chatbots face encounter in real-world scenarios.
Our approach to creating datasets is motivated by the latest developments in Natural Language
Inference (NLI) problems. Our approach is in line with a paradigm that goes beyond the constraints
of traditional crowdsourced datasets because it places a strong emphasis on Worker-AI
collaboration. By using a collaborative pipeline that combines the generative power of GPT-3 with
the evaluative expertise of human annotators, we actively contribute to the generation of
challenging examples, which promotes model robustness and generalization, while also addressing
the shortcomings of in-domain performance.
As we explore the complex terrain of query intent detection, the research develops into a thorough
examination of several machine learning systems. These algorithms are examined for performance,
accuracy, and overall efficacy. They each have unique capabilities and considerations. The
investigation becomes a symphony of algorithmic complexities, ranging from the dependability of
the Linear Support Vector Classifier (LinearSVC) to the impressive performance of Multinomial
Naive Bayes, the resilience of Random Forest Classifier, the varied advantages of Gradient
Boosting, and the remarkable accuracy of Bi-directional Long Short-Term Memory (Bi-LSTM).
Past the mathematical analysis, our thesis emerges as a story that moves through the phases of
dataset generation, algorithm training, and thorough assessment. The complex nature of query intent
detection is highlighted by the dependable performances of LinearSVC and Multinomial Naive
Bayes, the resilience displayed by Random Forest, the subtle strengths displayed by Gradient
Boosting, and the unmatched accuracy of Bi-LSTM. Our contribution goes beyond algorithmic
investigation; it encompasses a comprehensive strategy that combines advanced methods with a
dataset that has been carefully selected.
The path from data collection—which is distinguished by the innovative WANLI dataset
construction methodology—to outcome attainment bears witness to our steadfast dedication to
expanding the boundaries of NLP research. This thesis offers a scientific investigation of algorithms
as well as a lighthouse pointing the way for further study, creativity, and the ongoing development
of technology involved in user-centered, contextually aware, and emotionally intelligent
conversations. The relevance of our thesis, as we set out on this intellectual journey, is not just in
the remarkable performance of the algorithms, but also in the spirit of innovation that drove us to
completely reimagine the field of query intent detection. This thesis adds significantly to the
continuous story of technology advancement and is far more than just an academic project.

1.1 Statement of the Problem


The communication process between humans is complex by nature, involving multiple levels of
ambiguity, hidden indications and situational complexities. Traditional chatbots, which are mostly
built on rule-based or keyword-driven approaches, face considerable challenges when it comes to
correctly determining the true intent behind user inputs. This constraint results in reactions that may
be incorrect in context or even unpleasant. Developing intelligent systems that go beyond simple
word interpretation and actually understand the different objectives that users have—informational,
transactional, navigational, and various other goals—is the main challenges.
Furthermore, the current body of work in the subject frequently fails to adequately address the need
for novel approaches to dataset development that go beyond the limitations of conventional
datasets. This thesis synthesizes previous research, suggests new methods, and presents a carefully
selected dataset in an effort to address these important challenges. The main issues include
improving the accuracy and efficiency of query intent detection, getting past the constraints of the
available datasets, and expanding the state of the art in natural language processing.

1.2 Query Intent Detection


One major challenge in natural language processing is interpreting user intent from textual
questions, particularly in the context of chatbots and virtual assistants. The development of robust
intent detection models is hampered by the absence of a standard dataset that can be customized for
different types of inquiries. The limited scope and recurrent nature of conventional crowdsourced
datasets hamper the generalization capabilities of machine learning algorithms. To create systems
that can have more emotionally intelligent, contextually aware, and user-centered discussions, it
becomes vital to close this gap.

1.3 Objectives
The overall objective of this thesis is to advance the field of query intent detection through a
multidimensional journey. The main goals include a thorough review of previous studies, the
production of creative datasets, and a thorough examination of machine learning methods. The aims
are designed to provide useful insights toward boosting the user experience using chatbots and
virtual assistants, ranging from improving model generalization to evaluating real-world
applicability. The work goes beyond algorithmic investigation and paves the way for further
developments in natural language processing in the future.
review and Synthesize: Using cutting-edge techniques, do a thorough review of the body of work
already done on intent identification and natural language processing.
Dataset Collection: Carefully select a large dataset that is suited to research requirements. The
dataset includes more than 500,000 inquiries of ten different kinds, which represent the variety of
user interactions that occur in real-world situations.
New Dataset generation approach: To overcome the drawbacks of conventional crowdsourced
datasets and guarantee strong model generalization, implement a novel dataset generation approach
that is motivated by current developments in Natural Language Inference (NLI) tasks.
Algorithmic Analysis: Examine various machine learning techniques for query intent
identification, such as Bi-LSTM, Random Forest Classifier, Linear Support Vector Classifier
(LinearSVC), Gradient Boosting, and Multinomial Naive Bayes.
Analyze each algorithm's performance using the following metrics: accuracy, F1-score,
confusion matrices, and multiclass ROC curves. This will give you important information about the
algorithms' advantages and disadvantages as well as how well-suited they are for the job.
Model Robustness and Generalization: In order to create efficient chatbot applications, examine
how well machine learning models can generalize to a variety of query types.
Future Work Exploration: Examine potential paths for further study and development, such as
handling multimodal data, explainability, real-time query intent recognition, domain-specific query
fine-tuning, improved dataset refining, and sophisticated model architectures.
User Feedback Integration: Provide methods for integrating user feedback into the training of the
model to guarantee ongoing enhancement and adjustment to changing language trends and user
preferences.

1.4 Multinomial Naive Bayes


Multinomial Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem, with
an assumption of independence between features. It is particularly suited for text classification
tasks, where the data can be represented as word frequency vectors.
Multinomial Naive Bayes assumes that the occurrence of a particular term in a document is
independent of the occurrences of other terms [6]. While this might not be strictly true in natural
language, the algorithm often performs well in practice. It works well with datasets where features
are discrete and represent the frequency of terms (words) in a document. This makes it suitable for
text data. It is computationally efficient and simple to implement, making it a good choice for large
datasets with high dimensionality.
In the context of chatbots and virtual assistants, user queries are often represented as text.
Multinomial Naive Bayes is well-suited for processing textual data, making it an appropriate choice
for understanding user intent. The dataset structure includes user queries and their corresponding
intent labels. By utilizing the Multinomial Naive Bayes algorithm, we can model the frequency of
words in these queries, capturing essential information about the language used to express different
intents. While the independence assumption may not perfectly hold in natural language, the
algorithm's ability to handle features independently is beneficial for understanding user intent in
diverse contexts. The simplicity and efficiency of Multinomial Naive Bayes make it suitable for
quick prototyping and experimentation. In the development of chatbots, where rapid iterations are
common, this efficiency can be advantageous. The probabilistic nature of Naive Bayes allows for
easy interpretation of results. Understanding the likelihood of a query belonging to a specific intent
category provides transparency in the model's decision-making process. While Multinomial Naive
Bayes is effective, it may not capture complex relationships in language and context as effectively
as more advanced algorithms. However, its simplicity and efficiency often make it a good starting
point for intent detection tasks, and it can serve as a baseline model for comparison with more
sophisticated approaches.

1.5 Linear Support Vector Machine (SVM) Classifier


Support Vector Machine is a supervised machine learning algorithm used for classification and
regression. In the case of linearly separable data, Linear SVM is a variant that aims to find the
optimal hyperplane that best separates the classes.
Linear SVM seeks to find a hyperplane that maximizes the margin between different classes. The
margin is the distance between the hyperplane and the nearest data points from each class. SVMs
are particularly effective in high-dimensional spaces, making them suitable for text classification
tasks where the feature space is often large (e.g., bag-of-words representation). SVMs are less prone
to overfitting, especially in high-dimensional spaces. This is beneficial when dealing with text data,
as it helps prevent the model from capturing noise.[8]
In the context of intent detection in chatbots, the feature space is often high-dimensional, especially
when using techniques like TF-IDF or word embeddings. Linear SVMs can handle these high-
dimensional spaces efficiently. Linear SVM aims to find the optimal hyperplane that separates
different classes. This is crucial for intent detection, as it helps identify a clear decision boundary
between various user intents. Chatbot datasets are often sparse (many zero values in the feature
matrix, especially in text data). Linear SVM is effective in handling sparse datasets, making it
suitable for natural language processing tasks. While Linear SVM deals with linearly separable
data, the kernel trick can be applied to handle non-linear relationships in the data. However, for
linearly separable problems like intent detection, the linear kernel might be sufficient. Linear SVM
provides a good balance between model complexity and performance. It is more computationally
efficient compared to non-linear SVMs, making it a practical choice for intent detection tasks in
chatbots.
While Linear SVM is effective for many scenarios, it may not capture complex non-linear
relationships as well as some other algorithms. For highly non-linear datasets, experimenting with
non-linear SVM kernels or other advanced models might be necessary. However, Linear SVM
serves as a robust and interpretable model, especially for tasks like intent detection in chatbots..

1.6 Random Forest Classifier Algorithm:


Random Forest is an ensemble learning method used for classification, regression, and other tasks.
It builds multiple decision trees during training and merges them together to get a more accurate
and stable prediction.
Random Forest consists of an ensemble (collection) of decision trees. Each tree is constructed
independently based on a random subset of the training data and features. The algorithm uses a
technique called bagging, where each tree is trained on a bootstrapped sample (randomly sampled
with replacement) from the original dataset. This promotes diversity among the trees. For each split
in a decision tree, only a random subset of features is considered. This feature randomization helps
to decorrelate the trees and reduces overfitting. In classification tasks, each tree "votes" for a class,
and the class with the majority of votes becomes the final prediction. In regression tasks, the
average of predictions is taken. Random Forest is less prone to overfitting compared to individual
decision trees. The ensemble nature helps in generalizing well to new, unseen data.[3]
Chatbot datasets, especially when using text data, can have a high-dimensional feature space.
Random Forest can effectively handle such high-dimensional data and identify relevant features for
intent detection. Chatbot datasets might contain noisy or irrelevant features. Random Forest is
robust to noisy data and can filter out less informative features by considering only a random subset
for each decision. Random Forest is capable of capturing non-linear relationships in the data. This is
beneficial when dealing with the varied and nuanced language used in user queries, allowing the
model to discern complex patterns. The ensemble approach, along with bootstrapping and feature
randomization, reduces the risk of overfitting. This is crucial in chatbot intent detection, as the
model needs to generalize well to diverse user inputs. While Random Forest is an ensemble of
decision trees, it still provides a degree of interpretability. Users can understand the importance of
different features in determining intent, aiding in model transparency.
Random Forest is a versatile algorithm suitable for various datasets, including those with high
dimensionality and non-linearity. However, it might not capture very subtle relationships present in
the data, which more complex models might address. In the context of chatbot intent detection,
Random Forest strikes a balance between performance, interpretability, and resistance to overfitting.

1.7 Gradient Boosting Algorithm:


Gradient Boosting is an ensemble learning technique that combines the predictions of multiple
weak learners, typically decision trees, to create a strong predictive model. It builds trees
sequentially, with each tree correcting the errors of the previous ones.
Gradient Boosting builds trees sequentially, where each tree focuses on correcting the errors made
by the ensemble of previous trees. The algorithm uses gradient descent optimization to minimize
the loss function, adjusting the weights of misclassified instances at each iteration. The base
learners, often decision trees, are referred to as weak learners. These are typically shallow trees to
avoid overfitting. The final prediction is the cumulative sum of predictions from all trees. Each tree
contributes a weighted amount to the final result. Gradient Boosting includes regularization terms to
control the complexity of the model and prevent overfitting.[4]
Chatbot datasets often involve non-linear relationships between user queries and intent labels.
Gradient Boosting is adept at capturing non-linear patterns, providing flexibility in understanding
the nuances of user intent. In chatbot intent detection, where understanding the context and subtle
nuances is crucial, the sequential correction of errors by each tree allows the model to iteratively
improve its predictions, enhancing overall accuracy. Chatbot datasets may contain noise or outliers.
Gradient Boosting is robust to noisy data and can adapt by assigning lower weights to misclassified
instances during training. Gradient Boosting provides insights into feature importance, helping
understand which aspects of user queries contribute most to predicting intent. This interpretability is
valuable in the context of chatbots. Chatbot datasets often involve categorical features, such as
different types of queries. Gradient Boosting handles categorical features naturally, eliminating the
need for extensive pre-processing. Gradient Boosting is effective in optimizing accuracy. In chatbot
applications, accurately identifying user intent is crucial for providing relevant and helpful
responses.
While Gradient Boosting is a powerful algorithm, it might be computationally more expensive
compared to other methods, and hyperparameter tuning is crucial for optimal performance.
However, its ability to handle non-linearity, sequential learning, and feature importance analysis
makes it a suitable choice for intent detection in chatbot applications.

1.8 Bi-LSTM (Bidirectional Long Short-Term Memory) Algorithm:


Bi-LSTM is a type of recurrent neural network (RNN) that incorporates bidirectionality and long
short-term memory units. It is particularly effective in handling sequential data and is widely used
for natural language processing tasks.
Bi-LSTM is designed to process sequential data, making it suitable for tasks where the order of
input elements (such as words in a sentence) is essential. LSTM networks, including Bi-LSTM,
have memory cells that can store and retrieve information over long sequences, addressing the
vanishing gradient problem in traditional RNNs. Bi-LSTM processes sequences in both forward and
backward directions. This bidirectional approach allows the model to capture contextual
information from both preceding and succeeding elements. By considering context from both
directions, Bi-LSTM can better understand the dependencies and relationships between words in a
sentence, which is crucial for interpreting user queries and intent. Bi-LSTM can handle variable-
length input sequences, making it suitable for natural language processing tasks where the length of
sentences or queries may vary.[5]
In chatbot datasets, user queries often exhibit sequential dependencies where the order of words
influences the overall meaning. Bi-LSTM excels at capturing such dependencies. Intent detection
requires a nuanced understanding of the context within a query. Bi-LSTM, through bidirectional
processing, can capture context effectively, enhancing the model's ability to discern user intent.
Chatbot queries can vary in length, and Bi-LSTM's ability to handle variable-length sequences
makes it suitable for capturing intent in queries of different structures and lengths. Bi-LSTM can
generate rich semantic representations of user queries. This is crucial for distinguishing between
intents that might share common keywords but differ in their overall context. Intent detection often
requires understanding long-term dependencies in queries. Bi-LSTM's memory cells enable the
model to capture such dependencies, contributing to more accurate predictions. Bi-LSTM is widely
used in natural language processing tasks due to its effectiveness in handling sequential data.
Chatbot intent detection inherently involves processing natural language, making Bi-LSTM a
suitable choice.
While Bi-LSTM is powerful, it may require more computational resources compared to traditional
machine learning algorithms. Additionally, tuning hyperparameters, such as the number of LSTM
units and layers, is crucial for optimal performance. Nevertheless, the ability of Bi-LSTM to capture
sequential dependencies aligns well with the challenges posed by the structure of chatbot datasets,
making it a valuable choice for intent detection.
Chapter 2

Literature Review

Thanks to a variety of approaches, the field of user intent categorization has advanced significantly,
with each approach offering a unique perspective and tackling a different set of difficulties. K-
means clustering, in particular, has proven to be an effective method with remarkable accuracy
(94%) on datasets of different sizes. This method finds eight kinds of user intent, outperforming
binary tree classification, with information seeking being the most prevalent intent. Although it has
the potential to be used for real-time search engine applications, its limitations are highlighted by
concerns about user representativeness and dependence on transaction logs as the data source[6].
Initiatives such as "Open Intent Discovery through Unsupervised Semantic Clustering and
Dependency Parsing"[11] have also investigated unsupervised learning techniques for user intent
discovery. Similar unsupervised learning techniques are explored in this work, which clarifies the
complexities of user intent extraction from dialogues. Moreover, an analysis has been conducted
comparing transformer-based models for intent detection with K-means clustering[12], highlighting
the ongoing effectiveness of K-means clustering when it comes to user intent categorization. This
work adds to the current discussion on intent detection techniques and offers insightful information
to professionals working in the subject.
New methods for intent recognition and slot tagging have been made possible by the development
of neural networks, as the "Multi-stage Bi-LSTM for Career Chatbot"[7] demonstrates. This novel
design leveraged a Bi-LSTM model in a multi-stage process where intent and slot information
mutually inform each other to reach state-of-the-art outcomes (F1-score >77%). This research
presents a viable path for increasing intent recognition in particular domains, addressing issues like
noisy user queries and non-native speakers. "Joint Slot Filling and Intent Classification with Deep
Learning"[13] examines joint learning techniques that use deep learning for both intent detection
and slot filling at the same time. Comparably, "A Neural Multi-stage Architecture for Intent
Detection and Slot Filling"[14] examines the effectiveness of such methods by utilizing LSTMs in a
multi-stage neural network design.
In the context of search queries, convolutional neural networks (CNNs) have been used to
determine user intent[2]. This method reduces the requirement for manual feature engineering by
using CNNs to learn semantic representations while treating queries as vectors. The study indicates
that although CNN features are excellent at capturing semantic similarity, more research should
focus on combining them with other techniques and on recurrent neural networks, like Bi-LSTM, to
achieve higher accuracy. In "Self-Attention Networks for Intent Detection"[16], the integration of
self-attention mechanisms with CNNs has been investigated, offering possible advantages over
CNNs operating independently. The present study underscores the dynamic character of intent
detection techniques, which integrate several neural network topologies to achieve maximum
efficacy.
The use of BERT in building a knowledge base chatbot is described in "[9]", which offers a
thorough structure for responding to information requests and determining which ones are outside
of its purview. The work addresses issues in knowledge base chatbot development by successfully
generating IS queries and detecting OOS. Suggestions for future development highlight the
possibility for sophisticated methods of natural language creation and the incorporation of user
feedback. The use of neural networks in open-domain chatbots is investigated in "A Neural
Conversational Model for Open Domain Dialog"[17], which highlights the potential for knowledge-
based strategies. Furthermore exploring the subtleties of building chatbots for task-oriented
domains, "Building Effective Dialog Systems for Task-Oriented Domains with Multi-Domain
Dialog State Trackers"[18] emphasizes the crucial significance of knowledge integration.
In conclusion, a wide range of approaches are covered in the literature on user intent categorization,
all of which contribute to the changing field of natural language processing. These methods, which
range from advanced neural architectures to conventional clustering techniques, together influence
the direction of intent detection, providing insightful information and opening the door for new
developments.
Chapter 3

Methodology

3.1 Dataset Collection and Preparation


In the absence of a standardized dataset for query intent detection, a comprehensive approach was
taken to curate a substantial dataset tailored to our research needs. Leveraging prompt engineering
techniques with ChatGPT, an extensive collection of over 500,000 queries was generated. These
queries spanned across ten distinct types, encompassing informational, navigational, transactional,
troubleshooting and support, appointment and reservation, educational, entertainment, health and
wellness, personal, and product or services queries.
To ensure diversity and relevance, these queries were strategically prompted from ChatGPT across
935 sectors. These sectors ranged from 'Online Real Estate Listings' to 'Lighting Fixtures,' 'Hotel
Reservations,' 'Adventure Tour Bookings,' 'Legal Consultations,' and many more. By encompassing
a wide array of sectors, our dataset reflects the rich tapestry of user interactions that a chatbot might
encounter in real-world scenarios.
This meticulously curated dataset serves as the foundation for our research, empowering the
development and training of a robust intent detection model. The incorporation of numerous query
types and sectors enhances the model's ability to generalize across diverse user intents, ensuring its
effectiveness in practical chatbot applications.

3.2 Justification for Dataset Collection Methodology


Our dataset creation method builds on recent Natural Language Inference (NLI) advancements for
robust intent detection research (similar to [10]). We address limitations of traditional crowdsourced
data by leveraging a collaborative pipeline with GPT-3 and human annotators. This tackles
repetitive patterns hindering model generalization.Inspired by the authors' emphasis on challenging
examples, we utilize prompt engineering informed by language model capabilities. The carefully
curated dataset (500,000+ queries across 10 types and 935 sectors) reflects real-world
diversity.Similar to WANLI's success, combining machine-generated examples with human
evaluation strengthens our dataset's reliability and versatility. This approach ensures our model
excels in diverse and challenging queries, crucial for effective chatbots.In short, our innovative
dataset creation method addresses limitations of existing approaches, promoting model
generalization and effectiveness in various user interactions.
Aligning AI-Generated Text with Human Preferences [20] explores methods for fine-tuning LLMs
to produce text aligned with human preferences, ensuring the generated content resonates with the
target audience. .Human Evaluation of AI-Generated Text Quality [21] This research investigates
how humans perceive the quality of AI-generated text, providing valuable insights for improving
the naturalness and coherence of machine-written content.While traditional human-labeled datasets
remain crucial for NLP research, their limitations in terms of cost, noise, and potential biases can be
mitigated by incorporating AI-generated text. This approach, as demonstrated in works like
"SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets" [22], leverages
LLMs to provide seeds for human annotators, leading to: The curated dataset is likely to be less
susceptible to biases and inconsistencies compared to solely relying on web-scraped data. The
collaboration between humans and AI introduces a broader range of writing styles and content,
enriching the dataset for various NLP tasks.Incorporating AI-generated text fosters a collaborative
environment, promoting cost-effective and efficient dataset creation while ensuring the quality and
diversity necessary for robust NLP research.
3.3 Data Preprocessing

Ten Distinct Datasets: Different datasets were collected, each focusing on a specific category of
user intent. Examples include informational queries, navigational queries, and transactional queries.
This ensures a diverse representation of user interactions.
Comprehensive DataFrame: The collected datasets were combined into a single comprehensive
DataFrame. This consolidation likely involved merging or concatenating the individual datasets,
creating a unified dataset for analysis and model training.
Text Cleaning Techniques: Text cleaning is crucial for improving the quality of textual data and
enhancing the performance of machine learning models. The following techniques were applied:
Removal of Special Characters: Non-alphanumeric characters, such as punctuation or symbols,
were likely removed to focus on the meaningful words in the text.
Stop Words Removal: Common words (stop words) that don't contribute significantly to the
meaning of the text (e.g., "and," "the," "is") were removed to reduce noise.
Tokenization: The process of breaking down a text into individual words or tokens. This step
facilitates further analysis by treating each word as a separate entity.
Lemmatization: It involves reducing words to their base or root form, considering variations like
plurals or different verb tenses. This helps in standardizing the vocabulary.
Quality Improvement: The overall goal of these text cleaning techniques is to enhance the quality
of the textual data. By removing noise and standardizing the representation of words, the
subsequent analysis and modeling stages can be more effective.

3.4 Labeling and Encoding


Labeling
Categorical Labels: Each query in the dataset was assigned a categorical label based on its type or
intent. In this case, examples include "informational," "navigational," and potentially other
categories representing different user intents.
Encoding
Label Encoding: Label encoding is a method of converting categorical labels into numerical values.
Each unique categorical label is mapped to a corresponding integer. For instance:

Query Type Label


Informational 0
Navigational 1
Transactional 2
Product or Services 3
Troubleshooting and Support 4
Appointment and Reservation 5
Educational 6
Entertainment 7
Health and Wellness 8
Personal 9

Machine learning algorithms typically work with numerical inputs. Label encoding allows the
algorithm to interpret and learn from the categorical labels by representing them as numerical
values. This is crucial for tasks such as classification, where the algorithm needs to predict the
category or intent of a given query.
3.5 Word Frequency
In our thesis report, we conducted a comprehensive analysis of word frequency after thorough text
pre-processing for ten distinct classes representing various query intents. Each class was
meticulously examined to unveil the most frequently occurring words, providing valuable insights
into the underlying themes and user intentions. For the "Appointments and Reservations" class,
prevalent words such as "schedule," "book," and "consultation" underscore the emphasis on
scheduling and booking activities. In contrast, the "Educational" class prominently features words
like "explain," "types," and "concept," emphasizing a focus on educational content, concepts, and
diverse topics. The "Entertainment" class showcases words like "TV," "recommend," and "suggest,"
indicative of user queries related to entertainment preferences, recommendations, and suggestions.
"Health and Wellness" emphasizes words like "help," "therapy," and "cancer," suggesting a focus on
queries related to health assistance, therapies, and concerns. For the "Informational" class, key
words like "information," "conceptual," and "explain" highlight a quest for informative content,
conceptual understanding, and explanatory details. "Navigational" queries, on the other hand,
revolve around words like "center," "directions," and "nearest," indicating a user's need for
navigational assistance, locations, and directions. In the "Personal" class, frequent words like
"check," "time," and "flight" point to queries associated with personal matters, time management,
and travel arrangements. "Product or Services" class predominantly features words such as "best,"
"recommend," and "smart," reflecting user interest in product recommendations and information on
various services. The "Transactional" class is characterized by words like "book," "private," and
"subscription," suggesting queries related to transactions, bookings, and subscription services.
Lastly, the "Troubleshooting and Support" class includes words like "child," "insurance," and
"troubleshoot," indicative of user queries seeking assistance, troubleshooting guidance, and support.
This detailed analysis of word frequency provides a nuanced understanding of user intent within
each query class, offering valuable information for optimizing and enhancing query intent detection
models.
3.6 Train-Test Split
Purpose
Model Training: The training set is used to teach the machine learning model patterns, relationships,
and trends within the data. During this phase, the model adjusts its parameters to minimize the
difference between its predictions and the actual labels in the training set.
Model Evaluation: The test set is reserved for evaluating the model's performance on new, unseen
data. This helps to estimate how well the model will generalize to real-world scenarios and ensures
that it is not merely memorizing the training data (overfitting).
Splitting Process
Randomization: The dataset is typically randomly shuffled before the split to ensure that both the
training and test sets are representative of the overall data distribution.
Split Ratio: The dataset is divided into two portions based on a predefined ratio, such as 80% for
training and 20% for testing. The exact split ratio can vary based on the size of the dataset and the
specific requirements of the task.
Training Set
Teaching the Model: The training set is used to train the model by providing input data along with
corresponding labels. The model learns to identify patterns and relationships, adjusting its internal
parameters through optimization algorithms like gradient descent.
Test Set
Model Evaluation: The test set remains unseen by the model during training and is used to assess
how well the model generalizes to new instances. This set helps to estimate the model's
performance on real-world, unseen data.
Overfitting Prevention
Detecting Overfitting: The use of a separate test set helps identify if the model has overfit the
training data by performing well on it but poorly on new data.
Hyperparameter Tuning: The test set can also be used for hyperparameter tuning, where different
configurations of the model are evaluated to find the optimal set of hyperparameters.
Cross-Validation:
K-Fold Cross-Validation: In addition to a simple train-test split, more advanced techniques like k-
fold cross-validation can be employed to further ensure robust model evaluation. This involves
dividing the dataset into k subsets and performing k iterations, using different subsets as the test set
in each iteration.

3.7 Text Vectorization - TF-IDF:


Term Frequency (TF)
Definition: Term Frequency represents how often a term (word) appears in a document. It is
calculated as the ratio of the number of occurrences of a term to the total number of terms in the
document.
Example: If the word "informational" appears 5 times in a document of 100 words, the term
frequency for "informational" in that document is 5/100 = 0.05.
Inverse Document Frequency (IDF)
Definition: Inverse Document Frequency measures the importance of a term in the entire dataset by
calculating the logarithm of the ratio of the total number of documents to the number of documents
containing the term.
Example:If there are 1,000 documents in the dataset, and the word "informational" appears in 100
of them, the inverse document frequency for "informational" is log(1000/100) = 1.
TF-IDF Calculation
Formula: TF-IDF is calculated by multiplying the Term Frequency (TF) by the Inverse Document
Frequency (IDF). The result is a numerical weight that reflects the importance of a term in a
specific document relative to its importance in the entire dataset.
Example: If the TF for "informational" in a document is 0.05, and the IDF for "informational" is 1,
then the TF-IDF weight for "informational" in that document is 0.05 * 1 = 0.05.
Vectorization
Creation of Feature Vectors: Each document in the dataset is represented as a feature vector, where
each element corresponds to the TF-IDF weight of a specific term. This process results in a high-
dimensional sparse matrix where rows represent documents, columns represent terms, and the
matrix values are the TF-IDF weights.
Purpose
Numerical Representation: TF-IDF vectorization transforms the raw text data into a numerical
format that machine learning models can understand and process.
Chapter 4

Result and Analysis


4.1 Result Parameters
In the realm of classification models, evaluating their performance is crucial to understanding how
well they generalize to unseen data. Several metrics provide valuable insights into different aspects
of model performance. In this context, we delve into key evaluation metrics that aid in
comprehensively assessing classification models. These metrics include F1-Score, Macro Average,
Weighted Average, AUC Curve, and the Confusion Matrix. Each of these metrics plays a distinct
role in elucidating the strengths and limitations of a classification model.
TP (True Positives): The number of instances correctly predicted as positive (belonging to the
positive class).
FP (False Positives): The number of instances incorrectly predicted as positive when they actually
belong to the negative class.
TN (True Negatives): The number of instances correctly predicted as negative (belonging to the
negative class).
FN (False Negatives): The number of instances incorrectly predicted as negative when they actually
belong to the positive class.

Precision:
Definition: Precision is the ratio of true positive predictions to the total positive predictions made by
the model. It assesses the accuracy of positive predictions.
Precision

P = Precision
TP = True Positives
FP = False Positives
High precision means that when the model predicts a positive instance, it is likely to be correct.
Precision is particularly important when the cost of false positives is high.
Recall (Sensitivity or True Positive Rate)
Recall is the ratio of true positive predictions to the total actual positive instances in the dataset. It
assesses the ability of the model to capture all positive instances.
Recall
R = Recall
TP = True Positives
FP = False Negatives
High recall indicates that the model is effectively identifying most of the positive instances. Recall
is crucial when the cost of false negatives is high.

F1-Score
F1-Score is a metric that combines precision and recall into a single value. It is particularly useful in
binary classification tasks and provides a balance between precision and recall.
F1-Score

F1 = F1-Score
p = Precision
R = Recall
F1-Score ranges from 0 to 1, where 1 indicates perfect precision and recall. It is especially valuable
when there is an uneven class distribution.
Support
Support is the count of instances (or samples) for each class in the dataset. It provides context to the
evaluation metrics by showing how many instances belong to each class. Support is not a metric
that is optimized but rather gives a sense of the distribution of classes. It helps to understand the
imbalances in the dataset.
Accuracy
Accuracy is a measure of overall correctness in a classification model. It calculates the ratio of
correctly predicted instances to the total number of instances.
Accuracy

TP = True Positives
TN = True Negatives
FP = False Positives
FN = False Netatives
High accuracy indicates that the model is making a high percentage of correct predictions across all
classes. However, it might not be the best metric for imbalanced datasets.
Macro Average
Macro Average is a method of computing the average performance across multiple classes without
considering class imbalances. It calculates the metric independently for each class and then takes
the average.
Macro Avg

n = Number of classes (categories) in the classification task.


Macro Average treats each class equally, providing an unweighted average across all classes. It is
suitable when each class is considered equally important.
Weighted Average
Weighted Average is similar to Macro Average but considers class imbalances. It calculates the
metric for each class and then takes a weighted average based on the support (number of instances)
of each class.
Weighted Avg

Interpretation: Weighted Average is useful when there is an imbalance in class distribution. It gives
more weight to classes with larger support.
Confusion Matrix
A confusion matrix is a table used in classification to assess the performance of a machine learning
model. It provides a comprehensive breakdown of the model's predictions, comparing them to the
true labels. The matrix consists of four components:
True Positives (TP): Instances where the model correctly predicted the positive class.
True Negatives (TN): Instances where the model correctly predicted the negative class.
False Positives (FP): Instances where the model predicted the positive class, but the true class is
negative.
False Negatives (FN): Instances where the model predicted the negative class, but the true class is
positive.
The confusion matrix is especially useful in understanding the types and frequencies of errors made
by a classifier. It serves as the foundation for deriving various performance metrics such as accuracy,
precision, recall, and the F1-Score.
AUC Curve (Area Under the ROC Curve)
The Receiver Operating Characteristic (ROC) curve is a graphical representation of a classifier's
performance across different discrimination thresholds. The curve plots the True Positive Rate
(Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold values. The Area
Under the Curve (AUC) quantifies the overall performance of the classifier. The proportion of actual
positive instances correctly identified by the classifier.
False Positive Rate (1 - Specificity): The proportion of actual negative instances incorrectly
identified as positive by the classifier.
A model with a higher AUC score generally has better discrimination ability, meaning it can
distinguish between positive and negative instances more effectively. A perfect classifier would have
an AUC score of 1, while a random classifier would have an AUC of 0.5.

4.2 Linear SVC


With and Without Parameter Tuning
Precision Recall F1-score Support
Linear SVC Accuracy 0.95 101985
Macro Avg 0.95 0.94 0.95 101985
Weighted Avg 0.95 0.95 0.95 101985

The evaluation of the LinearSVC algorithm is presented in two scenarios: "Without Parameter
Tuning" and "With Parameter Tuning." In both cases, the algorithm exhibits high accuracy and
balanced performance metrics across various categories. In the absence of parameter tuning, the
LinearSVC achieves an impressive overall accuracy for f1-score of 0.95, with support for a
substantial number of instances (101,985). The macro and weighted averages for precision, recall,
and f1-score consistently reach 0.95, reflecting the algorithm's robustness and effectiveness across
diverse query types.
Upon incorporating parameter tuning, the best parameters for the LinearSVC are identified as {'C':
1, 'penalty': 'l2.'}. These parameters represent the regularization strength (C) and the penalty term
(penalty) used in the LinearSVC algorithm. In this context, a regularization strength of 1 and an 'l2'
penalty were identified as the optimal choices based on the tuning process. Parameter tuning aims to
optimize the model's performance by adjusting these hyperparameters, and in the case of
LinearSVC, the specified values were found to yield the best results for the given dataset and query
intent classification task. Notably, the performance metrics remain consistent with the untuned
scenario, maintaining an accuracy for f1-score of 0.95 and supporting 101,985 instances. The macro
and weighted averages for precision, recall, and f1-score mirror the untuned results, affirming the
stability of the algorithm's performance even after fine-tuning. Overall, these findings underscore
the resilience and reliability of the LinearSVC algorithm, as it maintains high-quality predictions for
query intent detection both with and without parameter tuning.
Confusion Matrix (Linear SVC)
The confusion matrix for the LinearSVC algorithm provides a comprehensive breakdown of the
model's performance across ten query types. Notably, for Query Type 1, the model demonstrates
high accuracy, correctly predicting 13,970 instances with minimal false positives and negatives.
Similarly, Query Type 2 and Query Type 3 exhibit strong performance, though with slight confusion
among other query types. However, Query Type 6 shows a lower count of true positives, indicating
challenges in prediction. Overall, the LinearSVC algorithm achieves an impressive accuracy of 95%,
showcasing its effectiveness in classifying diverse query types. The detailed analysis of the
confusion matrix allows for a nuanced understanding of the algorithm's strengths and areas for
improvement, providing valuable insights for further refinement and optimization.

Area Under the ROC curve ( Linear SVC)


The Multiclass ROC Curve for the LinearSVC algorithm provides a visual representation of its
performance across the ten query types. Each point on the curve corresponds to a specific threshold,
and the curve's shape indicates the trade-off between true positive rate and false positive rate. The
Area Under the Curve (AUC) values quantify the classifier's ability to distinguish between different
query types, with values close to 1.00 signifying excellent performance. In this case, the AUC values
are consistently high, ranging from 0.99 to 1.00 for all query types. Particularly noteworthy are the
perfect AUC scores for Class 1, Class 2, Class 3, Class 5, Class 6, Class 7, Class 8, and Class 9,
indicating the algorithm's exceptional discriminatory ability for these specific query types. Overall,
the Multiclass ROC Curve with elevated AUC values underscores the robust discriminative power of
the LinearSVC algorithm across diverse query categories.

4.3 Multinomial Naive Bayes


Without Parameter Tuning
Precision Recall F1-score Support
Multinomial Accuracy 0.88 101985
Naive Bayes Macro Avg 0.90 0.82 0.84 101985
Weighted Avg 0.88 0.88 0.88 101985

With Parameter Tuning


Precision Recall F1-score Support
Multinomial
Naive Bayes Accuracy 0.89 101985
Macro Avg 0.90 0.87 0.88 101985
Weighted Avg 0.89 0.89 0.89 101985
The performance evaluation of the MultinomialNB algorithm is conducted in two scenarios:
"Without Parameter Tuning" and "With Parameter Tuning." In both cases, MultinomialNB
demonstrates commendable accuracy and balanced performance metrics across various categories.
Without parameter tuning, the algorithm achieves an accuracy for f1-score of 0.88, with substantial
support for 101,985 instances. The macro and weighted averages for precision, recall, and f1-score
are consistently high at 0.90, 0.82, and 0.84, respectively, indicating robust performance across
diverse query types.
Upon incorporating parameter tuning, the best parameter identified for MultinomialNB is {'alpha':
0.1}. The alpha parameter in MultinomialNB represents the additive smoothing parameter, which
helps handle the issue of zero probabilities for certain features. In this context, an alpha value of 0.1
was identified as the optimal choice based on the tuning process. Parameter tuning aims to optimize
the model's performance by adjusting hyperparameters, and in the case of MultinomialNB, the
specified alpha value was found to yield the best results for the given dataset and query intent
classification task. With this parameter setting, the algorithm exhibits improved performance,
reaching an accuracy for f1-score of 0.89 while maintaining support for 101,985 instances. The
macro and weighted averages for precision, recall, and f1-score further improve, achieving values
of 0.90, 0.87, and 0.88, respectively. This enhancement underscores the effectiveness of parameter
tuning in optimizing the algorithm's performance. Overall, MultinomialNB proves to be a reliable
and adaptable choice for query intent detection, providing accurate predictions both with and
without parameter tuning.
Confusion Matrix (MultinomialNB
The confusion matrix for the MultinomialNB algorithm provides a comprehensive overview of its
performance across different query types. The matrix reveals the number of instances for each
predicted and true query type. Notably, the algorithm demonstrates robust capabilities in correctly
classifying instances, as evidenced by the diagonal elements representing true positives. For
instance, in Query Type 1, the algorithm correctly predicts 135,949 instances, indicating a high
accuracy for this specific query type.
However, the confusion matrix also exposes areas where the algorithm faces challenges. Instances
of confusion, reflected in off-diagonal elements, suggest misclassifications or uncertainties. For
instance, in Query Type 0, there are 316 instances misclassified as Query Type 1, and 363 instances
misclassified as Query Type 2. Analyzing these patterns of confusion provides valuable insights into
the algorithm's strengths and areas for potential improvement.
Area Under the ROC curve ( MultinomialNB)
The Multiclass ROC Curve for the MultinomialNB algorithm provides a detailed evaluation of its
performance across different query types. The Area Under the Curve (AUC) values associated with
each query type's curve offer insights into the algorithm's ability to discriminate between classes. A
higher AUC indicates better discriminatory power, and the results reveal notable strengths in certain
classes.
For instance, Class 1 exhibits a perfect AUC of 1.00, indicating that the algorithm achieves optimal
true positive rates while minimizing false positive rates for this specific query type. Similarly,
Classes 6 and 7 also demonstrate perfect AUC scores, signifying the algorithm's exceptional ability
to distinguish between instances of these query types.
While most classes exhibit high AUC values, such as Classes 0, 2, 3, 4, 5, 8, and 9 with AUC scores
ranging from 0.97 to 0.99, it's essential to consider the algorithm's performance in the context of
specific classes. AUC scores approaching 1.00 indicate robust performance, but deviations from
perfection may suggest areas for further exploration and optimization.

The Multiclass ROC Curve for the MultinomialNB algorithm provides a detailed evaluation of its
performance across different query types. The Area Under the Curve (AUC) values associated with
each query type's curve offer insights into the algorithm's ability to discriminate between classes. A
higher AUC indicates better discriminatory power, and the results reveal notable strengths in certain
classes.
For instance, Class 1 exhibits a perfect AUC of 1.00, indicating that the algorithm achieves optimal
true positive rates while minimizing false positive rates for this specific query type. Similarly,
Classes 6 and 7 also demonstrate perfect AUC scores, signifying the algorithm's exceptional ability
to distinguish between instances of these query types.
While most classes exhibit high AUC values, such as Classes 0, 2, 3, 4, 5, 8, and 9 with AUC scores
ranging from 0.97 to 0.99, it's essential to consider the algorithm's performance in the context of
specific classes. AUC scores approaching 1.00 indicate robust performance, but deviations from
perfection may suggest areas for further exploration and optimization.

4.4 Random Forest


With and Without Parameter Tuning
Precision Recall F1-score Support
Random Accuracy 0.97 101985
Forest Macro Avg 0.97 0.95 0.96 101985
Weighted Avg 0.97 0.97 0.97 101985

The Random Forest algorithm, a powerful ensemble method, was employed for query intent
detection with and without parameter tuning, and the results are highly promising. In both
scenarios, the algorithm exhibited exceptional performance, achieving an accuracy and f1-score of
97% across the dataset.
In the absence of parameter tuning, the Random Forest model demonstrated robustness, with macro
and weighted averages for precision, recall, and f1-score consistently reaching or exceeding 95%.
This signifies the algorithm's ability to effectively identify true positives while minimizing both
false positives and false negatives across diverse query types.
Following parameter tuning, the model's hyperparameters were optimized, enhancing its
performance further. The best parameter configuration, {'max_depth': None, 'min_samples_leaf':
1, 'min_samples_split': 2, 'n_estimators': 150}, reflects the choices that yielded the most
favorable outcomes. These parameters play a crucial role in defining the structure and behavior of
the random forest model. The "max_depth" parameter controls the maximum depth of the trees in
the forest, "min_samples_leaf" sets the minimum number of samples required to be at a leaf node,
"min_samples_split" specifies the minimum number of samples required to split an internal node,
and "n_estimators" determines the number of trees in the forest. Despite achieving the same overall
accuracy and f1-score as the untuned model, the tuned model's parameter configuration might
contribute to improved generalization and stability.
The Random Forest algorithm, whether with or without parameter tuning, emerges as a robust
choice for query intent detection. Its consistent high accuracy, precision, recall, and f1-score across
diverse query types underscore its effectiveness in handling the complexities of intent classification
tasks. The detailed evaluation and optimization of the algorithm contribute valuable insights to our
thesis, highlighting its competence in real-world applications.
Confusion Mtrix (Random Forest)
The confusion matrix for the Random Forest Classifier provides a detailed overview of the model's
performance across different query types. Each row corresponds to the true class, while each
column represents the predicted class.

The diagonal elements indicate the true positives for each query type, and off-diagonal elements
represent misclassifications. From the matrix, it is evident that the Random Forest Classifier excels
in correctly identifying query types, as evidenced by the high values along the diagonal.
For instance, in Query Type 1, the model achieved 14,009 true positives, with only a small number
of misclassifications across other categories. Similarly, for Query Type 6, the classifier
demonstrated strong performance, correctly predicting 690 instances while misclassifying only a
minimal number.
However, some challenges are observed in certain query types, such as Query Types 0, 2, and 8,
where misclassifications are slightly more prominent. These discrepancies could be attributed to the
inherent complexity and subtle differences between queries in these categories.
The confusion matrix provides valuable insights into the strengths and areas for improvement of the
Random Forest Classifier in handling diverse query types. Despite some misclassifications, the
model showcases robust performance, reinforcing its effectiveness for query intent detection in real-
world applications.
Area Under the ROC curve ( Random Forest)
The Multiclass ROC Curve for the Random Forest Classifier showcases the Area Under the Curve
(AUC) values for each query type, providing a comprehensive assessment of the model's
discriminatory power across different classes. Each curve represents the classifier's ability to
distinguish between a specific query type and the rest.
Remarkably, for the majority of query types, including Classes 0, 1, 2, 3, 4, 5, 7, 8, and 9, the AUC
values are consistently high, reaching a perfect score of 1.00. This indicates the Random Forest
Classifier's exceptional performance in distinguishing these query types, achieving optimal
sensitivity and specificity.

While the AUC for Class 6 is slightly lower at 0.99, it still reflects a high discriminatory capability,
showcasing the model's effectiveness in identifying this specific query type. The overall pattern of
near-perfect AUC values across query types highlights the robustness of the Random Forest
Classifier in handling diverse intents within the dataset.
This result reinforces the Random Forest Classifier as a powerful algorithm for query intent
detection, emphasizing its ability to provide accurate predictions across a broad range of query
types. The high AUC values affirm the model's reliability in real-world scenarios, supporting its
potential for deployment in applications requiring precise intent classification.

4.5 Gradient Boosting


Precision Recall F1-score Support
Gradient Accuracy 0.75 101985
Boosting Macro Avg 0.81 0.76 0.78 101985
Weighted Avg 0.79 0.75 0.76 101985

The performance metrics for the Gradient Boosting algorithm reveal insights into its effectiveness
for query intent classification. The achieved accuracy, precision, recall, and F1-score values are
critical indicators of the model's capability to correctly classify various query types within the
dataset.
With an accuracy of 0.75, the Gradient Boosting algorithm demonstrates a satisfactory level of
overall correctness in predicting query intents. The F1-score, a harmonic mean of precision and
recall, stands at 0.78, reflecting a balanced trade-off between these two metrics. The Macro average
precision and recall are reported as 0.81 and 0.76, respectively, emphasizing the model's ability to
generalize well across different query types.
In the context of weighted averages, the Gradient Boosting algorithm achieves precision, recall, and
F1-score values of 0.79, 0.75, and 0.76, respectively. These weighted averages provide a
comprehensive evaluation, considering the varying support for each query type within the dataset.
While the achieved metrics indicate a reasonable level of performance, it's essential to consider
these results in comparison to other algorithms and explore potential avenues for improvement.
Fine-tuning hyperparameters or exploring ensemble methods could be considered to enhance the
Gradient Boosting model's performance for query intent detection. Overall, this analysis contributes
valuable insights into the strengths and areas of improvement for the Gradient Boosting algorithm
in the context of our thesis on query intent classification.
Confusion Matrix (Gradient Boosting)
The confusion matrix for the Gradient Boosting algorithm reveals distinct performance patterns
across different query types. Notably, the algorithm excels in accurately classifying "Educational"
queries, achieving 644 correct predictions with minimal misclassifications. In contrast, challenges
are observed in distinguishing between "Navigational" and "Informational" queries, with significant
off-diagonal values in these respective rows, suggesting potential confusion between the two
categories.
Comparatively, the algorithm demonstrates relatively better performance in handling "Navigational"
queries (11105 correct predictions) compared to "Informational" queries (11492 correct
predictions). However, it is crucial to consider the specific characteristics and importance of each
query type in the context of the application. Additionally, the "Troubleshooting" class exhibits
notable misclassifications across various categories, indicating room for improvement.
In summary, while the Gradient Boosting algorithm excels in certain classes, such as "Educational"
and "Navigational," the detailed analysis suggests that it may face challenges in distinguishing
between closely related query types. Further optimization efforts could enhance its performance,
particularly in classes where misclassifications are more pronounced.
Area Under the ROC curve (Gradient Boosting)
The multiclass ROC Curve for the Gradient Boosting algorithm provides an insightful perspective
on the model's discriminatory ability across different query types. Each line on the curve
corresponds to a specific query type, and the Area Under the Curve (AUC) quantifies the model's
performance in distinguishing between classes.
Analyzing the AUC values for each class reveals the discriminatory power of the Gradient Boosting
algorithm. Notably, Class 1 has the highest AUC of 0.97, indicating strong performance in
accurately identifying instances of this query type. Similarly, Class 5, Class 7, Class 8, and Class 9
also demonstrate high AUC values, emphasizing the model's effectiveness in distinguishing these
query types.

However, it's essential to consider the AUC values in the context of individual query types. For
instance, while Class 0 exhibits a slightly lower AUC of 0.90, it still indicates a good discriminatory
ability for this query type. Understanding the nuances of AUC values across different classes helps
provide a nuanced assessment of the model's overall performance in multiclass classification.
The multiclass ROC Curve for Gradient Boosting showcases the algorithm's ability to discriminate
between various query types, with high AUC values indicating strong performance for specific
classes. These findings contribute valuable insights to our thesis on query intent classification,
demonstrating the Gradient Boosting algorithm's discriminative capabilities and highlighting areas
for potential optimization.

4.6 Bi-LSTM

Precision Recall F1-score Support


Bi-LSTM Accuracy 0.98 101985
Macro Avg 0.97 0.97 0.97 101985
Weighted Avg 0.98 0.98 0.98 101985

In our thesis report, the performance evaluation of the Bi-LSTM algorithm for query intent
detection revealed impressive results, affirming its efficacy in accurately classifying user queries
across ten distinct classes. The algorithm achieved a remarkable accuracy and F1-score of 0.98,
showcasing its robustness in discerning nuanced differences in user intent. The macro and weighted
average precision and recall scores of 0.97 further underscore the algorithm's ability to achieve high
precision and recall across all query classes, ensuring a balanced and reliable classification
performance.
The Bi-LSTM algorithm's exceptional performance suggests its suitability for handling complex
and context-dependent queries, a crucial aspect in query intent detection. The high F1-score,
precision, and recall values indicate the algorithm's proficiency in minimizing false positives and
false negatives, vital for delivering accurate and reliable predictions in real-world scenarios. These
results position Bi-LSTM as a promising choice for query intent detection applications,
emphasizing its potential to enhance user experience and information retrieval systems by precisely
categorizing diverse user queries with a high level of accuracy.

4.7 Analysis
Our thorough investigation of several machine learning methods for query intent recognition has
produced insightful information about how well they function in a variety of contexts. Every
algorithm has contributed differently to the diverse array of outcomes, each with its own set of
advantages and disadvantages. Linear Support Vector Classifier (LinearSVC), Multinomial Naive
Bayes, Random Forest Classifier, Gradient Boosting, and Bi-directional Long Short-Term Memory
(Bi-LSTM) are all included in the comprehensive analysis.
Linear Support Vector Classifier (LinearSVC): Performance: LinearSVC exhibited robust
performance with an accuracy and F1-score of 0.95, showcasing its discriminative power. High
accuracy, reliable in diverse query classifications, and excellent discriminative capabilities. While
LinearSVC performs admirably, the choice might hinge on specific application requirements,
computational resources, and the importance of precision and recall.
Multinomial Naive Bayes: Multinomial Naive Bayes demonstrated commendable accuracy and
F1-score of 0.88, with a slight improvement after parameter tuning (0.89). Handles various query
types effectively, contributing to its overall effectiveness. While versatile, the algorithm's suitability
may depend on specific use cases and desired precision.
Random Forest Classifier: Performance: Exceptional performance even without parameter tuning,
maintaining an accuracy and F1-score of 0.97. Outstanding robustness, consistent results, and high
AUC values in the multiclass ROC curve. A reliable choice with consistent performance, especially
in scenarios where robustness and discriminative capabilities are crucial.
Gradient Boosting: Achieved an accuracy and F1-score of 0.75, indicating reasonable but
comparatively lower performance. Showcased diverse strengths and weaknesses in classifying
different query types. While offering reasonable performance, considerations may be given based
on the specific requirements of the application.
Bi-directional Long Short-Term Memory (Bi-LSTM): Emerged as a standout performer with an
impressive accuracy and F1-score of 0.98. Unparalleled accuracy, robustness, and reliability in
accurately classifying queries. Exceptional performance makes Bi-LSTM a compelling choice,
especially when precision and accuracy are paramount.
Which is better?
The choice of the best method depends on the particular requirements of the application, available
computing power, and how much emphasis is placed on accuracy, precision, and recall. Gradient
Boosting exhibits a variety of qualities, LinearSVC is notable for its dependable performance,
Multinomial Naive Bayes is versatile, Random Forest Classifier is durable, and Bi-LSTM is an
exceptional performer. The choice must be made with the particular use case in mind, taking into
account elements like interpretability, computational effectiveness, and the crucial harmony
between recall and precision in practical applications. The selection turns into a skillful calibration
in which the algorithm conforms to the subtleties of the intended user interactions, guaranteeing a
smooth and contextually aware experience.
Chapter 5

Conclusion & Future work

5.1 Future work


As we stand on the precipice of groundbreaking advancements in the field of query intent detection,
our thesis opens the door to several exciting avenues for future exploration and enhancement. The
dynamic nature of natural language processing and the ever-evolving landscape of user queries
present numerous opportunities for further research and development. Here are some potential
directions for future work:
Enhanced Dataset Refinement: Continuously refining and expanding our dataset will be crucial
for staying ahead of emerging query patterns and ensuring robust model generalization.
Incorporating additional sectors, diverse language variations, and real-time user interactions can
contribute to a more comprehensive dataset.
Advanced Model Architectures: Exploring and experimenting with state-of-the-art model
architectures beyond those investigated in this thesis can lead to further performance improvements.
Techniques such as transformer-based models, attention mechanisms, and ensemble learning could
be investigated to enhance the accuracy and efficiency of query intent detection.
Fine-tuning for Domain-Specific Queries: Tailoring models to specific domains or industries,
especially for sectors with nuanced language usage, can enhance the precision of query intent
detection. Fine-tuning algorithms on domain-specific datasets may prove beneficial for applications
in specialized fields.
Explainability and Interpretability: Incorporating techniques to make models more interpretable
and explainable is essential, especially in applications where transparency is crucial. This could
involve exploring methods such as attention visualization, SHAP (SHapley Additive exPlanations),
or LIME (Local Interpretable Model-agnostic Explanations).
Real-time Query Intent Detection: Adapting models for real-time applications, such as live chat
support or voice-activated assistants, requires a focus on minimizing latency. Optimizing algorithms
for speed without compromising accuracy will be a critical consideration for future work.
Handling Multimodal Data: Integrating models that can handle both textual and visual
information in queries opens avenues for understanding user intent in a richer context. Exploring
multimodal approaches, including image and voice recognition, can lead to more comprehensive
query intent detection systems.
User Feedback Integration: Developing mechanisms to incorporate user feedback into the model
training process allows for continuous improvement. This iterative learning approach ensures that
the model adapts to evolving user preferences and language trends.

5.2 Conclusion
In the journey from meticulously collecting a vast dataset of over 500,000 queries through
innovative prompt engineering from ChatGPT to the culmination of insightful results, our thesis
stands as a testament to the fusion of cutting-edge techniques and rigorous methodology. We
embraced the challenges posed by the absence of a standard dataset for query intent detection and
took a bold step in creating our own, meticulously categorizing queries across ten distinct types and
associating them with 935 diverse sectors.
The utilization of prompt engineering, inspired by a groundbreaking paradigm for dataset creation,
allowed us to overcome limitations inherent in traditional large-scale crowdsourced datasets. By
employing a collaborative pipeline involving the generative capabilities of GPT-3 and the
evaluative strength of human annotators, we transcended the pitfalls of repetitive patterns and
achieved a dataset—Worker-and-AI NLI (WANLI)—that not only surpassed MultiNLI in
performance but also showcased remarkable generalization on out-of-domain test sets.
Moving through the stages of dataset creation, algorithm training, and extensive evaluation, we
navigated the intricacies of machine learning models. LinearSVC, Multinomial Naive Bayes,
Random Forest Classifier, Gradient Boosting, and Bi-LSTM each played a distinctive role,
contributing to a rich tapestry of results. LinearSVC and Multinomial Naive Bayes exhibited
commendable accuracy, Random Forest demonstrated exceptional robustness, Gradient Boosting
showcased diverse strengths, and Bi-LSTM emerged as a standout performer with unparalleled
accuracy.
Our endeavor transcends mere algorithmic exploration; it represents a holistic approach to query
intent detection, blending sophisticated techniques with a richly curated dataset. The journey from
data collection to result gaining underscores our commitment to pushing the boundaries of
knowledge in the realm of natural language processing. As we conclude, the significance of our
thesis lies not only in the exceptional performance of algorithms but in the pioneering spirit that
drove us to redefine the landscape of query intent detection.
References:
[1] A. Kathuria, B. J. Jansen, C. Hafernik, and A. Spink, "Classifying the user intent of web queries
using k-means clustering," in IEEE Transactions on Information Theory, vol. pp. 563-581,
[publication year, 2010].
[2]Query Intent Detection using Convolutional Neural Networks Homa B. Hashemi Intelligent
Systems Program University of Pittsburgh hashemi@cs.pitt.edu Amir Asiaee, Reiner Kraft Yahoo!
inc Sunnyvale, CA
[3] "Understand Random Forest Algorithms," Analytics Vidhya, [Online]. Available:
https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/
[4] "Gradient Boosting Algorithm," Analytics Vidhya, [Online]. Available:
https://www.analyticsvidhya.com/blog/2021/09/gradient-boosting-algorithm-a-complete-guide-for-
beginners/
[5]"Bidirectional Long Short-Term Memory Network," ScienceDirect, [Online]. Available:
https://www.sciencedirect.com/topics/computer-science/bidirectional-long-short-term-memory-
network
[6] "Naive Bayes Classifier Explained," Analytics Vidhya, [Online]. Available:
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
[7] A. Nigam, P. Sahare, and K. Pandya, "Intent Detection and Slots Prompt in a Closed-Domain
Chatbot," in Proceedings of the IEEE International Conference on Natural Language Processing and
Machine Learning. New Delhi, India: kydots.ai.
[8] "Guide on Support Vector Machine (SVM) Algorithm," Analytics Vidhya, [Online]. Available:
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
[9] L. P. Manik, D. S. Rini, Z. Akbar, H. F. Mustika, A. Indrawati, A. D. Fefirenta, T.
Djarwaningsih, "Out-of-Scope Intent Detection on A Knowledge-Based Chatbot," Research Center
for Informatics, Indonesian Institute of Sciences, Indonesia.
[10] A. Liu, S. Swayamdipta, N. A. Smith, Y. Choi, "WANLI: Worker and AI Collaboration for
Natural Language Inference Dataset Creation," Paul G. Allen School of Computer Science &
Engineering, University of Washington; Allen Institute for Artificial Intelligence.
[12] Moura, A., Lima, P., Mendonça, F., Mostafa, S. S., & Morgado-Dias, F. (2021). On the Use of
Transformer-Based Models for Intent Detection Using Clustering Algorithms. Sensors, 21(13),
4428. https://doi.org/10.3390/s21134428
[13] Zhang, C., Li, Y., Du, N., Fan, W., & Yu, P. S. (2019, June). Joint slot filling and intent
detection via capsule neural networks. In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics (Vol. 1, pp. 5259-5267). Association for Computational
Linguistics. https://aclanthology.org/P19-1519
[14] Dao, M. H., Truong, T. H., & Nguyen, D. Q. (2021). Intent detection and slot filling for
Vietnamese [Abstract]. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing (EMNLP) (pp. 1276-1287). Association for Computational Linguistics.
https://arxiv.org/abs/2104.02021
[15] Zhang, H., Song, W., Liu, L., Du, C., & Zhao, X. (2016). Query classification using
convolutional neural networks. 2016 IEEE International Conference on Data Mining (ICDM) (pp.
1041-1046). Institute of Electrical and Electronics Engineers (IEEE).
[16] Yolchuyeva, S., Németh, G., & Gyires-Tóth, B. (2020). Self-attention networks for intent
detection. In 2020 43rd International Conference on Telecommunications and Signal Processing
(TSP) (pp. 470-474). Institute of Electrical and Electronics Engineers (IEEE).
https://ieeexplore.ieee.org/document/10052676
[17] Vinyals, O., & Le, Q. V. (2015, June 23). A neural conversational model [ArXiv preprint
arXiv:1506.05869]. https://arxiv.org/abs/1506.05869
[18] Zhu, Q., Zhang, Z., Zhu, X., & Huang, M. (2023). Building multi-domain dialog state trackers
from single-domain dialogs. In Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing (EMNLP) (pp. 946-957). Association for Computational Linguistics.
https://aclanthology.org/2023.emnlp-main.946
[19] Toward Data Science. (2022, January 15). Micro, Macro, Weighted Averages of F1 Score:
Clearly Explained. Towards Data Science. Available: https://towardsdatascience.com/micro-macro-
weighted-averages-of-f1-score-clearly-explained-b603420b292f
[20] Yao, Z., Schloss, B. J., & Selvaraj, S. P. (2023, December). Aligning AI-Generated Text with
Human Preferences. [ArXiv preprint arXiv:2312.15997]. https://arxiv.org/abs/2312.15997
[21] Chen, S., Gao, S., & He, J. (2023, May). Evaluating Factual Consistency of Summaries with
Large Language Models. [ArXiv preprint arXiv:2305.14069]. https://arxiv.org/abs/2305.14069
[22] Yuan, A., Ippolito, D., Nikolaev, V., Callison-Burch, C., Coenen, A., & Gehrmann, S. (2021).
SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets. [ArXiv preprint
arXiv:2111.06467]. https://arxiv.org/abs/2111.06467

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy