Britto
Britto
A PROJECT REPORT
Submitted by
A.BRITTO RAJ
BACHELOR OF
ENGINEERING
in
ODDANCHATRAM-624619
ANNA UNIVERSITY::CHENNAI–600025
MAY 2024
1
CHRISTIAN COLLEGE OF ENGINEERING AND TECHNOLOGY
ODDANCHATRAM-624619
REGISTER NO:
Visualization: Create visualizations (e.g., histograms, word clouds, and scatter plots) to explore the distribution of the data and understand patterns and
relationships within the dataset.
2. Data Cleaning:
Handling Missing Values: Identify and handle missing values through methods such as imputation, removal, or lling with placeholder values, depending on
the context.
Removing Duplicates: Check for and remove duplicate records to ensure data quality.
Outlier Detection: Detect and handle outliers that may skew the model training, using statistical methods or visualization techniques.
Standardizing Formats: Standardize data formats (e.g., date formats, text casing) to ensure consistency across the dataset.
3. Data Transformation:
Tokenization: Split text data into tokens (words or subwords) using tokenizers from NLP libraries like SpaCy or Hugging Face
Transformers.
Normalization: Normalize text data by converting it to lowercase, removing punctuation, special characters, and stop words.
Lemmatization/Stemming: Apply lemmatization or stemming to reduce words to their base or root form, aiding in reducing dimensionality and improving
model performance.
Vectorization: Convert text data into numerical format using techniques like TF-IDF, word embeddings (Word2Vec, GloVe), or transformer-based
embeddings (BERT, GPT).
4. Feature Engineering:
Contextual Features: Extract contextual features such as dialogue context, speaker information, and conversation history to enhance model
understanding.
Sentiment Analysis: Incorporate sentiment analysis to capture the emotional tone of the conversations.
Custom Features: Create custom features relevant to the chatbot’s domain, such as named entity recognition (NER) tags or topic modeling
outputs.
Feature Selection: Select and prioritize features that are most relevant to the task, using techniques like correlation analysis, mutual information, or feature
importance from tree-based models.
5. Model Selection and Training: Model Architecture: Choose appropriate model architectures for the chatbot, such as transformer-based models (e.g.,
BERT, GPT) for their superior performance in NLP tasks.
Hyperparameter Tuning: Perform hyperparameter tuning using techniques like grid search or random search to optimize model performance.
Training Process: Train the model on the preprocessed data, ensuring to use appropriate loss functions, optimizers, and learning rate schedules. Data
Augmentation: Apply data augmentation techniques if necessary to arti cially expand the training dataset and improve model robustness.
6. Model Evaluation:
Validation Split: Split the dataset into training, validation, and test sets to evaluate model performance.
Evaluation Metrics: Use relevant metrics such as accuracy, precision, recall, F1-score, and confusion matrix to assess model performance.
Cross-Validation: Implement cross-validation to ensure the model generalizes well to unseen data.
Error Analysis: Conduct error analysis to identify common failure modes and areas for improvement.
Iterative Improvement: Based on evaluation results, iterate on the model by re ning preprocessing steps, adjusting features, or exploring alternative model
architectures.
Rule-Based Systems: Early systems like ELIZA and PARRY set the stage for automated dialogue.
Statistical Models: Techniques like HMM and CRF improved tasks with machine learning.
Sequence-to-Sequence Models: Models like GNMT brought coherence to responses using RNNs.
Transformer Models: Transformers like BERT and GPT revolutionized NLP with attention mechanisms.
Pre-trained Language Models: BERT and GPT excel in context comprehension and response generation.
Multimodal Chatbots: Emerging chatbots integrate text, voice, and visual inputs for richer interactions.
Evaluation and Benchmarks: GLUE and SuperGLUE set standards for assessing model performance
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 2/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Colab
FLOW CHART
FUTURE ENHANCEMENT
Consider adding multimodal capabilities to your chatbot for a richer user experience. This includes:
Image Recognition: Let the chatbot interpret images for tasks like product information or recommendations.
Voice Input: Enable users to interact with voice commands, enhancing accessibility and natural conversation.
Emotion Detection: Incorporate sentiment analysis to tailor responses based on user emotions.
Interactive Visual Output: Provide visual aids like graphs or charts for better information conveyance.
These enhancements broaden user engagement and cater to diverse needs, making your chatbot more
versatile and user-friendly.
CONCLUSION:
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 3/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
In conclusion, our NLU chatbot project signi es a signi cant advancement in conversational AI. Leveraging Kaggle's dataset and advanced ML techniques,
we've created a robust chatbot pro cient in understanding and responding to natural language inputs. Overcoming challenges like data preprocessing and
model training, our chatbot showcases impressive performance.
This project isn't just a standalone chatbot but a precursor to future AI-driven conversational systems. With potential enhancements like multimodal
capabilities and personalization, our chatbot paves the way for more interactive human-computer interactions.
Re ecting on our journey, we acknowledge the collaborative effort and dedication involved. Moving forward, we're excited about deploying our chatbot
across various domains, revolutionizing user experiences. Overall, our project underscores AI and ML's transformative potential, offering innovative
solutions for human-centric interactions.
Mounted at
/content/drive
columns = ['question',
'answer'] df =
pd.read_csv('/content/drive
/MyDrive/NLU chatbot
# Now df will have columns named 'question' and
/dialogs.txt', sep='\t',
'answer' df.head()
names=columns)
questio answe
n r
0 hi, how are you doing? i'm fine. how about
yourself?
1
i'm fine. how about yourself? i'm pretty good. thanks
for asking.
2 i'm pretty good. thanks for asking.no problem. so how
have you been?
been
aboutgood.
you? i'm
i'vein school right
now.
import matplotlib.pyplot as
plt from wordcloud import
WordCloud
print("\nData types of
columns:") print(df.dtypes)
print("\nShape of the
dataset:") print(df.shape)
print("\nMissing values:")
print(df.isnull().sum())
Shape of the
question (3725,
dataset: 0 2)
answer 3
dtype: values:
Missing
int64
Text Analysis:
Analyze the length distribution of questions and answers. Check for any unusual characters or patterns in the text. Explore the most common words or
phrases in questions and answers (word frequency analysis).
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 4/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM df['question_length'] = df['question'].apply(len) Colab
Visualization:
Create visualizations to better understand the data distribution (e.g., histograms, word clouds). Plot the distribution of question and answer lengths.
Visualize word frequency using bar plots or word clouds.
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 5/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Colab
# Replace NaN values with an empty string, then convert all entries to
strings df['answer'] = df['answer'].fillna('').astype(str)
Topic Modeling:
Use techniques like Latent Dirichlet Allocation (LDA) to identify underlying topics in the questions and answers. Cluster similar questions and answers based
on topic distributions.
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 6/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english') X =
vectorizer.fit_transform(df['question']) lda = LatentDirichletAllocation(n_components=7,
random_state=50) lda.fit(X) # Display the top words for each topic for idx, topic in
enumerate(lda.components_): print(f"Topic {idx}:", ", ".join([vectorizer.get_feature_names_out()
[i] for i in topic.argsort()[:-11:-1]]))
Topic 0: yes, good, need, ll, course, don, thank, hope, talk,
just Topic 1: like, lot, make, people, don, got, great, fun,
doing, sounds Topic 2: didn, day, sure, just, tell, come,
maybe, nice, bad, eat
Topic 3: going, mean, does, tv, heard, party, long, happened, idea,
told Topic 4: think, right, time, ll, money, okay, today, school,
look, love Topic 5: want, know, don, really, ve, movie, weather,
yeah, buy, best Topic 6: did, let, say, new, wrong, ll, enjoy,
smell, school, phone
Language Complexity:
Measure the complexity of language used in questions and answers (e.g., average word length, vocabulary richness). Explore readability scores or linguistic
features.
Data Preprocessing
new_dialogue_data =
[["Hi",
"Hello"],
["How are you?",
"I'm good,
thanks for
asking. How
about you?"],
["I'm doing well
too.", "That's
great to hear.
What have you
been up to
lately?"],
["Not much, just working and spending time with family.", "That sounds nice. Have you watched any good movies recently?"],
["Yeah, I saw a really good one last weekend.", "It was a thriller, right? I heard good things about it."],
["Yes, it was.", "Do you want to watch it together sometime?"],
["Sure, that sounds like a plan.", "Awesome! Let's plan it for this
weekend."], ["Sounds good to me.", "Alright then, it's a plan. What
time works for you?"],
["How about Saturday evening?", "Perfect! Saturday evening it is. I'll book the
tickets."], ["Great! Looking forward to it.",answer
questio "Me too. It'll be fun."]
question_length
] new_df = n answer_length
pd.DataFrame(new_dialogue_data,
columns=columns)
Do you want to
3730 Yes, it was. watch it together NaN NaN
sometime?
Sure, that sounds like Awesome! Let's plan it for
3731 a plan. tthis weekend.
k NaN NaN
Preprocessing i f
3732 Sounds good to me. Alright then, it's a plan. NaN NaN
What
Lowercase
Tokenization
Stop words
removal
Lemmization
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 8/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM [nltk_data] Downloading package punkt to Colab
/root/nltk_data... [nltk_data] Package punkt is
already up-to-date!
question
do you want to
3730 yes, it was. watch it together NaN NaN
sometime?
sure, that sounds like awesome! let's plan it for
3731 a ti k this NaN NaN
plan. f weekend.
Classical
3732
ML ChatBot
sounds good to me. alright then, it's a plan. NaN NaN
what
Pipe = Pipeline([
('tfidf', TfidfVectorizer()),
('classifier',
RandomForestClassifier())
])
predicted_text = Pipe.predict(X_test)
# Creating a DataFrame to compare the first 10 results comparison_df = pd.DataFrame({'Real Question ':
X_test[:10],'Real Generated Text': y_test[:10], 'Predicted Text': predicted_text[:10]}) comparison_df.head(10)
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 9/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Real Colab Real Generated Predicted
Question Text Text
3253 how are you doing i started shopping at the dollar i don't know. i think i'm
that? store. you shouldn't carry your keys and average. i'm pretty good.
3190 the pants are fine, but the pocket has a pens in your... thanks for asking.
2194
huge ... uh-oh. that means that she's fat and she's men singers don't have to look
ugly. cute. good.
no pets are if they don't like it, they can
3303 that's great. we won't have neighbors on allowed. move.
both ... yeah, maybe next this friday? sorry, i already have
time. plans.
3214
642 i really wanted you to come, butthat's
i a good deal. and a one-pound tub of soft butter yes, even though some of the potatoes had
was the sam...
understand. eyes.
maybe we should learn some good
184 she's one of the prettiest girls at the school. what does she jokes.
look like? did you need
something?
3515 no, that's not the problem. maybe it will go away in you might hit someone in the
a little while. head.
i have the
3185 why not? i didn't want to pay dvd.
for the holes.
# Visualize Feature Importance if
2206 eight o'clock. that
isinstance(Pipe.named_steps['classifier'], RandomForestClassifier):
sounds great.
feature_importances =
import seaborn as sns
Pipe.named_steps['classifier'].feature_importances_ feature_names =
Pipe.named_steps['tfidf'].get_feature_names_out()
plt.figure(figsize=(6, 4))
sns.barplot(x=top_feature_importances,
y=top_feature_names) plt.xlabel('Token Importance')
plt.ylabel('Token Name') plt.title('Top 10 Tokenze
Importance') plt.show()
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 10/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM question = input("You: Colab
") if question.lower() ==
'quit':
print("Chatbot:
Goodbye!") break
response =
get_response(question)
print("Chatbot:",
response)
# Start the
chat chat()
Encoder - Decoder Model with Attention and LSTMs Chatbot from scratch
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 11/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM import nltk from nltk.corpus Colab
import stopwords from
nltk.tokenize import
word_tokenize import re import
unicodedata
nltk.download('punkt')
def
unicode_to_ascii(s):
return ''.join(c for c in
unicodedata.normalize('NFD', s) if
unicodedata.category(c) != 'Mn')
3730 <sos> yes it was <eos> <sos> do you want to watch it together NaN NaN
sometim...
3731 <sos> sure that sounds like a plan <eos> <sos> awesome let s plan it for this NaN NaN
weekend ...
3732 <sos> sounds good to me <eos> <sos> alright then it s a plan what time NaN NaN
wor...
3735 rows × 4 columns
3733 <sos> how about saturday evening <eos> <sos> perfect saturday evening it is i ll bo... NaN NaN
3734 <sos> great looking forward to it <eos> <sos> me too it ll be fun <eos> NaN NaN
questions =
preprocessed_df['question'].values.tolist()
answers =
preprocessed_df['answer'].values.tolist()
# Padding sequences for equal length # Pad sequences for equal length
max_len_question = max(len(seq) for seq in question_seqs)
max_len_answer = max(len(seq) for seq in answer_seqs) max_len =
max(max_len_question, max_len_answer) print(max(max_len_question,
max_len_answer)) # Pad sequences separately for questions and answers
question_seqs = pad_sequences(question_seqs, maxlen=max_len,
padding='post') answer_seqs = pad_sequences(answer_seqs,
maxlen=max_len, padding='post')
24
tokenizer.texts_to_sequences("<sos
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-
>") [[], [9], [490], [9], 12/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
[]]
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM tokenizer.word_index["<sos>"] Colab
Model Architecture
from tensorflow.keras.models import Model from tensorflow.keras.layers import Input, LSTM, Dense,
Embedding, Attention, Concatenate, Dropout
# Print model
summary
model.summary()
Model: "model"
# Train (type)
Layer the model model.fit([question_seqs,
Output Shape Param # Connected to
answer_seqs[:, :-1]], answer_seqs[:, 1:],
======================== input_1
batch_size=64,
[(None, 24)] epochs=32, validation_split=0.2)
0 []
========================================================================== (InputLayer)
input_2 (InputLayer) [(None, 23)] 0 []
embedding (Embedding) (None, 24, 256) 618496 ['input_1[0][0]']
embedding_1 (Embedding) (None, 23, 256) 618496 ['input_2[0][0]']
lstm (LSTM) [(None, 24, 256), 525312 ['embedding[0]
[0]']
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 13/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM (None, 256), Colab
(None, 256)]
Epoch 1/32
47/47 [==============================] - 63s 1s/step - loss: 3.0299 - val_loss: 2.2781
Epoch 2/32
47/47 [==============================] - 49s 1s/step - loss: 1.9036 - val_loss: 2.0941
Epoch 3/32
47/47 [==============================] - 49s 1s/step - loss: 1.7891 - val_loss: 2.0648
Epoch 4/32 :
47/47 [==============================] 2.0476
- 55s 1s/step - loss: 1.7495 - val_loss
Epoch 5/32
47/47 [==============================] - 47s 1s/step - loss: 1.7074 - val_loss: 2.0242
Epoch 6/32
47/47 [==============================] - 48s 1s/step - loss: 1.6609 - val_loss: 2.0007
Epoch 7/32
47/47 [==============================] - 49s 1s/step - loss: 1.6212 - val_loss: 1.9783
Epoch 8/32
47/47 [==============================] - 47s 1s/step - loss: 1.5819 - val_loss: 1.9536
Epoch 9/32 :
47/47 [==============================] 1.9287
- 48s 1s/step - loss: 1.5394 - val_loss
Epoch 10/32
47/47 [==============================] - 50s 1s/step - loss: 1.4969 - val_loss: 1.9095
Epoch 11/32
47/47 [==============================] - 47s 1s/step - loss: 1.4552 - val_loss: 1.8925
Epoch 12/32
def
generate_response(input_text
): # Tokenize the input
text
input_sequence =
tokenizer.texts_to_sequences([input_text]) # Pad the
input sequence
input_sequence = pad_sequences(input_sequence,
maxlen=max_len, padding='post')
return output_text.strip()
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 14/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
# Test the function with input "how are
you" input_text = "how do you do"
response =
generate_response(input_text)
print("Response:", response[5:])
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Colab
1/1 [==============================] - 2s 2s/step
1/1 [==============================] - 0s 175ms/step
1/1 [==============================] - 0s 175ms/step
1/1 [==============================] - 0s 172ms/step
1/1 [==============================] - 0s 171ms/step
Response: i m not sure model_name='google/flan-t5-
small'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
torch_dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(model_name)
Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does
pull out some important information from the text which indicates the model can be ne-tuned to the task at hand.
index = 0
question = df['question']
[index] answer =
df['answer'][index]
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 15/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true