Britto 1 15 2 15 - Merged
Britto 1 15 2 15 - Merged
A PROJECT REPORT
Submitted by
A.BRITTO RAJ
BACHELOR OF ENGINEERING
in
ODDANCHATRAM-624619
ANNA UNIVERSITY::CHENNAI–600025
MAY 2024
1
CHRISTIAN COLLEGE OF ENGINEERING AND TECHNOLOGY
ODDANCHATRAM-624619
REGISTER NO:
Visualization: Create visualizations (e.g., histograms, word clouds, and scatter plots) to explore the distribution of the data and understand patterns and
relationships within the dataset.
2. Data Cleaning:
Handling Missing Values: Identify and handle missing values through methods such as imputation, removal, or lling with placeholder values, depending on
the context.
Removing Duplicates: Check for and remove duplicate records to ensure data quality.
Outlier Detection: Detect and handle outliers that may skew the model training, using statistical methods or visualization techniques.
Standardizing Formats: Standardize data formats (e.g., date formats, text casing) to ensure consistency across the dataset.
3. Data Transformation:
Tokenization: Split text data into tokens (words or subwords) using tokenizers from NLP libraries like SpaCy or Hugging Face Transformers.
Normalization: Normalize text data by converting it to lowercase, removing punctuation, special characters, and stop words.
Lemmatization/Stemming: Apply lemmatization or stemming to reduce words to their base or root form, aiding in reducing dimensionality and improving
model performance.
Vectorization: Convert text data into numerical format using techniques like TF-IDF, word embeddings (Word2Vec, GloVe), or transformer-based
embeddings (BERT, GPT).
4. Feature Engineering:
Contextual Features: Extract contextual features such as dialogue context, speaker information, and conversation history to enhance model understanding.
Sentiment Analysis: Incorporate sentiment analysis to capture the emotional tone of the conversations.
Custom Features: Create custom features relevant to the chatbot’s domain, such as named entity recognition (NER) tags or topic modeling outputs.
Feature Selection: Select and prioritize features that are most relevant to the task, using techniques like correlation analysis, mutual information, or feature
importance from tree-based models.
5. Model Selection and Training: Model Architecture: Choose appropriate model architectures for the chatbot, such as transformer-based models (e.g.,
BERT, GPT) for their superior performance in NLP tasks.
Hyperparameter Tuning: Perform hyperparameter tuning using techniques like grid search or random search to optimize model performance.
Training Process: Train the model on the preprocessed data, ensuring to use appropriate loss functions, optimizers, and learning rate schedules. Data
Augmentation: Apply data augmentation techniques if necessary to arti cially expand the training dataset and improve model robustness.
6. Model Evaluation:
Validation Split: Split the dataset into training, validation, and test sets to evaluate model performance.
Evaluation Metrics: Use relevant metrics such as accuracy, precision, recall, F1-score, and confusion matrix to assess model performance.
Cross-Validation: Implement cross-validation to ensure the model generalizes well to unseen data.
Error Analysis: Conduct error analysis to identify common failure modes and areas for improvement.
Iterative Improvement: Based on evaluation results, iterate on the model by re ning preprocessing steps, adjusting features, or exploring alternative model
architectures.
Rule-Based Systems: Early systems like ELIZA and PARRY set the stage for automated dialogue.
Statistical Models: Techniques like HMM and CRF improved tasks with machine learning.
Sequence-to-Sequence Models: Models like GNMT brought coherence to responses using RNNs.
Transformer Models: Transformers like BERT and GPT revolutionized NLP with attention mechanisms.
Pre-trained Language Models: BERT and GPT excel in context comprehension and response generation.
Multimodal Chatbots: Emerging chatbots integrate text, voice, and visual inputs for richer interactions.
Evaluation and Benchmarks: GLUE and SuperGLUE set standards for assessing model performance
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 2/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
FLOW CHART
FUTURE ENHANCEMENT
Consider adding multimodal capabilities to your chatbot for a richer user experience. This includes:
Image Recognition: Let the chatbot interpret images for tasks like product information or recommendations.
Voice Input: Enable users to interact with voice commands, enhancing accessibility and natural conversation.
Emotion Detection: Incorporate sentiment analysis to tailor responses based on user emotions.
Interactive Visual Output: Provide visual aids like graphs or charts for better information conveyance.
These enhancements broaden user engagement and cater to diverse needs, making your chatbot more versatile and user-friendly.
CONCLUSION:
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 3/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
In conclusion, our NLU chatbot project signi es a signi cant advancement in conversational AI. Leveraging Kaggle's dataset and advanced ML techniques,
we've created a robust chatbot pro cient in understanding and responding to natural language inputs. Overcoming challenges like data preprocessing and
model training, our chatbot showcases impressive performance.
This project isn't just a standalone chatbot but a precursor to future AI-driven conversational systems. With potential enhancements like multimodal
capabilities and personalization, our chatbot paves the way for more interactive human-computer interactions.
Re ecting on our journey, we acknowledge the collaborative effort and dedication involved. Moving forward, we're excited about deploying our chatbot
across various domains, revolutionizing user experiences. Overall, our project underscores AI and ML's transformative potential, offering innovative
solutions for human-centric interactions.
Mounted at /content/drive
question answer
0 hi, how are you doing? i'm fine. how about yourself?
1 i'm fine. how about yourself? i'm pretty good. thanks for
asking.
2 i'm pretty good. thanks for asking.no problem. so how have you
been?
what about you? 4 i've been great. what about you? i've
Missing values:
question 0
answer 3
dtype: int64
Text Analysis:
Analyze the length distribution of questions and answers. Check for any unusual characters or patterns in the text. Explore the most common words or
phrases in questions and answers (word frequency analysis).
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 4/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
df['question_length'] = df['question'].apply(len)
# Convert all entries in the 'answer' column to strings before applying len()
df['answer_length'] = df['answer'].astype(str).apply(len)
Visualization:
Create visualizations to better understand the data distribution (e.g., histograms, word clouds). Plot the distribution of question and answer lengths.
Visualize word frequency using bar plots or word clouds.
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 5/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
# Replace NaN values with an empty string, then convert all entries to strings
df['answer'] = df['answer'].fillna('').astype(str)
Topic Modeling:
Use techniques like Latent Dirichlet Allocation (LDA) to identify underlying topics in the questions and answers. Cluster similar questions and answers based
on topic distributions.
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 6/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english') X =
vectorizer.fit_transform(df['question']) lda = LatentDirichletAllocation(n_components=7, random_state=50)
lda.fit(X) # Display the top words for each topic for idx, topic in enumerate(lda.components_):
print(f"Topic {idx}:", ", ".join([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-11:-1]]))
Topic 0: yes, good, need, ll, course, don, thank, hope, talk, just
Topic 1: like, lot, make, people, don, got, great, fun, doing, sounds
Topic 2: didn, day, sure, just, tell, come, maybe, nice, bad, eat
Topic 3: going, mean, does, tv, heard, party, long, happened, idea, told
Topic 4: think, right, time, ll, money, okay, today, school, look, love
Topic 5: want, know, don, really, ve, movie, weather, yeah, buy, best
Topic 6: did, let, say, new, wrong, ll, enjoy, smell, school, phone
Language Complexity:
Measure the complexity of language used in questions and answers (e.g., average word length, vocabulary richness). Explore readability scores or linguistic
features.
Data Preprocessing
new_dialogue_data =
[["Hi", "Hello"],
["How are you?", "I'm good, thanks for asking. How about you?"],
["I'm doing well too.", "That's great to hear. What have you been up to lately?"],
["Not much, just working and spending time with family.", "That sounds nice. Have you watched any good movies recently?"],["Yeah,
I saw a really good one last weekend.", "It was a thriller, right? I heard good things about it."],
["Yes, it was.", "Do you want to watch it together sometime?"],
["Sure, that sounds like a plan.", "Awesome! Let's plan it for this weekend."],
["Sounds good to me.", "Alright then, it's a plan. What time works for you?"],
["How about Saturday evening?", "Perfect! Saturday evening it is. I'll book the tickets."],
["Great! Looking forward to it.", "Me too. It'll be fun."]
] new_df = pd.DataFrame(new_dialogue_data,
columns=columns)
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 7/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
0 hi, how are you doing? i'm fine. how about yourself? 22.0 29.0
i'm fine. how about i'm pretty good. thanks for
1 yourself? asking. 29.0 35.0
3732 Sounds good to me. Alright then, it's a plan. What NaN NaN
ti kf
Preprocessing
Lowercase
Tokenization
return text
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 8/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
question answer question_length answer_length
0 hi, how are you doing? i'm fine. how about yourself? 22.0 29.0
i'm fine. how about i'm pretty good. thanks for
1 yourself? asking. 29.0 35.0
3732 sounds good to me. alright then, it's a plan. what NaN NaN
ti kf
Classical ML ChatBot
Pipe = Pipeline([
('tfidf', TfidfVectorizer()),
('classifier', RandomForestClassifier())
])
Evaluation
predicted_text = Pipe.predict(X_test)
# Creating a DataFrame to compare the first 10 results comparison_df = pd.DataFrame({'Real Question ': X_test[:10],'Real Generated
Text': y_test[:10], 'Predicted Text': predicted_text[:10]}) comparison_df.head(10)
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 9/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
Real Question Real Generated Text Predicted Text
3253 how are you doing that? i started shopping at the dollar store. i don't know. i think i'm average.
3190 the pants are fine, but the pocket has a huge ... you shouldn't carry your keys and pens in your... i'm pretty good. thanks for asking.
2194 uh-oh. that means that she's fat and ugly. she's cute. men singers don't have to look good.
3303 that's great. we won't have neighbors on both ... no pets are allowed. if they don't like it, they can move.
642 i really wanted you to come, but i understand. yeah, maybe next time. this friday? sorry, i already have plans.
3214 that's a good deal. and a one-pound tub of soft butter was the sam... yes, even though some of the potatoes had eyes.
184 she's one of the prettiest girls at the school. what does she look like? maybe we should learn some good jokes.
3515 no, that's not the problem. maybe it will go away in a little while. did you need something?
3185 why not? i didn't want to pay for the holes. you might hit someone in the head.
plt.figure(figsize=(6, 4))
sns.barplot(x=top_feature_importances, y=top_feature_names)
plt.xlabel('Token Importance') plt.ylabel('Token Name')
plt.title('Top 10 Tokenze Importance') plt.show()
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 10/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
question = input("You: ")
if question.lower() == 'quit':
print("Chatbot: Goodbye!")
break response =
get_response(question)
print("Chatbot:", response)
2 frames
/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py in _input_request(self, prompt, ident, parent, password)
893 except KeyboardInterrupt:
894 # re-raise KeyboardInterrupt, to truncate
traceback
--> 895 raise KeyboardInterrupt("Interrupted by user") from None
896 except Exception as e:
897 self.log.warning("Invalid Message:", exc_info=True)
Encoder - Decoder Model with Attention and LSTMs Chatbot from scratch
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 11/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
import nltk from nltk.corpus import
stopwords from nltk.tokenize import
word_tokenize import re import
unicodedata
nltk.download('punkt')
def unicode_to_ascii(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
0 <sos> hi how are you doing <eos> <sos> i m fine how about yourself <eos> 22.0 29.0
1 <sos> i m fine how about yourself <eos> <sos> i m pretty good thanks for asking <eos> 29.0 35.0
2 <sos> i m pretty good thanks for asking <eos> <sos> no problem so how have you been <eos> 35.0 33.0
3 <sos> no problem so how have you been <eos> <sos> i ve been great what about you <eos> 33.0 32.0
4 <sos> i ve been great what about you <eos> <sos> i ve been good i m in school right now ... 32.0 40.0
3730 <sos> yes it was <eos> <sos> do you want to watch it together sometim... NaN NaN
3731 <sos> sure that sounds like a plan <eos> <sos> awesome let s plan it for this weekend ... NaN NaN
3732 <sos> sounds good to me <eos> <sos> alright then it s a plan what time wor... NaN NaN
3733 <sos> how about saturday evening <eos> <sos> perfect saturday evening it is i ll bo... NaN NaN
3734 <sos> great looking forward to it <eos> <sos> me too it ll be fun <eos> NaN NaN
3735 rows × 4 columns
questions = preprocessed_df['question'].values.tolist()
answers = preprocessed_df['answer'].values.tolist()
# Padding sequences for equal length # Pad sequences for equal length
max_len_question = max(len(seq) for seq in question_seqs) max_len_answer =
max(len(seq) for seq in answer_seqs) max_len = max(max_len_question,
max_len_answer) print(max(max_len_question, max_len_answer)) # Pad sequences
separately for questions and answers question_seqs =
pad_sequences(question_seqs, maxlen=max_len, padding='post') answer_seqs =
pad_sequences(answer_seqs, maxlen=max_len, padding='post')
24
tokenizer.texts_to_sequences("<sos>")
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 12/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
tokenizer.word_index["<sos>"]
Model Architecture
from tensorflow.keras.models import Model from tensorflow.keras.layers import Input, LSTM, Dense,
Embedding, Attention, Concatenate, Dropout
Model: "model"
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 13/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
(None, 256),
(None, 256)]
==================================================================================================
Total params: 3527024 (13.45 MB)
Trainable params: 3527024 (13.45 MB)
Non-trainable params: 0 (0.00 Byte)
Epoch 1/32
47/47 [==============================] - 63s 1s/step - loss: 3.0299 - val_loss: 2.2781
Epoch 2/32
47/47 [==============================] - 49s 1s/step - loss: 1.9036 - val_loss: 2.0941
Epoch 3/32
47/47 [==============================] - 49s 1s/step - loss: 1.7891 - val_loss: 2.0648
Epoch 4/32
47/47 [==============================] - 55s 1s/step - loss: 1.7495 - val_loss : 2.0476
Epoch 5/32
47/47 [==============================] - 47s 1s/step - loss: 1.7074 - val_loss: 2.0242
Epoch 6/32
47/47 [==============================] - 48s 1s/step - loss: 1.6609 - val_loss: 2.0007
Epoch 7/32
47/47 [==============================] - 49s 1s/step - loss: 1.6212 - val_loss: 1.9783
Epoch 8/32
47/47 [==============================] - 47s 1s/step - loss: 1.5819 - val_loss: 1.9536
Epoch 9/32
47/47 [==============================] - 48s 1s/step - loss: 1.5394 - val_loss : 1.9287
Epoch 10/32
47/47 [==============================] - 50s 1s/step - loss: 1.4969 - val_loss: 1.9095
Epoch 11/32
47/47 [==============================] - 47s 1s/step - loss: 1.4552 - val_loss: 1.8925
Epoch 12/32
def generate_response(input_text):
# Tokenize the input text
input_sequence = tokenizer.texts_to_sequences([input_text])
# Pad the input sequence
input_sequence = pad_sequences(input_sequence, maxlen=max_len, padding='post')
return output_text.strip()
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 14/15
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
1/1 [==============================] - 2s 2s/step
1/1 [==============================] - 0s 175ms/step
1/1 [==============================] - 0s 175ms/step
1/1 [==============================] - 0s 172ms/step
1/1 [==============================] - 0s 171ms/step
Response: i m not sure model_name='google/flan-t5-
small'
Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does
pull out some important information from the text which indicates the model can be ne-tuned to the task at hand.
index = 0
question = df['question'][index]
answer = df['answer'][index]
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true 15/15