0% found this document useful (0 votes)
15 views16 pages

Britto

The document is a project report on the development of an NLU chatbot, submitted by A. Britto Raj for a Bachelor of Engineering degree in Computer Science and Engineering. It outlines the project phases including data inspection, cleaning, transformation, feature engineering, model selection, and evaluation, while also discussing the evolution of chatbot technologies. The report concludes with future enhancement suggestions and emphasizes the potential of AI in improving conversational systems.

Uploaded by

BEAST BRITTO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

Britto

The document is a project report on the development of an NLU chatbot, submitted by A. Britto Raj for a Bachelor of Engineering degree in Computer Science and Engineering. It outlines the project phases including data inspection, cleaning, transformation, feature engineering, model selection, and evaluation, while also discussing the evolution of chatbot technologies. The report concludes with future enhancement suggestions and emphasizes the potential of AI in improving conversational systems.

Uploaded by

BEAST BRITTO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

NLU CHATBOT

A PROJECT REPORT

Submitted by

A.BRITTO RAJ

In partial fulfillment for the award


of the degreeof

BACHELOR OF
ENGINEERING

in

COMPUTER SCEINECE AND


ENGINEERING

CHRISTIAN COLLEGE OF ENGINEERING AND


TECHNOLOGY,

ODDANCHATRAM-624619

ANNA UNIVERSITY::CHENNAI–600025

MAY 2024

1
CHRISTIAN COLLEGE OF ENGINEERING AND TECHNOLOGY

ODDANCHATRAM-624619

NM1022–EXPERIENCE BASED PROJECT LEARNING

RECORD NOTE BOOK

REGISTER NO:

Certified that it is a bonafide record of practical lab

work done by during the year

STAFF – INCHARGE HEAD OF THE DEPARTMENT

Submitted for the practical exam held on .

Internal Examiner External Examiner


5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
Initial Inspection: Load the datasets into a pandas DataFrame and perform an initial inspection to understand their structure and content (e.g., number of
records, columns, and data types).

Visualization: Create visualizations (e.g., histograms, word clouds, and scatter plots) to explore the distribution of the data and understand patterns and
relationships within the dataset.

2. Data Cleaning:
Handling Missing Values: Identify and handle missing values through methods such as imputation, removal, or lling with placeholder values, depending on
the context.

Removing Duplicates: Check for and remove duplicate records to ensure data quality.

Outlier Detection: Detect and handle outliers that may skew the model training, using statistical methods or visualization techniques.
Standardizing Formats: Standardize data formats (e.g., date formats, text casing) to ensure consistency across the dataset.

3. Data Transformation:
Tokenization: Split text data into tokens (words or subwords) using tokenizers from NLP libraries like SpaCy or Hugging Face
Transformers.

Normalization: Normalize text data by converting it to lowercase, removing punctuation, special characters, and stop words.
Lemmatization/Stemming: Apply lemmatization or stemming to reduce words to their base or root form, aiding in reducing dimensionality and improving
model performance.
Vectorization: Convert text data into numerical format using techniques like TF-IDF, word embeddings (Word2Vec, GloVe), or transformer-based
embeddings (BERT, GPT).

4. Feature Engineering:
Contextual Features: Extract contextual features such as dialogue context, speaker information, and conversation history to enhance model
understanding.

Sentiment Analysis: Incorporate sentiment analysis to capture the emotional tone of the conversations.

Custom Features: Create custom features relevant to the chatbot’s domain, such as named entity recognition (NER) tags or topic modeling
outputs.

Feature Selection: Select and prioritize features that are most relevant to the task, using techniques like correlation analysis, mutual information, or feature
importance from tree-based models.
5. Model Selection and Training: Model Architecture: Choose appropriate model architectures for the chatbot, such as transformer-based models (e.g.,
BERT, GPT) for their superior performance in NLP tasks.

Hyperparameter Tuning: Perform hyperparameter tuning using techniques like grid search or random search to optimize model performance.

Training Process: Train the model on the preprocessed data, ensuring to use appropriate loss functions, optimizers, and learning rate schedules. Data
Augmentation: Apply data augmentation techniques if necessary to arti cially expand the training dataset and improve model robustness.

6. Model Evaluation:
Validation Split: Split the dataset into training, validation, and test sets to evaluate model performance.

Evaluation Metrics: Use relevant metrics such as accuracy, precision, recall, F1-score, and confusion matrix to assess model performance.

Cross-Validation: Implement cross-validation to ensure the model generalizes well to unseen data.

Error Analysis: Conduct error analysis to identify common failure modes and areas for improvement.

Iterative Improvement: Based on evaluation results, iterate on the model by re ning preprocessing steps, adjusting features, or exploring alternative model
architectures.

NLU chatbot development has evolved through:

Rule-Based Systems: Early systems like ELIZA and PARRY set the stage for automated dialogue.

Statistical Models: Techniques like HMM and CRF improved tasks with machine learning.

Sequence-to-Sequence Models: Models like GNMT brought coherence to responses using RNNs.

Transformer Models: Transformers like BERT and GPT revolutionized NLP with attention mechanisms.

Pre-trained Language Models: BERT and GPT excel in context comprehension and response generation.

Conversational AI Platforms: Tools like Dialog ow streamline chatbot development.

Multimodal Chatbots: Emerging chatbots integrate text, voice, and visual inputs for richer interactions.

Evaluation and Benchmarks: GLUE and SuperGLUE set standards for assessing model performance

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 2/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Colab
FLOW CHART

FUTURE ENHANCEMENT
Consider adding multimodal capabilities to your chatbot for a richer user experience. This includes:

Image Recognition: Let the chatbot interpret images for tasks like product information or recommendations.

Voice Input: Enable users to interact with voice commands, enhancing accessibility and natural conversation.

Emotion Detection: Incorporate sentiment analysis to tailor responses based on user emotions.

Interactive Visual Output: Provide visual aids like graphs or charts for better information conveyance.

Personalization: Offer personalized responses based on user history and preferences.

These enhancements broaden user engagement and cater to diverse needs, making your chatbot more
versatile and user-friendly.

CONCLUSION:

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 3/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
In conclusion, our NLU chatbot project signi es a signi cant advancement in conversational AI. Leveraging Kaggle's dataset and advanced ML techniques,
we've created a robust chatbot pro cient in understanding and responding to natural language inputs. Overcoming challenges like data preprocessing and
model training, our chatbot showcases impressive performance.

This project isn't just a standalone chatbot but a precursor to future AI-driven conversational systems. With potential enhancements like multimodal
capabilities and personalization, our chatbot paves the way for more interactive human-computer interactions.

Re ecting on our journey, we acknowledge the collaborative effort and dedication involved. Moving forward, we're excited about deploying our chatbot
across various domains, revolutionizing user experiences. Overall, our project underscores AI and ML's transformative potential, offering innovative
solutions for human-centric interactions.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig,


TrainingArguments, Trainer import torch import time import pandas as pd import numpy as np import
string from nltk.corpus import stopwords import pandas as pd from sklearn.feature_extraction.text
import CountVectorizer from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble
import RandomForestClassifier from sklearn.feature_extraction.text import
TfidfTransformer,TfidfVectorizer from sklearn.pipeline import Pipeline import evaluate

from google.colab import


drive
drive.mount('/content/drive
')

Mounted at
/content/drive

columns = ['question',
'answer'] df =
pd.read_csv('/content/drive
/MyDrive/NLU chatbot
# Now df will have columns named 'question' and
/dialogs.txt', sep='\t',
'answer' df.head()
names=columns)
questio answe
n r
0 hi, how are you doing? i'm fine. how about
yourself?
1
i'm fine. how about yourself? i'm pretty good. thanks
for asking.
2 i'm pretty good. thanks for asking.no problem. so how
have you been?

3 no problem. so how have you been? i've been

great. what about you? 4 i've been great. what

been
aboutgood.
you? i'm
i'vein school right
now.

import matplotlib.pyplot as
plt from wordcloud import
WordCloud

print("\nData types of
columns:") print(df.dtypes)
print("\nShape of the
dataset:") print(df.shape)
print("\nMissing values:")
print(df.isnull().sum())

Data types of columns:


question object
answer object dtype:
object

Shape of the
question (3725,
dataset: 0 2)
answer 3
dtype: values:
Missing
int64

Text Analysis:

Analyze the length distribution of questions and answers. Check for any unusual characters or patterns in the text. Explore the most common words or
phrases in questions and answers (word frequency analysis).

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 4/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM df['question_length'] = df['question'].apply(len) Colab

# Convert all entries in the 'answer' column to strings before


applying len() df['answer_length'] =
df['answer'].astype(str).apply(len)

Visualization:

Create visualizations to better understand the data distribution (e.g., histograms, word clouds). Plot the distribution of question and answer lengths.
Visualize word frequency using bar plots or word clouds.

plt.figure(figsize=(8, 6)) plt.hist(df['question_length'], bins=30, alpha=0.7,


color='red', label='Question Length') plt.hist(df['answer_length'], bins=30,
alpha=0.7, color='blue', label='Answer Length') plt.title('Distribution of
Question and Answer Lengths') plt.xlabel('Length') plt.ylabel('Frequency')
plt.legend() plt.show()

question_text = ' '.join(df['question']) wordcloud = WordCloud(width=600,


height=200, background_color ='black').generate(question_text)
plt.figure(figsize=(12, 6)) plt.imshow(wordcloud, interpolation='bilinear')
plt.title('generating Word Cloud for Questions') plt.axis('off') plt.show()

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 5/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Colab

# Replace NaN values with an empty string, then convert all entries to
strings df['answer'] = df['answer'].fillna('').astype(str)

# Join all the answers into a single


string answer_text = '
'.join(df['answer'])

# Generate the word cloud wordcloud = WordCloud(width=800,


height=400, background_color='white').generate(answer_text)

# Plot the word cloud


plt.figure(figsize=(10, 6))
plt.imshow(wordcloud,
interpolation='bilinear')
plt.title('generating Word Cloud for
Answers') plt.axis('off') # Hide axes
plt.show()

Topic Modeling:

Use techniques like Latent Dirichlet Allocation (LDA) to identify underlying topics in the questions and answers. Cluster similar questions and answers based
on topic distributions.

from collections import Counter from


sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import
LatentDirichletAllocation from textblob import
TextBlob

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 6/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english') X =
vectorizer.fit_transform(df['question']) lda = LatentDirichletAllocation(n_components=7,
random_state=50) lda.fit(X) # Display the top words for each topic for idx, topic in
enumerate(lda.components_): print(f"Topic {idx}:", ", ".join([vectorizer.get_feature_names_out()
[i] for i in topic.argsort()[:-11:-1]]))

Topic 0: yes, good, need, ll, course, don, thank, hope, talk,
just Topic 1: like, lot, make, people, don, got, great, fun,
doing, sounds Topic 2: didn, day, sure, just, tell, come,
maybe, nice, bad, eat
Topic 3: going, mean, does, tv, heard, party, long, happened, idea,
told Topic 4: think, right, time, ll, money, okay, today, school,
look, love Topic 5: want, know, don, really, ve, movie, weather,
yeah, buy, best Topic 6: did, let, say, new, wrong, ll, enjoy,
smell, school, phone

Language Complexity:

Measure the complexity of language used in questions and answers (e.g., average word length, vocabulary richness). Explore readability scores or linguistic
features.

# Ensure there are no NaN values and convert to


strings df['answer'] =
df['answer'].fillna('').astype(str)

# Function to calculate average word length, handling empty


strings def avg_word_length(text): words =
text.split() if len(words) == 0:
return 0 return sum(len(word) for word
in words) / len(words)

# Analyze language complexity avg_question_word_length =


df['question'].apply(lambda x: avg_word_length(x)).mean()
avg_answer_word_length = df['answer'].apply(lambda x:
avg_word_length(x)).mean()

Average word word


print("\nAverage length in questions:
length in questions:",
4.166629246606172 Average
avg_question_word_length) word lengthword
print("Average in length in
answers:
answers:", 4.171673641983572
avg_answer_word_length)

Data Preprocessing

Data Augmentation (UPSampling)

new_dialogue_data =
[["Hi",
"Hello"],
["How are you?",
"I'm good,
thanks for
asking. How
about you?"],
["I'm doing well
too.", "That's
great to hear.
What have you
been up to
lately?"],
["Not much, just working and spending time with family.", "That sounds nice. Have you watched any good movies recently?"],
["Yeah, I saw a really good one last weekend.", "It was a thriller, right? I heard good things about it."],
["Yes, it was.", "Do you want to watch it together sometime?"],
["Sure, that sounds like a plan.", "Awesome! Let's plan it for this
weekend."], ["Sounds good to me.", "Alright then, it's a plan. What
time works for you?"],
["How about Saturday evening?", "Perfect! Saturday evening it is. I'll book the
tickets."], ["Great! Looking forward to it.",answer
questio "Me too. It'll be fun."]
question_length
] new_df = n answer_length
pd.DataFrame(new_dialogue_data,
columns=columns)

# Concatenate the new DataFrame with the existing


DataFrame df = pd.concat([df, new_df],
ignore_index=True)

# Print the updated


https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-
DataFrame df 7/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM 0 hi, how are you i'm fine. how about Colab 22.0 29.0
doing? i'm fine. yourself? i'm pretty
how about good. thanks for
1 yourself? asking. 29.0 35.0

i'm pretty good. no problem. so how have


2 thanks you 35.0 33.0
for asking. been?
no problem. so i've been great. what about
3 how have you? 33.0 32.0
you been?
i've been great. what i've been good. i'm in
4 about you? school 32.0 40.0
right now.
... ... ... ... ...

Do you want to
3730 Yes, it was. watch it together NaN NaN
sometime?
Sure, that sounds like Awesome! Let's plan it for
3731 a plan. tthis weekend.
k NaN NaN
Preprocessing i f
3732 Sounds good to me. Alright then, it's a plan. NaN NaN
What
Lowercase

Tokenization

Stop words
removal
Lemmization

import nltk from nltk.corpus


import stopwords from
nltk.tokenize import
def
word_tokenize #
preprocess_text(text): tex
nltk.download('punkt')
Convert text to t
lowercase
#= Function
tokensfor
text.lower() preprocessing
#
= [word for word text
tokens in tokens if
= word_tokenize(text)
word.isalnum()] preprocessed_text = '
# '.join(tokens)
return text

# Apply preprocessing to question and answer columns preprocessed_df =


df.copy() preprocessed_df['question'] =
preprocessed_df['question'].apply(preprocess_text)
preprocessed_df['answer'] =
preprocessed_df['answer'].apply(preprocess_text)

# Print the preprocessed


DataFrame preprocessed_df

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 8/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM [nltk_data] Downloading package punkt to Colab
/root/nltk_data... [nltk_data] Package punkt is
already up-to-date!
question

0 hi, how are answer


you question_length answer_length
i'm fine. how about 22.0 29.0
doing? i'm fine. yourself? i'm pretty
how about good. thanks for
1 yourself? asking. 29.0 35.0

i'm pretty good. no problem. so how have


2 thanks you been? 35.0 33.0
for asking.
no problem. so i've been great. what about
3 how have you? 33.0 32.0
you been?
i've been great. what i've been good. i'm in
4 about you? school 32.0 40.0
right now.
... ... ... ... ...

do you want to
3730 yes, it was. watch it together NaN NaN
sometime?
sure, that sounds like awesome! let's plan it for
3731 a ti k this NaN NaN
plan. f weekend.
Classical
3732
ML ChatBot
sounds good to me. alright then, it's a plan. NaN NaN
what

from sklearn.model_selection import train_test_split

# Splitting the dataset into train and test sets


X_train, X_test, y_train, y_test = train_test_split(preprocessed_df['question'], preprocessed_df['answer'], test_size=0.2,
random_state=42,

Model Pipline and training

Pipe = Pipeline([
('tfidf', TfidfVectorizer()),
('classifier',
RandomForestClassifier())
])

Pipe.predict(['where are you going'])


[0]

'we went to a nice restaurant.'


Model
Evaluation

predicted_text = Pipe.predict(X_test)

# Creating a DataFrame to compare the first 10 results comparison_df = pd.DataFrame({'Real Question ':
X_test[:10],'Real Generated Text': y_test[:10], 'Predicted Text': predicted_text[:10]}) comparison_df.head(10)

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 9/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Real Colab Real Generated Predicted
Question Text Text

3253 how are you doing i started shopping at the dollar i don't know. i think i'm
that? store. you shouldn't carry your keys and average. i'm pretty good.
3190 the pants are fine, but the pocket has a pens in your... thanks for asking.
2194
huge ... uh-oh. that means that she's fat and she's men singers don't have to look
ugly. cute. good.
no pets are if they don't like it, they can
3303 that's great. we won't have neighbors on allowed. move.
both ... yeah, maybe next this friday? sorry, i already have
time. plans.
3214
642 i really wanted you to come, butthat's
i a good deal. and a one-pound tub of soft butter yes, even though some of the potatoes had
was the sam...
understand. eyes.
maybe we should learn some good
184 she's one of the prettiest girls at the school. what does she jokes.
look like? did you need
something?
3515 no, that's not the problem. maybe it will go away in you might hit someone in the
a little while. head.
i have the
3185 why not? i didn't want to pay dvd.
for the holes.
# Visualize Feature Importance if
2206 eight o'clock. that
isinstance(Pipe.named_steps['classifier'], RandomForestClassifier):
sounds great.
feature_importances =
import seaborn as sns
Pipe.named_steps['classifier'].feature_importances_ feature_names =
Pipe.named_steps['tfidf'].get_feature_names_out()

# Sort feature importances sorted_indices =


feature_importances.argsort()[::-1]
top_feature_importances =
feature_importances[sorted_indices][:10] top_feature_names =
np.array(feature_names)[sorted_indices][:10]

plt.figure(figsize=(6, 4))
sns.barplot(x=top_feature_importances,
y=top_feature_names) plt.xlabel('Token Importance')
plt.ylabel('Token Name') plt.title('Top 10 Tokenze
Importance') plt.show()

Chat with you ChatBot :D


def get_response(question):
response = Pipe.predict([question])
[0] return response

# Function to interact with the


user def chat(): while True:

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 10/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM question = input("You: Colab
") if question.lower() ==
'quit':
print("Chatbot:
Goodbye!") break
response =
get_response(question)
print("Chatbot:",
response)

# Start the
chat chat()

You: hello Chatbot: okay. i'll return your


pen when i'm done.
You: i am really excited for
you Chatbot: are you really?
You: yes Chatbot:
see those stains?
You: what are you
doing?
Chatbot: i'm going to change the light bulb. it burnt
out. You: is that a lot of work?
Chatbot: babies cry all the time.
You: yes Chatbot:
see those stains?
You: no Chatbot: how
do you know?
You: no Chatbot: how
do you know?
You: come Chatbot: you can see the stars so much more clearly
after it rains.
You: Goodbye!
Chatbot: when's
that? You: dead
Chatbot:
when's that?
You: die Chatbot: yes, he did. his
cat died, too.
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-22-22eefa91ce41> in <cell line:
16>() 14
15 # Start the chat
---> 16 chat()
893 except KeyboardInterrupt:
894 # re-raise KeyboardInterrupt, to
2 frames
truncate traceback
--> raise KeyboardInterrupt("Interrupted by user") from
/usr/local/lib/python3.10/dist-packages/ipykernel/
895 None except Exception as e:
kernelbase.py in _input_request(self,
896 prompt,
self.log.warning("Invalid Message:", exc_info=True)
ident,
897 parent, password)
KeyboardInterrupt: Interrupted by
user

Encoder - Decoder Model with Attention and LSTMs Chatbot from scratch

import tensorflow as tf from tensorflow.keras.preprocessing.text import Tokenizer


from tensorflow.keras.preprocessing.sequence import pad_sequences from
tensorflow.keras.models import Model from tensorflow.keras.layers import Input,
LSTM, Dense, Embedding, Attention, Concatenate

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 11/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM import nltk from nltk.corpus Colab
import stopwords from
nltk.tokenize import
word_tokenize import re import
unicodedata

nltk.download('punkt')

def

unicode_to_ascii(s):
return ''.join(c for c in
unicodedata.normalize('NFD', s) if
unicodedata.category(c) != 'Mn')

# Function for preprocessing text def


preprocess_text(text):
# Convert text to
lowercase text =
unicode_to_ascii(text.lower().strip())
text = re.sub("(\\W)","
",text) text =
re.sub('\S*\d\S*\s*','', text) text =
"<sos> " + text + "<eos>" return
text

# Apply preprocessing to question and answer columns preprocessed_df =


df.copy() preprocessed_df['question'] =
question
preprocessed_df['question'].apply(preprocess_text) answer question_lengt answer_length
h
preprocessed_df['answer'] =
0 <sos> hi how are you doing <eos>
preprocessed_df['answer'].apply(preprocess_text) <sos> i m fine how about yourself <eos> 22.0 29.0
1 <sos> i m fine how about yourself <eos> <sos> i m pretty good thanks for asking 29.0 35.0
# Print the preprocessed <eos>
DataFrame
2 preprocessed_df
<sos> i m pretty good thanks for asking <sos> no problem so how have you been 35.0 33.0
[nltk_data] Downloading package punkt<eos>
to <eos>
/root/nltk_data...
3 [nltk_data]
<sos> no problem so how have Package
you been punkt is
<sos> i ve been great what about you 33.0 32.0
already up-to-date! <eos> <eos>
4 <sos> i ve been great what about you <sos> i ve been good i m in school right now 32.0 40.0
<eos> ...
... ... ... ... ...

3730 <sos> yes it was <eos> <sos> do you want to watch it together NaN NaN
sometim...
3731 <sos> sure that sounds like a plan <eos> <sos> awesome let s plan it for this NaN NaN
weekend ...
3732 <sos> sounds good to me <eos> <sos> alright then it s a plan what time NaN NaN
wor...
3735 rows × 4 columns
3733 <sos> how about saturday evening <eos> <sos> perfect saturday evening it is i ll bo... NaN NaN

3734 <sos> great looking forward to it <eos> <sos> me too it ll be fun <eos> NaN NaN
questions =
preprocessed_df['question'].values.tolist()
answers =
preprocessed_df['answer'].values.tolist()

# Tokenizing the data tokenizer = Tokenizer(filters='')


tokenizer.fit_on_texts(np.concatenate((questions, answers),
axis=0)) vocab_size = len(tokenizer.word_index) + 1

# Convert text to sequences question_seqs =


tokenizer.texts_to_sequences(questions)
answer_seqs =
tokenizer.texts_to_sequences(answers)

# Padding sequences for equal length # Pad sequences for equal length
max_len_question = max(len(seq) for seq in question_seqs)
max_len_answer = max(len(seq) for seq in answer_seqs) max_len =
max(max_len_question, max_len_answer) print(max(max_len_question,
max_len_answer)) # Pad sequences separately for questions and answers
question_seqs = pad_sequences(question_seqs, maxlen=max_len,
padding='post') answer_seqs = pad_sequences(answer_seqs,
maxlen=max_len, padding='post')

24

tokenizer.texts_to_sequences("<sos

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-
>") [[], [9], [490], [9], 12/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
[]]
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM tokenizer.word_index["<sos>"] Colab

Model Architecture

from tensorflow.keras.models import Model from tensorflow.keras.layers import Input, LSTM, Dense,
Embedding, Attention, Concatenate, Dropout

# Define the model architecture latent_dim = 256 #


Dimensionality of the encoding space

# Encoder encoder_inputs = Input(shape=(max_len,)) encoder_embedding = Embedding(vocab_size,


latent_dim, input_shape=(max_len,)) encoder_lstm = LSTM(latent_dim, return_sequences=True,
return_state=True, dropout=0.1, recurrent_dropout=0.1) encoder_outputs, state_h, state_c =
encoder_lstm(encoder_embedding(encoder_inputs)) encoder_states = [state_h, state_c]

# Decoder decoder_inputs = Input(shape=(max_len-1,)) decoder_embedding = Embedding(vocab_size,


latent_dim, input_shape=(max_len-1,)) decoder_lstm = LSTM(latent_dim, return_sequences=True,
return_state=True, dropout=0.1, recurrent_dropout=0.1) decoder_outputs, _, _ =
decoder_lstm(decoder_embedding(decoder_inputs), initial_state=encoder_states)

# Attention mechanism attention_layer = Attention()


attention_output = attention_layer([decoder_outputs,
encoder_outputs])

# Concatenate attention output and decoder LSTM output


decoder_concat_input = Concatenate(axis=-1)([decoder_outputs,
attention_output])

# Add dropout layer for regularization


decoder_concat_input = Dropout(0.1)
(decoder_concat_input)

# Output layer decoder_dense = Dense(vocab_size,


activation='softmax') decoder_outputs =

decoder_dense(decoder_concat_input) Model Training

model = Model([encoder_inputs, decoder_inputs],


decoder_outputs)

# Compile the model


model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy')

# Print model
summary
model.summary()
Model: "model"

# Train (type)
Layer the model model.fit([question_seqs,
Output Shape Param # Connected to
answer_seqs[:, :-1]], answer_seqs[:, 1:],
======================== input_1
batch_size=64,
[(None, 24)] epochs=32, validation_split=0.2)
0 []
========================================================================== (InputLayer)
input_2 (InputLayer) [(None, 23)] 0 []
embedding (Embedding) (None, 24, 256) 618496 ['input_1[0][0]']
embedding_1 (Embedding) (None, 23, 256) 618496 ['input_2[0][0]']
lstm (LSTM) [(None, 24, 256), 525312 ['embedding[0]
[0]']

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 13/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM (None, 256), Colab
(None, 256)]

lstm_1 (LSTM) [(None, 23, 256), 525312 ['embedding_1[0]


(None, 256), [0]',
'lstm[0][1]',
(None, 256)] 'lstm[0][2]']

attention (Attention) (None, 23, 256) 0 ['lstm_1[0][0]',


'lstm[0][0]']

concatenate (None, 23, 512) 0 ['lstm_1[0][0]',


(Concatenate) 'attention[0][0]']

dropout (Dropout) (None, 23, 512) 0 ['concatenate[0]


[0]']
dense (Dense) (None, 23, 2416) 1239408 ['dropout[0][0]']
===========================================================================================
=======
Total params: 3527024 (13.45 MB)
Trainable params: 3527024 (13.45 MB)
Non-trainable params: 0 (0.00 Byte)

Epoch 1/32
47/47 [==============================] - 63s 1s/step - loss: 3.0299 - val_loss: 2.2781
Epoch 2/32
47/47 [==============================] - 49s 1s/step - loss: 1.9036 - val_loss: 2.0941
Epoch 3/32
47/47 [==============================] - 49s 1s/step - loss: 1.7891 - val_loss: 2.0648
Epoch 4/32 :
47/47 [==============================] 2.0476
- 55s 1s/step - loss: 1.7495 - val_loss
Epoch 5/32
47/47 [==============================] - 47s 1s/step - loss: 1.7074 - val_loss: 2.0242
Epoch 6/32
47/47 [==============================] - 48s 1s/step - loss: 1.6609 - val_loss: 2.0007
Epoch 7/32
47/47 [==============================] - 49s 1s/step - loss: 1.6212 - val_loss: 1.9783
Epoch 8/32
47/47 [==============================] - 47s 1s/step - loss: 1.5819 - val_loss: 1.9536
Epoch 9/32 :
47/47 [==============================] 1.9287
- 48s 1s/step - loss: 1.5394 - val_loss
Epoch 10/32
47/47 [==============================] - 50s 1s/step - loss: 1.4969 - val_loss: 1.9095
Epoch 11/32
47/47 [==============================] - 47s 1s/step - loss: 1.4552 - val_loss: 1.8925
Epoch 12/32

def
generate_response(input_text
): # Tokenize the input
text
input_sequence =
tokenizer.texts_to_sequences([input_text]) # Pad the
input sequence
input_sequence = pad_sequences(input_sequence,
maxlen=max_len, padding='post')

# Initialize the decoder input sequence with start


token decoder_input_sequence = np.zeros((1, max_len-
1)) decoder_input_sequence[0, 0] =
tokenizer.word_index['<sos>']

# Generate response using the trained


model for i in range(max_len - 1):
predictions = model.predict([input_sequence,
decoder_input_sequence]) predicted_id =
np.argmax(predictions[0, i, :])
if predicted_id ==
tokenizer.word_index['<eos>']: break
decoder_input_sequence[0, i+1] = predicted_id

# Convert output sequence to text


output_text = ''
for token_index in
decoder_input_sequence[0]:
if token_index == tokenizer.word_index['<eos>'] or token_index
== 0: break
output_text += tokenizer.index_word[token_index] + ' '

return output_text.strip()
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 14/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
# Test the function with input "how are
you" input_text = "how do you do"
response =
generate_response(input_text)
print("Response:", response[5:])
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Colab
1/1 [==============================] - 2s 2s/step
1/1 [==============================] - 0s 175ms/step
1/1 [==============================] - 0s 175ms/step
1/1 [==============================] - 0s 172ms/step
1/1 [==============================] - 0s 171ms/step
Response: i m not sure model_name='google/flan-t5-

small'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
torch_dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(model_name)

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated


and will b warnings.warn( /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: The
secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set
it as secre You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or
datasets. warnings.warn(
config.json: 100% 1.40k/1.40k [00:00<00:00, 21.9kB/s]
model.safetensors: 10 308M/
0% 308M [00:03<00:00, 79.4MB/s]
generation_config.json: 10 147/147 [00:00<00:00, 3.48
0% kB/s]
tokenizer_config.json: 10 2.54k/
0% 2.54k [00:00<00:00, 53.9kB/s]
spiece.model: 100 792k/
% 792k [00:00<00:00, 19.5MB/s]
tokenizer.json: 100 2.42M/
% 2.42M [00:00<00:00, 5.05MB/s]
special_tokens_map.json: 100% 2.20k/
2.20k [00:00<00:00, 95.9kB/s]
def
print_number_of_trainable_model_parameters(mod
el): trainable_model_params
all_model_params = 0 = 0
for _, param
in model.named_parameters():
all_model_params +=
param.numel() if
param.requires_grad: return f"trainable model parameters: {trainable_model_params}\nall model
trainable_model_par parameters:
{all_model_params}\npercentage
ams += param.numel()of trainable model p print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 76961152 all


model parameters: 76961152 percentage of
trainable model parameters: 100.00%

Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does
pull out some important information from the text which indicates the model can be ne-tuned to the task at hand.

index = 0

question = df['question']
[index] answer =
df['answer'][index]

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 15/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy