0% found this document useful (0 votes)

15 views16 pages

Britto

The document is a project report on the development of an NLU chatbot, submitted by A. Britto Raj for a Bachelor of Engineering degree in Computer Science and Engineering. It outlines the project phases including data inspection, cleaning, transformation, feature engineering, model selection, and evaluation, while also discussing the evolution of chatbot technologies. The report concludes with future enhancement suggestions and emphasizes the potential of AI in improving conversational systems.

Uploaded by

BEAST BRITTO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views16 pages

Britto

Uploaded by

BEAST BRITTO

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

NLU CHATBOT

A PROJECT REPORT

Submitted by

A.BRITTO RAJ

In partial fulfillment for the award

of the degreeof

BACHELOR OF
ENGINEERING

COMPUTER SCEINECE AND

ENGINEERING

CHRISTIAN COLLEGE OF ENGINEERING AND

TECHNOLOGY,

ODDANCHATRAM-624619

ANNA UNIVERSITY::CHENNAI–600025

MAY 2024

1
CHRISTIAN COLLEGE OF ENGINEERING AND TECHNOLOGY

ODDANCHATRAM-624619

NM1022–EXPERIENCE BASED PROJECT LEARNING

RECORD NOTE BOOK

Certified that it is a bonafide record of practical lab

work done by during the year

STAFF – INCHARGE HEAD OF THE DEPARTMENT

Submitted for the practical exam held on .

Internal Examiner External Examiner

5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
Initial Inspection: Load the datasets into a pandas DataFrame and perform an initial inspection to understand their structure and content (e.g., number of
records, columns, and data types).

Visualization: Create visualizations (e.g., histograms, word clouds, and scatter plots) to explore the distribution of the data and understand patterns and
relationships within the dataset.

2. Data Cleaning:
Handling Missing Values: Identify and handle missing values through methods such as imputation, removal, or lling with placeholder values, depending on
the context.

Removing Duplicates: Check for and remove duplicate records to ensure data quality.

Outlier Detection: Detect and handle outliers that may skew the model training, using statistical methods or visualization techniques.
Standardizing Formats: Standardize data formats (e.g., date formats, text casing) to ensure consistency across the dataset.

3. Data Transformation:
Tokenization: Split text data into tokens (words or subwords) using tokenizers from NLP libraries like SpaCy or Hugging Face
Transformers.

Normalization: Normalize text data by converting it to lowercase, removing punctuation, special characters, and stop words.
Lemmatization/Stemming: Apply lemmatization or stemming to reduce words to their base or root form, aiding in reducing dimensionality and improving
model performance.
Vectorization: Convert text data into numerical format using techniques like TF-IDF, word embeddings (Word2Vec, GloVe), or transformer-based
embeddings (BERT, GPT).

4. Feature Engineering:
Contextual Features: Extract contextual features such as dialogue context, speaker information, and conversation history to enhance model
understanding.

Sentiment Analysis: Incorporate sentiment analysis to capture the emotional tone of the conversations.

Custom Features: Create custom features relevant to the chatbot’s domain, such as named entity recognition (NER) tags or topic modeling
outputs.

Feature Selection: Select and prioritize features that are most relevant to the task, using techniques like correlation analysis, mutual information, or feature
importance from tree-based models.
5. Model Selection and Training: Model Architecture: Choose appropriate model architectures for the chatbot, such as transformer-based models (e.g.,
BERT, GPT) for their superior performance in NLP tasks.

Hyperparameter Tuning: Perform hyperparameter tuning using techniques like grid search or random search to optimize model performance.

Training Process: Train the model on the preprocessed data, ensuring to use appropriate loss functions, optimizers, and learning rate schedules. Data
Augmentation: Apply data augmentation techniques if necessary to arti cially expand the training dataset and improve model robustness.

6. Model Evaluation:
Validation Split: Split the dataset into training, validation, and test sets to evaluate model performance.

Evaluation Metrics: Use relevant metrics such as accuracy, precision, recall, F1-score, and confusion matrix to assess model performance.

Cross-Validation: Implement cross-validation to ensure the model generalizes well to unseen data.

Error Analysis: Conduct error analysis to identify common failure modes and areas for improvement.

Iterative Improvement: Based on evaluation results, iterate on the model by re ning preprocessing steps, adjusting features, or exploring alternative model
architectures.

NLU chatbot development has evolved through:

Rule-Based Systems: Early systems like ELIZA and PARRY set the stage for automated dialogue.

Statistical Models: Techniques like HMM and CRF improved tasks with machine learning.

Sequence-to-Sequence Models: Models like GNMT brought coherence to responses using RNNs.

Transformer Models: Transformers like BERT and GPT revolutionized NLP with attention mechanisms.

Pre-trained Language Models: BERT and GPT excel in context comprehension and response generation.

Conversational AI Platforms: Tools like Dialog ow streamline chatbot development.

Multimodal Chatbots: Emerging chatbots integrate text, voice, and visual inputs for richer interactions.

Evaluation and Benchmarks: GLUE and SuperGLUE set standards for assessing model performance

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 2/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Colab
FLOW CHART

FUTURE ENHANCEMENT
Consider adding multimodal capabilities to your chatbot for a richer user experience. This includes:

Image Recognition: Let the chatbot interpret images for tasks like product information or recommendations.

Voice Input: Enable users to interact with voice commands, enhancing accessibility and natural conversation.

Emotion Detection: Incorporate sentiment analysis to tailor responses based on user emotions.

Interactive Visual Output: Provide visual aids like graphs or charts for better information conveyance.

Personalization: Offer personalized responses based on user history and preferences.

These enhancements broaden user engagement and cater to diverse needs, making your chatbot more
versatile and user-friendly.

CONCLUSION:

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 3/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
In conclusion, our NLU chatbot project signi es a signi cant advancement in conversational AI. Leveraging Kaggle's dataset and advanced ML techniques,
we've created a robust chatbot pro cient in understanding and responding to natural language inputs. Overcoming challenges like data preprocessing and
model training, our chatbot showcases impressive performance.

This project isn't just a standalone chatbot but a precursor to future AI-driven conversational systems. With potential enhancements like multimodal
capabilities and personalization, our chatbot paves the way for more interactive human-computer interactions.

Re ecting on our journey, we acknowledge the collaborative effort and dedication involved. Moving forward, we're excited about deploying our chatbot
across various domains, revolutionizing user experiences. Overall, our project underscores AI and ML's transformative potential, offering innovative
solutions for human-centric interactions.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig,

TrainingArguments, Trainer import torch import time import pandas as pd import numpy as np import
string from nltk.corpus import stopwords import pandas as pd from sklearn.feature_extraction.text
import CountVectorizer from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble
import RandomForestClassifier from sklearn.feature_extraction.text import
TfidfTransformer,TfidfVectorizer from sklearn.pipeline import Pipeline import evaluate

from google.colab import

drive
drive.mount('/content/drive
')

Mounted at
/content/drive

columns = ['question',
'answer'] df =
pd.read_csv('/content/drive
/MyDrive/NLU chatbot
# Now df will have columns named 'question' and
/dialogs.txt', sep='\t',
'answer' df.head()
names=columns)
questio answe
n r
0 hi, how are you doing? i'm fine. how about
yourself?
1
i'm fine. how about yourself? i'm pretty good. thanks
for asking.
2 i'm pretty good. thanks for asking.no problem. so how
have you been?

3 no problem. so how have you been? i've been

great. what about you? 4 i've been great. what

been
aboutgood.
you? i'm
i'vein school right
now.

import matplotlib.pyplot as
plt from wordcloud import
WordCloud

print("\nData types of
columns:") print(df.dtypes)
print("\nShape of the
dataset:") print(df.shape)
print("\nMissing values:")
print(df.isnull().sum())

Data types of columns:

question object
answer object dtype:
object

Shape of the
question (3725,
dataset: 0 2)
answer 3
dtype: values:
Missing
int64

Text Analysis:

Analyze the length distribution of questions and answers. Check for any unusual characters or patterns in the text. Explore the most common words or
phrases in questions and answers (word frequency analysis).

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 4/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM df['question_length'] = df['question'].apply(len) Colab

# Convert all entries in the 'answer' column to strings before

applying len() df['answer_length'] =
df['answer'].astype(str).apply(len)

Visualization:

Create visualizations to better understand the data distribution (e.g., histograms, word clouds). Plot the distribution of question and answer lengths.
Visualize word frequency using bar plots or word clouds.

plt.figure(figsize=(8, 6)) plt.hist(df['question_length'], bins=30, alpha=0.7,

color='red', label='Question Length') plt.hist(df['answer_length'], bins=30,
alpha=0.7, color='blue', label='Answer Length') plt.title('Distribution of
Question and Answer Lengths') plt.xlabel('Length') plt.ylabel('Frequency')
plt.legend() plt.show()

question_text = ' '.join(df['question']) wordcloud = WordCloud(width=600,

height=200, background_color ='black').generate(question_text)
plt.figure(figsize=(12, 6)) plt.imshow(wordcloud, interpolation='bilinear')
plt.title('generating Word Cloud for Questions') plt.axis('off') plt.show()

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 5/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Colab

# Replace NaN values with an empty string, then convert all entries to
strings df['answer'] = df['answer'].fillna('').astype(str)

# Join all the answers into a single

string answer_text = '
'.join(df['answer'])

# Generate the word cloud wordcloud = WordCloud(width=800,

height=400, background_color='white').generate(answer_text)

# Plot the word cloud

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud,
interpolation='bilinear')
plt.title('generating Word Cloud for
Answers') plt.axis('off') # Hide axes
plt.show()

Topic Modeling:

Use techniques like Latent Dirichlet Allocation (LDA) to identify underlying topics in the questions and answers. Cluster similar questions and answers based
on topic distributions.

from collections import Counter from

sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import
LatentDirichletAllocation from textblob import
TextBlob

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 6/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 PM NLU chat bot phase 5.ipynb - Colab
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english') X =
vectorizer.fit_transform(df['question']) lda = LatentDirichletAllocation(n_components=7,
random_state=50) lda.fit(X) # Display the top words for each topic for idx, topic in
enumerate(lda.components_): print(f"Topic {idx}:", ", ".join([vectorizer.get_feature_names_out()
[i] for i in topic.argsort()[:-11:-1]]))

Topic 0: yes, good, need, ll, course, don, thank, hope, talk,
just Topic 1: like, lot, make, people, don, got, great, fun,
doing, sounds Topic 2: didn, day, sure, just, tell, come,
maybe, nice, bad, eat
Topic 3: going, mean, does, tv, heard, party, long, happened, idea,
told Topic 4: think, right, time, ll, money, okay, today, school,
look, love Topic 5: want, know, don, really, ve, movie, weather,
yeah, buy, best Topic 6: did, let, say, new, wrong, ll, enjoy,
smell, school, phone

Language Complexity:

Measure the complexity of language used in questions and answers (e.g., average word length, vocabulary richness). Explore readability scores or linguistic
features.

# Ensure there are no NaN values and convert to

strings df['answer'] =
df['answer'].fillna('').astype(str)

# Function to calculate average word length, handling empty

strings def avg_word_length(text): words =
text.split() if len(words) == 0:
return 0 return sum(len(word) for word
in words) / len(words)

# Analyze language complexity avg_question_word_length =

df['question'].apply(lambda x: avg_word_length(x)).mean()
avg_answer_word_length = df['answer'].apply(lambda x:
avg_word_length(x)).mean()

Average word word

print("\nAverage length in questions:
length in questions:",
4.166629246606172 Average
avg_question_word_length) word lengthword
print("Average in length in
answers:
answers:", 4.171673641983572
avg_answer_word_length)

Data Preprocessing

Data Augmentation (UPSampling)

new_dialogue_data =
[["Hi",
"Hello"],
["How are you?",
"I'm good,
thanks for
asking. How
about you?"],
["I'm doing well
too.", "That's
great to hear.
What have you
been up to
lately?"],
["Not much, just working and spending time with family.", "That sounds nice. Have you watched any good movies recently?"],
["Yeah, I saw a really good one last weekend.", "It was a thriller, right? I heard good things about it."],
["Yes, it was.", "Do you want to watch it together sometime?"],
["Sure, that sounds like a plan.", "Awesome! Let's plan it for this
weekend."], ["Sounds good to me.", "Alright then, it's a plan. What
time works for you?"],
["How about Saturday evening?", "Perfect! Saturday evening it is. I'll book the
tickets."], ["Great! Looking forward to it.",answer
questio "Me too. It'll be fun."]
question_length
] new_df = n answer_length
pd.DataFrame(new_dialogue_data,
columns=columns)

# Concatenate the new DataFrame with the existing

DataFrame df = pd.concat([df, new_df],
ignore_index=True)

# Print the updated

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-
DataFrame df 7/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM 0 hi, how are you i'm fine. how about Colab 22.0 29.0
doing? i'm fine. yourself? i'm pretty
how about good. thanks for
1 yourself? asking. 29.0 35.0

i'm pretty good. no problem. so how have

2 thanks you 35.0 33.0
for asking. been?
no problem. so i've been great. what about
3 how have you? 33.0 32.0
you been?
i've been great. what i've been good. i'm in
4 about you? school 32.0 40.0
right now.
... ... ... ... ...

Do you want to
3730 Yes, it was. watch it together NaN NaN
sometime?
Sure, that sounds like Awesome! Let's plan it for
3731 a plan. tthis weekend.
k NaN NaN
Preprocessing i f
3732 Sounds good to me. Alright then, it's a plan. NaN NaN
What
Lowercase

Tokenization

Stop words
removal
Lemmization

import nltk from nltk.corpus

import stopwords from
nltk.tokenize import
def
word_tokenize #
preprocess_text(text): tex
nltk.download('punkt')
Convert text to t
lowercase
#= Function
tokensfor
text.lower() preprocessing
#
= [word for word text
tokens in tokens if
= word_tokenize(text)
word.isalnum()] preprocessed_text = '
# '.join(tokens)
return text

# Apply preprocessing to question and answer columns preprocessed_df =

df.copy() preprocessed_df['question'] =
preprocessed_df['question'].apply(preprocess_text)
preprocessed_df['answer'] =
preprocessed_df['answer'].apply(preprocess_text)

# Print the preprocessed

DataFrame preprocessed_df

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 8/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM [nltk_data] Downloading package punkt to Colab
/root/nltk_data... [nltk_data] Package punkt is
already up-to-date!
question

0 hi, how are answer

you question_length answer_length
i'm fine. how about 22.0 29.0
doing? i'm fine. yourself? i'm pretty
how about good. thanks for
1 yourself? asking. 29.0 35.0

i'm pretty good. no problem. so how have

2 thanks you been? 35.0 33.0
for asking.
no problem. so i've been great. what about
3 how have you? 33.0 32.0
you been?
i've been great. what i've been good. i'm in
4 about you? school 32.0 40.0
right now.
... ... ... ... ...

do you want to
3730 yes, it was. watch it together NaN NaN
sometime?
sure, that sounds like awesome! let's plan it for
3731 a ti k this NaN NaN
plan. f weekend.
Classical
3732
ML ChatBot
sounds good to me. alright then, it's a plan. NaN NaN
what

from sklearn.model_selection import train_test_split

# Splitting the dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(preprocessed_df['question'], preprocessed_df['answer'], test_size=0.2,
random_state=42,

Model Pipline and training

Pipe = Pipeline([
('tfidf', TfidfVectorizer()),
('classifier',
RandomForestClassifier())
])

Pipe.predict(['where are you going'])

[0]

'we went to a nice restaurant.'

Model
Evaluation

predicted_text = Pipe.predict(X_test)

# Creating a DataFrame to compare the first 10 results comparison_df = pd.DataFrame({'Real Question ':
X_test[:10],'Real Generated Text': y_test[:10], 'Predicted Text': predicted_text[:10]}) comparison_df.head(10)

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 9/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Real Colab Real Generated Predicted
Question Text Text

3253 how are you doing i started shopping at the dollar i don't know. i think i'm
that? store. you shouldn't carry your keys and average. i'm pretty good.
3190 the pants are fine, but the pocket has a pens in your... thanks for asking.
2194
huge ... uh-oh. that means that she's fat and she's men singers don't have to look
ugly. cute. good.
no pets are if they don't like it, they can
3303 that's great. we won't have neighbors on allowed. move.
both ... yeah, maybe next this friday? sorry, i already have
time. plans.
3214
642 i really wanted you to come, butthat's
i a good deal. and a one-pound tub of soft butter yes, even though some of the potatoes had
was the sam...
understand. eyes.
maybe we should learn some good
184 she's one of the prettiest girls at the school. what does she jokes.
look like? did you need
something?
3515 no, that's not the problem. maybe it will go away in you might hit someone in the
a little while. head.
i have the
3185 why not? i didn't want to pay dvd.
for the holes.
# Visualize Feature Importance if
2206 eight o'clock. that
isinstance(Pipe.named_steps['classifier'], RandomForestClassifier):
sounds great.
feature_importances =
import seaborn as sns
Pipe.named_steps['classifier'].feature_importances_ feature_names =
Pipe.named_steps['tfidf'].get_feature_names_out()

# Sort feature importances sorted_indices =

feature_importances.argsort()[::-1]
top_feature_importances =
feature_importances[sorted_indices][:10] top_feature_names =
np.array(feature_names)[sorted_indices][:10]

plt.figure(figsize=(6, 4))
sns.barplot(x=top_feature_importances,
y=top_feature_names) plt.xlabel('Token Importance')
plt.ylabel('Token Name') plt.title('Top 10 Tokenze
Importance') plt.show()

Chat with you ChatBot :D

def get_response(question):
response = Pipe.predict([question])
[0] return response

# Function to interact with the

user def chat(): while True:

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 10/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM question = input("You: Colab
") if question.lower() ==
'quit':
print("Chatbot:
Goodbye!") break
response =
get_response(question)
print("Chatbot:",
response)

# Start the
chat chat()

You: hello Chatbot: okay. i'll return your

pen when i'm done.
You: i am really excited for
you Chatbot: are you really?
You: yes Chatbot:
see those stains?
You: what are you
doing?
Chatbot: i'm going to change the light bulb. it burnt
out. You: is that a lot of work?
Chatbot: babies cry all the time.
You: yes Chatbot:
see those stains?
You: no Chatbot: how
do you know?
You: no Chatbot: how
do you know?
You: come Chatbot: you can see the stars so much more clearly
after it rains.
You: Goodbye!
Chatbot: when's
that? You: dead
Chatbot:
when's that?
You: die Chatbot: yes, he did. his
cat died, too.
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-22-22eefa91ce41> in <cell line:
16>() 14
15 # Start the chat
---> 16 chat()
893 except KeyboardInterrupt:
894 # re-raise KeyboardInterrupt, to
2 frames
truncate traceback
--> raise KeyboardInterrupt("Interrupted by user") from
/usr/local/lib/python3.10/dist-packages/ipykernel/
895 None except Exception as e:
kernelbase.py in _input_request(self,
896 prompt,
self.log.warning("Invalid Message:", exc_info=True)
ident,
897 parent, password)
KeyboardInterrupt: Interrupted by
user

Encoder - Decoder Model with Attention and LSTMs Chatbot from scratch

import tensorflow as tf from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences from
tensorflow.keras.models import Model from tensorflow.keras.layers import Input,
LSTM, Dense, Embedding, Attention, Concatenate

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 11/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM import nltk from nltk.corpus Colab
import stopwords from
nltk.tokenize import
word_tokenize import re import
unicodedata

nltk.download('punkt')

def

unicode_to_ascii(s):
return ''.join(c for c in
unicodedata.normalize('NFD', s) if
unicodedata.category(c) != 'Mn')

# Function for preprocessing text def

preprocess_text(text):
# Convert text to
lowercase text =
unicode_to_ascii(text.lower().strip())
text = re.sub("(\\W)","
",text) text =
re.sub('\S*\d\S*\s*','', text) text =
"<sos> " + text + "<eos>" return
text

# Apply preprocessing to question and answer columns preprocessed_df =

df.copy() preprocessed_df['question'] =
question
preprocessed_df['question'].apply(preprocess_text) answer question_lengt answer_length
h
preprocessed_df['answer'] =
0 <sos> hi how are you doing <eos>
preprocessed_df['answer'].apply(preprocess_text) <sos> i m fine how about yourself <eos> 22.0 29.0
1 <sos> i m fine how about yourself <eos> <sos> i m pretty good thanks for asking 29.0 35.0
# Print the preprocessed <eos>
DataFrame
2 preprocessed_df
<sos> i m pretty good thanks for asking <sos> no problem so how have you been 35.0 33.0
[nltk_data] Downloading package punkt<eos>
to <eos>
/root/nltk_data...
3 [nltk_data]
<sos> no problem so how have Package
you been punkt is
<sos> i ve been great what about you 33.0 32.0
already up-to-date! <eos> <eos>
4 <sos> i ve been great what about you <sos> i ve been good i m in school right now 32.0 40.0
<eos> ...
... ... ... ... ...

3730 <sos> yes it was <eos> <sos> do you want to watch it together NaN NaN
sometim...
3731 <sos> sure that sounds like a plan <eos> <sos> awesome let s plan it for this NaN NaN
weekend ...
3732 <sos> sounds good to me <eos> <sos> alright then it s a plan what time NaN NaN
wor...
3735 rows × 4 columns
3733 <sos> how about saturday evening <eos> <sos> perfect saturday evening it is i ll bo... NaN NaN

3734 <sos> great looking forward to it <eos> <sos> me too it ll be fun <eos> NaN NaN
questions =
preprocessed_df['question'].values.tolist()
answers =
preprocessed_df['answer'].values.tolist()

# Tokenizing the data tokenizer = Tokenizer(filters='')

tokenizer.fit_on_texts(np.concatenate((questions, answers),
axis=0)) vocab_size = len(tokenizer.word_index) + 1

# Convert text to sequences question_seqs =

tokenizer.texts_to_sequences(questions)
answer_seqs =
tokenizer.texts_to_sequences(answers)

# Padding sequences for equal length # Pad sequences for equal length
max_len_question = max(len(seq) for seq in question_seqs)
max_len_answer = max(len(seq) for seq in answer_seqs) max_len =
max(max_len_question, max_len_answer) print(max(max_len_question,
max_len_answer)) # Pad sequences separately for questions and answers
question_seqs = pad_sequences(question_seqs, maxlen=max_len,
padding='post') answer_seqs = pad_sequences(answer_seqs,
maxlen=max_len, padding='post')

tokenizer.texts_to_sequences("<sos

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5-
>") [[], [9], [490], [9], 12/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
[]]
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM tokenizer.word_index["<sos>"] Colab

Model Architecture

from tensorflow.keras.models import Model from tensorflow.keras.layers import Input, LSTM, Dense,
Embedding, Attention, Concatenate, Dropout

# Define the model architecture latent_dim = 256 #

Dimensionality of the encoding space

# Encoder encoder_inputs = Input(shape=(max_len,)) encoder_embedding = Embedding(vocab_size,

latent_dim, input_shape=(max_len,)) encoder_lstm = LSTM(latent_dim, return_sequences=True,
return_state=True, dropout=0.1, recurrent_dropout=0.1) encoder_outputs, state_h, state_c =
encoder_lstm(encoder_embedding(encoder_inputs)) encoder_states = [state_h, state_c]

# Decoder decoder_inputs = Input(shape=(max_len-1,)) decoder_embedding = Embedding(vocab_size,

latent_dim, input_shape=(max_len-1,)) decoder_lstm = LSTM(latent_dim, return_sequences=True,
return_state=True, dropout=0.1, recurrent_dropout=0.1) decoder_outputs, _, _ =
decoder_lstm(decoder_embedding(decoder_inputs), initial_state=encoder_states)

# Attention mechanism attention_layer = Attention()

attention_output = attention_layer([decoder_outputs,
encoder_outputs])

# Concatenate attention output and decoder LSTM output

decoder_concat_input = Concatenate(axis=-1)([decoder_outputs,
attention_output])

# Add dropout layer for regularization

decoder_concat_input = Dropout(0.1)
(decoder_concat_input)

# Output layer decoder_dense = Dense(vocab_size,

activation='softmax') decoder_outputs =

decoder_dense(decoder_concat_input) Model Training

model = Model([encoder_inputs, decoder_inputs],

decoder_outputs)

# Compile the model

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy')

# Print model
summary
model.summary()
Model: "model"

# Train (type)
Layer the model model.fit([question_seqs,
Output Shape Param # Connected to
answer_seqs[:, :-1]], answer_seqs[:, 1:],
======================== input_1
batch_size=64,
[(None, 24)] epochs=32, validation_split=0.2)
0 []
========================================================================== (InputLayer)
input_2 (InputLayer) [(None, 23)] 0 []
embedding (Embedding) (None, 24, 256) 618496 ['input_1[0][0]']
embedding_1 (Embedding) (None, 23, 256) 618496 ['input_2[0][0]']
lstm (LSTM) [(None, 24, 256), 525312 ['embedding[0]
[0]']

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 13/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM (None, 256), Colab
(None, 256)]

lstm_1 (LSTM) [(None, 23, 256), 525312 ['embedding_1[0]

(None, 256), [0]',
'lstm[0][1]',
(None, 256)] 'lstm[0][2]']

attention (Attention) (None, 23, 256) 0 ['lstm_1[0][0]',

'lstm[0][0]']

concatenate (None, 23, 512) 0 ['lstm_1[0][0]',

(Concatenate) 'attention[0][0]']

dropout (Dropout) (None, 23, 512) 0 ['concatenate[0]

[0]']
dense (Dense) (None, 23, 2416) 1239408 ['dropout[0][0]']
===========================================================================================
=======
Total params: 3527024 (13.45 MB)
Trainable params: 3527024 (13.45 MB)
Non-trainable params: 0 (0.00 Byte)

Epoch 1/32
47/47 [==============================] - 63s 1s/step - loss: 3.0299 - val_loss: 2.2781
Epoch 2/32
47/47 [==============================] - 49s 1s/step - loss: 1.9036 - val_loss: 2.0941
Epoch 3/32
47/47 [==============================] - 49s 1s/step - loss: 1.7891 - val_loss: 2.0648
Epoch 4/32 :
47/47 [==============================] 2.0476
- 55s 1s/step - loss: 1.7495 - val_loss
Epoch 5/32
47/47 [==============================] - 47s 1s/step - loss: 1.7074 - val_loss: 2.0242
Epoch 6/32
47/47 [==============================] - 48s 1s/step - loss: 1.6609 - val_loss: 2.0007
Epoch 7/32
47/47 [==============================] - 49s 1s/step - loss: 1.6212 - val_loss: 1.9783
Epoch 8/32
47/47 [==============================] - 47s 1s/step - loss: 1.5819 - val_loss: 1.9536
Epoch 9/32 :
47/47 [==============================] 1.9287
- 48s 1s/step - loss: 1.5394 - val_loss
Epoch 10/32
47/47 [==============================] - 50s 1s/step - loss: 1.4969 - val_loss: 1.9095
Epoch 11/32
47/47 [==============================] - 47s 1s/step - loss: 1.4552 - val_loss: 1.8925
Epoch 12/32

def
generate_response(input_text
): # Tokenize the input
text
input_sequence =
tokenizer.texts_to_sequences([input_text]) # Pad the
input sequence
input_sequence = pad_sequences(input_sequence,
maxlen=max_len, padding='post')

# Initialize the decoder input sequence with start

token decoder_input_sequence = np.zeros((1, max_len-
1)) decoder_input_sequence[0, 0] =
tokenizer.word_index['<sos>']

# Generate response using the trained

model for i in range(max_len - 1):
predictions = model.predict([input_sequence,
decoder_input_sequence]) predicted_id =
np.argmax(predictions[0, i, :])
if predicted_id ==
tokenizer.word_index['<eos>']: break
decoder_input_sequence[0, i+1] = predicted_id

# Convert output sequence to text

output_text = ''
for token_index in
decoder_input_sequence[0]:
if token_index == tokenizer.word_index['<eos>'] or token_index
== 0: break
output_text += tokenizer.index_word[token_index] + ' '

return output_text.strip()
https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 14/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true
# Test the function with input "how are
you" input_text = "how do you do"
response =
generate_response(input_text)
print("Response:", response[5:])
5/30/24, 1:16 NLU chat bot phase 5.ipynb -
PM Colab
1/1 [==============================] - 2s 2s/step
1/1 [==============================] - 0s 175ms/step
1/1 [==============================] - 0s 175ms/step
1/1 [==============================] - 0s 172ms/step
1/1 [==============================] - 0s 171ms/step
Response: i m not sure model_name='google/flan-t5-

small'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
torch_dtype=torch.bfloat16) tokenizer = AutoTokenizer.from_pretrained(model_name)

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated

and will b warnings.warn( /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: The
secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set
it as secre You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or
datasets. warnings.warn(
config.json: 100% 1.40k/1.40k [00:00<00:00, 21.9kB/s]
model.safetensors: 10 308M/
0% 308M [00:03<00:00, 79.4MB/s]
generation_config.json: 10 147/147 [00:00<00:00, 3.48
0% kB/s]
tokenizer_config.json: 10 2.54k/
0% 2.54k [00:00<00:00, 53.9kB/s]
spiece.model: 100 792k/
% 792k [00:00<00:00, 19.5MB/s]
tokenizer.json: 100 2.42M/
% 2.42M [00:00<00:00, 5.05MB/s]
special_tokens_map.json: 100% 2.20k/
2.20k [00:00<00:00, 95.9kB/s]
def
print_number_of_trainable_model_parameters(mod
el): trainable_model_params
all_model_params = 0 = 0
for _, param
in model.named_parameters():
all_model_params +=
param.numel() if
param.requires_grad: return f"trainable model parameters: {trainable_model_params}\nall model
trainable_model_par parameters:
{all_model_params}\npercentage
ams += param.numel()of trainable model p print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 76961152 all

model parameters: 76961152 percentage of
trainable model parameters: 100.00%

Test the Model with Zero Shot Inferencing

Test the model with the zero shot inferencing. You can see that the model struggles to summarize the dialogue compared to the baseline summary, but it does
pull out some important information from the text which indicates the model can be ne-tuned to the task at hand.

index = 0

question = df['question']
[index] answer =
df['answer'][index]

https://colab.research.google.com/drive/1B9vUz86zGJkw5auKFSUskzS0U5- 15/15
BmGJE#scrollTo=wrcO4Fj0vveQ&printMode=true

Chatbot: Abhishek Verma (00414902018) Archit Kr. Singh (01414902018) Jatin Bagga (03814902018)
No ratings yet
Chatbot: Abhishek Verma (00414902018) Archit Kr. Singh (01414902018) Jatin Bagga (03814902018)
29 pages
Aiproject Report
No ratings yet
Aiproject Report
11 pages
Python Chatbot Project: January 2022
No ratings yet
Python Chatbot Project: January 2022
6 pages
Seminar
No ratings yet
Seminar
27 pages
ANKUSH
No ratings yet
ANKUSH
20 pages
GRP 117 Review 1 Chatbot
No ratings yet
GRP 117 Review 1 Chatbot
28 pages
ABM 11 Research - Docx 2
No ratings yet
ABM 11 Research - Docx 2
15 pages
Whats App
No ratings yet
Whats App
24 pages
Artificially Human
No ratings yet
Artificially Human
77 pages
AI Chatbot: Green University of Bangladesh
100% (2)
AI Chatbot: Green University of Bangladesh
20 pages
Automated Chatbot Implemented Using Natural Language Processing PDF
No ratings yet
Automated Chatbot Implemented Using Natural Language Processing PDF
5 pages
ChatBot Using TenserFlow
No ratings yet
ChatBot Using TenserFlow
12 pages
Course Project Report For: Artificial Intelligence EL-3011
No ratings yet
Course Project Report For: Artificial Intelligence EL-3011
8 pages
Python Chat Bot Project
100% (1)
Python Chat Bot Project
6 pages
Final Presentation
No ratings yet
Final Presentation
13 pages
An AI Engineer's Guide To Machine Learning and Generative AI - by Ai Geek (Wishesh) - Medium
No ratings yet
An AI Engineer's Guide To Machine Learning and Generative AI - by Ai Geek (Wishesh) - Medium
67 pages
ChatBot Project in Machine Learning PPT Kundan
No ratings yet
ChatBot Project in Machine Learning PPT Kundan
11 pages
Chatbot Phase3
No ratings yet
Chatbot Phase3
7 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
6 pages
Chatbot Development With ChatGPT & LangChain A Context-Aware Approach DataCamp
No ratings yet
Chatbot Development With ChatGPT & LangChain A Context-Aware Approach DataCamp
18 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
10 pages
FINAL-MIDTERM Major2
No ratings yet
FINAL-MIDTERM Major2
20 pages
Ai Phase 3 Project
No ratings yet
Ai Phase 3 Project
18 pages
Chatbot
No ratings yet
Chatbot
3 pages
A ChatBot For Answering Python Queries Using NLP
No ratings yet
A ChatBot For Answering Python Queries Using NLP
5 pages
Format wpr-3
No ratings yet
Format wpr-3
6 pages
002 AI Book in Honour of Prof Abubakar Adamu Rasheed Volume 3 Draft Edition
No ratings yet
002 AI Book in Honour of Prof Abubakar Adamu Rasheed Volume 3 Draft Edition
868 pages
ChatBot Using Python Flask
No ratings yet
ChatBot Using Python Flask
4 pages
Project Report
No ratings yet
Project Report
4 pages
Phase 5
No ratings yet
Phase 5
9 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
AI Phae 2 Project
No ratings yet
AI Phae 2 Project
8 pages
Natural Language Understanding in Chatbots
No ratings yet
Natural Language Understanding in Chatbots
4 pages
Deep Learning Project
No ratings yet
Deep Learning Project
21 pages
NLU Final
No ratings yet
NLU Final
23 pages
Final Presentation
No ratings yet
Final Presentation
22 pages
Transforming Conversational AI Exploring The Power of Large Language Models in Interactive Conversational Agents (Michael McTear, Marina Ashurkina) (Z-Library)
No ratings yet
Transforming Conversational AI Exploring The Power of Large Language Models in Interactive Conversational Agents (Michael McTear, Marina Ashurkina) (Z-Library)
235 pages
AWS Major Project
No ratings yet
AWS Major Project
139 pages
A Brief Overview of Chatgpt: The History, Status Quo and Potential Future Development
No ratings yet
A Brief Overview of Chatgpt: The History, Status Quo and Potential Future Development
15 pages
FOA Project Report: Basic Conversational Chatbot - Robo
No ratings yet
FOA Project Report: Basic Conversational Chatbot - Robo
10 pages
Chatbots
No ratings yet
Chatbots
15 pages
Artificial Intelligence - A Strategic Imperative For CFOs
No ratings yet
Artificial Intelligence - A Strategic Imperative For CFOs
18 pages
PDP On Generative AI Essentials A Deep Dive Into Theory and Practice
No ratings yet
PDP On Generative AI Essentials A Deep Dive Into Theory and Practice
3 pages
HAI Report 3
No ratings yet
HAI Report 3
13 pages
Ai Interview Chatbot
No ratings yet
Ai Interview Chatbot
9 pages
Mini Chat Bot
No ratings yet
Mini Chat Bot
22 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
6 pages
How To Build A Chatbot Using Natural Language Processing?: NLP Techniques
No ratings yet
How To Build A Chatbot Using Natural Language Processing?: NLP Techniques
8 pages
AI Phase 4
No ratings yet
AI Phase 4
9 pages
Synopsys 11
No ratings yet
Synopsys 11
17 pages
PP Mini Project-Gp
No ratings yet
PP Mini Project-Gp
23 pages
RAG Based Chatbot Using LLMs
No ratings yet
RAG Based Chatbot Using LLMs
4 pages
LLM Basics
No ratings yet
LLM Basics
3 pages
Shreyank
No ratings yet
Shreyank
6 pages
AI Project Logbook
No ratings yet
AI Project Logbook
5 pages
ChatBot Synopsis-Final
No ratings yet
ChatBot Synopsis-Final
7 pages
All Ai Prectical
No ratings yet
All Ai Prectical
4 pages
AI Chatbot Documentation
No ratings yet
AI Chatbot Documentation
2 pages
ChatGPT in The Classroom The Future of Educational AI From Elementary To University - Transformative Strategies For... (Hussaini, Saif) (Z-Library)
No ratings yet
ChatGPT in The Classroom The Future of Educational AI From Elementary To University - Transformative Strategies For... (Hussaini, Saif) (Z-Library)
153 pages
ChatGPT Interaction
No ratings yet
ChatGPT Interaction
3 pages
Einstein Features Cheat Sheet
No ratings yet
Einstein Features Cheat Sheet
18 pages
Generative AI Terminology
No ratings yet
Generative AI Terminology
5 pages
Hugging Face Repo Project Report
No ratings yet
Hugging Face Repo Project Report
29 pages
Educational Technology in The University A Comprehensive Look at The Role of A Professor and Artificial Intelligence
No ratings yet
Educational Technology in The University A Comprehensive Look at The Role of A Professor and Artificial Intelligence
13 pages
5-Using AI-based Detectors To Control AI-assisted PL
No ratings yet
5-Using AI-based Detectors To Control AI-assisted PL
29 pages
Choi 2024 Utilizing Generative AI For Instructional Design E
No ratings yet
Choi 2024 Utilizing Generative AI For Instructional Design E
13 pages
GenAI Cheat Sheet-1
No ratings yet
GenAI Cheat Sheet-1
10 pages
Backend Developer Roadmap 2025
No ratings yet
Backend Developer Roadmap 2025
8 pages
AI Detector - Trusted AI Checker For ChatGPT, GPT4 & Gemini
No ratings yet
AI Detector - Trusted AI Checker For ChatGPT, GPT4 & Gemini
1 page
Chat Bot
No ratings yet
Chat Bot
10 pages
Set Yourself Up For Success With AI
No ratings yet
Set Yourself Up For Success With AI
70 pages
Omniscience: A Domain-Specialized LLM For Scientific Reasoning and Discovery
No ratings yet
Omniscience: A Domain-Specialized LLM For Scientific Reasoning and Discovery
27 pages
50 FREE AI SEO Tools
No ratings yet
50 FREE AI SEO Tools
4 pages
Applicationsof Generative AIinthe Creative Sect
No ratings yet
Applicationsof Generative AIinthe Creative Sect
13 pages
Seed Coder
No ratings yet
Seed Coder
46 pages
Udaylokhande
No ratings yet
Udaylokhande
8 pages
Britto 1 15 2 15 - Merged
No ratings yet
Britto 1 15 2 15 - Merged
18 pages
Sundar RajI Phase 3
No ratings yet
Sundar RajI Phase 3
29 pages
Important of Engineering
No ratings yet
Important of Engineering
23 pages
P3R1 Text Classification
No ratings yet
P3R1 Text Classification
4 pages
Britto
No ratings yet
Britto
16 pages
2-Day Gen AI Mastermind Schedule - 23rd - 24th April'25
No ratings yet
2-Day Gen AI Mastermind Schedule - 23rd - 24th April'25
1 page
Raza K. Generative AI. Current Trends and Applications 2024
No ratings yet
Raza K. Generative AI. Current Trends and Applications 2024
468 pages
2020-Anki P. Et Al.-Intelligent Chatbot Adapted From Question and Answer System Using RNN-LSTM Model
No ratings yet
2020-Anki P. Et Al.-Intelligent Chatbot Adapted From Question and Answer System Using RNN-LSTM Model
12 pages
2ndu Page For Edit
No ratings yet
2ndu Page For Edit
1 page
Od 331513279052855100
No ratings yet
Od 331513279052855100
2 pages
666666in This New Year's Light, May You Find My Love True, and May All Your Dreams and Wishes Come Through. ?
No ratings yet
666666in This New Year's Light, May You Find My Love True, and May All Your Dreams and Wishes Come Through. ?
2 pages
CM2015 Midterm Apr25
No ratings yet
CM2015 Midterm Apr25
4 pages
1d-9950-68cf14f647fb - FIRE DETECTIOhihjhvN
No ratings yet
1d-9950-68cf14f647fb - FIRE DETECTIOhihjhvN
19 pages
Chatterbot
No ratings yet
Chatterbot
12 pages
01 Merged
No ratings yet
01 Merged
15 pages
Chatbot Code Explanation
No ratings yet
Chatbot Code Explanation
2 pages
6
No ratings yet
6
1 page
Scope Btech Capstone Project Final Schedule Students
No ratings yet
Scope Btech Capstone Project Final Schedule Students
107 pages
Project Report
No ratings yet
Project Report
3 pages
College Chatbot
No ratings yet
College Chatbot
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Britto

Uploaded by

Britto

Uploaded by

NLU CHATBOT

In partial fulfillment for the award

COMPUTER SCEINECE AND

CHRISTIAN COLLEGE OF ENGINEERING AND

NM1022–EXPERIENCE BASED PROJECT LEARNING

RECORD NOTE BOOK

Certified that it is a bonafide record of practical lab

work done by during the year

STAFF – INCHARGE HEAD OF THE DEPARTMENT

Submitted for the practical exam held on .

Internal Examiner External Examiner

NLU chatbot development has evolved through:

Conversational AI Platforms: Tools like Dialog ow streamline chatbot development.

Personalization: Offer personalized responses based on user history and preferences.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig,

from google.colab import

3 no problem. so how have you been? i've been

great. what about you? 4 i've been great. what

Data types of columns:

# Convert all entries in the 'answer' column to strings before

plt.figure(figsize=(8, 6)) plt.hist(df['question_length'], bins=30, alpha=0.7,

question_text = ' '.join(df['question']) wordcloud = WordCloud(width=600,

# Join all the answers into a single

# Generate the word cloud wordcloud = WordCloud(width=800,

# Plot the word cloud

from collections import Counter from

# Ensure there are no NaN values and convert to

# Function to calculate average word length, handling empty

# Analyze language complexity avg_question_word_length =

Average word word

Data Augmentation (UPSampling)

# Concatenate the new DataFrame with the existing

# Print the updated

i'm pretty good. no problem. so how have

import nltk from nltk.corpus

# Apply preprocessing to question and answer columns preprocessed_df =

# Print the preprocessed

0 hi, how are answer

i'm pretty good. no problem. so how have

from sklearn.model_selection import train_test_split

# Splitting the dataset into train and test sets

Model Pipline and training

Pipe.predict(['where are you going'])

'we went to a nice restaurant.'

# Sort feature importances sorted_indices =

Chat with you ChatBot :D

# Function to interact with the

You: hello Chatbot: okay. i'll return your

import tensorflow as tf from tensorflow.keras.preprocessing.text import Tokenizer

# Function for preprocessing text def

# Apply preprocessing to question and answer columns preprocessed_df =

# Tokenizing the data tokenizer = Tokenizer(filters='')

# Convert text to sequences question_seqs =

# Define the model architecture latent_dim = 256 #

# Encoder encoder_inputs = Input(shape=(max_len,)) encoder_embedding = Embedding(vocab_size,

# Decoder decoder_inputs = Input(shape=(max_len-1,)) decoder_embedding = Embedding(vocab_size,

# Attention mechanism attention_layer = Attention()

# Concatenate attention output and decoder LSTM output

# Add dropout layer for regularization

# Output layer decoder_dense = Dense(vocab_size,

decoder_dense(decoder_concat_input) Model Training

model = Model([encoder_inputs, decoder_inputs],

# Compile the model

lstm_1 (LSTM) [(None, 23, 256), 525312 ['embedding_1[0]

attention (Attention) (None, 23, 256) 0 ['lstm_1[0][0]',

concatenate (None, 23, 512) 0 ['lstm_1[0][0]',

dropout (Dropout) (None, 23, 512) 0 ['concatenate[0]

# Initialize the decoder input sequence with start

# Generate response using the trained

# Convert output sequence to text

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated

trainable model parameters: 76961152 all

Test the Model with Zero Shot Inferencing

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.