Tensor Flow Chat Bot
Tensor Flow Chat Bot
CS 20SI:
TensorFlow for Deep Learning Research
Lecture 13
3/1/2017
1
2
Announcements
Assignment 3 out tonight, due March 17
3
Agenda
Seq2seq
Implementation keys
Chatbot craze
4
Sequence to Sequence
● The current model class of choice for most dialogue and
machine translation systems
● Introduced by Cho et al. in 2014 for Statistical Machine
Translation (the predecessor of NMT)
● The paper “Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Translation” has
been cited 900 times, approx. one paper a day.
● Originally called “RNN Encoder – Decoder”
5
Sequence to Sequence
Consists of two recurrent neural networks (RNNs):
● Encoder maps a variable-length source sequence (input) to a
fixed-length vector
● Decoder maps the vector representation back to a variable-length
target sequence (output)
● Two RNNs are trained jointly to maximize the conditional probability
of the target sequence given a source sequence
6
Vanilla Encoder and Decoder
Graph from “Learning Phrase Representations using RNN Encoder–Decoder for Statistical 7
Machine Translation” (Cho et al.)
Encoder and Decoder in TensorFlow
● Each box in the picture represents a cell of the RNN, most commonly
a GRU cell or an LSTM cell.
● Encoder and decoder often have different weights, but sometimes
they can share weights.
9
Graph by Indico.io blog
Bucketing
● Avoid too much padding that leads to extraneous computation
● Group sequences of similar lengths into the same buckets
10
Bucketing
● Avoid too much padding that leads to extraneous computation
● Group sequences of similar lengths into the same buckets
● Create a separate subgraph for each bucket
11
Bucketing
● Avoid too much padding that leads to extraneous computation
● Group sequences of similar lengths into the same buckets
● Create a separate subgraph for each bucket
● In theory, can use for v1.0:
tf.contrib.training.bucket_by_sequence_length(max_length,
examples, batch_size, bucket_boundaries, capacity=2 *
batch_size, dynamic_pad=True)
● In practice, use the bucketing algorithm used in TensorFlow’s
translate model (because we’re using v0.12)
12
Sampled Softmax
● Avoid the growing complexity of computing the normalization
constant
● Approximate the negative term of the gradient, by importance
sampling with a small number of samples.
● At each step, update only the vectors associated with the
correct word w and with the sampled words in V’
● Once training is over, use the full target vocabulary to compute
the output probability of each target word
On Using Very Large Target Vocabulary for Neural Machine Translation (Jean et al., 2015)
13
Sampled Softmax
14
Sampled Softmax
● Generally an underestimate of the full softmax loss.
● At inference time, compute the full softmax using:
15
Seq2seq in TensorFlow
outputs, states = basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell)
16
Seq2seq in TensorFlow
outputs, states = basic_rnn_seq2seq(encoder_inputs,
decoder_inputs,
cell)
17
Seq2seq in TensorFlow
outputs, states = embedding_rnn_seq2seq(encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
output_projection=None,
feed_previous=False)
To embed your inputs and outputs, need to specify the number of input and output tokens
Feed_previous if you want to feed the previously predicted word to train, even if the model
makes mistakes
Output_projection: tuple of project weight and bias if use sampled softmax
18
Seq2seq in TensorFlow
outputs, states = embedding_attention_seq2seq(encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
num_heads=1,
output_projection=None,
feed_previous=False,
initial_state_attention=False)
19
Wrapper for seq2seq with buckets
outputs, losses = model_with_buckets(encoder_inputs,
decoder_inputs,
targets,
weights,
buckets,
seq2seq,
softmax_loss_function=None,
per_example_loss=False)
20
Our TensorFlow chatbot
21
Cornell Movie-Dialogs Corpus
● 220,579 conversational exchanges between
● 10,292 pairs of movie characters
● 9,035 characters from 617 movies
● 304,713 total utterances
● Very well-formatted (almost perfect)
22
Input Length Distribution
23
Bucketing
9 buckets
[(6, 8), (8, 10), (10, 12), (13, 15), (16, 19), (19, 22), (23, 26), (29, 32), (39, 44)]
[19530, 17449, 17585, 23444, 22884, 16435, 17085, 18291, 18931]
5 buckets
[(8, 10), (12, 14), (16, 19), (23, 26), (39, 43)] # bucket boundaries
[37049, 33519, 30223, 33513, 37371] # number of samples in each bucket
3 buckets - recommended
[37899, 34480, 31045]
[(8, 10), (12, 14), (16, 19)]
24
Vocabulary tradeoff
● Get all tokens that appear at least a number of time (twice)
● Alternative approach: get a fixed size vocabulary
Smaller vocabulary:
● Has smaller loss/perplexity but loss/perplexity isn’t everything
● Gives <unk> answers to questions that require personal information
● Doesn’t give the bot’s answers much response
● Doesn’t train much faster than big vocab using sampled softmax
25
Model
● Seq2seq
● Attentional decoder
● Reverse encoder inputs
● Bucketing
● Sampled softmax
● Based on the Google’s vanilla translate model,
originally used to translate from English to French
26
The sassiest bot
> hi
you have you go .
> human
> where? bot
in mind .
> you're weird?
you ' re right . . .
> what do you want?
who is you me .
> i don't understand
you know you look .
> i look what?
you nothing .
> you're funny
you ' re not .
27
The sassiest bot
> hi
hi .
> human
> how're you? bot
just go .
> go where?
<unk> .
> omg
<unk> !
> you're doing that again!
i ' m not .
> stop telling me to go somewhere!
go !
> it's almost like you want to piss me off.
i ' m not .
> you're stupid
i ' m not . 28
It sort of works
> you are a monster!
i am . . .
> human
> i'm sorry bot
you don ' t need to boy .
> i'm a girl!
you done ?
==========
> Did you change your hair?
no .
> no?
no i won t .
> who?
paul owen .
> where?
in a harem .
29
Sort of doesn’t
> what?
don ' t bother .
> human
> you're being sassy bot
i need .
> what do you need?
i need ya help ya .
> okay
what do you want ?
30
Sort of doesn’t
> you're just talking gibberish
and what are you talking about ?
> about life > human
and what about it ? bot
> life is beautiful
is good .
> where do you live?
up .
> as in heaven?
in his apartment .
> who's he?
who ' s here ?
> oh well
and you are in trouble .
> i'm not!
please !
> whatever
i don ' t know what to say .
31
See sample conversations on GitHub
(in assignments)
32
Sanity check?
33
Sanity check?
34
Problems?
● The bot is very dramatic (thanks to Hollywood screenwriters)
● Topics of conversations aren’t realistic
● Responses are always fixed for one encoder input
● Inconsistent personality
● Use only the last previous utterance as the input for the encoder
● Doesn’t keep track of information about users
35
Train on multiple datasets
● Twitter chat log (courtesy of Marsan Ma)
● More movie substitles (less clean)
● Every publicly available Reddit comments (1TB of data!)
● Your own conversations (chat logs, text messages, emails)
36
Example of Twitter chat log
37
Chatbot with personalities
● At the decoder phase, inject consistent information about the bot
For example: name, age, hometown, current location, job
● Use the decoder inputs from one person only
For example: your own Sheldon Cooper bot!
38
Train on the incoming inputs
● Save the conversation with users and train on those conversations
● Create a feedback loop so users can correct the bot’s responses
39
Remember what users say
● The bot can extract information the user gives them
> hi
hi . what ' s your name ?
> my name is chip
nice to meet you .
> what's my name?
let ' s talk about something else .
40
Use characters instead of tokens
● Character level language modeling seems to be working quite well
● Smaller vocabulary -- no unknown tokens!
● But the sequences will be much longer (approximately 4x longer)
41
Improve input pipeline
● Right now, 50% of running time is spent on generating batches!
42
See assignment 3 handout
43
Next class
More discussion on chatbot
Feedback: huyenn@stanford.edu
Thanks!
44