Natural Language Processing With RNNs .Ipynb - Colaboratory
Natural Language Processing With RNNs .Ipynb - Colaboratory
ipynb - Colaboratory
Sentiment Analysis
Character Generation
RNN's are complex and come in many different forms so in this tutorial we wil focus on how
they work and the kind of problems they are best suited for.
Sequence Data
In the previous tutorials we focused on data that we could represent as one static data point
where the notion of time or step was irrelevant. Take for example our image data, it was simply a
tensor of shape (width, height, channels). That data doesn't change or care about the notion of
time.
In this tutorial we will look at sequences of text and learn how we can encode them in a
meaningful way. Unlike images, sequence data such as long chains of text, weather patterns,
videos and really anything where the notion of a step or time is relevant needs to be processed
and handled in a special way.
But what do I mean by sequences and why is text data a sequence? Well that's a good question.
Since textual data contains many words that follow in a very speci c and meaningful order, we
need to be able to keep track of each word and when it occurs in the data. Simply encoding say
an entire paragraph of text into one data point wouldn't give us a very meaningful picture of the
data and would be very di cult to do anything with. This is why we treat text as a sequence and
process one word at a time. We will keep track of where each of these words appear and use
that information to try to understand the meaning of peices of text.
Encoding Text
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 1/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
As we know machine learning models and neural networks don't take raw text data as an input.
This means we must somehow encode our textual data to numeric values that our models can
understand. There are many different ways of doing this and we will look at a few examples
below.
Before we get into the different encoding/preprocessing methods let's understand the
information we can get from textual data by looking at the following two movie reviews.
I thought the movie was going to be bad, but it was actually amazing!
I thought the movie was going to be amazing, but it was actually bad!
Although these two setences are very similar we know that they have very different meanings.
This is because of the ordering of words, a very important property of textual data.
Now keep that in mind while we consider some different ways of encoding our textual data.
Bag of Words
The rst and simplest way to encode our data is to use something called bag of words. This is a
pretty easy technique where each word in a sentence is encoded with an integer and thrown into
a collection that does not maintain the order of the words but does keep track of the frequency.
Have a look at the python function below that encodes a string of text into bag of words.
vocab = {} # maps word to integer representing it
word_encoding = 1
def bag_of_words(text):
global word_encoding
words = text.lower().split(" ") # create a list of all of the words in the text, well a
bag = {} # stores all of the encodings and their frequency
if encoding in bag:
bag[encoding] += 1
else:
bag[encoding] = 1
return bag
text = "this is a test to see if this test will work is is test a a"
bag = bag_of_words(text)
print(bag)
print(vocab)
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 2/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
This isn't really the way we would do this in practice, but I hope it gives you an idea of how bag
of words works. Notice that we've lost the order in which words appear. In fact, let's look at how
this encoding works for the two sentences we showed above.
positive_review = "I thought the movie was going to be bad but it was actually amazing"
negative_review = "I thought the movie was going to be amazing but it was actually bad"
pos_bag = bag_of_words(positive_review)
neg_bag = bag_of_words(negative_review)
print("Positive:", pos_bag)
print("Negative:", neg_bag)
We can see that even though these sentences have a very different meaning they are encoded
exaclty the same way. Obviously, this isn't going to y. Let's look at some other methods.
Integer Encoding
The next technique we will look at is called integer encoding. This involves representing each
word or character in a sentence as a unique integer and maintaining the order of these words.
This should hopefully x the problem we saw before were we lost the order of words.
vocab = {}
word_encoding = 1
def one_hot_encoding(text):
global word_encoding
return encoding
text = "this is a test to see if this test will work is is test a a"
encoding = one_hot_encoding(text)
print(encoding)
print(vocab)
And now let's have a look at one hot encoding on our movie reviews.
positive_review = "I thought the movie was going to be bad but it was actually amazing"
negative_review = "I thought the movie was going to be amazing but it was actually bad"
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 3/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
g _ g g g g y
pos_encode = one_hot_encoding(positive_review)
neg_encode = one_hot_encoding(negative_review)
print("Positive:", pos_encode)
print("Negative:", neg_encode)
Much better, now we are keeping track of the order of words and we can tell where each occurs.
But this still has a few issues with it. Ideally when we encode words, we would like similar words
to have similar labels and different words to have very different labels. For example, the words
happy and joyful should probably have very similar labels so we can determine that they are
similar. While words like horrible and amazing should probably have very different labels. The
method we looked at above won't be able to do something like this for us. This could mean that
the model will have a very di cult time determing if two words are similar or not which could
result in some pretty drastic performace impacts.
Word Embeddings
Luckily there is a third method that is far superior, word embeddings. This method keeps the
order of words intact as well as encodes similar words with very similar labels. It attempts to not
only encode the frequency and order of words but the meaning of those words in the sentence.
It encodes each word as a dense vector that represents its context in the sentence.
Unlike the previous techniques word embeddings are learned by looking at many different
training examples. You can add what's called an embedding layer to the beggining of your model
and while your model trains your embedding layer will learn the correct embeddings for words.
You can also use pretrained embedding layers.
This is the technique we will use for our examples and its implementation will be showed later
on.
This is why we are treating our text data as a sequence! So that we can pass one word at a time
to the RNN.
Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Let's de ne what all these variables stand for before we get into the explination.
ht output at time t
xt input at time t
What this diagram is trying to illustrate is that a recurrent layer processes words or input one at
a time in a combination with the output from the previous iteration. So, as we progress further in
the input sequence, we build a more complex understanding of the text as a whole.
What we've just looked at is called a simple RNN layer. It can be effective at processing shorter
sequences of text for simple problems but has many downfalls associated with it. One of them
being the fact that as text sequences get longer it gets increasingly di cult for the network to
understand the text properly.
LSTM
The layer we dicussed in depth above was called a simpleRNN. However, there does exist some
other recurrent layers (layers that contain a loop) that work much better than a simple RNN layer.
The one we will talk about here is called LSTM (Long Short-Term Memory). This layer works very
similarily to the simpleRNN layer but adds a way to access inputs from any timestep in the past.
Whereas in our simple RNN layer input from previous timestamps gradually disappeared as we
got further through the input. With a LSTM we have a long-term memory data structure storing
all the previously seen inputs as well as when we saw them. This allows for us to access any
previous value we want at any point in time. This adds to the complexity of our network and
allows it to discover more useful relationships between inputs and when they appear.
For the purpose of this course we will refrain from going any further into the math or details
behind how these layers work.
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 5/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
Sentiment Analysis
And now time to see a recurrent neural network in action. For this example, we are going to do
something called sentiment analysis.
the process of computationally identifying and categorizing opinions expressed in a piece of text,
especially in order to determine whether the writer's attitude towards a particular topic, product,
etc. is positive, negative, or neutral.
The example we’ll use here is classifying movie reviews as either postive, negative or neutral.
%tensorflow_version 2.x # this line is not required unless you are in a notebook
from keras.datasets import imdb
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np
VOCAB_SIZE = 88584
MAXLEN = 250
BATCH_SIZE = 64
More Preprocessing
If we have a look at some of our loaded in reviews, we'll notice that they are different lengths.
This is an issue. We cannot pass different length data into our neural network. Therefore, we
must make each review the same length. To do this we will follow the procedure below:
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 6/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
if the review is greater than 250 words then trim off the extra words
if the review is less than 250 words add the necessary amount of 0's to make it equal to
250.
Luckily for us keras has a function that can do this for us:
32 stands for the output dimension of the vectors generated by the embedding layer. We can
change this value if we'd like!
model = tf.keras.Sequential([
tf.keras.layers.Embedding(VOCAB_SIZE, 32),
tf.keras.layers.LSTM(32),
tf.keras.layers.Dense(1, activation="sigmoid")
])
model.summary()
Training
Now it's time to compile and train the model.
model.compile(loss="binary_crossentropy",optimizer="rmsprop",metrics=['acc'])
And we'll evaluate the model on our training data to see how well it performs.
So we're scoring somewhere in the mid-high 80's. Not bad for a simple recurrent network.
Making Predictions
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 7/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
Now let’s use our network to make predictions on our own reviews.
Since our reviews are encoded well need to convert any review that we write into that form so
the network can understand it. To do that well load the encodings from the dataset and use
them to encode our own data.
word_index = imdb.get_word_index()
def encode_text(text):
tokens = keras.preprocessing.text.text_to_word_sequence(text)
tokens = [word_index[word] if word in word_index else 0 for word in tokens]
return sequence.pad_sequences([tokens], MAXLEN)[0]
def decode_integers(integers):
PAD = 0
text = ""
for num in integers:
if num != PAD:
text += reverse_word_index[num] + " "
return text[:-1]
print(decode_integers(encoded))
def predict(text):
encoded_text = encode_text(text)
pred = np.zeros((1,250))
pred[0] = encoded_text
result = model.predict(pred)
print(result[0])
positive_review = "That movie was! really loved it and would great watch it again because
predict(positive_review)
negative_review = "that movie really sucked. I hated it and wouldn't watch it again. Was o
predict(negative_review)
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 8/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
Now time for one of the coolest examples we've seen so far. We are going to use a RNN to
generate a play. We will simply show the RNN an example of something we want it to recreate
and it will learn how to write a version of it on its own. We'll do this using a character predictive
model that will take as input a variable length sequence and predict the next character. We can
use the model many times in a row with the output from the last predicition as the input for the
next call to generate a sequence.
%tensorflow_version 2.x # this line is not required unless you are in a notebook
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np
Dataset
For this example, we only need one peice of training data. In fact, we can write our own poem or
play and pass that to the network for training if we'd like. However, to make things easy we'll use
an extract from a shakesphere play.
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 9/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
Encoding
Since this text isn't encoded yet well need to do that ourselves. We are going to encode each
unique character as a different integer.
vocab = sorted(set(text))
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
def text_to_int(text):
return np.array([char2idx[c] for c in text])
text_as_int = text_to_int(text)
And here we will make a function that can convert our numeric values to text.
def int_to_text(ints):
try:
ints = ints.numpy()
except:
pass
return ''.join(idx2char[ints])
print(int_to_text(text_as_int[:13]))
The training examples we will prepapre will use a seq_length sequence as input and a seq_length
sequence as the output where that sequence is the original sequence shifted one letter to the
right. For example:
Our rst step will be to create a stream of characters from our text data.
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
Next we can use the batch method to turn this stream of characters into batches of desired
length.
Now we need to use these sequences of length 101 and split them into input and output.
for x, y in dataset.take(2):
print("\n\nEXAMPLE\n")
print("INPUT")
print(int_to_text(x))
print("\nOUTPUT")
print(int_to_text(y))
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab) # vocab is number of unique characters
EMBEDDING_DIM = 256
RNN_UNITS = 1024
However, before we do that let's have a look at a sample input and the output from our untrained
model. This is so we can understand what the model is giving us.
# we can see that the predicition is an array of 64 arrays, one for each entry in the batc
print(len(example_batch_predictions))
print(example_batch_predictions)
# If we want to determine the predicted character we need to sample the output distributio
sampled_indices = tf.random.categorical(pred, num_samples=1)
# now we can reshape that array and convert all the integers to numbers to see the actual
sampled_indices = np.reshape(sampled_indices, (1, -1))[0]
predicted_chars = int_to_text(sampled_indices)
di t d h # d thi i h t th d l di t d f t i i 1
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 12/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
predicted_chars # and this is what the model predicted for training sequence 1
So now we need to create a loss function that can compare that output to the expected output
and give us some numeric value representing how close the two were.
model.compile(optimizer='adam', loss=loss)
Creating Checkpoints
Now we are going to setup and con gure our model to save checkpoinst as it trains. This will
allow us to load our model from a checkpoint and continue training it.
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
Training
Finally, we will start training the model.
If this is taking a while go to Runtime > Change Runtime Type and choose "GPU" under
hardware accelerator.
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 13/15
8/6/2020 Natural Language Processing with RNNs .ipynb - Colaboratory
Once the model is nished training, we can nd the lastest checkpoint that stores the models
weights using the following line.
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
checkpoint_num = 10
model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoin
model.build(tf.TensorShape([1, None]))
Generating Text
Now we can use the lovely function provided by tensor ow to generate some text using any
starting string we'd like.
predictions = tf.squeeze(predictions, 0)
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(idx2char[predicted_id])
And that's pretty much it for this module! I highly reccomend messing with the model we just
created and seeing what you can get it to do!
Sources
1. Chollet François. Deep Learning with Python. Manning Publications Co., 2018.
2. “Text Classi cation with an RNN : TensorFlow Core.” TensorFlow,
www.tensor ow.org/tutorials/text/text_classi cation_rnn.
3. “Text Generation with an RNN : TensorFlow Core.” TensorFlow,
www.tensor ow.org/tutorials/text/text_generation.
4. “Understanding LSTM Networks.” Understanding LSTM Networks -- Colah's Blog,
https://colah.github.io/posts/2015-08-Understanding-LSTMs/.
https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&printMode=true 15/15