Model5 partial
Model5 partial
• You can see that, to represent a word, we are actually wasting a lot of
memory to just set 0s(sparse matrix). These one-hot encodings also
doesn’t reflect any relation between similar words. They are just
representation of some word with ‘1’. Two similar words such as “accurate”
and “exact” might be at very different positions in one-hot encodings.
Embedding
• What if we can represent a word with less space and have a meaning
of its representation with which we can learn something.
• Word embeddings also represent words in an array, not in the form of
0s and 1s but continuous vectors.
• They can represent any word in few dimensions, mostly based on the
number of unique words in our text.
• They are dense, low dimensional vectors
• Not hardcoded but are “learned” through data.
Embedding
• An embedding is a dense vector that represents a word (or a symbol).
• By default, the embedding vectors are randomly initialized, then will gradually be
improved during the training phase, with the gradient descent algorithm at each
back-propagation step, so that similar words in the same lexical field or with
common stem … will end up close in terms of distance in the new vector space.
Embedding
• One-hot encoding representation is sparse and doesn’t capture the relationship
between the words. For example, if your model learns from the training data that
the word after “pumpkin” in the first sentence is “pie,” can it fill the second
sentence’s blank with “pie”?
1.[training data] My favorite Christmas dessert is pumpkin pie.
2.[testing data] My favorite Christmas dessert is apple ____.
• The algorithm can not learn the relationship between “pumpkin” and “apple” by
the distance between the one-hot encoding for the two words. If we can find a
way to create features to represent the words, we may teach the model to learn
that pumpkin and apple is food. And the distance between the feature
representations of these two words is closer than, for example, “apple” and
“nurse.” So when the model sees “apple” in a new sentence and needs to predict
the word after, it is more likely to choose “pie” since it sees “pumpkin pie” in the
training data. The idea of word embedding is to learn a set of features to
represent words.
Embedding
• For example, you can score each word in the dictionary
according to a group of features like this:
Embedding
Embedding
Bidirectional RNNs
• A bi-directional recurrent neural network (Bi-RNN) is a type of recurrent
neural network (RNN) that processes input data in both forward and
backward directions. The goal of a Bi-RNN is to capture the contextual
dependencies in the input data by processing it in both directions, which
can be useful in various natural language processing (NLP) tasks.
• In a Bi-RNN, the input data is passed through two separate RNNs: one
processes the data in the forward direction, while the other processes it in
the reverse direction. The outputs of these two RNNs are then combined in
some way to produce the final output.
• One common way to combine the outputs of the forward and reverse
RNNs is to concatenate them. Still, other methods, such as element-wise
addition or multiplication, can also be used. The choice of combination
method can depend on the specific task and the desired properties of the
final output.
Bidirectional RNNs: Need
• A uni-directional recurrent neural network (RNN) processes input
sequences in a single direction, either from left to right or right to left.
• This means the network can only use information from earlier time
steps when making predictions at later time steps.
• This can be limiting, as the network may not capture important
contextual information relevant to the output prediction.
• For example, in natural language processing tasks, a uni-directional
RNN may not accurately predict the next word in a sentence if the
previous words provide important context for the current word.
Bidirectional RNNs
• Consider an example where we could use the recurrent network to predict
the masked word in a sentence.
•I am ___.
•I am ___ hungry.
•I am ___ hungry, and I can eat half a pig
•In the first sentence “happy” seems to be a likely candidate. The words “not” and
“very” seem plausible in the second sentence, but “not” seems incompatible with the
third sentences.
• A Recurrent Neural Network that can only process the inputs from left to
right may not accurately predict the right answer for sentences discussed
above.
• To perform well on natural language tasks, the model must be able to
process the sequence in both directions.
Bidirectional RNNs
• This means that the network has two separate RNNs:
1.One that processes the input sequence from left to right
2.Another one that processes the input sequence from right to left.
• These two RNNs are typically called forward and backward RNNs,
respectively.
Bidirectional RNNs
Bidirectional RNNs
Bidirectional RNNs
• In the case of a bidirectional RNN, two separate Backpropagation
passes are involved: one for the forward RNN and one for the
backward RNN.
• During the forward pass, the forward RNN processes the input
sequence in the usual way and makes predictions for the output
sequence.
• These predictions are then compared to the target output sequence,
and the error is backpropagated through the network to update the
weights of the forward RNN.
Bidirectional RNNs
• The backward RNN processes the input sequence in reverse order during
the backward pass and predicts the output sequence. These predictions
are then compared to the target output sequence in reverse order, and the
error is backpropagated through the network to update the weights of the
backward RNN.
• Once both passes are complete, the weights of the forward and backward
RNNs are updated based on the errors computed during the forward and
backward passes, respectively. This process is repeated for multiple
iterations until the model converges and the predictions of the
bidirectional RNN are accurate.
• This allows the bidirectional RNN to consider information from past and
future time steps when making predictions, which can significantly improve
the model's accuracy.
Bidirectional RNNs
• There are several ways in which the outputs of the forward and backward RNNs can be
merged, depending on the specific needs of the model and the task it is being used for.
Some common merge modes include:
1. Concatenation: In this mode, the outputs of the forward and backward RNNs are concatenated
together, resulting in a single output tensor that is twice as long as the original input.
2. Sum: In this mode, the outputs of the forward and backward RNNs are added together element-
wise, resulting in a single output tensor that has the same shape as the original input.
3. Average: In this mode, the outputs of the forward and backward RNNs are averaged element-
wise, resulting in a single output tensor that has the same shape as the original input.
4. Maximum: In this mode, the maximum value of the forward and backward outputs is taken at
each time step, resulting in a single output tensor with the same shape as the original input.
• Which merge mode to use will depend on the specific needs of the model and the task it
is being used for.
• Concatenation is generally a good default choice and works well in many cases, but other
merge modes may be more appropriate for certain tasks.
Bidirectional RNNs
Bidirectional RNNs
Bidirectional RNNs
• Given a dictionary containing all potential words, our neural network
will take the sequence of words as seed input : 1: “the”, 2: “man”, 3:
“is”, …
• Its output will be a matrix providing the probability for each word
from the dictionary to be the next one of the given sequence.
• Based on the training data, it could maybe guess the next word will
be "the"…
• Then, how will we generate the whole text ? Simply by iterating the
process. Once the next word is drawn from the dictionary, we add it
at the end of the sequence. Then, we guess a new word for this new
sequence.
Bidirectional RNNs
Bidirectional RNNs
• Training data
• Riya thinks the same way as Puja.
• Puja has been working in the same job for years.
• Riya has been working for the same company since 2009.
• You still get the same money.
• It is necessary to focus on two things at the same time.
• Testing data
• Rahul is still _______________ [at the same company].
Seq2seq Problem
• In Natural Language Processing (NLP), it is important to detect the
relationship between two sequences or to generate a sequence of
tokens given another observed sequence.
• We call the type of problems on modelling sequence pairs as
sequence to sequence (seq2seq) mapping problems.
Beam Search
•Beam search is a search algorithm, frequently used in machine translation tasks,
to generate the most likely sequence of words given a particular input or context. It
is an efficient algorithm that explores multiple possibilities and retains the most
likely ones, based on a pre-defined parameter called the beam size.
•Beam search is widely used in sequence-to-sequence models, including recurrent
neural networks and transformers, to improve the quality of the output by
exploring different possibilities while being computationally efficient.
•The core idea of beam search is that on each step of the decoder, we want to keep
track of the k most probable partial candidates/hypotheses (such as generated
translations in case of a machine translation task) where k is the beam size (usually
5 - 10 in practice). Put simply, we book-keep the top k predictions, generate the
word after next for each of these predictions, and select whichever combination
had less error.
•The image (next slide) shows how the algorithm works with a beam size of 2.
Beam Search
RNN – Hidden States
• In traditional neural networks, all the inputs and outputs are independent of each
other.
• Still, in cases when it is required to predict the next word of a sentence, the
previous words are required and hence there is a need to remember the previous
words.
• Thus RNN came into existence, which solved this issue with the help of a Hidden
Layer.
• The main and most important feature of RNN is its Hidden state, which
remembers some information about a sequence.
• The state is also referred to as Memory State since it remembers the previous
input to the network. It uses the same parameters for each input as it performs
the same task on all the inputs or hidden layers to produce the output.
• This reduces the complexity of parameters, unlike other neural networks.
RNN – Hidden States
Perplexity
• Perplexity is the common evaluation metric for a language model.
Generally, it measures how well the proposed probability model
represents the target data.
• Lower perplexity is better.
Character-level Language Models
• Character-level language modeling is an approach to natural language
processing (NLP) and text generation that operates at the level of
individual characters, rather than words or tokens. Instead of
predicting the next word in a sentence, a character-level language
model predicts the next character based on the preceding characters.
• Here’s how character-level language modeling typically works:
• Data preparation: The training data consists of a large corpus of text. This
corpus can be preprocessed by splitting it into individual characters or
sequences of characters, depending on the desired context window.
Character-level Language Models
• Here’s how character-level language modeling typically works:
• Model architecture: The model architecture used for character-level language
modeling can vary, but recurrent neural networks (RNNs) are commonly employed.
RNNs, such as Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU), are
well-suited for capturing sequential dependencies in text data.
• Input representation: Each character is usually encoded using a one-hot encoding
scheme, where each character corresponds to a unique vector representation. These
one-hot encoded vectors serve as input to the model.
• Training: The model is trained to predict the next character given the previous
characters in the input sequence. During training, the model learns the patterns,
dependencies, and probabilities of character transitions in the training data.
• Text generation: Once the model is trained, it can be used to generate new text by
sampling from the learned probability distribution of characters. Starting with an
initial seed sequence, the model predicts the next character and appends it to the
sequence. This process can be repeated iteratively to generate longer sequences of
text.
Character-level Language Models
Character-level
Language Models
Character-level
Language Models
Suppose we only had a vocabulary of
four possible letters “helo”, and
wanted to train an RNN on the
training sequence “hello”. This
training sequence is in fact a source of
4 separate training examples: 1. The
probability of “e” should be likely
given the context of “h”, 2. “l” should
be likely in the context of “he”, 3. “l”
should also be likely given the context
of “hel”, and finally 4. “o” should be
likely given the context of “hell”.