CH4_AA1.1-Sequence Models (1)
CH4_AA1.1-Sequence Models (1)
1
Learning Outcomes
• At the end of this chapter, you will:
• Understand how to build and train Recurrent Neural Networks
(RNNs), and commonly-used variants such as GRUs and LSTMs.
• Be able to apply sequence models to natural language problems,
• Be able to apply sequence models to audio applications, including
speech recognition
2
What is sequential data ?
• Sequence of numbers in time:
• Stock price
• Earthquake sensor data
• Sequence of words: texts
• Sound: speech, sound
• Image: videos
3
What tasks can be done with sequential data?
• Time Series: prediction time series problem such as stock market predictions.
• Image Captioning: caption an image by analyzing the present action.
• Machine Translation:
4
What tasks can be done with sequential data?
• Speech recognition:
5
Different types of Sequence Modeling Tasks
• Sentiment classification
Image classification Music generation Sentiment classification Machine translation Speech recognition
Image captioning Video activity recognition
6
CNN limitations
• Fixed input size: CNNs require a fixed input size. This can be a problem
when dealing with sequences of varying length, such as in natural
language processing or speech recognition.
• Lack of memory: CNNs have no memory of previous inputs, which
means that they can't easily capture temporal dependencies in
sequential data.
• Order invariance: CNNs are order-invariant, which means that they treat
all inputs equally regardless of their position in the sequence. This makes
it difficult for the network to capture the order-dependent patterns that
are often present in sequential data.
7
Recurrent Neural network : RNN
• RNNs are designed to handle sequences of varying length
• They have memory that allows them to capture temporal
dependencies in the data.
• They also have the ability to process inputs in a sequential order,
which allows them to capture order-dependent patterns.
• RNNs have become the standard approach for processing sequential
data such as speech, text, and time series data.
8
Graphical Representation of RNN
• Xi : Input at tti1 t2 t3
• hi : hidden state at ti
• Yi : output at ti
9
Back Propagation Though Time (BPTT)
•
10
Back Propagation Though Time (BPTT)
•
• Assume that
•
=
11
Back Propagation Though Time
•
12
RNN limitations
• Two common problems that occur during the backpropagation : the
vanishing and exploding gradients.
•
• =
• If is a sigmoid function,
• If is a tanh function,
13
RNN limitations
•
• Assume that:
14
Exploding Gradients with vanilla RNNs
• To avoid exploding gradient :
• gradient clipping : where at each timestamp, we can check if the
gradient > threshold and if it is, we normalize it.
•
15
Vanishing Gradients with vanilla RNNs
• To tackle the vanishing gradient problem these are the possible
solutions:
• Use ReLU instead of tanh or sigmoid activation function.
• Proper initialization of the weights can reduce the effect of
vanishing gradients.
• Truncated BPTT
• Using gated cells such as LSTM or GRUs
16
Truncate BPTT
• Moving window through the training process
Forward propagation
Backward propagation
• Advantages:
• Help to avoid exploding/vanishing gradient
• Much faster than the simple BPTT, and also less complex
• disadvantage:
• dependencies of longer than the chunk length, are not taught during the
training process.
17
Long Short-Term Memory (LSTM)
18
LSTM : Input Gate
• The input gate decides what new information will be stored in the long-term
memory.
• It only works with the information from the current input and the short-term
memory from the previous time step.
19
LSTM : Input Gate
• i1 : filter which selects what information can pass through it and what
information to be discarded.
• The sigmoid function will transform the values to be between 0 and 1,
• 0 indicates that part of the information is unimportant,
• 1 indicates that the information will be used
20
Forget Gate
• The forget gate decides which information from the long-term memory should
be kept or discarded.
• multiply the incoming long-term memory by a forget vector generated by the
current input and incoming short-term memory.
• Uses Sigmoid function
• The outputs from the Input gate and the Forget gate
will undergo a pointwise addition to give a new version
of the long-term memory.
21
Output Gate
• The output gate will take the current input, the previous short-term memory,
and the newly computed long-term memory to produce the new short-term
memory.
• the previous short-term memory and current input will be passed into a sigmoid function to
create the third and final filter.
• the new long-term memory is passed through an activation tanh function.
22
Gated Recurrent Units : GRU
• A variant of the RNN architecture introduced in 2014 by Cho, et al.
• Uses gating mechanisms to control and manage the flow of
information between cells in the neural network.
23
Reset Gate
• multiply and with their respective weights and summing them before passing
the sum through a sigmoid function.
• will first be multiplied by a trainable weight and will then undergo an element-
wise multiplication (Hadamard product) with the reset vector.
• This operation will decide which information is to be kept from the previous time
steps together with the new inputs.
24
Update Gate
• Both the Update and Reset gate vectors are created using the same formula, but,
the weights multiplied with the input and hidden state are unique to each gate.
• The purpose of the Update gate here is to help the model determine how much
of the past information stored in the previous hidden state needs to be retained
for the future.
25
Combining the outputs
•
• : The purpose of this operation is for the Update gate to determine what portion
of the new information should be stored in the hidden state.
26