0% found this document useful (0 votes)
13 views16 pages

DS303 RNN LSTM

The document provides an overview of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, emphasizing their design for processing sequential data and addressing challenges like long-term dependencies. RNNs maintain hidden states to capture temporal dependencies, while LSTMs introduce gated mechanisms to improve gradient flow and memory retention. The document discusses various RNN architectures, backpropagation techniques, and the advantages of LSTMs in applications such as language processing and speech recognition.

Uploaded by

Adrika Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views16 pages

DS303 RNN LSTM

The document provides an overview of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, emphasizing their design for processing sequential data and addressing challenges like long-term dependencies. RNNs maintain hidden states to capture temporal dependencies, while LSTMs introduce gated mechanisms to improve gradient flow and memory retention. The document discusses various RNN architectures, backpropagation techniques, and the advantages of LSTMs in applications such as language processing and speech recognition.

Uploaded by

Adrika Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

DS303 : Introduction to Machine Learning

Sequence Modeling Techniques: RNNs and LSTMs

Manjesh K. Hanawal

April 16, 2025


Recurrent Neural Network
(RNN)
Motivation

▶ Traditional neural networks (e.g., feedforward, CNNs) struggle


with sequential data due to lack of temporal modeling.
▶ Inputs are treated independently, ignoring order and context
— not suitable for time-dependent tasks.
▶ Real-world problems often involve sequences with
dependencies across time (e.g., language, speech, stock
prices).
▶ Need for a model that can:
▶ Handle variable-length sequences
▶ Capture temporal patterns and contextual dependencies
▶ RNNs are designed specifically for sequential data, enabling
memory of past inputs and dynamic behavior over time.
Examples

Videos as image sequences:


Recognizing activities like Surya Part-of-speech tagging: Tag
Namaskar requires capturing prediction depends on current
temporal context. and previous words.
Introduction : RNN
▶ Recurrent Neural Networks (RNNs) are a class of neural
networks designed to process sequential data.
▶ RNNs maintain a hidden state that captures information from
previous time steps, allowing them to learn temporal
dependencies.
▶ They are widely used in tasks such as language modeling, text
generation, speech recognition, and time-series forecasting.
▶ The recurrent structure enables the network to use its memory
to influence current outputs based on past inputs.

Figure: Basic RNN structure for sequence learning


Important Design Patterns of RNN: (1/3)
▶ Recurrent networks that produce an output at each time step
and have recurrent connections between hidden units.

Figure: Type 1 RNN: An input sequence is mapped to an output


sequence, with recurrent connections across time. The network uses
shared parameters (U for input-to-hidden, W for hidden-to-hidden, V for
hidden-to-output), and computes a loss at each time step. The right side
shows the unrolled computational graph.
Important Design Patterns of RNN: (2/3)
▶ Recurrent networks that produce an output at each time step
and have recurrent connections only from the output at one
time step to the hidden units at the next time step.

Figure: Type 2 RNN: Recurrence exists only from the output to the
hidden layer. Each time step is trained independently, enabling
parallelization. While less expressive than fully recurrent RNNs, this
structure is simpler and easier to train.
Important Design Patterns of RNN: (3/3)
▶ Recurrent networks with recurrent connections between
hidden units, that read an entire sequence and then produce a
single output.

Figure: Type 3 RNN: A sequence-to-one model that processes the entire


input sequence and produces a single output at the end. Useful for tasks
like classification, where a fixed-size representation summarizes the
sequence.
Backpropagation Through Time (BPTT) (1/2)
▶ Recurrent neural networks compute hidden states and outputs
for sequences using shared weights across time.
▶ At each time step t = 1 to τ , we apply:

a(t) = Wh(t−1) + Ux (t) + b (Pre-activation)


h(t) = tanh(a(t) ) (Hidden state)
(t) (t)
o = Vh +c (Output logits)
ŷ (t) = softmax(o (t) ) (Predicted output)
▶ Parameters: U (input-hidden), W (hidden-hidden), V
(hidden-output), with biases b, c.
▶ Total loss over the sequence:
τ
X
L=− log pmodel (y (t) | x (1) , ..., x (t) )
t=1

▶ BPTT unfolds the RNN through time and applies


backpropagation to compute gradients over all time steps.
Backpropagation Through Time (BPTT) (2/2)
▶ BPTT is used to compute gradients for RNNs by unrolling the
network over time.
▶ Gradients are computed by backpropagating from the final
time step to the beginning.
▶ Assumes softmax output and negative log-likelihood loss:
∂L (t)
(t)
= ŷi − 1i,y (t)
∂oi
▶ Gradient at hidden layer at final step:
∇h(τ ) L = V ⊤ ∇o (τ ) L
▶ Recursive update for t < τ :
 
∇h(t) L = W ⊤ ∇h(t+1) L ⊙ (1 − (h(t+1) )2 ) + V ⊤ ∇o (t) L
▶ Parameter gradients (e.g., for W and U) are accumulated
over time steps:

X  
∇W L = diag 1 − (h(t) )2 (∇h(t) L)h(t−1)
t
Challenge of Long-Term Dependencies (1/2)

▶ Main Problem: Gradients in RNNs tend to:


▶ Vanish most of the time
▶ Explode rarely, but severely
▶ Why It Happens:
▶ Repeated application of functions per time step causes
gradient decay/explosion
▶ Long-term dependencies involve products of many Jacobians
→ exponentially smaller gradients
▶ Simplified Example:
▶ h(t) = W ⊤ h(t−1) , recursively becomes h(t) = W t h(0)
▶ If W = QΛQ ⊤ , then h(t) = QΛt Q ⊤ h(0)
▶ Eigenvalues:
▶ |λ| < 1 → vanishing gradients
▶ |λ| > 1 → exploding gradients
Challenge of Long-Term Dependencies (2/2)
▶ Comparison to Non-Recurrent Networks:
▶ With separate weights at each step, careful variance control
can stabilize gradients
▶ Sussillo (2014) shows deep feedforward networks avoid
vanishing gradients via scaling
▶ Empirical Findings:
▶ Bengio et al. (1994): SGD fails to train standard RNNs on
sequences of just 10–20 steps
▶ Why It’s Hard to Avoid:
▶ Even stable memory representations cause gradients to vanish
▶ Long-term dependencies yield smaller gradients than
short-term ones
▶ Conclusion:
▶ Learning long-term dependencies is fundamentally hard for
RNNs
▶ Remains one of the core challenges in deep learning
Long Short Term Memory
(LSTM)
LSTM: Long Short-Term Memory
Key Idea: LSTM introduces self-loops to allow gradients to flow
for long durations.
Why LSTM?
▶ Standard RNNs struggle with long-term dependencies due to
vanishing/exploding gradients.
▶ LSTM uses gated mechanisms to control information flow.
Self-loop Weighting:
▶ Self-loop weight is dynamically controlled by another hidden
unit.
▶ This allows flexible memory retention based on context.

LSTM has proven successful in applications such as:


▶ Handwriting and speech recognition
▶ Machine translation and image captioning
LSTM Cell Structure

Components:
▶ Forget Gate ft
▶ Input Gate it
▶ Output Gate ot
▶ Cell state ct (long term
memory)
▶ Hidden state ht (short term
memory)
Gating Equations:
LSTM Cell Block Diagram
ft = σ(Wf [ht−1 , xt ] + bf )

it = σ(Wi [ht−1 , xt ] + bi )
LSTM Cell Update and Output

State and Output Updates:

c̃t = tanh(Wc [ht−1 , xt ] + bc )

ct = ft ⊙ ct−1 + it ⊙ c̃t
ot = σ(Wo [ht−1 , xt ] + bo )
ht = ot ⊙ tanh(ct )
Key Benefits:
▶ Learns long-term dependencies
▶ Gated structure avoids vanishing/exploding gradients
▶ Strong empirical performance in many NLP and vision tasks

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy