DL4CV Seq Att
DL4CV Seq Att
Images Sequences
static time-dependent
DL4CV @ Weizmann
Deep Learning for Sequences
One to one One to many Many to one Many to many Many to many
DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many
DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many
DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many
DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many
DL4CV @ Weizmann
Example: Language Modeling
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡
𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7
𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “h” predict “e”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7
𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “he” predict “l”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7
𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “hel” predict “l”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7
𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “hell” predict “o”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7
𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
.04 `e`
.12
Softmax:
Task: .01
.83
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡 1.0
2.2
Output Layer: -3.
4.1
DL4CV @ Weizmann
𝒕 − 𝟏, 𝒕, 𝒕 + 𝟏
Searching for Interpretable Cells
DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells
DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells
DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells
DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells
DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells
DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Agenda
Images Sequences
static time-dependent
DL4CV @ Weizmann
RNN: Gradient Flow
𝑊 tanh ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡
ℎ𝑡−1 ℎ𝑡 ℎ𝑡−1
= tanh 𝑊 ⋅
𝑥𝑡
𝑥𝑡
DL4CV @ Weizmann
RNN: Gradient Flow
𝑊 tanh ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡
ℎ𝑡−1 ℎ𝑡 ℎ𝑡−1
= tanh 𝑊 ⋅
𝑥𝑡
𝑥𝑡
DL4CV @ Weizmann
RNN: Gradient Flow
𝑊 tanh 𝑊 tanh 𝑊 tanh 𝑊 ta
ℎ𝑡−1 ℎ𝑡 ℎ𝑡+1
DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
x input
W x tanh h
𝑊 tanh hidden
h
h ℎ𝑡−1 ℎ𝑡
𝑥𝑡
DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
x input
Data flow
x tanh g h hidden
h Cell,
sigmoid i c
W “memory”
sigmoid f g
Gates (Control)
sigmoid o i input
c f forget
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡
o output
ℎ𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡 )
DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
𝑐𝑡−1 ⨀ + 𝑐𝑡 x input
Data flow
x tanh g f h hidden
i Cell,
h sigmoid i 𝑊 c
g ⨀ tanh “memory”
W
f o
sigmoid
ℎ𝑡−1 ⨀ ℎ𝑡 g
Gates (Control)
sigmoid o i input
𝑥𝑡
c f forget
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡
o output
ℎ𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡 )
DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
𝑐𝑡−1 ⨀ + 𝑐𝑡 x input
A gradient path
Data flow
f h hidden
w/o matrix i Cell,
𝑊 c
g ⨀ tanh “memory”
multiplication o
ℎ𝑡−1 ⨀ ℎ𝑡 g
Gates (Control)
i input
𝑥𝑡
f forget
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡
o output
ℎ𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡 )
DL4CV @ Weizmann
Agenda
Images Sequences
static time-dependent
DL4CV @ Weizmann
Sequence to Sequence
We ( אנוANU)
Learn ( לומדיםLOMDIM)
Computer ( ממוחשבתMEMUCHSHEVET)
Vision ( ראיהRE’EYA)
DL4CV @ Weizmann
Sequence to Sequence: RNN
Encoder RNN Decoder RNN
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN
Encoder RNN Decoder RNN
context
“context” is a bottleneck
What if seq is very long?
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
No need to provide:
Learned end-to-end
𝑽 ⨀ ⨀ ⨀ ⨀
+
softmax
𝑲 𝑸
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
𝑽 ⨀ ⨀ ⨀ ⨀
+
softmax
𝑲 𝑸
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
𝑽 ⨀ ⨀ ⨀ ⨀
+
softmax
𝑲 𝑸
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
𝑽 ⨀ ⨀ ⨀ ⨀
+
softmax
𝑲 𝑸
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
Input (English):
The agreement on the European Economic Area
was signed in August 1992.
Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.
DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
Input (English):
The agreement on the European Economic Area
was signed in August 1992.
Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.
DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
Input (English):
The agreement on the European Economic Area
was signed in August 1992.
Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.
DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
Input (English):
The agreement on the European Economic Area
was signed in August 1992.
Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.
DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Attention Layer
𝑽 ⨀ ⨀ ⨀ ⨀
+
softmax
𝑲 𝑸
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Attention Layer
Inputs:
Query: 𝑸 (shape: 𝑁𝑞 × 𝐷𝑞 ) 𝑦1 𝑦2 𝑦3 𝑦4
Input: 𝑿 (shape: 𝑁𝑥 × 𝐷𝑥 )
prod(→), sum (↑)
𝑾𝒗
Layer’s Parameters: 𝑣1 𝑎1,1 𝑎2,1 𝑎3,1 𝑎4,1
𝑿 → 𝑲: 𝑾𝒌 (shape: 𝐷𝑥 × 𝐷𝑞 )
𝑿 → 𝑽: 𝑾𝒗 (shape: 𝐷𝑥 × 𝐷𝑣 ) 𝑣2 𝑎1,2 𝑎2,2 𝑎3,2 𝑎4,2
𝑣3 𝑎1,3 𝑎2,3 𝑎3,3 𝑎4,3
softmax (↑)
Compute: 𝑾𝒌
Keys: 𝑲 = 𝑿𝑾𝒌 (shape: 𝑁𝑥 × 𝐷𝑞 ) 𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1 𝑒4,1
Values: 𝐕 = 𝐗𝑾𝒗 (shape: 𝑁𝑥 × 𝐷𝑣 ) 𝑥2 𝑘2 𝑒1,2 𝑒2,2 𝑒3,2 𝑒4,2
Similarities: 𝑬 = 𝑸𝑲𝑇 / 𝐷𝑞 (shape: 𝑁𝑞 × 𝑁𝑥 )
𝑥3 𝑘3 𝑒1,3 𝑒2,3 𝑒3,3 𝑒4,3
Attention: 𝑨 = softmax 𝑬; ↑ (shape: 𝑁𝑞 × 𝑁𝑥 )
𝑞1 𝑞2 𝑞3 𝑞4
Outputs: 𝐘 = 𝐀𝐕 (shape: 𝑁𝑞 × 𝐷𝑣 )
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Self-Attention Layer
Input:
𝑦1 𝑦2 𝑦3
Input: 𝑿 (shape: 𝑁𝑥 × 𝐷𝑥 )
prod(→), sum (↑)
𝑾𝒗
Layer’s Parameters: 𝑣1 𝑎1,1 𝑎2,1 𝑎3,1
𝑿 → 𝑸: 𝑾𝒒 (shape: 𝐷𝑥 × 𝐷𝑞 )
𝑿 → 𝑲: 𝑾𝒌 (shape: 𝐷𝑥 × 𝐷𝑞 ) 𝑣2 𝑎1,2 𝑎2,2 𝑎3,2
𝑿 → 𝑽: 𝑾𝒗 (shape: 𝐷𝑥 × 𝐷𝑣 ) 𝑎1,3 𝑎2,3 𝑎3,2
𝑣3
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
𝑦:
𝐷𝑦 × 𝑇 × 𝐻 × 𝑊
Self Attention in Vision
𝑥:
softmax
𝐷𝑥 × 𝑇 × 𝐻 × 𝑊 ⨂
𝑇⋅𝐻⋅𝑊
×
CNN 𝑇⋅𝐻⋅𝑊
Backbone
𝑞: 𝑘: 𝑣:
𝐷𝑞 × 𝑇 ⋅ 𝐻 ⋅ 𝑊 𝐷𝑞 × 𝑇 ⋅ 𝐻 ⋅ 𝑊 𝐷𝑦 × 𝑇 ⋅ 𝐻 ⋅ 𝑊
𝑊𝑞 : 𝑊𝑘 : 𝑊𝑣 :
1×1×1 1×1×1 1×1×1
𝑥:
𝐷𝑥 × 𝑇 × 𝐻 × 𝑊
DL4CV @ Weizmann
Wang, X., Girshick, R., Gupta, A. and He, K., “Non-local neural networks” CVPR (2018)
Self Attention in Vision Avg
Pool
+
MLP
CNN
Backbone
DL4CV @ Weizmann
Wang, X., Girshick, R., Gupta, A. and He, K., “Non-local neural networks” CVPR (2018)
Self-Attention Layer: Properties
𝑦1 𝑦2 𝑦3
SelfAtt 𝜋 𝑥1 , … , 𝑥𝑛 = 𝜋 SelfAtt(𝑥1 , … , 𝑥𝑛 ) 𝑾𝒗
prod(→), sum (↑)
DL4CV @ Weizmann
Self-Attention Layer: Properties
𝑦1 𝑦2 𝑦3
prod(→), sum (↑)
𝑾𝒗
𝑣1 𝑎1,1 𝑎2,1 𝑎3,1
𝑣2 𝑎1,2 𝑎2,2 𝑎3,2
𝑣3 𝑎1,3 𝑎2,3 𝑎3,2
Positional Encoding 𝑾𝒌
softmax (↑)
DL4CV @ Weizmann
Multi-head Self-Attention
DL4CV @ Weizmann
𝒙 Slide credit: Justin Johnson (EECS-498-007, UMich)
Three Ways of Processing Sequences
Recurrent Neural Network 1D Convolution Self-Attention
MLP
Highly scalable
Highly parallelizable Layer Norm
Multi-head
Self-Attention
Layer Norm
𝑥1 , … , 𝑥𝑛
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence
⨀ ⨀ ⨀ ⨀
Attention
softmax
+
Encoder Decoder
DL4CV @ Weizmann
Attn.
Encoder
Decoder
DL4CV @ Weizmann
Transformers Network 𝑦1 , … , 𝑦𝑛
Pretraining:
Download a LOT of text from the internet
Train a transformers network using self-supervision
Finetuning:
Fine-tune the transformer to specific NLP task at hand
DL4CV @ Weizmann
𝑥1 , … , 𝑥𝑛
Example of GPT-3 generated text
chat.openai.com
DL4CV @ Weizmann
Final Project – team up deadline
DL4CV @ Weizmann
What’s next?
Tomorrow:
AI & Robotics Seminar (not for credit)
Assaf Shocher
Next lecture:
Vision Transformers – ViT (Shai)
DL4CV @ Weizmann