0% found this document useful (0 votes)

219 views63 pages

DL4CV Seq Att

Deep Learning Notes

Uploaded by

Ashish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

219 views63 pages

DL4CV Seq Att

Deep Learning Notes

Uploaded by

Ashish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Deep Learning for Computer Vision:

Sequences and Attention

Shai Bagon
Agenda

Images Sequences
static time-dependent

Perceptron Convolution CNN Recurrent NN Memory Attention

DL4CV @ Weizmann
Deep Learning for Sequences
One to one One to many Many to one Many to many Many to many

Feed-forward: e.g., video classification

e.g., image captioning e.g., video frames classificatione.g., Machine translation
image -> sequencesequence
e.g., classification of words of frames -> label
sequence of frames -> seq.sequence
of labels of words -> seq. of words
image -> label
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Recurrent Neural Networks
One to one Many to many

DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many

(+) Parameter efficient

(-) No temporal dependency

DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many

(+?) Temporal dependency (via trained parameters)

(-) Parameter inefficiency
(-) Fixed sequence length

DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many

(+) Temporal dependency (via “hidden state”)

(+) Parameter efficiency
(+) Arbitrary sequence length

DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many

ℎ𝑡 = 𝑓 ℎ𝑡−1 , 𝑥𝑡 ; 𝑾 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 + 𝑏ℎ

𝑦𝑡 = 𝑔 ℎ𝑡 ; 𝑾 = 𝑊ℎ𝑦 ℎ𝑡 + 𝑏𝑦

DL4CV @ Weizmann
Example: Language Modeling
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡

Training sequence: “hello” 1 0 0 0

0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡

ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡

0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` è` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: è` `l` `l` ò`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` è` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “h” predict “e”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: è` `l` `l` ò`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` è` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “he” predict “l”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: è` `l` `l` ò`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` è` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “hel” predict “l”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: è` `l` `l` ò`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` è` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “hell” predict “o”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: è` `l` `l` ò`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

At test time: generate new text

Sample one char at a time 0.3
Hidden Layer: -.1
0.9

Training sequence: “hello” 1

0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
Input chars: `h`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
.04 `e` .25
`l`
.12 .20
Softmax:
Task: .01
.83
.06
.49
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
Output Layer: -3. -1.
4.1 1.2

At test time: generate new text

Sample one char at a time 0.3 1.0
Hidden Layer: -.1 0.3
0.9 0.1

Training sequence: “hello” 1 0

0 1
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0 0
0 0
Input chars: `h` `e`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
.04 `e` .25
`l` .11
`l`
.12 .20 .16
Softmax:
Task: .01
.83
.06
.49
.64
.09
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
Output Layer: -3. -1. 1.9
4.1 1.2 -.1

At test time: generate new text

Sample one char at a time 0.3 1.0 0.1
Hidden Layer: -.1 0.3 -.5
0.9 0.1 -.3

Training sequence: “hello” 1 0 0

0 1 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0 0 1
0 0 0
Input chars: `h` è` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
.04 è` .25
`l` .11
`l` .11
ò`
.12 .20 .16 .01
Softmax:
Task: .01
.83
.06
.49
.64
.09
.07
.81
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2

At test time: generate new text

Sample one char at a time 0.3 1.0 0.1 -.3
Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

Training sequence: “hello” 1 0 0 0

0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0 0 1 1
0 0 0 0
Input chars: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Searching for Interpretable Cells

DL4CV @ Weizmann
𝒕 − 𝟏, 𝒕, 𝒕 + 𝟏
Searching for Interpretable Cells

DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells

Images Sequences
static time-dependent

Perceptron Convolution CNN Recurrent NN Memory

DL4CV @ Weizmann
RNN: Gradient Flow
𝑊 tanh ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡

ℎ𝑡−1 ℎ𝑡 ℎ𝑡−1
= tanh 𝑊 ⋅
𝑥𝑡
𝑥𝑡

DL4CV @ Weizmann
RNN: Gradient Flow
𝑊 tanh ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡

ℎ𝑡−1 ℎ𝑡 ℎ𝑡−1
= tanh 𝑊 ⋅
𝑥𝑡
𝑥𝑡

DL4CV @ Weizmann
RNN: Gradient Flow
𝑊 tanh 𝑊 tanh 𝑊 tanh 𝑊 ta

ℎ𝑡−1 ℎ𝑡 ℎ𝑡+1

𝑥𝑡−1 𝑥𝑡 𝑥𝑡+1 𝑥𝑡+2

Forwarding/backwarding in time for long sequences = very deep NN
Easily leads to exploding/vanishing gradients

DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
x input
W x tanh h
𝑊 tanh hidden
h
h ℎ𝑡−1 ℎ𝑡

𝑥𝑡

DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
x input

Data flow
x tanh g h hidden
h Cell,
sigmoid i c
W “memory”
sigmoid f g

Gates (Control)
sigmoid o i input
c f forget
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡
o output
ℎ𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡 )
DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
𝑐𝑡−1 ⨀ + 𝑐𝑡 x input

Data flow
x tanh g f h hidden
i Cell,
h sigmoid i 𝑊 c
g ⨀ tanh “memory”
W
f o
sigmoid
ℎ𝑡−1 ⨀ ℎ𝑡 g

Gates (Control)
sigmoid o i input
𝑥𝑡
c f forget
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡
o output
ℎ𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡 )
DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
𝑐𝑡−1 ⨀ + 𝑐𝑡 x input

A gradient path

Data flow
f h hidden
w/o matrix i Cell,
𝑊 c
g ⨀ tanh “memory”
multiplication o
ℎ𝑡−1 ⨀ ℎ𝑡 g

Gates (Control)
i input
𝑥𝑡
f forget
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡
o output
ℎ𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡 )
DL4CV @ Weizmann
Agenda

Images Sequences
static time-dependent

Perceptron Convolution CNN Recurrent NN Memory Attention

DL4CV @ Weizmann
Sequence to Sequence

We ‫( אנו‬ANU)
Learn ‫( לומדים‬LOMDIM)
Computer ‫( ממוחשבת‬MEMUCHSHEVET)
Vision ‫( ראיה‬RE’EYA)

DL4CV @ Weizmann
Sequence to Sequence: RNN
Encoder RNN Decoder RNN

Initial hidden state

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN
Encoder RNN Decoder RNN

context

Initial hidden state

“context” is a bottleneck
What if seq is very long?

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
No need to provide:
Learned end-to-end
𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

Example: English to French translation

Input (English):
The agreement on the European Economic Area
was signed in August 1992.

Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.

DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

Example: English to French translation

Input (English):
The agreement on the European Economic Area
was signed in August 1992.

Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.

Example: English to French translation

Input (English):
The agreement on the European Economic Area
was signed in August 1992.

Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.

Example: English to French translation

Input (English):
The agreement on the European Economic Area
was signed in August 1992.

Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.

DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Attention Layer

𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Attention Layer
Inputs:
Query: 𝑸 (shape: 𝑁𝑞 × 𝐷𝑞 ) 𝑦1 𝑦2 𝑦3 𝑦4
Input: 𝑿 (shape: 𝑁𝑥 × 𝐷𝑥 )
prod(→), sum (↑)
𝑾𝒗
Layer’s Parameters: 𝑣1 𝑎1,1 𝑎2,1 𝑎3,1 𝑎4,1
𝑿 → 𝑲: 𝑾𝒌 (shape: 𝐷𝑥 × 𝐷𝑞 )
𝑿 → 𝑽: 𝑾𝒗 (shape: 𝐷𝑥 × 𝐷𝑣 ) 𝑣2 𝑎1,2 𝑎2,2 𝑎3,2 𝑎4,2
𝑣3 𝑎1,3 𝑎2,3 𝑎3,3 𝑎4,3
softmax (↑)
Compute: 𝑾𝒌
Keys: 𝑲 = 𝑿𝑾𝒌 (shape: 𝑁𝑥 × 𝐷𝑞 ) 𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1 𝑒4,1
Values: 𝐕 = 𝐗𝑾𝒗 (shape: 𝑁𝑥 × 𝐷𝑣 ) 𝑥2 𝑘2 𝑒1,2 𝑒2,2 𝑒3,2 𝑒4,2
Similarities: 𝑬 = 𝑸𝑲𝑇 / 𝐷𝑞 (shape: 𝑁𝑞 × 𝑁𝑥 )
𝑥3 𝑘3 𝑒1,3 𝑒2,3 𝑒3,3 𝑒4,3
Attention: 𝑨 = softmax 𝑬; ↑ (shape: 𝑁𝑞 × 𝑁𝑥 )
𝑞1 𝑞2 𝑞3 𝑞4
Outputs: 𝐘 = 𝐀𝐕 (shape: 𝑁𝑞 × 𝐷𝑣 )

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Self-Attention Layer
Input:
𝑦1 𝑦2 𝑦3
Input: 𝑿 (shape: 𝑁𝑥 × 𝐷𝑥 )
prod(→), sum (↑)
𝑾𝒗
Layer’s Parameters: 𝑣1 𝑎1,1 𝑎2,1 𝑎3,1
𝑿 → 𝑸: 𝑾𝒒 (shape: 𝐷𝑥 × 𝐷𝑞 )
𝑿 → 𝑲: 𝑾𝒌 (shape: 𝐷𝑥 × 𝐷𝑞 ) 𝑣2 𝑎1,2 𝑎2,2 𝑎3,2
𝑿 → 𝑽: 𝑾𝒗 (shape: 𝐷𝑥 × 𝐷𝑣 ) 𝑎1,3 𝑎2,3 𝑎3,2
𝑣3

Compute: softmax (↑)

𝑾𝒌
Query: 𝑸 = 𝑿𝑾𝒒 (shape: 𝑁𝑥 × 𝐷𝑞 ) 𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1
Keys: 𝑲 = 𝑿𝑾𝒌 (shape: 𝑁𝑥 × 𝐷𝑞 )
𝑥2 𝑘2 𝑒1,2 𝑒2,2 𝑒3,2
Values: 𝐕 = 𝐗𝑾𝒗 (shape: 𝑁𝑥 × 𝐷𝑣 )
Similarities: 𝑬 = 𝑸𝑲𝑇 / 𝐷𝑞 (shape: 𝑁𝑥 × 𝑁𝑥 ) 𝑥3 𝑘3 𝑒1,3 𝑒2,3 𝑒3,3
Attention: 𝑨 = softmax 𝑬; ↑ (shape: 𝑁𝑥 × 𝑁𝑥 ) 𝑞1 𝑞2 𝑞3
Outputs: 𝐘 = 𝐀𝐕 (shape: 𝑁𝑥 × 𝐷𝑣 ) 𝑾𝒒

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
𝑦:
𝐷𝑦 × 𝑇 × 𝐻 × 𝑊
Self Attention in Vision
𝑥:
softmax
𝐷𝑥 × 𝑇 × 𝐻 × 𝑊 ⨂
𝑇⋅𝐻⋅𝑊
×
CNN 𝑇⋅𝐻⋅𝑊
Backbone
𝑞: 𝑘: 𝑣:
𝐷𝑞 × 𝑇 ⋅ 𝐻 ⋅ 𝑊 𝐷𝑞 × 𝑇 ⋅ 𝐻 ⋅ 𝑊 𝐷𝑦 × 𝑇 ⋅ 𝐻 ⋅ 𝑊

𝑊𝑞 : 𝑊𝑘 : 𝑊𝑣 :
1×1×1 1×1×1 1×1×1

𝑥:
𝐷𝑥 × 𝑇 × 𝐻 × 𝑊
DL4CV @ Weizmann
Wang, X., Girshick, R., Gupta, A. and He, K., “Non-local neural networks” CVPR (2018)
Self Attention in Vision Avg
Pool
+
MLP

CNN
Backbone

DL4CV @ Weizmann
Wang, X., Girshick, R., Gupta, A. and He, K., “Non-local neural networks” CVPR (2018)
Self-Attention Layer: Properties
𝑦1 𝑦2 𝑦3
SelfAtt 𝜋 𝑥1 , … , 𝑥𝑛 = 𝜋 SelfAtt(𝑥1 , … , 𝑥𝑛 ) 𝑾𝒗
prod(→), sum (↑)

𝑣1 𝑎1,1 𝑎2,1 𝑎3,1

Self-Attention is permutation equivariant
𝑣2 𝑎1,2 𝑎2,2 𝑎3,2
“I am studying” ? “Am I studying” 𝑣3 𝑎1,3 𝑎2,3 𝑎3,2
softmax (↑)
𝑾𝒌
𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1
𝑥2 𝑘2 𝑒1,2 𝑒2,2 𝑒3,2
𝑥3 𝑘3 𝑒1,3 𝑒2,3 𝑒3,3
𝑞1 𝑞2 𝑞3
𝑾𝒒

DL4CV @ Weizmann
Self-Attention Layer: Properties
𝑦1 𝑦2 𝑦3
prod(→), sum (↑)
𝑾𝒗
𝑣1 𝑎1,1 𝑎2,1 𝑎3,1
𝑣2 𝑎1,2 𝑎2,2 𝑎3,2
𝑣3 𝑎1,3 𝑎2,3 𝑎3,2
Positional Encoding 𝑾𝒌
softmax (↑)

1 𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1

2 𝑥2 𝑘2 𝑒1,2 𝑒2,2 𝑒3,2
3 𝑥3 𝑘3 𝑒1,3 𝑒2,3 𝑒3,3
𝑞1 𝑞2 𝑞3
𝑾𝒒

DL4CV @ Weizmann
Multi-head Self-Attention

DL4CV @ Weizmann
𝒙 Slide credit: Justin Johnson (EECS-498-007, UMich)
Three Ways of Processing Sequences
Recurrent Neural Network 1D Convolution Self-Attention

Works on Ordered Sequences Works on Multidimensional Grids Works on Sets

(+) large and adaptive receptive (-) Fixed receptive field. Need to (+) receptive filed = entire
field via hidden state stack many layers to have a sequence
(-) Not parallelizable: need to decent one (+) parallelizable
process states sequentially (+) Highly parallelizable (-) Very memory intensive
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Three Ways of Processing Sequences
Recurrent Neural Network 1D Convolution Self-Attention

Works on Ordered Sequences Works on Multidimensional Grids Works on Sets

(+) large and adaptive receptive (-) Fixed receptive field. Need to (+) receptive filed = entire
field via hidden state stack many layers to have a sequence
(-) Not parallelizable: need to decent one (+) parallelizable
process states sequentially (+) Highly parallelizable NeurIPS 2017
(-) Very memory intensive
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Transformer Layer 𝑦1 , … , 𝑦𝑛

Input: 𝑥1 , … , 𝑥𝑛 (𝑛 tokens in 𝐷 dimensions) +

Output: 𝑦1 , … , 𝑦𝑛 (𝑛 tokens in 𝐷 dimensions)

MLP
Highly scalable
Highly parallelizable Layer Norm

Multi-head
Self-Attention

Layer Norm

𝑥1 , … , 𝑥𝑛
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence

⨀ ⨀ ⨀ ⨀

Attention
softmax
+

Encoder Decoder

DL4CV @ Weizmann
Attn.

Encoder
Decoder

DL4CV @ Weizmann
Transformers Network 𝑦1 , … , 𝑦𝑛

Pretraining:
Download a LOT of text from the internet
Train a transformers network using self-supervision

Finetuning:
Fine-tune the transformer to specific NLP task at hand

Model Layers Width (𝑫) #Heads #Params Data

BERT-base 12 768 12 110M 13GB
BERT-large 24 1,024 16 340M 13GB
GPT-2 48 1,600 ? 1.5B 40GB
GPT-3 96 12,288 96 175B 694GB

DL4CV @ Weizmann
𝑥1 , … , 𝑥𝑛
Example of GPT-3 generated text

chat.openai.com

DL4CV @ Weizmann
Final Project – team up deadline

DL4CV @ Weizmann
What’s next?
Tomorrow:
AI & Robotics Seminar (not for credit)
Assaf Shocher

Next lecture:
Vision Transformers – ViT (Shai)

DL4CV @ Weizmann

AML - Lecture - 09 - 08nov24
No ratings yet
AML - Lecture - 09 - 08nov24
126 pages
Recurrent Neural Net-1
No ratings yet
Recurrent Neural Net-1
47 pages
Support Materi
No ratings yet
Support Materi
120 pages
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
No ratings yet
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
103 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
111 pages
Recurrent Neural Network (RNN) : Tuan Nguyen - AI4E
No ratings yet
Recurrent Neural Network (RNN) : Tuan Nguyen - AI4E
38 pages
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
No ratings yet
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
105 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Recurrent Neural Nets
No ratings yet
Recurrent Neural Nets
144 pages
Mod 4-RNN Deep Learning
No ratings yet
Mod 4-RNN Deep Learning
63 pages
L6 - UCLxDeepMind DL2020 Document of Google
No ratings yet
L6 - UCLxDeepMind DL2020 Document of Google
141 pages
Nn4ir PDF
No ratings yet
Nn4ir PDF
290 pages
Question Bank Deep-Learning Unit 3 and 4
No ratings yet
Question Bank Deep-Learning Unit 3 and 4
5 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
Technical DL U4-6
No ratings yet
Technical DL U4-6
98 pages
Sequence Learning
No ratings yet
Sequence Learning
22 pages
Lecture 5
No ratings yet
Lecture 5
102 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-02-28 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-02-28 Reference-Material-I
39 pages
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
No ratings yet
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
25 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
32 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
RNN & LSTM Notes
No ratings yet
RNN & LSTM Notes
8 pages
Recurrent Neural Networks (RNN) : Subtitle
No ratings yet
Recurrent Neural Networks (RNN) : Subtitle
53 pages
NLP Week7 RNNLSTM
No ratings yet
NLP Week7 RNNLSTM
66 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
For Seminar
No ratings yet
For Seminar
17 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
DL Un3
No ratings yet
DL Un3
11 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
11 RNN
No ratings yet
11 RNN
32 pages
Deep Learning Recurrent Neural Networks - Introduction
No ratings yet
Deep Learning Recurrent Neural Networks - Introduction
106 pages
Lecture 1a - Introduction
No ratings yet
Lecture 1a - Introduction
38 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
DL Mod4
No ratings yet
DL Mod4
105 pages
B.E Syllabus For DL
No ratings yet
B.E Syllabus For DL
4 pages
Natual Language Processing
No ratings yet
Natual Language Processing
33 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
125 pages
Deep Network Notes
No ratings yet
Deep Network Notes
54 pages
L15 Intro-Rnn Slides
No ratings yet
L15 Intro-Rnn Slides
50 pages
3 Sequence and Language Modeling
No ratings yet
3 Sequence and Language Modeling
56 pages
6S191 MIT DeepLearning L2
No ratings yet
6S191 MIT DeepLearning L2
85 pages
DL-19-CNN Sequential Model 210223
No ratings yet
DL-19-CNN Sequential Model 210223
18 pages
DL Mod 3
No ratings yet
DL Mod 3
4 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
Sequence Learning Problem
No ratings yet
Sequence Learning Problem
42 pages
Lec 10
No ratings yet
Lec 10
37 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
104 pages
5a. Recurrent Neural Networks
No ratings yet
5a. Recurrent Neural Networks
45 pages
RNN
No ratings yet
RNN
22 pages
DL Decode Endsem
No ratings yet
DL Decode Endsem
71 pages
10 RNN
No ratings yet
10 RNN
56 pages
Deep Learning For Natural Language GDG Bloomington 1690248059
No ratings yet
Deep Learning For Natural Language GDG Bloomington 1690248059
41 pages
AWS Prep
No ratings yet
AWS Prep
5 pages
Circuit Decomposition For Hardware Efficient VQE
No ratings yet
Circuit Decomposition For Hardware Efficient VQE
18 pages
IQuHack2024 Remote Challenge
No ratings yet
IQuHack2024 Remote Challenge
3 pages
Classiq's Generalized Arithmetic Challenge
No ratings yet
Classiq's Generalized Arithmetic Challenge
3 pages
8 Esh Narayan 734 Research Article CSIT June 2012
No ratings yet
8 Esh Narayan 734 Research Article CSIT June 2012
9 pages
Back-Propagation Is Very Simple. Who Made It Complicated
No ratings yet
Back-Propagation Is Very Simple. Who Made It Complicated
26 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
4 pages
Analisis Clustering Dokumen Tugas Akhir Mahasiswa Sistem Informasi Universitas Nasional Menggunakan Metode K-Means Clustering
No ratings yet
Analisis Clustering Dokumen Tugas Akhir Mahasiswa Sistem Informasi Universitas Nasional Menggunakan Metode K-Means Clustering
7 pages
Jntuk r20 Unit V Deep Learning Techniqueswwwjntumaterials
No ratings yet
Jntuk r20 Unit V Deep Learning Techniqueswwwjntumaterials
32 pages
2 Ai-B ML TLP
No ratings yet
2 Ai-B ML TLP
4 pages
Clustering Part2 Continued: Han/Eick: Clustering II 1
No ratings yet
Clustering Part2 Continued: Han/Eick: Clustering II 1
33 pages
Machine Learning For Beginners
No ratings yet
Machine Learning For Beginners
16 pages
9 RNN LSTM Gru
No ratings yet
9 RNN LSTM Gru
91 pages
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
No ratings yet
Bcse209l Machine-Learning TH 1.0 0 Bcse209l
3 pages
Introduction To Artificial Neural Networks
No ratings yet
Introduction To Artificial Neural Networks
19 pages
Introduction To Convolutional Neural Network (CNN) Using Tensorflow - by Govinda Dumane - Towards Data Science
No ratings yet
Introduction To Convolutional Neural Network (CNN) Using Tensorflow - by Govinda Dumane - Towards Data Science
17 pages
Hopfield Networks and Boltzman Machines-Part 1
100% (1)
Hopfield Networks and Boltzman Machines-Part 1
13 pages
Virus Detection Using Deep Learning: Saurabh Malusare Rojan Sudev Rishabh Nrupnarayan
No ratings yet
Virus Detection Using Deep Learning: Saurabh Malusare Rojan Sudev Rishabh Nrupnarayan
28 pages
Capture D'écran, Le 2025-04-21 À 21.26.38
No ratings yet
Capture D'écran, Le 2025-04-21 À 21.26.38
14 pages
Rosen Blatt's Perceptron Model
No ratings yet
Rosen Blatt's Perceptron Model
11 pages
Deep Learning Technique Syllabus
No ratings yet
Deep Learning Technique Syllabus
2 pages
Definition:: Large Language Models (LLMS)
No ratings yet
Definition:: Large Language Models (LLMS)
41 pages
Supervised Learning Network Introduction: Unit 2
No ratings yet
Supervised Learning Network Introduction: Unit 2
52 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
44 pages
Deep Learning Handwritten Notes
No ratings yet
Deep Learning Handwritten Notes
18 pages
Deep Learning For Face Recognition
No ratings yet
Deep Learning For Face Recognition
47 pages
Artificial Intelligence: Machine Learning Algorithms Id3 Dbscan
No ratings yet
Artificial Intelligence: Machine Learning Algorithms Id3 Dbscan
30 pages
Swipe
No ratings yet
Swipe
18 pages
Convolutional Neural Network - Towards Data Science PDF
No ratings yet
Convolutional Neural Network - Towards Data Science PDF
10 pages
PE - IV - 102047804 - Deep Learning and Applications
No ratings yet
PE - IV - 102047804 - Deep Learning and Applications
3 pages
III B.Tech I Sem MachineLearning (20AD5T04)
No ratings yet
III B.Tech I Sem MachineLearning (20AD5T04)
1 page
Deep Learning
No ratings yet
Deep Learning
5 pages
ML CT Question Paper 2023 24
No ratings yet
ML CT Question Paper 2023 24
2 pages
Deep Learning For Hate Speech Detection in Tweets (Pinkesh Badjatiya and Others)
No ratings yet
Deep Learning For Hate Speech Detection in Tweets (Pinkesh Badjatiya and Others)
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DL4CV Seq Att

Uploaded by

DL4CV Seq Att

Uploaded by

Deep Learning for Computer Vision:

Sequences and Attention

Perceptron Convolution CNN Recurrent NN Memory Attention

Feed-forward: e.g., video classification

(+) Parameter efficient

(+?) Temporal dependency (via trained parameters)

(+) Temporal dependency (via “hidden state”)

ℎ𝑡 = 𝑓 ℎ𝑡−1 , 𝑥𝑡 ; 𝑾 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 + 𝑏ℎ

Training sequence: “hello” 1 0 0 0

ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡

At test time: generate new text

Training sequence: “hello” 1

At test time: generate new text

Training sequence: “hello” 1 0

At test time: generate new text

Training sequence: “hello” 1 0 0

At test time: generate new text

Training sequence: “hello” 1 0 0 0

Perceptron Convolution CNN Recurrent NN Memory

𝑥𝑡−1 𝑥𝑡 𝑥𝑡+1 𝑥𝑡+2

Perceptron Convolution CNN Recurrent NN Memory Attention

Initial hidden state

Initial hidden state

Example: English to French translation

Example: English to French translation

Example: English to French translation

Example: English to French translation

Compute: softmax (↑)

𝑣1 𝑎1,1 𝑎2,1 𝑎3,1

1 𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1

Works on Ordered Sequences Works on Multidimensional Grids Works on Sets

Works on Ordered Sequences Works on Multidimensional Grids Works on Sets

Input: 𝑥1 , … , 𝑥𝑛 (𝑛 tokens in 𝐷 dimensions) +

Model Layers Width (𝑫) #Heads #Params Data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.