0% found this document useful (0 votes)
219 views63 pages

DL4CV Seq Att

Deep Learning Notes

Uploaded by

Ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views63 pages

DL4CV Seq Att

Deep Learning Notes

Uploaded by

Ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Deep Learning for Computer Vision:

Sequences and Attention


Shai Bagon
Agenda

Images Sequences
static time-dependent

Perceptron Convolution CNN Recurrent NN Memory Attention

DL4CV @ Weizmann
Deep Learning for Sequences
One to one One to many Many to one Many to many Many to many

Feed-forward: e.g., video classification


e.g., image captioning e.g., video frames classificatione.g., Machine translation
image -> sequencesequence
e.g., classification of words of frames -> label
sequence of frames -> seq.sequence
of labels of words -> seq. of words
image -> label
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Recurrent Neural Networks
One to one Many to many

DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many

(+) Parameter efficient


(-) No temporal dependency

DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many

(+?) Temporal dependency (via trained parameters)


(-) Parameter inefficiency
(-) Fixed sequence length

DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many

(+) Temporal dependency (via “hidden state”)


(+) Parameter efficiency
(+) Arbitrary sequence length

DL4CV @ Weizmann
Recurrent Neural Networks
One to one Many to many

ℎ𝑡 = 𝑓 ℎ𝑡−1 , 𝑥𝑡 ; 𝑾 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 + 𝑏ℎ


𝑦𝑡 = 𝑔 ℎ𝑡 ; 𝑾 = 𝑊ℎ𝑦 ℎ𝑡 + 𝑏𝑦

DL4CV @ Weizmann
Example: Language Modeling
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡

Training sequence: “hello” 1 0 0 0


0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡

ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡


0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “h” predict “e”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Embedding Layer:
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “he” predict “l”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “hel” predict “l”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
Given “hell” predict “o”
Task:
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1 Target chars: `e` `l` `l` `o`
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2
ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 𝑊ℎ𝑦
0.3
𝑊ℎℎ 1.0 0.1 -.3
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

𝑊𝑥ℎ
Training sequence: “hello” 1 0 0 0
0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
0
0
1
0
1
0
Input sequence: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
.04 `e`
.12
Softmax:
Task: .01
.83
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡 1.0
2.2
Output Layer: -3.
4.1

At test time: generate new text


Sample one char at a time 0.3
Hidden Layer: -.1
0.9

Training sequence: “hello” 1


0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0
0
Input chars: `h`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
.04 `e` .25
`l`
.12 .20
Softmax:
Task: .01
.83
.06
.49
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
Output Layer: -3. -1.
4.1 1.2

At test time: generate new text


Sample one char at a time 0.3 1.0
Hidden Layer: -.1 0.3
0.9 0.1

Training sequence: “hello” 1 0


0 1
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0 0
0 0
Input chars: `h` `e`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
.04 `e` .25
`l` .11
`l`
.12 .20 .16
Softmax:
Task: .01
.83
.06
.49
.64
.09
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
Output Layer: -3. -1. 1.9
4.1 1.2 -.1

At test time: generate new text


Sample one char at a time 0.3 1.0 0.1
Hidden Layer: -.1 0.3 -.5
0.9 0.1 -.3

Training sequence: “hello” 1 0 0


0 1 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0 0 1
0 0 0
Input chars: `h` `e` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Example: Language Modeling
.04 `e` .25
`l` .11
`l` .11
`o`
.12 .20 .16 .01
Softmax:
Task: .01
.83
.06
.49
.64
.09
.07
.81
Given characters 𝑐0 , 𝑐1 , … , 𝑐𝑡−1
Predict 𝑐𝑡 1.0
2.2
0.5
0.3
0.1
0.5
0.2
-2.
Output Layer: -3. -1. 1.9 -.3
4.1 1.2 -.1 2.2

At test time: generate new text


Sample one char at a time 0.3 1.0 0.1 -.3
Hidden Layer: -.1 0.3 -.5 0.9
0.9 0.1 -.3 0.7

Training sequence: “hello” 1 0 0 0


0 1 0 0
Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 0 0 1 1
0 0 0 0
Input chars: `h` `e` `l` `l`
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Searching for Interpretable Cells

DL4CV @ Weizmann
𝒕 − 𝟏, 𝒕, 𝒕 + 𝟏
Searching for Interpretable Cells

DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells

DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells

DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells

DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells

DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Searching for Interpretable Cells

DL4CV @ Weizmann Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Agenda

Images Sequences
static time-dependent

Perceptron Convolution CNN Recurrent NN Memory

DL4CV @ Weizmann
RNN: Gradient Flow
𝑊 tanh ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡

ℎ𝑡−1 ℎ𝑡 ℎ𝑡−1
= tanh 𝑊 ⋅
𝑥𝑡
𝑥𝑡

DL4CV @ Weizmann
RNN: Gradient Flow
𝑊 tanh ℎ𝑡 = tanh 𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡

ℎ𝑡−1 ℎ𝑡 ℎ𝑡−1
= tanh 𝑊 ⋅
𝑥𝑡
𝑥𝑡

DL4CV @ Weizmann
RNN: Gradient Flow
𝑊 tanh 𝑊 tanh 𝑊 tanh 𝑊 ta

ℎ𝑡−1 ℎ𝑡 ℎ𝑡+1

𝑥𝑡−1 𝑥𝑡 𝑥𝑡+1 𝑥𝑡+2


Forwarding/backwarding in time for long sequences = very deep NN
Easily leads to exploding/vanishing gradients

DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
x input
W x tanh h
𝑊 tanh hidden
h
h ℎ𝑡−1 ℎ𝑡

𝑥𝑡

DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
x input

Data flow
x tanh g h hidden
h Cell,
sigmoid i c
W “memory”
sigmoid f g

Gates (Control)
sigmoid o i input
c f forget
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡
o output
ℎ𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡 )
DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
𝑐𝑡−1 ⨀ + 𝑐𝑡 x input

Data flow
x tanh g f h hidden
i Cell,
h sigmoid i 𝑊 c
g ⨀ tanh “memory”
W
f o
sigmoid
ℎ𝑡−1 ⨀ ℎ𝑡 g

Gates (Control)
sigmoid o i input
𝑥𝑡
c f forget
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡
o output
ℎ𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡 )
DL4CV @ Weizmann
Long Short-Term Memory (LSTM)
𝑐𝑡−1 ⨀ + 𝑐𝑡 x input

A gradient path

Data flow
f h hidden
w/o matrix i Cell,
𝑊 c
g ⨀ tanh “memory”
multiplication o
ℎ𝑡−1 ⨀ ℎ𝑡 g

Gates (Control)
i input
𝑥𝑡
f forget
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡
o output
ℎ𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡 )
DL4CV @ Weizmann
Agenda

Images Sequences
static time-dependent

Perceptron Convolution CNN Recurrent NN Memory Attention

DL4CV @ Weizmann
Sequence to Sequence

We ‫( אנו‬ANU)
Learn ‫( לומדים‬LOMDIM)
Computer ‫( ממוחשבת‬MEMUCHSHEVET)
Vision ‫( ראיה‬RE’EYA)

DL4CV @ Weizmann
Sequence to Sequence: RNN
Encoder RNN Decoder RNN

Initial hidden state

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN
Encoder RNN Decoder RNN

context

Initial hidden state

“context” is a bottleneck
What if seq is very long?

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention
No need to provide:
Learned end-to-end
𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

Example: English to French translation

Input (English):
The agreement on the European Economic Area
was signed in August 1992.

Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.

DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

Example: English to French translation

Input (English):
The agreement on the European Economic Area
was signed in August 1992.

Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.

DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

Example: English to French translation

Input (English):
The agreement on the European Economic Area
was signed in August 1992.

Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.

DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence: RNN & Attention

Example: English to French translation

Input (English):
The agreement on the European Economic Area
was signed in August 1992.

Output (French):
L’accord sur la zone économique européenne
a été signé en août 1992.

DL4CV @ Weizmann
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Slide credit: Justin Johnson (EECS-498-007, UMich)
Attention Layer

𝑽 ⨀ ⨀ ⨀ ⨀

+
softmax

𝑲 𝑸

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Attention Layer
Inputs:
Query: 𝑸 (shape: 𝑁𝑞 × 𝐷𝑞 ) 𝑦1 𝑦2 𝑦3 𝑦4
Input: 𝑿 (shape: 𝑁𝑥 × 𝐷𝑥 )
prod(→), sum (↑)
𝑾𝒗
Layer’s Parameters: 𝑣1 𝑎1,1 𝑎2,1 𝑎3,1 𝑎4,1
𝑿 → 𝑲: 𝑾𝒌 (shape: 𝐷𝑥 × 𝐷𝑞 )
𝑿 → 𝑽: 𝑾𝒗 (shape: 𝐷𝑥 × 𝐷𝑣 ) 𝑣2 𝑎1,2 𝑎2,2 𝑎3,2 𝑎4,2
𝑣3 𝑎1,3 𝑎2,3 𝑎3,3 𝑎4,3
softmax (↑)
Compute: 𝑾𝒌
Keys: 𝑲 = 𝑿𝑾𝒌 (shape: 𝑁𝑥 × 𝐷𝑞 ) 𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1 𝑒4,1
Values: 𝐕 = 𝐗𝑾𝒗 (shape: 𝑁𝑥 × 𝐷𝑣 ) 𝑥2 𝑘2 𝑒1,2 𝑒2,2 𝑒3,2 𝑒4,2
Similarities: 𝑬 = 𝑸𝑲𝑇 / 𝐷𝑞 (shape: 𝑁𝑞 × 𝑁𝑥 )
𝑥3 𝑘3 𝑒1,3 𝑒2,3 𝑒3,3 𝑒4,3
Attention: 𝑨 = softmax 𝑬; ↑ (shape: 𝑁𝑞 × 𝑁𝑥 )
𝑞1 𝑞2 𝑞3 𝑞4
Outputs: 𝐘 = 𝐀𝐕 (shape: 𝑁𝑞 × 𝐷𝑣 )

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Self-Attention Layer
Input:
𝑦1 𝑦2 𝑦3
Input: 𝑿 (shape: 𝑁𝑥 × 𝐷𝑥 )
prod(→), sum (↑)
𝑾𝒗
Layer’s Parameters: 𝑣1 𝑎1,1 𝑎2,1 𝑎3,1
𝑿 → 𝑸: 𝑾𝒒 (shape: 𝐷𝑥 × 𝐷𝑞 )
𝑿 → 𝑲: 𝑾𝒌 (shape: 𝐷𝑥 × 𝐷𝑞 ) 𝑣2 𝑎1,2 𝑎2,2 𝑎3,2
𝑿 → 𝑽: 𝑾𝒗 (shape: 𝐷𝑥 × 𝐷𝑣 ) 𝑎1,3 𝑎2,3 𝑎3,2
𝑣3

Compute: softmax (↑)


𝑾𝒌
Query: 𝑸 = 𝑿𝑾𝒒 (shape: 𝑁𝑥 × 𝐷𝑞 ) 𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1
Keys: 𝑲 = 𝑿𝑾𝒌 (shape: 𝑁𝑥 × 𝐷𝑞 )
𝑥2 𝑘2 𝑒1,2 𝑒2,2 𝑒3,2
Values: 𝐕 = 𝐗𝑾𝒗 (shape: 𝑁𝑥 × 𝐷𝑣 )
Similarities: 𝑬 = 𝑸𝑲𝑇 / 𝐷𝑞 (shape: 𝑁𝑥 × 𝑁𝑥 ) 𝑥3 𝑘3 𝑒1,3 𝑒2,3 𝑒3,3
Attention: 𝑨 = softmax 𝑬; ↑ (shape: 𝑁𝑥 × 𝑁𝑥 ) 𝑞1 𝑞2 𝑞3
Outputs: 𝐘 = 𝐀𝐕 (shape: 𝑁𝑥 × 𝐷𝑣 ) 𝑾𝒒

DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
𝑦:
𝐷𝑦 × 𝑇 × 𝐻 × 𝑊
Self Attention in Vision
𝑥:
softmax
𝐷𝑥 × 𝑇 × 𝐻 × 𝑊 ⨂
𝑇⋅𝐻⋅𝑊
×
CNN 𝑇⋅𝐻⋅𝑊
Backbone
𝑞: 𝑘: 𝑣:
𝐷𝑞 × 𝑇 ⋅ 𝐻 ⋅ 𝑊 𝐷𝑞 × 𝑇 ⋅ 𝐻 ⋅ 𝑊 𝐷𝑦 × 𝑇 ⋅ 𝐻 ⋅ 𝑊

𝑊𝑞 : 𝑊𝑘 : 𝑊𝑣 :
1×1×1 1×1×1 1×1×1

𝑥:
𝐷𝑥 × 𝑇 × 𝐻 × 𝑊
DL4CV @ Weizmann
Wang, X., Girshick, R., Gupta, A. and He, K., “Non-local neural networks” CVPR (2018)
Self Attention in Vision Avg
Pool
+
MLP

CNN
Backbone

DL4CV @ Weizmann
Wang, X., Girshick, R., Gupta, A. and He, K., “Non-local neural networks” CVPR (2018)
Self-Attention Layer: Properties
𝑦1 𝑦2 𝑦3
SelfAtt 𝜋 𝑥1 , … , 𝑥𝑛 = 𝜋 SelfAtt(𝑥1 , … , 𝑥𝑛 ) 𝑾𝒗
prod(→), sum (↑)

𝑣1 𝑎1,1 𝑎2,1 𝑎3,1


Self-Attention is permutation equivariant
𝑣2 𝑎1,2 𝑎2,2 𝑎3,2
“I am studying” ? “Am I studying” 𝑣3 𝑎1,3 𝑎2,3 𝑎3,2
softmax (↑)
𝑾𝒌
𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1
𝑥2 𝑘2 𝑒1,2 𝑒2,2 𝑒3,2
𝑥3 𝑘3 𝑒1,3 𝑒2,3 𝑒3,3
𝑞1 𝑞2 𝑞3
𝑾𝒒

DL4CV @ Weizmann
Self-Attention Layer: Properties
𝑦1 𝑦2 𝑦3
prod(→), sum (↑)
𝑾𝒗
𝑣1 𝑎1,1 𝑎2,1 𝑎3,1
𝑣2 𝑎1,2 𝑎2,2 𝑎3,2
𝑣3 𝑎1,3 𝑎2,3 𝑎3,2
Positional Encoding 𝑾𝒌
softmax (↑)

1 𝑥1 𝑘1 𝑒1,1 𝑒2,1 𝑒3,1


2 𝑥2 𝑘2 𝑒1,2 𝑒2,2 𝑒3,2
3 𝑥3 𝑘3 𝑒1,3 𝑒2,3 𝑒3,3
𝑞1 𝑞2 𝑞3
𝑾𝒒

DL4CV @ Weizmann
Multi-head Self-Attention

DL4CV @ Weizmann
𝒙 Slide credit: Justin Johnson (EECS-498-007, UMich)
Three Ways of Processing Sequences
Recurrent Neural Network 1D Convolution Self-Attention

Works on Ordered Sequences Works on Multidimensional Grids Works on Sets


(+) large and adaptive receptive (-) Fixed receptive field. Need to (+) receptive filed = entire
field via hidden state stack many layers to have a sequence
(-) Not parallelizable: need to decent one (+) parallelizable
process states sequentially (+) Highly parallelizable (-) Very memory intensive
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Three Ways of Processing Sequences
Recurrent Neural Network 1D Convolution Self-Attention

Works on Ordered Sequences Works on Multidimensional Grids Works on Sets


(+) large and adaptive receptive (-) Fixed receptive field. Need to (+) receptive filed = entire
field via hidden state stack many layers to have a sequence
(-) Not parallelizable: need to decent one (+) parallelizable
process states sequentially (+) Highly parallelizable NeurIPS 2017
(-) Very memory intensive
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Transformer Layer 𝑦1 , … , 𝑦𝑛

Input: 𝑥1 , … , 𝑥𝑛 (𝑛 tokens in 𝐷 dimensions) +


Output: 𝑦1 , … , 𝑦𝑛 (𝑛 tokens in 𝐷 dimensions)

MLP
Highly scalable
Highly parallelizable Layer Norm

Multi-head
Self-Attention

Layer Norm

𝑥1 , … , 𝑥𝑛
DL4CV @ Weizmann
Slide credit: Justin Johnson (EECS-498-007, UMich)
Sequence to Sequence

⨀ ⨀ ⨀ ⨀

Attention
softmax
+

Encoder Decoder

DL4CV @ Weizmann
Attn.

Encoder
Decoder

DL4CV @ Weizmann
Transformers Network 𝑦1 , … , 𝑦𝑛

Pretraining:
Download a LOT of text from the internet
Train a transformers network using self-supervision

Finetuning:
Fine-tune the transformer to specific NLP task at hand

Model Layers Width (𝑫) #Heads #Params Data


BERT-base 12 768 12 110M 13GB
BERT-large 24 1,024 16 340M 13GB
GPT-2 48 1,600 ? 1.5B 40GB
GPT-3 96 12,288 96 175B 694GB

DL4CV @ Weizmann
𝑥1 , … , 𝑥𝑛
Example of GPT-3 generated text

chat.openai.com

DL4CV @ Weizmann
Final Project – team up deadline

DL4CV @ Weizmann
What’s next?
Tomorrow:
AI & Robotics Seminar (not for credit)
Assaf Shocher

Next lecture:
Vision Transformers – ViT (Shai)

DL4CV @ Weizmann

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy