Lecture 2
Lecture 2
14, 2022
Ming Li
01. Word2Vec
Plan: We trace back history to see how attention and transformers have emerged
2. CNN
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that do convolutional operation.
A filter
02 Attention and Transformers
…
…
Input
Each filter detects a
small pattern (3 x 3).
Convolution 1 -1 -1
Operation -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input
Attention
02 and Transformers
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
Input 3 -2 -2 -1
02 Attention and Transformers
3. RNN
Parameters to be learned:
U, V, W
02 Attention and Transformers
LSTM
02 Attention and Transformers
Attention
02 Attention and Transformers
Image caption generation using attention
Word 1
A vector for
each region
z0 z1
Attention to
a region
weighted
CNN filter filter filter sum
filter filter filter
0.7 0.1 0.1
0.1 0.0 0.0
Word 1 Word 2
A vector for
each region
z0 z1 z2
weighted
CNN filter filter filter sum
filter filter filter
0.0 0.8 0.2
0.0 0.0 0.0
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio,
“Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015
02 Attention and Transformers
Transformer
02 Attention and Transformers
Encoder
64
64 3
512
qi = xi WQ
Ki = xi WK
Vi = xi WV
02 Attention and Transformers
Self Attention
Self Attention
Formula
64x64 64x512
~ ~
dk=64 is dimension of key vector
z1= 0.88v1+ 0.12v2
02 Attention and Transformers
Multiple heads
1. It expands the
model’s ability to
focus on different
positions.
2. It gives the
attention layer
multiple
“representation
subspaces”
02 Attention and Transformers
The output
is
expecting
only a 2x4 24
matrix,
hence,
The encoder-
decoder
attention is just
like self
attention, except
it uses K, V from
the top of
encoder output,
and its own Q
02 Attention and Transformers
Decoder's
Output
Linear
Layer
02 Attention and Transformers How it works
Decoder’
s
Output But what about
Linear Self-attention?
Layer
02 Attention and Transformers
We can also
optimize two
words at a time:
using BEAM
search: keep a few
alternatives for
the first word.
02 Attention and Transformers
Transformer Results
02 Attention and Transformers Next Lecture: Pretraining
BERT GPT
02 Attention and Transformers
Resources:
https://nlp.seas.harvard.edu/2018/04/03/attention.html (Excellent
explanation of transformer model with codes.)
Jay Alammar, The illustrated transformer (from which I borrowed many
pictures):
http://jalammar.github.io/illustrated-transformer/