lecture15_transformer
lecture15_transformer
∑
Attention output outt = αt,jvj
•
j
• Yt = g2(outt; θ)
ft
Issues of Vanilla Self-Attention
• Attention is order-invariant
• Lack of non-linearities
• All the weights are simple weighted average
∑
Attention output outt = αt,jvj
•
j
• Idea: position encoding:
• pi: an embedding vector (feature) of position i
• (kt, vt, qt ) = g1([Xt, pt]; θ)
• Fix:
• Add an MLP to process outi
• mi = MLP(outi) = W2ReLU(W1outi + b1) + b2
• Usually do not put activation layer before softmaax
Masked Attention
• In language model decoder: P(Yt | Xi<t )
• outt cannot look at future Xi>t
• Masked attention
⊤
• Compute ei,j = qi kj as usuall
• Mask out ei>j by setting ei>j = − ∞
• e ⊙ (1 − M ) ← − ∞
• M is a fixed 0/1 mask matrix
• Then compute αi = so max(ei)
• Remarks:
• M = 1 for full self-attention
• Set M for arbitrary dependency ordering
ft
Transformer
Transformer-based sequence-to-sequence modeling
Key-query-value attention
• Obtain qt, vt, kt from Xt
• qt = W q Xt; vt = W v Xt; kt = W k Xt (position encoding omitted)
• W q, W v, W k are learnable weight matrices
αi,j = so max(qi⊤kj); outi =
∑
αi,jvj
•
k
• Intuition: key, query, and value can focus on different parts of input
ft
Multi-headed attention
• Standard attention: single-headed attention
• Xt ∈ ℝd, Q, K, V ∈ ℝd×d
• We only look at a single position j with
high αi,j
• What if we want to look at different j for
different reasons?
• Idea: define h separate attention heads
• h different attention distributions, keys,
values, and queries
d
• Q ℓ, K ℓ, V ℓ ∈ ℝd× h for 1 ≤ ℓ ≤ h
ℓ
= so max((qiℓ)⊤kjℓ); outiℓ = ℓ ℓ
∑
αi,j αi,j vj
•
j
ft
Multi-headed attention
• Standard attention: single-headed attention
• Xt ∈ ℝd, Q, K, V ∈ ℝd×d
• We only look at a single position j with
high αi,j
• What if we want to look at different j for
different reasons?
• Idea: define h separate attention heads
• h different attention distributions, keys,
values, and queries
d
• Q ℓ, K ℓ, V ℓ ∈ ℝd× h for 1 ≤ ℓ ≤ h
ℓ
= so max((qiℓ)⊤kjℓ); outiℓ = ℓ ℓ
∑
αi,j αi,j vj
•
j
ft
Transformer
Transformer-based sequence-to-sequence modeling
• Enhancements:
• Key-query-value attention
• Multi-headed attention
• Architecture modifications:
• Residual connection
• Layer normalization
Transformer
Machine translation with transformer
Transformer
• Limitations of transformer: Quadratic computation cost
• Linear for RNNs
• Large cost for large sequence length, e.g., L > 104
• Follow-ups:
• Large-scale training: transformer-XL; XL-net (‘20)
• Projection tricks to O(L): Linformer ('20)
• Math tricks to O(L): Performer (‘20)
• Sparse interactions: Big Bird (‘20)
• Deeper transformers: DeepNet (’22)
Transformer for Images
• Vision Transformer (’21)
• Decompose an image to 16x16 patches and then apply transformer encoder
Transformer for Images
• Swin Transformer (’21)
• Build hierachical feature maps at different resolution
• Self-attention only within each block
• Shifted block partitions to encode information between blocks
CNN vs. RNN vs. Attention
Summary
• Language model & sequence to sequence model:
• Fundamental ideas and methods for sequence modeling
• Attention mechanism
• So far the most successful idea for sequence data in deep learning
• A scale/order-invariant representation
• Transformer: a fully attention-based architecture for sequence data
• Transformer + Pretraining: the core idea in today’s NLP tasks