0% found this document useful (0 votes)
15 views26 pages

lecture15_transformer

transformers

Uploaded by

srikantkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views26 pages

lecture15_transformer

transformers

Uploaded by

srikantkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Attention Mechanism

Seq2Seq with Attention


Seq2Seq with Attention
Summary
• Input sequence X, encoder fenc, and decoder fdec
• fenc(X ) produces hidden states h1enc, h2enc, …, hNenc
• On time step t, we have decoder hidden state ht
• Compute attention score ei = ht⊤hienc
• Compute attention distribution αi = Patt(Xi) = so max(ei)
enc
αihienc

Attention output: hatt =

i
enc
• Yt ∼ g(ht, hatt ; θ)
enc
• Sample an output using both ht and hatt
ft
Key-query-value attention
• Obtain qt, vt, kt from Xt
• qt = W q Xt; vt = W v Xt; kt = W k Xt
• W q, W v, W k are learnable weight matrices
αi,j = so max(qi⊤kj); outi =

αi,jvj

k
• Intuition: key, query, and value can focus on different parts of input
ft
Attention is all you need (Vsawani ’17)
• A pure attention-based architecture for sequence modeling
• No RNN at all!
• Basic component: self-attention, Y = fSA(X; θ)
• Xt uses attention on entire X sequence
• Yt computed from Xt and the attention output
• Computing Yt
• Key kt, value vt, query qt from Xt
• (kt, vt, qt ) = g1(Xt; θ)

• Attention distribution αt,j = so max(qt kj)


Attention output outt = αt,jvj

j
• Yt = g2(outt; θ)
ft
Issues of Vanilla Self-Attention
• Attention is order-invariant

• Lack of non-linearities
• All the weights are simple weighted average

• Capability of autoregressive modeling


• In generation tasks, the model cannot “look at the future”
• e.g. Text generation:
• Yt can only depend on Xi<t
• But vanilla self-attention requires the entire sequence
Position Encoding
• Vanilla self-attention
• (kt, vt, qt ) = g1(Xt; θ)

• αt,j = so max(qt kj)


Attention output outt = αt,jvj

j
• Idea: position encoding:
• pi: an embedding vector (feature) of position i
• (kt, vt, qt ) = g1([Xt, pt]; θ)

• In practice: Additive is sufficient: kt ← k̃t + pt, qt ← q̃t + pt, vt ← ṽt + pt;


(k̃t, ṽt, q̃t) = g1(Xt; θ)

• pt is only included in the first layer


ft
Position Encoding
pt design 1: Sinusoidal position representation
• Pros:
• simple
• naturally models “relative position”
• Easily applied to long sequences
• Cons:
• Not learnable
• Generalization poorly to sequences longer than training data
Position Encoding
pt design 2: Learned representation
• Assume maximum length L, learn a matrix p ∈ ℝd×T, pt is a column of p
• Pros:
• Flexible
• Learnable and more powerful
• Cons:
• Need to assume a fixed maximum length L
• Does not work at all for length above L

• pt design 3: Relative position representation (Shaw, Uszkoreit, Vaswani ’18)


Combine Self-Attention with Nonlinearity
• Vanilla self-attention
• No element-wise activation (e.g., ReLU, tanh)
• Only weighted average and softmax operator

• Fix:
• Add an MLP to process outi
• mi = MLP(outi) = W2ReLU(W1outi + b1) + b2
• Usually do not put activation layer before softmaax
Masked Attention
• In language model decoder: P(Yt | Xi<t )
• outt cannot look at future Xi>t

• Masked attention

• Compute ei,j = qi kj as usuall
• Mask out ei>j by setting ei>j = − ∞
• e ⊙ (1 − M ) ← − ∞
• M is a fixed 0/1 mask matrix
• Then compute αi = so max(ei)
• Remarks:
• M = 1 for full self-attention
• Set M for arbitrary dependency ordering
ft
Transformer
Transformer-based sequence-to-sequence modeling
Key-query-value attention
• Obtain qt, vt, kt from Xt
• qt = W q Xt; vt = W v Xt; kt = W k Xt (position encoding omitted)
• W q, W v, W k are learnable weight matrices
αi,j = so max(qi⊤kj); outi =

αi,jvj

k
• Intuition: key, query, and value can focus on different parts of input
ft
Multi-headed attention
• Standard attention: single-headed attention
• Xt ∈ ℝd, Q, K, V ∈ ℝd×d
• We only look at a single position j with
high αi,j
• What if we want to look at different j for
different reasons?
• Idea: define h separate attention heads
• h different attention distributions, keys,
values, and queries
d
• Q ℓ, K ℓ, V ℓ ∈ ℝd× h for 1 ≤ ℓ ≤ h

= so max((qiℓ)⊤kjℓ); outiℓ = ℓ ℓ

αi,j αi,j vj

j
ft
Multi-headed attention
• Standard attention: single-headed attention
• Xt ∈ ℝd, Q, K, V ∈ ℝd×d
• We only look at a single position j with
high αi,j
• What if we want to look at different j for
different reasons?
• Idea: define h separate attention heads
• h different attention distributions, keys,
values, and queries
d
• Q ℓ, K ℓ, V ℓ ∈ ℝd× h for 1 ≤ ℓ ≤ h

= so max((qiℓ)⊤kjℓ); outiℓ = ℓ ℓ

αi,j αi,j vj

j
ft
Transformer
Transformer-based sequence-to-sequence modeling

• Basic building blocks: self-attention


• Position encoding
• Post-processing MLP
• Attention mask

• Enhancements:
• Key-query-value attention
• Multi-headed attention
• Architecture modifications:
• Residual connection
• Layer normalization
Transformer
Machine translation with transformer
Transformer
• Limitations of transformer: Quadratic computation cost
• Linear for RNNs
• Large cost for large sequence length, e.g., L > 104

• Follow-ups:
• Large-scale training: transformer-XL; XL-net (‘20)
• Projection tricks to O(L): Linformer ('20)
• Math tricks to O(L): Performer (‘20)
• Sparse interactions: Big Bird (‘20)
• Deeper transformers: DeepNet (’22)
Transformer for Images
• Vision Transformer (’21)
• Decompose an image to 16x16 patches and then apply transformer encoder
Transformer for Images
• Swin Transformer (’21)
• Build hierachical feature maps at different resolution
• Self-attention only within each block
• Shifted block partitions to encode information between blocks
CNN vs. RNN vs. Attention
Summary
• Language model & sequence to sequence model:
• Fundamental ideas and methods for sequence modeling

• Attention mechanism
• So far the most successful idea for sequence data in deep learning
• A scale/order-invariant representation
• Transformer: a fully attention-based architecture for sequence data
• Transformer + Pretraining: the core idea in today’s NLP tasks

• LSTM is still useful in lightweight scenarios


Other architectures
Graph Neural Networks
Graph Neural Networks
Geometric Deep Learning

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy