0% found this document useful (0 votes)

15 views26 pages

lecture15_transformer

transformers

Uploaded by

srikantkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views26 pages

lecture15_transformer

transformers

Uploaded by

srikantkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Attention Mechanism

Seq2Seq with Attention

Seq2Seq with Attention
Summary
• Input sequence X, encoder fenc, and decoder fdec
• fenc(X ) produces hidden states h1enc, h2enc, …, hNenc
• On time step t, we have decoder hidden state ht
• Compute attention score ei = ht⊤hienc
• Compute attention distribution αi = Patt(Xi) = so max(ei)
enc
αihienc
∑
Attention output: hatt =
•
i
enc
• Yt ∼ g(ht, hatt ; θ)
enc
• Sample an output using both ht and hatt
ft
Key-query-value attention
• Obtain qt, vt, kt from Xt
• qt = W q Xt; vt = W v Xt; kt = W k Xt
• W q, W v, W k are learnable weight matrices
αi,j = so max(qi⊤kj); outi =
∑
αi,jvj
•
k
• Intuition: key, query, and value can focus on different parts of input
ft
Attention is all you need (Vsawani ’17)
• A pure attention-based architecture for sequence modeling
• No RNN at all!
• Basic component: self-attention, Y = fSA(X; θ)
• Xt uses attention on entire X sequence
• Yt computed from Xt and the attention output
• Computing Yt
• Key kt, value vt, query qt from Xt
• (kt, vt, qt ) = g1(Xt; θ)
⊤
• Attention distribution αt,j = so max(qt kj)

∑
Attention output outt = αt,jvj
•
j
• Yt = g2(outt; θ)
ft
Issues of Vanilla Self-Attention
• Attention is order-invariant

• Lack of non-linearities
• All the weights are simple weighted average

• Capability of autoregressive modeling

• In generation tasks, the model cannot “look at the future”
• e.g. Text generation:
• Yt can only depend on Xi<t
• But vanilla self-attention requires the entire sequence
Position Encoding
• Vanilla self-attention
• (kt, vt, qt ) = g1(Xt; θ)
⊤
• αt,j = so max(qt kj)

∑
Attention output outt = αt,jvj
•
j
• Idea: position encoding:
• pi: an embedding vector (feature) of position i
• (kt, vt, qt ) = g1([Xt, pt]; θ)

• In practice: Additive is sufficient: kt ← k̃t + pt, qt ← q̃t + pt, vt ← ṽt + pt;

(k̃t, ṽt, q̃t) = g1(Xt; θ)

• pt is only included in the first layer

ft
Position Encoding
pt design 1: Sinusoidal position representation
• Pros:
• simple
• naturally models “relative position”
• Easily applied to long sequences
• Cons:
• Not learnable
• Generalization poorly to sequences longer than training data
Position Encoding
pt design 2: Learned representation
• Assume maximum length L, learn a matrix p ∈ ℝd×T, pt is a column of p
• Pros:
• Flexible
• Learnable and more powerful
• Cons:
• Need to assume a fixed maximum length L
• Does not work at all for length above L

• pt design 3: Relative position representation (Shaw, Uszkoreit, Vaswani ’18)

Combine Self-Attention with Nonlinearity
• Vanilla self-attention
• No element-wise activation (e.g., ReLU, tanh)
• Only weighted average and softmax operator

• Fix:
• Add an MLP to process outi
• mi = MLP(outi) = W2ReLU(W1outi + b1) + b2
• Usually do not put activation layer before softmaax
Masked Attention
• In language model decoder: P(Yt | Xi<t )
• outt cannot look at future Xi>t

• Masked attention
⊤
• Compute ei,j = qi kj as usuall
• Mask out ei>j by setting ei>j = − ∞
• e ⊙ (1 − M ) ← − ∞
• M is a fixed 0/1 mask matrix
• Then compute αi = so max(ei)
• Remarks:
• M = 1 for full self-attention
• Set M for arbitrary dependency ordering
ft
Transformer
Transformer-based sequence-to-sequence modeling
Key-query-value attention
• Obtain qt, vt, kt from Xt
• qt = W q Xt; vt = W v Xt; kt = W k Xt (position encoding omitted)
• W q, W v, W k are learnable weight matrices
αi,j = so max(qi⊤kj); outi =
∑
αi,jvj
•
k
• Intuition: key, query, and value can focus on different parts of input
ft
Multi-headed attention
• Standard attention: single-headed attention
• Xt ∈ ℝd, Q, K, V ∈ ℝd×d
• We only look at a single position j with
high αi,j
• What if we want to look at different j for
different reasons?
• Idea: define h separate attention heads
• h different attention distributions, keys,
values, and queries
d
• Q ℓ, K ℓ, V ℓ ∈ ℝd× h for 1 ≤ ℓ ≤ h
ℓ
= so max((qiℓ)⊤kjℓ); outiℓ = ℓ ℓ
∑
αi,j αi,j vj
•
j
ft
Multi-headed attention
• Standard attention: single-headed attention
• Xt ∈ ℝd, Q, K, V ∈ ℝd×d
• We only look at a single position j with
high αi,j
• What if we want to look at different j for
different reasons?
• Idea: define h separate attention heads
• h different attention distributions, keys,
values, and queries
d
• Q ℓ, K ℓ, V ℓ ∈ ℝd× h for 1 ≤ ℓ ≤ h
ℓ
= so max((qiℓ)⊤kjℓ); outiℓ = ℓ ℓ
∑
αi,j αi,j vj
•
j
ft
Transformer
Transformer-based sequence-to-sequence modeling

• Basic building blocks: self-attention

• Position encoding
• Post-processing MLP
• Attention mask

• Enhancements:
• Key-query-value attention
• Multi-headed attention
• Architecture modifications:
• Residual connection
• Layer normalization
Transformer
Machine translation with transformer
Transformer
• Limitations of transformer: Quadratic computation cost
• Linear for RNNs
• Large cost for large sequence length, e.g., L > 104

• Follow-ups:
• Large-scale training: transformer-XL; XL-net (‘20)
• Projection tricks to O(L): Linformer ('20)
• Math tricks to O(L): Performer (‘20)
• Sparse interactions: Big Bird (‘20)
• Deeper transformers: DeepNet (’22)
Transformer for Images
• Vision Transformer (’21)
• Decompose an image to 16x16 patches and then apply transformer encoder
Transformer for Images
• Swin Transformer (’21)
• Build hierachical feature maps at different resolution
• Self-attention only within each block
• Shifted block partitions to encode information between blocks
CNN vs. RNN vs. Attention
Summary
• Language model & sequence to sequence model:
• Fundamental ideas and methods for sequence modeling

• Attention mechanism
• So far the most successful idea for sequence data in deep learning
• A scale/order-invariant representation
• Transformer: a fully attention-based architecture for sequence data
• Transformer + Pretraining: the core idea in today’s NLP tasks

• LSTM is still useful in lightweight scenarios

Other architectures
Graph Neural Networks
Graph Neural Networks
Geometric Deep Learning

AE556_2024_Topic7_Transformer
No ratings yet
AE556_2024_Topic7_Transformer
49 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Transformer
No ratings yet
Transformer
58 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Transformers 22nd April 2025 (2)
No ratings yet
Transformers 22nd April 2025 (2)
67 pages
Csm-Form School
No ratings yet
Csm-Form School
2 pages
lec-12
No ratings yet
lec-12
30 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
chapter_4
No ratings yet
chapter_4
24 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Attention_ Attention! _ Lil'Log
No ratings yet
Attention_ Attention! _ Lil'Log
23 pages
Attention is all you need
No ratings yet
Attention is all you need
19 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
16_
No ratings yet
16_
41 pages
Transformer 1803.02155
No ratings yet
Transformer 1803.02155
5 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Scada PDF
100% (1)
Scada PDF
59 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
ACONIS Maintenance
No ratings yet
ACONIS Maintenance
15 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
NLP-week8-transformers
No ratings yet
NLP-week8-transformers
66 pages
Transformer
No ratings yet
Transformer
59 pages
Chap6 Transformer (20240219) - DL4H practioner guide
No ratings yet
Chap6 Transformer (20240219) - DL4H practioner guide
36 pages
Ethers - Js Beginner To Advanced Guides
No ratings yet
Ethers - Js Beginner To Advanced Guides
239 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
Transformer
No ratings yet
Transformer
31 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
attention
No ratings yet
attention
15 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Transformers
No ratings yet
Transformers
15 pages
attention is all you need
No ratings yet
attention is all you need
11 pages
Aiayn
No ratings yet
Aiayn
15 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
7181-attention-is-all-you-need
No ratings yet
7181-attention-is-all-you-need
11 pages
Transformer
No ratings yet
Transformer
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
Transformer
No ratings yet
Transformer
5 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
AutoCAD Level 4 study manual
No ratings yet
AutoCAD Level 4 study manual
14 pages
Transformer
No ratings yet
Transformer
4 pages
HB-0286-006 1090119 HB IAS QIAxcel DNA 1114 WW
No ratings yet
HB-0286-006 1090119 HB IAS QIAxcel DNA 1114 WW
56 pages
Mit LCS TM 528
No ratings yet
Mit LCS TM 528
23 pages
Topic 1 - Intro To STS-2
No ratings yet
Topic 1 - Intro To STS-2
49 pages
Transformer Models: An Introduction and Catalog
No ratings yet
Transformer Models: An Introduction and Catalog
67 pages
Empowerment Technology
No ratings yet
Empowerment Technology
9 pages
Example File
No ratings yet
Example File
3 pages
SBA - Fault Injection Attack On Deep Neural Network
No ratings yet
SBA - Fault Injection Attack On Deep Neural Network
23 pages
Safety and Effectiveness of Electronic Decisions Support To Improve Care Decisions and Outcomes 2
No ratings yet
Safety and Effectiveness of Electronic Decisions Support To Improve Care Decisions and Outcomes 2
39 pages
Graphic Design Assignments
50% (2)
Graphic Design Assignments
5 pages
Design Core Competence Diagnosis: A Case From The Automotive Industry
No ratings yet
Design Core Competence Diagnosis: A Case From The Automotive Industry
15 pages
SMC Notes On Unit-I
No ratings yet
SMC Notes On Unit-I
27 pages
Tmua TT B1 B2
No ratings yet
Tmua TT B1 B2
14 pages
20160120-ML Installtaion of Application Software
No ratings yet
20160120-ML Installtaion of Application Software
10 pages
9778d5d219c5080b9a6a17bef029331c_1
No ratings yet
9778d5d219c5080b9a6a17bef029331c_1
8 pages
Ibm Filenet P8 Platform and Architecture: Reprinted For Supriya Kapoor, Tata Consultancy Svcs
No ratings yet
Ibm Filenet P8 Platform and Architecture: Reprinted For Supriya Kapoor, Tata Consultancy Svcs
30 pages
TALKING ABOUT ONLINE COMMUNICATION
No ratings yet
TALKING ABOUT ONLINE COMMUNICATION
5 pages
Table 6.3.1.1 Pipe or Tube Materials and Dimensions
No ratings yet
Table 6.3.1.1 Pipe or Tube Materials and Dimensions
1 page
Candidate's CV Tracker - Assessment (QA - QC-Inspector)
No ratings yet
Candidate's CV Tracker - Assessment (QA - QC-Inspector)
8 pages
Architecture Thesis Topic
No ratings yet
Architecture Thesis Topic
11 pages
Vice City - Tourist Guide
100% (1)
Vice City - Tourist Guide
14 pages
What Do You Mean by Software?
No ratings yet
What Do You Mean by Software?
17 pages
v40 Tailgate Lock Assembly Replacing
No ratings yet
v40 Tailgate Lock Assembly Replacing
2 pages
Tabela de Conversao de Unidades de Pressao
No ratings yet
Tabela de Conversao de Unidades de Pressao
1 page
Terms and Conditions - Apple Security Research
No ratings yet
Terms and Conditions - Apple Security Research
1 page
Ds-8Acsh: Slim Type Sata Super Allwrite
No ratings yet
Ds-8Acsh: Slim Type Sata Super Allwrite
2 pages
Hydraulics Lecture Notes 1 - Problem Exercises
No ratings yet
Hydraulics Lecture Notes 1 - Problem Exercises
2 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Transformation of Axes (Geometry) Mathematics Question Bank
From Everand
Transformation of Axes (Geometry) Mathematics Question Bank
Mohmmad Khaja Shareef
3/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

lecture15_transformer

Uploaded by

lecture15_transformer

Uploaded by

Attention Mechanism

Seq2Seq with Attention

• Capability of autoregressive modeling

• In practice: Additive is sufficient: kt ← k̃t + pt, qt ← q̃t + pt, vt ← ṽt + pt;

• pt is only included in the first layer

• pt design 3: Relative position representation (Shaw, Uszkoreit, Vaswani ’18)

• Basic building blocks: self-attention

• LSTM is still useful in lightweight scenarios

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.