0% found this document useful (0 votes)

14 views4 pages

Transformer

The Transformer is a model architecture that utilizes an encoder-decoder structure and relies solely on an attention mechanism to process input and output sequences. It features multi-head attention, position-wise feed-forward networks, and positional encodings to effectively capture dependencies in data. The model has shown superior performance compared to traditional convolutional or recurrent layers and is trained on datasets like WMT 2014 for translation tasks.

Uploaded by

nguyennmanhhduyy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views4 pages

Transformer

Uploaded by

nguyennmanhhduyy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Report: Transformer

1 Introduction
Transformer is a model architecture that eschews recurrence and instead relies
entirely on an attention mechanism to draw global dependencies between input
and output.

2 Transformer Architecture

Fig. 1. The Transformer-model architecture

The Transformer has an encoder-decoder structure. Here, the encoder maps an

input sequence of symbol representations (x1 , ..., xn ) to a sequence of continuous
representations z = (z1 , ..., zn ). Given z, the decoder then generates an output
Report: Transformer

sequence of symbols (y1 , ...yn ) one element at a time. At each step, the model is
auto-regressive, consuming the previously generated symbols as additional input
when generating the next.

2.1 Encoder and Decoder Stacks

Encoder The encoder is composed of a stack of N = 6 identical layers. Each
layer has a multi-head self-attention sub-layer and a feedforward sub-layer, wrapped
with residual connections and followed by layer normalization. All sub-layers in
the model produce output of dimension dmodel = 512.

Decoder The decoder is also composed of a stack of N = 6 identical layers.

Each decoder layer adds a third sublayer: multi-head attention over the encoder
output. Like the encoder, each sublayer uses residual connections followed by
layer normalization. Self-attention is masked so position i attends only to posi-
tions < i.

2.2 Attention
An attention function can be described as mapping a query and a set of key-value
pairs to an output, where the query, keys, values, and output are all vectors. The
output is computed as a weighted sum of the values, where the weight assigned
to each value is computed by a compatibility function of the query with the
corresponding key

Fig. 2. (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of

several attention layers running in parallel.

2.2.1 Scaled Dot-Product Attention The input consists of queries and

keys of dimension dk , and values of dimension
√ dv . We compute the dot products
of the query with all keys, divide each by dk , and apply a softmax function to
obtain the weights on the values. In practice, we compute the attention function

2
Report: Transformer

on a set of queries simultaneously, the matrix of outputs can be represented as:

QK T
Attention(Q, K, V ) = sof tmax( √ )V (1)
dk

2.2.2 Multi-Head Attention Instead of performing a single attention func-

tion with dmodel -dimensional keys, values, and queries, it is more beneficial to
linearly project them h times with different, learned linear projections to dk , dk ,
and dv dimensions respectively. On each of these projected versions of queries,
keys, and values, we then perform the attention function in parallel, yielding dv -
dimensional output values. These are concatenated and once again projected,
resulting in the final values, as depicted in Fig.2.
This allows the model to attend to information from different representation
subspaces at different positions. The formula can be represented as:
M ultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O (2)
where
headi = Attention(QWiQ , KWiK , V WiV ) (3)
with the projections are parameter matrices WiQ , WiK , WiV o
, and W .

2.2.3 Applications of Attention in Transformer The Transformer uses

multi-head attention in three different ways:
– In encoder-decoder attention, queries come from the decoder, while keys
and values come from the encoder output, allowing each decoder position to
attend to the entire input sequence.
– The encoder uses self-attention, where queries, keys, and values all come from
the previous encoder layer, allowing each position to attend to all others.
– Decoder self-attention lets each position attend to itself and earlier positions,
with future tokens masked to maintain the auto-regressive property.

2.3 Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in the encoder and decoder
contains a fully connected feed-forward network, which is applied to each position
separately and identically. This consists of two linear transformations with a
ReLU activation in between.
F F N (x) = max(0, xW1 + b1 )W2 + b2 (4)

2.4 Embeddings and Softmax

Learnable embeddings are used to convert the input and output tokens to a
vector of dimension dmodel . In addition, we also use a usual learned linear trans-
formation and softmax function to convert the decoder output to the next-token
probabilities. In this model, we share the same weight matrix between the two
embedding layers and the pre-softmax √ linear transformation. In the embedding
layers, we multiply those weights by dmodel

3
Report: Transformer

2.5 Positional Encoding

For the model to utilize the order of the sequence, we add the ”positional en-
codings” to the output of the input and output embeddings. These have the
same dimension dmodel as the embeddings. We use sine and cosine functions of
different frequencies:

P Epos,2i = sin(pos/100002i/dmodel ) (5)

P Epos,2i+1 = cos(pos/100002i/dmodel ) (6)

where pos is the position and i is the dimension.

3 Why self-attention
We can obtain the same size of result using convolutional or recurrent layers
instead of self-attention ones. However, self-attention layers are the least com-
putationally complex ones, while achieving superior performance.

4 Training
4.1 Training Data
We train on the standard WMT 2014 English-German dataset and WMT 2014
English-French dataset. Sentence pairs were batched together by approximate
sequence length; each batch contains approximately 25000 source tokens and
25000 target tokens.

4.2 Optimizer
We use Adam optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 10−9 . We varied the
learning rate over the training course according to the formula:

lrate = d−0.5
model .min(step num
−0.5
, step num.warmup steps−1.5 ) (7)

4.3 Regularization
We utilize three types of regularization during training: dropout some elements
in the output of all sub-layers and the input of the encoder and decoder stacks
with a probability of 0.1, and use label smoothing to improve accuracy and
BLEU score.

5 Conclusion
Transformer is the first sequence transduction model based entirely on attention.
We plan to apply Transformer to other domains with large inputs and outputs,
such as images, audio, and video.

Dare To Be You
100% (1)
Dare To Be You
96 pages
Hapen HPVFV
No ratings yet
Hapen HPVFV
82 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Chap6 Transformer (20240219) - DL4H practioner guide
No ratings yet
Chap6 Transformer (20240219) - DL4H practioner guide
36 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
NLP-week8-transformers
No ratings yet
NLP-week8-transformers
66 pages
Transformer
No ratings yet
Transformer
58 pages
AE556_2024_Topic7_Transformer
No ratings yet
AE556_2024_Topic7_Transformer
49 pages
1706.03762v1
No ratings yet
1706.03762v1
15 pages
paper2
No ratings yet
paper2
8 pages
Transformer
No ratings yet
Transformer
5 pages
Transformer
No ratings yet
Transformer
10 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
16_
No ratings yet
16_
41 pages
Transformers 22nd April 2025 (2)
No ratings yet
Transformers 22nd April 2025 (2)
67 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
A Survey of Transformers:, Yuxin Wang, Xiangyang Liu, and
No ratings yet
A Survey of Transformers:, Yuxin Wang, Xiangyang Liu, and
40 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
58 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
Transformers
No ratings yet
Transformers
41 pages
Transformers
No ratings yet
Transformers
15 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
Transformer
No ratings yet
Transformer
41 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Transformer
No ratings yet
Transformer
31 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Lec 7 Trans(decoder)+ViT
No ratings yet
Lec 7 Trans(decoder)+ViT
20 pages
A1
No ratings yet
A1
11 pages
Attention is All You Need PPT
No ratings yet
Attention is All You Need PPT
18 pages
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
No ratings yet
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
22 pages
Transformer Presentation
No ratings yet
Transformer Presentation
15 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Transformer
No ratings yet
Transformer
33 pages
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
chapter_4
No ratings yet
chapter_4
24 pages
Transformer (v5)
No ratings yet
Transformer (v5)
31 pages
VR Part2 Lecture 5 Annotated
No ratings yet
VR Part2 Lecture 5 Annotated
14 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
lec-12
No ratings yet
lec-12
30 pages
Transformer 1803.02155
No ratings yet
Transformer 1803.02155
5 pages
Attention LLM
No ratings yet
Attention LLM
36 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
attention
No ratings yet
attention
15 pages
SimpleTron NeurIPS 2022
No ratings yet
SimpleTron NeurIPS 2022
15 pages
Transformers_v1.1
No ratings yet
Transformers_v1.1
1 page
attention is all you need
No ratings yet
attention is all you need
11 pages
7181-attention-is-all-you-need
No ratings yet
7181-attention-is-all-you-need
11 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Aiayn
No ratings yet
Aiayn
15 pages
C Programming
From Everand
C Programming
Netra
No ratings yet
Vb Net Programming
From Everand
Vb Net Programming
Martin Booch
No ratings yet
Reflection - 1st To 4th
No ratings yet
Reflection - 1st To 4th
4 pages
Difference Between Jit and Traditional
33% (3)
Difference Between Jit and Traditional
7 pages
Family Nursing Care Plan
No ratings yet
Family Nursing Care Plan
10 pages
Mridula-Gupta - Resume - Univ - 2020
No ratings yet
Mridula-Gupta - Resume - Univ - 2020
8 pages
4.3 Structural Analysis 4.3.1 Modelling
No ratings yet
4.3 Structural Analysis 4.3.1 Modelling
8 pages
Cultural Change and Adaptation in The Central Atacama Desert of Northern Chile
No ratings yet
Cultural Change and Adaptation in The Central Atacama Desert of Northern Chile
29 pages
Doing Philosophy
No ratings yet
Doing Philosophy
7 pages
Issues, Goals And: Performance Requirement
No ratings yet
Issues, Goals And: Performance Requirement
24 pages
Heat Conduction in Cylindrical and Spherical Coordinates I
25% (4)
Heat Conduction in Cylindrical and Spherical Coordinates I
14 pages
Post Event Recap and Analysis
No ratings yet
Post Event Recap and Analysis
11 pages
Course Code - Edu213
No ratings yet
Course Code - Edu213
6 pages
Flight Simulator Sense
No ratings yet
Flight Simulator Sense
45 pages
Business Research Method: Unit 4
No ratings yet
Business Research Method: Unit 4
17 pages
02-General Oceanography
No ratings yet
02-General Oceanography
97 pages
Hookes Law and Springs
No ratings yet
Hookes Law and Springs
22 pages
Module 01
No ratings yet
Module 01
15 pages
Transitional Analysis
No ratings yet
Transitional Analysis
28 pages
13 The 5e Instructional Model NASA
No ratings yet
13 The 5e Instructional Model NASA
3 pages
FengLi2018 Cooper2016
No ratings yet
FengLi2018 Cooper2016
3 pages
ELECTRICITY
No ratings yet
ELECTRICITY
5 pages
IELTS Placement Test 2 6
No ratings yet
IELTS Placement Test 2 6
5 pages
Altivar VW3A3401
No ratings yet
Altivar VW3A3401
1 page
Manual de Reloj Casio 5183
No ratings yet
Manual de Reloj Casio 5183
1 page
Astm A29, A29m (2016)
No ratings yet
Astm A29, A29m (2016)
17 pages
If North American Political Subdivisions Were People - Google Sheets
No ratings yet
If North American Political Subdivisions Were People - Google Sheets
11 pages
TESMEC SE 1470 FUA FUH EN Rev.01
No ratings yet
TESMEC SE 1470 FUA FUH EN Rev.01
1 page
Мутанты. Элизиум.ENG (251-272)
No ratings yet
Мутанты. Элизиум.ENG (251-272)
22 pages
9 кл сор 2 1 четв
No ratings yet
9 кл сор 2 1 четв
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Transformer

Uploaded by

Transformer

Uploaded by

Report: Transformer

Fig. 1. The Transformer-model architecture

The Transformer has an encoder-decoder structure. Here, the encoder maps an

2.1 Encoder and Decoder Stacks

Decoder The decoder is also composed of a stack of N = 6 identical layers.

Fig. 2. (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of

2.2.1 Scaled Dot-Product Attention The input consists of queries and

on a set of queries simultaneously, the matrix of outputs can be represented as:

2.2.2 Multi-Head Attention Instead of performing a single attention func-

2.2.3 Applications of Attention in Transformer The Transformer uses

2.3 Position-wise Feed-Forward Networks

2.4 Embeddings and Softmax

2.5 Positional Encoding

P Epos,2i = sin(pos/100002i/dmodel ) (5)

P Epos,2i+1 = cos(pos/100002i/dmodel ) (6)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.