Transformer
Transformer
1 Introduction
Transformer is a model architecture that eschews recurrence and instead relies
entirely on an attention mechanism to draw global dependencies between input
and output.
2 Transformer Architecture
sequence of symbols (y1 , ...yn ) one element at a time. At each step, the model is
auto-regressive, consuming the previously generated symbols as additional input
when generating the next.
2.2 Attention
An attention function can be described as mapping a query and a set of key-value
pairs to an output, where the query, keys, values, and output are all vectors. The
output is computed as a weighted sum of the values, where the weight assigned
to each value is computed by a compatibility function of the query with the
corresponding key
2
Report: Transformer
QK T
Attention(Q, K, V ) = sof tmax( √ )V (1)
dk
3
Report: Transformer
3 Why self-attention
We can obtain the same size of result using convolutional or recurrent layers
instead of self-attention ones. However, self-attention layers are the least com-
putationally complex ones, while achieving superior performance.
4 Training
4.1 Training Data
We train on the standard WMT 2014 English-German dataset and WMT 2014
English-French dataset. Sentence pairs were batched together by approximate
sequence length; each batch contains approximately 25000 source tokens and
25000 target tokens.
4.2 Optimizer
We use Adam optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 10−9 . We varied the
learning rate over the training course according to the formula:
lrate = d−0.5
model .min(step num
−0.5
, step num.warmup steps−1.5 ) (7)
4.3 Regularization
We utilize three types of regularization during training: dropout some elements
in the output of all sub-layers and the input of the encoder and decoder stacks
with a probability of 0.1, and use label smoothing to improve accuracy and
BLEU score.
5 Conclusion
Transformer is the first sequence transduction model based entirely on attention.
We plan to apply Transformer to other domains with large inputs and outputs,
such as images, audio, and video.