Transformer Design Report
Transformer Design Report
Positional Encoding
Since the Transformer architecture contains no recurrence or convolution, it has no
inherent sense of sequence order. To enable the model to utilize the order of the
sequence, "positional encodings" are added to the input embeddings at the bottoms of
the encoder and decoder stacks. These encodings provide information about the
absolute and relative position of the tokens in the sequence.
The original Transformer paper proposed using sinusoidal functions of different
frequencies for positional encoding. This choice allows the model to easily learn to
attend by relative positions, as for any fixed offset \begin{math}k\end{math}, \
begin{math}PE_{pos+k}\end{math} can be represented as a linear function of \
begin{math}PE_{pos}\end{math}.
The positional encoding for position \begin{math}pos\end{math} and dimension \
begin{math}i\end{math} within the embedding vector is calculated as follows:
• For even dimensions \begin{math}i\end{math} (up to the maximum dimension \
begin{math}d_{model}\end{math}): \begin{equation} PE(pos, 2i) = \sin\left(\
frac{pos}{10000^{2i/d_{model}}}\right) \end{equation}
• For odd dimensions \begin{math}i\end{math}: \begin{equation} PE(pos, 2i+1) = \
cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \end{equation}
Here, \begin{math}pos\end{math} is the position and \begin{math}d_{model}\end{math}
is the dimensionality of the embeddings (and the model's internal representation size).
The wavelengths form a geometric progression from \begin{math}2\pi\end{math} to \
begin{math}10000 \cdot 2\pi\end{math}. These positional encodings have the same
dimension \begin{math}d_{model}\end{math} as the embeddings, so they can be
summed together. The resulting combined vector (embedding + positional encoding) is
what is fed into the subsequent layers of the encoder and decoder.
Multi-Head Attention
Instead of performing a single attention function with \begin{math}d_{model}\end{math}-
dimensional Keys, Values, and Queries, the Transformer uses "multi-head attention".
This mechanism linearly projects the Queries, Keys, and Values \begin{math}h\
end{math} times with different, learned linear projections to \begin{math}d_k\
end{math}, \begin{math}d_k\end{math}, and \begin{math}d_v\end{math} dimensions,
respectively. For each of these projected versions, the attention function is applied in
parallel, yielding \begin{math}h\end{math} output values. These \begin{math}h\
end{math} outputs are then concatenated and once again linearly projected to obtain
the final output value, which has the desired dimension \begin{math}d_{model}\
end{math}.
Mathematically, the multi-head attention is defined as: \begin{equation} \text{MultiHead}
(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O \end{equation} where \
begin{math}\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\end{math}. The
projections are parameter matrices \begin{math}W_i^Q \in \mathbb{R}^{d_{model} \
times d_k}\end{math}, \begin{math}W_i^K \in \mathbb{R}^{d_{model} \times d_k}\
end{math}, \begin{math}W_i^V \in \mathbb{R}^{d_{model} \times d_v}\end{math}, and \
begin{math}W^O \in \mathbb{R}^{hd_v \times d_{model}}\end{math}. The choice of \
begin{math}d_k = d_v = d_{model}/h\end{math} is common, ensuring that the total
computation cost is similar to that of a single attention head with full dimensionality.
The motivation behind multi-head attention is that it allows the model to jointly attend to
information from different representation subspaces at different positions. With a single
attention head, averaging inhibits this. Multiple heads allow the model to capture
different types of relationships or attend to different parts of the sequence concurrently.
Here, \begin{math}s(i, j)\end{math> denotes the scaled dot product between the query
at position \begin{math}i\end{math> and the key at position \begin{math}j\end{math}.
The values \begin{math}-\infty\end{math> ensure that the softmax probabilities for
attending to future positions are zero. This masking is only applied in the decoder's self-
attention and is essential during training to simulate the causal, step-by-step generation
process of the decoder during inference.
Label Smoothing
To enhance generalization and avoid overconfidence in predictions, Transformer
training integrates a technique called label smoothing. Instead of using a one-hot
encoding where the target class probability is 1 and all others are 0, label smoothing
assigns a slightly lower probability (e.g., 0.9) to the correct class and distributes the
remainder uniformly across the other classes. This approach effectively prevents the
model from becoming too confident in its predictions, which can reduce overfitting and
improve calibration.
Concretely, if the smoothing factor is \begin{math}\epsilon\end{math> and there are \
begin{math}K\end{math> classes, the target label vector components are:
• Target class: \begin{math}1 - \epsilon\end{math>
• Non-target classes: \begin{math}\frac{\epsilon}{K-1}\end{math>
This softening of targets encourages the network to learn more robust feature
representations.