Dis7 Sol
Dis7 Sol
1 Attention Mechanisms
For many NLP and visual tasks we train our deep models on, features appear on the input text/visual data
often contributes unevenly to the output task. For example, in a translation task, not the entirety of the
input sentence will be useful (and may even be confusing) for the model to generate a certain output word,
or not the entirety of the image contributes to a certain sentence generated in the caption.
While some RNN architectures we previously covered possess the capability to maintain a memory of the
previous inputs/outputs, to compute output and to modify the memory accordingly, these memory states
need to encompass information of many previous states, which can be difficult especially when performing
tasks with long-term dependencies.
Attention mechanisms were developed to improve the network’s capability of orienting perception onto parts
of the data, and to allow random access to the memory of processing previous inputs. In the context of
RNNs, attention mechanisms allow networks to not only utilize the current hidden state, but also the hidden
states of the network computed in previous time steps as shown in Figure 5.
With hl and each ht we compute an alignment score 1 , then take a Softmax to obtain attention weights from
the alignment scores.
With these attention weights, we compute a context vector ct , which is a weighted sum of hs . As we get ct ,
we compute a transformation h̃t = tanh Wf [ht ; ct ] applied on the concatenation of the context vector and
the original hidden state. This is then used to compute the final output.
The motivation for this kind of machine translation system was to treat the encoder (the blue part) as like
a memory which we can access later during the decoding process (colored red). Most of the early neural
machine translation systems ran the encoder with the input sentence to get a hidden state, and then that
hidden state was the input to the decoder which needed to generate the resulting sentence. With this model,
we are able to use the other hidden state outputs from the encoder, not just the last one.
Then, using Q and K, we can compute a dot product as the ’score’ of K for Q as shown in Figure 4.
Intuitively, Q is the querying term that you would like to find. Its relations for each corresponding K and
V pairs (key-value) pairs, can be computed using the key. Note that this dot product is computed across
various time steps by matrix multiplication. So we get a score for each K for each Q. We then use a Softmax
function to get our attention weights.
Finally, using these weights, we can compute our weighted sum by multiplying the weights with the values.
Comparing to the Luong attention, query is analogous to the original hl , key and query are analogous to the
original ht .
Problem: Attention in RNNs
To incorporate self-attention, we can let each hidden state attend to themselves. In other words,
every hidden state attends to the previous hidden states. Put more formally, ht attends to previous
states by,
et,l = score(ht , hl )
We apply Softmax to get attention distribution over previous states,
exp et,l
αt,l = P
j exp et,j
Consider a form of attention that matches query q to keys k1 , . . . , kt in order to attend over associated
values v1 , . . . , vt .
If we have multiple queries q1 , . . . , ql , how can we write this version of attention in matrix notation?
In practice, Transformers use a Scaled Self-Attention. Suppose q, k ∈ Rd are two random vectors
with q, k ∼ N (µ, σ 2 I), where µ ∈ Rd and σ ∈ R+
1.
Xd
E[q > k] = E qi ki
i=1
d
X
= E[qi ki ]
i=1
d
X
= µ2i
i=1
= µ> µ
Then,
Both operate similarly, except the Transformer Decoder takes xtarget as input, but Transformer Encoder
takes in xsource as input. In addition, there are several differences in cross-attention and self-attention
operations. In particular, transformers are novel in that they add,
2.1 Notations
To ensure a level of clarity, we will let B be the batch size, Lsource represent the source sequence length,
Ltarget be the target sequence length, D represent the model hidden dimension and H represent the number
of attention heads.
In particular, transformers receive two sequences as input. The first is xsource ∈ ZB×Lsource and the second
is xtarget ∈ ZB×Ltarget . These are integer tensors, and each integer represents a word or token.
Q = Xsource WQ
K = Xsource WK
V = Xsource WV
Using Q, K, V , we will compute the attention scores (tensor in RB×Lsource ×Lsource ). For each element in the
qi> kj
batch, each entry i, j in the matrix would be √
D
for scaled dot product attention. Alternatively, we can
>
QK
compute, √
D
.To produce weights over each position in the sequence, we want each score to sum to one
over the keys K. To accomplish this, we take a softmax update over the last dimension of the attention
scores. Then, to produce the attention update, we multiply these attention weights by our values V ,
!
QK >
Cupdate = softmax √ V
D
Feedforward Layer The feedforward layer applies linear transformation to each position, apply a nonlin-
ear activation, then applies a second linear transformation.
Encoder-Decoder Attention Encoder-Decoder attention operated similarly as well, except that we have
two sequences: (1) generate queries and (2) generate keys-values. Hence, we let Q = Xtarget WQ , K =
Xsource WK , V = Xsource WV , where Xsource is the output of the transformer encoder on the source sequences.
3. For input sequences of length M and output sequences of length N , what are the complexities of
(1) Encoder Self-Attention (2) Decoder-Encoder Attention (3) Decoder Self-Attention. Further
let k be the hidden dimension of the network
4. Do activation of the encoder depend on decoder activation? How much additional computation
is needed to translate a source sequence into a different target language, in terms of M and N ?
1. Position encoding is used to ensure that word position is known. Because attention is applied
symmetrically to all input vectors from the layer below, there is no way for the network to know
which positions were filtered through to the output of the attention block. Position encoding
also allows the network to compare words (nearby position encodings have high inner product)
and find nearby words.
2. Multi-Head attention allows for a single attention module to attend to multiple parts of an
input sequence. This is useful when the output is dependent on multiple inputs (such as in the
case of the tense of a verb in translation). Attention heads find features like start of sentence
and paragraph, subject/object relations, pronouns, etc.
3. (1) O(M 2 k) (2) O(M N k) (3) O(N 2 k)
4. No. The encoder activations do not depend on the decoder activations. Thus, you only need
O(M N + N 2 ) additional computation to decode into a new sequence.