0% found this document useful (0 votes)

4 views8 pages

Dis7 Sol

This document discusses Attention Mechanisms and Transformers in the context of deep neural networks, focusing on their application in natural language processing and visual tasks. It explains the Luong attention mechanism and self-attention in Transformer networks, detailing how they improve model performance by allowing selective focus on relevant parts of the input data. Additionally, it covers the architecture of Transformers, including the roles of positional encoding, multi-headed attention, and the differences between encoder and decoder operations.

Uploaded by

abc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views8 pages

Dis7 Sol

Uploaded by

abc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

CS 182/282A Designing, Visualizing and Understanding Deep Neural Networks

Spring 2021 Sergey Levine Discussion 7

This discussion covers Attention Mechanisms and Transformers.

1 Attention Mechanisms
For many NLP and visual tasks we train our deep models on, features appear on the input text/visual data
often contributes unevenly to the output task. For example, in a translation task, not the entirety of the
input sentence will be useful (and may even be confusing) for the model to generate a certain output word,
or not the entirety of the image contributes to a certain sentence generated in the caption.
While some RNN architectures we previously covered possess the capability to maintain a memory of the
previous inputs/outputs, to compute output and to modify the memory accordingly, these memory states
need to encompass information of many previous states, which can be difficult especially when performing
tasks with long-term dependencies.
Attention mechanisms were developed to improve the network’s capability of orienting perception onto parts
of the data, and to allow random access to the memory of processing previous inputs. In the context of
RNNs, attention mechanisms allow networks to not only utilize the current hidden state, but also the hidden
states of the network computed in previous time steps as shown in Figure 5.

Figure 1: Attention Mechanism Illustrated

1.1 Luong Attention

The Luong attention is one of the commonly used attention mechanism in Neural Machine Translation, which
is trained on the task of translating text from source to target language. At each time-step of decoding, the
Luong Attention computes a set of alignment(attention) weights a·,t based on alignment scores and use this
to augment the computation of output probabilities in the translation task.
As shown in Figure 2, we have our encoding states (RNN hidden states when passed a source language into
the model first) ht . At decoding time, we have the original hidden states hl initially computed with the
previous output token and hidden states.

CS 182/282A, Spring 2021, Discussion 7 1

Figure 2: Luong Attention

With hl and each ht we compute an alignment score 1 , then take a Softmax to obtain attention weights from
the alignment scores.
With these attention weights, we compute a context vector ct , which is a weighted sum of hs . As we get ct ,
we compute a transformation h̃t = tanh Wf [ht ; ct ] applied on the concatenation of the context vector and
the original hidden state. This is then used to compute the final output.
The motivation for this kind of machine translation system was to treat the encoder (the blue part) as like
a memory which we can access later during the decoding process (colored red). Most of the early neural
machine translation systems ran the encoder with the input sentence to get a hidden state, and then that
hidden state was the input to the decoder which needed to generate the resulting sentence. With this model,
we are able to use the other hidden state outputs from the encoder, not just the last one.

1.2 Self-Attention in Transformer Networks

Self attention is an attention mechanism introduced in the Transformer architecture which undergoes similar
procedures as the Luong attention. The first step of the attention is to compute Q, K, V using different
transformations from the original input embedding as shown in Figure 3.
1 theoriginal paper proposed multiple alignment scores, the general one they propose is h>
l Wa ht , another simple ’location’
based attention is just computed using the decoding hidden state at time t: Wa hl

CS 182/282A, Spring 2021, Discussion 7 2

Figure 3: Computing K, Q, V from input embeddings in a Transformer Network

Then, using Q and K, we can compute a dot product as the ’score’ of K for Q as shown in Figure 4.
Intuitively, Q is the querying term that you would like to find. Its relations for each corresponding K and
V pairs (key-value) pairs, can be computed using the key. Note that this dot product is computed across
various time steps by matrix multiplication. So we get a score for each K for each Q. We then use a Softmax
function to get our attention weights.

Figure 4: Computing Attention Scores from K, Q, V

Finally, using these weights, we can compute our weighted sum by multiplying the weights with the values.
Comparing to the Luong attention, query is analogous to the original hl , key and query are analogous to the
original ht .
Problem: Attention in RNNs

Explain how we incorporate self-attention into an RNN model at a high-level.

CS 182/282A, Spring 2021, Discussion 7 3

Solution: Attention in RNNs

To incorporate self-attention, we can let each hidden state attend to themselves. In other words,
every hidden state attends to the previous hidden states. Put more formally, ht attends to previous
states by,
et,l = score(ht , hl )
We apply Softmax to get attention distribution over previous states,

exp et,l
αt,l = P
j exp et,j

We then compute attention output,

t−1
X
at = αt,j hj
j=1

Problem: Generalized Attention in Matrix Form

Consider a form of attention that matches query q to keys k1 , . . . , kt in order to attend over associated
values v1 , . . . , vt .
If we have multiple queries q1 , . . . , ql , how can we write this version of attention in matrix notation?

Solution: Generalized Attention in Matrix Form

Stack queries into a matrix Q, keys into K and values V . Then,

a(Q, K, V ) = Softmax(QK > )V

where Softmax is applied row-wise

Problem: Justifying Scaled Self-Attention

In practice, Transformers use a Scaled Self-Attention. Suppose q, k ∈ Rd are two random vectors
with q, k ∼ N (µ, σ 2 I), where µ ∈ Rd and σ ∈ R+

1. Define E[q > k] in terms of µ, σ, d

2. Define V ar(q > k) in terms of µ, σ, d

3. Let s be the scaling factor on the dot product. We would like E[q > k/s] to scale linearly with
d. What should s be in terms of µ, σ, d
4. Briefly explain what would happen to the variance of dot product if s = 1.

CS 182/282A, Spring 2021, Discussion 7 4

Solution: Justifying Scaled Self-Attention

1.
 
Xd
E[q > k] = E  qi ki 
i=1
d
X
= E[qi ki ]
i=1
d
X
= µ2i
i=1
= µ> µ

2. First, notice that if random variables are uncorrelated, then, we have

 
Xn Xn
V ar  Xi  = V ar(Xi )
i=1 i=1

Then,

V ar(q > k) = E[(q > k)2 ] − E[q > k]2

= E[q > kk > q] − (µ> µ)2
= E[T r(qq > kk > )] − (µ> µ)2
= T r(E[qq > ]E[kk > ]) − (µ> µ)2

= T r (E[q]E[q > ] + σ 2 I)(E[k]E[k > ] + σ 2 I) − (µ> µ)2
= σ 2 dµ> µ + σ 2 dµ> µ + σ 4 d
= 2dσ 2 µ> µ + dσ 4

3. We would like s = 1 for E[q > k/s] to scale linearly with d

4. As d grows larger, the variance increases and dimensions of Softmax will blow up.

1.3 Tutorials on Attention Networks and Eager Execution on Tensorflow

Here is a tutorial that we recommend you run through in RNNs with tf.eager and keras: https://colab.
research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/
examples/nmt_with_attention/nmt_with_attention.ipynb

CS 182/282A, Spring 2021, Discussion 7 5

2 Transformers
At a high-level, transformers consist of the Transformer Encoder and Transformer Decoders.

Figure 5: Overview of Transformer architecture

Both operate similarly, except the Transformer Decoder takes xtarget as input, but Transformer Encoder
takes in xsource as input. In addition, there are several differences in cross-attention and self-attention
operations. In particular, transformers are novel in that they add,

• Positional Encoding: Addresses lack of sequence information

• Multi-headed Attention: Allows querying multiple positions at each layer
• Non-linearities

• Masked Decoding: Prevent attention lookups into the future

2.1 Notations
To ensure a level of clarity, we will let B be the batch size, Lsource represent the source sequence length,
Ltarget be the target sequence length, D represent the model hidden dimension and H represent the number
of attention heads.
In particular, transformers receive two sequences as input. The first is xsource ∈ ZB×Lsource and the second
is xtarget ∈ ZB×Ltarget . These are integer tensors, and each integer represents a word or token.

CS 182/282A, Spring 2021, Discussion 7 6

2.2 Transformer Encoders
Input & Positional Embedding The source tensor is embedded into the model hidden dimension, and
produces a tensor Xsource ∈ RB×Lsource ×D . We then add a positional encoding that differs for each sequence
position in order to enable the model to differentiate the positions in the sequence. In general, we need this
information since position of words in a sentence carries information.

Encoder Attention The Encoder Attention is self-attention. Specifically, in Transformer networks, we

use the Scaled QKV Attention (not covered explicitly in lecture). In other words, we would like to build
a representation of a single sequence such that every position in the sequence has information about every
other position in the sequence. In particular, to enable this, we will use the Query-Key-Values (QKV)
Attention. Our queries, keys and values will be tensors in Xsource ∈ RB×Lsource ×D and weight matrices will
be WQ , WK , WV ∈ RD×D . Ultimately, we will retrieve,

Q = Xsource WQ
K = Xsource WK
V = Xsource WV

Using Q, K, V , we will compute the attention scores (tensor in RB×Lsource ×Lsource ). For each element in the
qi> kj
batch, each entry i, j in the matrix would be √
D
for scaled dot product attention. Alternatively, we can
>
QK
compute, √
D
.To produce weights over each position in the sequence, we want each score to sum to one
over the keys K. To accomplish this, we take a softmax update over the last dimension of the attention
scores. Then, to produce the attention update, we multiply these attention weights by our values V ,
!
QK >
Cupdate = softmax √ V
D

where Cupdate ∈ RB×Lsource ×D

One of the key changes in Transformers is the multi-headed attention mechanism. To turn it into multi-
headed attention, we can take any such update matrices and reshape and permute the matrix from shape
D
B × Lsource × D to B × H × Lsource × H .
We finally consider padding. In general, we operate on a batch of B sequences, but these sequences may
not be the same length. We pad each sequence to Lsource . To prevent our model from paying attention
to padded positions, we add −∞ to attention scores prior to the Softmax of any position that should be
ignored.

Feedforward Layer The feedforward layer applies linear transformation to each position, apply a nonlin-
ear activation, then applies a second linear transformation.

2.3 Transformer Decoder

Masked Decoder Self-Attention Masked decoder self-attention is the same as encoder self-attention,
but with different masking. In particular, we would like every position to pay attention to all previous
qi> kj
positions, but not future positions. To achieve this, we set attention score to √ D
if i ≤ j and −∞
otherwise.

Encoder-Decoder Attention Encoder-Decoder attention operated similarly as well, except that we have
two sequences: (1) generate queries and (2) generate keys-values. Hence, we let Q = Xtarget WQ , K =
Xsource WK , V = Xsource WV , where Xsource is the output of the transformer encoder on the source sequences.

CS 182/282A, Spring 2021, Discussion 7 7

Problem: Machine Translation

1. What is the reason for positional encoding? How is it typically implemented?

2. What is the advantage of multi-head attention? Give some examples of structures that can be
found using multi-head attention

3. For input sequences of length M and output sequences of length N , what are the complexities of
(1) Encoder Self-Attention (2) Decoder-Encoder Attention (3) Decoder Self-Attention. Further
let k be the hidden dimension of the network
4. Do activation of the encoder depend on decoder activation? How much additional computation
is needed to translate a source sequence into a different target language, in terms of M and N ?

Solution: Machine Translation

1. Position encoding is used to ensure that word position is known. Because attention is applied
symmetrically to all input vectors from the layer below, there is no way for the network to know
which positions were filtered through to the output of the attention block. Position encoding
also allows the network to compare words (nearby position encodings have high inner product)
and find nearby words.
2. Multi-Head attention allows for a single attention module to attend to multiple parts of an
input sequence. This is useful when the output is dependent on multiple inputs (such as in the
case of the tense of a verb in translation). Attention heads find features like start of sentence
and paragraph, subject/object relations, pronouns, etc.
3. (1) O(M 2 k) (2) O(M N k) (3) O(N 2 k)

4. No. The encoder activations do not depend on the decoder activations. Thus, you only need
O(M N + N 2 ) additional computation to decode into a new sequence.

2.4 Why Transformers

In general, transformers are good for long-range connections, are easy to parallelize and transformers can be
made much deeper than RNNs. On the other hand, attention computations are complex to implement and
computations take O(n2 ) time.
However, in practice, it turns out the benefits vastly outweigh the downsides, and transformers work better
than RNNs and LSTMs in many cases.

CS 182/282A, Spring 2021, Discussion 7 8

Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
21 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
A1
No ratings yet
A1
11 pages
Understanding Attention Mechanisms in Deep Learning
No ratings yet
Understanding Attention Mechanisms in Deep Learning
104 pages
3.1 Language Models and Attention
No ratings yet
3.1 Language Models and Attention
22 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
7181 Attention Is All You Need
No ratings yet
7181 Attention Is All You Need
11 pages
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
No ratings yet
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
42 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Attention As An RNN: Preprint. Under Review
No ratings yet
Attention As An RNN: Preprint. Under Review
18 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
No ratings yet
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
15 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
LLM Attention
No ratings yet
LLM Attention
13 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Example File
No ratings yet
Example File
3 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
Aiayn
No ratings yet
Aiayn
15 pages
Transformer
No ratings yet
Transformer
58 pages
Transformer
No ratings yet
Transformer
59 pages
SimpleTron NeurIPS 2022
No ratings yet
SimpleTron NeurIPS 2022
15 pages
Attention
No ratings yet
Attention
15 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
19 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Attention - Attention! - Lil'Log
No ratings yet
Attention - Attention! - Lil'Log
23 pages
Transformer
No ratings yet
Transformer
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Presentation On Attention Model
No ratings yet
Presentation On Attention Model
14 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
Fastformer: Additive Attention Can Be All You Need
No ratings yet
Fastformer: Additive Attention Can Be All You Need
11 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
Transformer
No ratings yet
Transformer
4 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Transformer
No ratings yet
Transformer
5 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
NeurIPS 2020 Fast Transformers With Clustered Attention Paper
No ratings yet
NeurIPS 2020 Fast Transformers With Clustered Attention Paper
10 pages
Lec 12
No ratings yet
Lec 12
30 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
C# Lab Report VTU
No ratings yet
C# Lab Report VTU
75 pages
Cubeacon Card - Datasheet-V - 0.3.1
No ratings yet
Cubeacon Card - Datasheet-V - 0.3.1
2 pages
15572854118361575jy NARI MUX 2M Interface
No ratings yet
15572854118361575jy NARI MUX 2M Interface
2 pages
A Good Team RACF and CICS 2
No ratings yet
A Good Team RACF and CICS 2
9 pages
Quick Start - RAGFlow
No ratings yet
Quick Start - RAGFlow
10 pages
Sunspec Modbus Protocol For SMA Device
No ratings yet
Sunspec Modbus Protocol For SMA Device
19 pages
BCG Application Guide 2021revised
No ratings yet
BCG Application Guide 2021revised
41 pages
Maths PQ2
No ratings yet
Maths PQ2
6 pages
de3eff8f6907b6b29ecc2014b615d71dd241738f80ca7be596d3983253f3d57d
No ratings yet
de3eff8f6907b6b29ecc2014b615d71dd241738f80ca7be596d3983253f3d57d
2 pages
Application: Material: Color: Advantage
No ratings yet
Application: Material: Color: Advantage
1 page
Daa ELab Level 2-3 Questions
100% (1)
Daa ELab Level 2-3 Questions
19 pages
DSGW-060 Smart Gateway
No ratings yet
DSGW-060 Smart Gateway
7 pages
Verizon Wir 11 Pro Gold
No ratings yet
Verizon Wir 11 Pro Gold
5 pages
1 ML Introduction
No ratings yet
1 ML Introduction
36 pages
Patient Monitor: Series
No ratings yet
Patient Monitor: Series
498 pages
UNIT - 1 PPT - DBMS - BSC
No ratings yet
UNIT - 1 PPT - DBMS - BSC
27 pages
Rock Smith Configuration
No ratings yet
Rock Smith Configuration
23 pages
Cybercrime and You How Criminals Attack and The Hu
100% (1)
Cybercrime and You How Criminals Attack and The Hu
22 pages
NPD Module 5 PDF
No ratings yet
NPD Module 5 PDF
36 pages
MunsA Textd Grade 4 ICT TERM 2 HOLIDAY WORK 2024
No ratings yet
MunsA Textd Grade 4 ICT TERM 2 HOLIDAY WORK 2024
4 pages
Egov Bancnet Corporate User'S Manual
No ratings yet
Egov Bancnet Corporate User'S Manual
66 pages
Tinder Tales An Exploratory Study of Online Dating Users and Their Most Interesting Stories
No ratings yet
Tinder Tales An Exploratory Study of Online Dating Users and Their Most Interesting Stories
16 pages
Enquiry Letter
No ratings yet
Enquiry Letter
11 pages
Automation Digitalization Pelleting Control Brochure Download Data
No ratings yet
Automation Digitalization Pelleting Control Brochure Download Data
22 pages
UISearchController Tutorial Getting Started
No ratings yet
UISearchController Tutorial Getting Started
16 pages
Guidelines For Performing Systematic Literature Reviews in Software Engineering Bibtex
No ratings yet
Guidelines For Performing Systematic Literature Reviews in Software Engineering Bibtex
7 pages
Data Storage 6MBXz6j2tBKCGg7X
No ratings yet
Data Storage 6MBXz6j2tBKCGg7X
8 pages
A6V10337045
No ratings yet
A6V10337045
28 pages
Flipkart TBBD '23 Cheat Sheet - Electronics
No ratings yet
Flipkart TBBD '23 Cheat Sheet - Electronics
23 pages
Config Idevice Standard DOCU V1d0 en
No ratings yet
Config Idevice Standard DOCU V1d0 en
44 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Dis7 Sol

Uploaded by

Dis7 Sol

Uploaded by

CS 182/282A Designing, Visualizing and Understanding Deep Neural Networks

Spring 2021 Sergey Levine Discussion 7

This discussion covers Attention Mechanisms and Transformers.

Figure 1: Attention Mechanism Illustrated

1.1 Luong Attention

CS 182/282A, Spring 2021, Discussion 7 1

1.2 Self-Attention in Transformer Networks

CS 182/282A, Spring 2021, Discussion 7 2

Figure 4: Computing Attention Scores from K, Q, V

Explain how we incorporate self-attention into an RNN model at a high-level.

CS 182/282A, Spring 2021, Discussion 7 3

We then compute attention output,

Problem: Generalized Attention in Matrix Form

Solution: Generalized Attention in Matrix Form

Stack queries into a matrix Q, keys into K and values V . Then,

a(Q, K, V ) = Softmax(QK > )V

where Softmax is applied row-wise

Problem: Justifying Scaled Self-Attention

1. Define E[q > k] in terms of µ, σ, d

2. Define V ar(q > k) in terms of µ, σ, d

CS 182/282A, Spring 2021, Discussion 7 4

2. First, notice that if random variables are uncorrelated, then, we have

V ar(q > k) = E[(q > k)2 ] − E[q > k]2

3. We would like s = 1 for E[q > k/s] to scale linearly with d

1.3 Tutorials on Attention Networks and Eager Execution on Tensorflow

CS 182/282A, Spring 2021, Discussion 7 5

Figure 5: Overview of Transformer architecture

• Positional Encoding: Addresses lack of sequence information

• Masked Decoding: Prevent attention lookups into the future

CS 182/282A, Spring 2021, Discussion 7 6

Encoder Attention The Encoder Attention is self-attention. Specifically, in Transformer networks, we

where Cupdate ∈ RB×Lsource ×D

2.3 Transformer Decoder

CS 182/282A, Spring 2021, Discussion 7 7

1. What is the reason for positional encoding? How is it typically implemented?

Solution: Machine Translation

2.4 Why Transformers

CS 182/282A, Spring 2021, Discussion 7 8

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.