0% found this document useful (0 votes)
33 views39 pages

Lecture 2

The lecture discussed the history of attention and transformer models, from basic feedforward and convolutional neural networks to recurrent models like LSTMs to current transformer architectures. Transformers use self-attention to compute representations without RNNs or CNNs, making computation easily parallelizable. Recent models like AlphaFold2 use transformer architectures to train end-to-end models for applications in protein structure prediction.

Uploaded by

badrya badhy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views39 pages

Lecture 2

The lecture discussed the history of attention and transformer models, from basic feedforward and convolutional neural networks to recurrent models like LSTMs to current transformer architectures. Transformers use self-attention to compute representations without RNNs or CNNs, making computation easily parallelizable. Recent models like AlphaFold2 use transformer architectures to train end-to-end models for applications in protein structure prediction.

Uploaded by

badrya badhy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Jan.

14, 2022

CS 886 Deep Learning for Biotechnology

Ming Li
01. Word2Vec

02. Attention / Transformers

03. Pretraining: GPT / BERT


CONTENT
04. Deep learning applications in proteomics

05. Student presentations begin


Attention and Transformers
02 LECTURE
TWO
02 Attention and Transformers

Plan: We trace back history to see how attention and transformers have emerged

1. Basic models, related to transduction models and attention.


2. Encoder-Decoder model, using recurrent networks such as LSTM.
3. Transformer models are general models sufficient for almost all biotech
applications (graph models may be treated to be special cases too).
4. For example, DeepMind AlphaFold2 uses depends on a transformer
architecture to train an end-to-end model.
5. The transformer model also makes it easy for large scale biological data
(pre)training.
02 Attention and Transformers

1. Fully connected network, feedforward network

To learn the weights on


the edges
02 Attention and Transformers

2. CNN
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that do convolutional operation.

A filter
02 Attention and Transformers

These are the network


Convolutional layer parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1 Filter 2
0 0 1 0 1 0 -1 1 -1



Input
Each filter detects a
small pattern (3 x 3).
Convolution 1 -1 -1
Operation -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

Input
Attention
02 and Transformers
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

Input 3 -2 -2 -1
02 Attention and Transformers

3. RNN
Parameters to be learned:
U, V, W
02 Attention and Transformers

Simple RNN vs LSTM


02 Attention and Transformers

Encoder-Decoder machine translation

LSTM
02 Attention and Transformers

Encoder-Decoder LSTM structure for chatting


02 Attention and Transformers

Attention
02 Attention and Transformers
Image caption generation using attention

z0 is initial parameter, it is also learned


A vector for
each region
z0 match 0.7

CNN filter filter filter


filter filter filter

filter filter filter


filter filter filter
02 Attention and Transformers
Image caption generation using attention

Word 1
A vector for
each region
z0 z1
Attention to
a region
weighted
CNN filter filter filter sum
filter filter filter
0.7 0.1 0.1
0.1 0.0 0.0

filter filter filter


filter filter filter
02 Attention and Transformers
Image caption generation using attention

Word 1 Word 2
A vector for
each region
z0 z1 z2

weighted
CNN filter filter filter sum
filter filter filter
0.0 0.8 0.2
0.0 0.0 0.0

filter filter filter


filter filter filter
02 Attention and Transformers

Image caption generation using attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio,
“Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015
02 Attention and Transformers

More new ideas:

1. ULM-FiT, pre-training, transfer learning in NLP


2. Recurrent models require linear sequential computation, hard to
parallelize. ELMo, bidirectional LSTM.
3. In order to reduce such sequential computation, several models based
on CNN are introduced, such as ConvS2S and ByteNet. Dependency for
ConvS2S needs linear depth, and ByteNet logarithmic.
4. The transformer is the first transduction model relying entirely on self-
attention to compute the representations of its input and output without
using RNN or CNN.
02 Attention and Transformers

Transformer
02 Attention and Transformers

An Encoder Block: same structure, different parameters


02 Attention and Transformers

Encoder

Note: The ffnn is independent


for each word.
Hence can be parallelized.
02 Attention and Transformers

Self Attention 512  4

64
64  3
512

First we create three vectors


by multiplying input embedding
(1x512)
xi with three matrices (64x512):

qi = xi WQ
Ki = xi WK
Vi = xi WV
02 Attention and Transformers

Self Attention

Now we need to calculate


a score to determine how
much focus to place on other
Parts of the input.
02 Attention and Transformers

Self Attention

Formula
64x64 64x512

~ ~
dk=64 is dimension of key vector
z1= 0.88v1+ 0.12v2
02 Attention and Transformers

Multiple heads
1. It expands the
model’s ability to
focus on different
positions.
2. It gives the
attention layer
multiple
“representation
subspaces”
02 Attention and Transformers

The output
is
expecting
only a 2x4 24

matrix,
hence,

If you want some more intuition on attention: watch https://www.youtube.com/watch?v=-9vVhYEXeyQ


02 Attention and Transformers

Representing the input order (positional


The transformer adds a vector to each input embedding. These vectors follow a specific
encoding)
pattern that the model learns, which helps it determine the position of each word, or the
distance between different words in the sequence. The intuition here is that adding these
values to the embedings provides meaningful distances between the embedding vectors
once they’re projected into Q/K/V vectors and during dot-product attention.

More on positional encoding:


https://kazemnejad.com/blog/transformer_arch
itecture_positional_encoding/
02 Attention and Transformers

Add and Normalize

In order to regulate the


computation, this is a normalization
layer so that each feature (column)
have the same average and
deviation.
02 Attention and Transformers

Layer Normalization (Hinton)

Layer normalization normalizes the


inputs across the features.
02 Attention and Transformers The complete transformer

The encoder-
decoder
attention is just
like self
attention, except
it uses K, V from
the top of
encoder output,
and its own Q
02 Attention and Transformers

Note: In decoder, the input is “incomplete”when


calculating self-attention.

The solution is to set future unknown values with “-inf”.


02 Attention and Transformers

Decoder's
Output
Linear
Layer
02 Attention and Transformers How it works

Decoder’
s
Output But what about
Linear Self-attention?
Layer
02 Attention and Transformers

Training and the Loss Function

We can use cross


Entropy.

We can also
optimize two
words at a time:
using BEAM
search: keep a few
alternatives for
the first word.
02 Attention and Transformers

Cross Entropy and KL (Kullback-Leibler) divergence

• Entropy: E(P) = - ΣiP(i)logP(i) - expected prefix-free code length (also optimal)


• Cross Entropy: C(P) = - ΣiP(i) log Q(i) – expected coding
length using optimal code for Q
• KL divergence:
DKL(P || Q) = ΣiP(i)log[P(i)/Q(i)] = ΣiP(i)[logP(i) – logQ(i)], extra bits to code
using Q rather than P

• JSD(P||Q) = ½ DKL(P||M)+ ½ DKL(Q||M), M= ½ (P+Q), symmetric KL


* JSD = Jensen-Shannon Divergency
02 Attention and Transformers

Transformer Results
02 Attention and Transformers Next Lecture: Pretraining

BERT GPT
02 Attention and Transformers

Literature & Resources for Transformers

Vaswani et al. Attention is all you need. 2017.

Resources:
https://nlp.seas.harvard.edu/2018/04/03/attention.html (Excellent
explanation of transformer model with codes.)
Jay Alammar, The illustrated transformer (from which I borrowed many
pictures):
http://jalammar.github.io/illustrated-transformer/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy